An archetype that is covered is that of learning by example. With this information the reader can select the packages that can help process the analytical tasks with minimum effort and maximum usefulness.
The use of Graphical User Interfaces GUI is emphasized in this book to further cut down and bend the famous learning curve in learning R. This book is aimed to help you kick-start with analytics including chapters on data visualization, code examples on web analytics and social media analytics, clustering, regression models, text mining, data mining models and forecasting.
The book tries to expose the reader to a breadth of business analytics topics without burying the user in needless depth. The included references and links allow the reader to pursue business analytics topics. This book is aimed at business analysts with basic programming skills for using R for Business Analytics. Note the scope of the book is neither statistical theory nor graduate level research for statistics, but rather it is for business analytics practitioners. Business analytics BA refers to the field of exploration and investigation of data generated by businesses.
Business Intelligence BI is the seamless dissemination of information through the organization, which primarily involves business metrics both past and current for the use of decision support in businesses. Data Mining DM is the process of discovering new patterns from large data using algorithms and statistical methods.
To differentiate between the three, BI is mostly current reports, BA is models to predict and strategize and DM matches patterns in big data. The R statistical software is the fastest growing analytics platform in the world, and is established in both academia and corporations for robustness, reliability and accuracy. It starts with an introduction to the subject, placing descriptive models in the context of the overall field as well as within the more specific field of data mining analysis.
Chapter 2 covers data visualization, including directions for accessing R open source software described through Rattle. Both R and Rattle are free to students.
Chapter 3 then describes market basket analysis, comparing it with more advanced models, and addresses the concept of lift. Subsequently, Chapter 4 describes smarketing RFM models and compares it with more advanced predictive models. Next, Chapter 5 describes association rules, including the APriori algorithm and provides software support from R.
Chapter 7 goes on to describe link analysis, social network metrics, and open source NodeXL software, and demonstrates link analysis application using PolyAnalyst output. Chapter 8 concludes the monograph. Using business-related data to demonstrate models, this descriptive book explains how methods work with some citations, but without detailed references. Data Mining and Anlaytics are the foundation technologies for the new knowledge based world where we build models from data and databases to understand and explore our world.
Data mining can improve our business, improve our government, and improve our life and with the right tools, any one can begin to explore this new technology, on the path to becoming a data mining professional. This book aims to get you into data mining quickly. Load some data e. This is the first step in a journey to data mining and analytics. In fact, it is a special kind of list called a data frame, which is one of the most common data structures in R for storing our datasets.
A data frame is essentially a list of columns. The weather dataset has 24 columns. For a data frame, each column is a vector, each of the same length. If we only want to review certain rows or columns of the data frame, we can index the dataset name. The string is actually equivalent to a call to seq with two arguments, 4 and 8. The function returns a vector containing the integers from 4 to 8.
This is automatically converted into a call to help. The help. A third command for searching for help on a topic is RSiteSearch. The command line, as we have introduced here, is where we access the full power of R. But not everyone wants to learn and remember commands, so Rattle will get us started quite quickly into data mining, with only our minimal knowledge of the command line.
R and Rattle Interactions Rattle generates R commands that are passed on through to R at various times during our interactions with Rattle. We can also interact with R itself directly, and even interleave our interactions with Rattle and R. In Section 2. The same can also be viewed in the R Console using print. We can replicate that here once we have built the decision tree model as described in Section 2. The R Console window is where we can enter R commands directly. We first need to make the window active, usually by clicking the mouse within that window.
For the example below, we assume we have run Rattle on the weather dataset to build a decision tree as described in Section 2. We can then type the print command at the prompt. We see this in the code box below. The command itself consists of the name of an R function we wish to call on print in this case , followed by a list of arguments we pass to the function. The arguments provide information about what we want the function to do. In this case we are choosing a single digit.
This has the effect of passing the command to R. R will respond with the text exactly as shown below. The text starts with an indication of the number of observations This is followed by the same textual presentation of the model we saw in Section 2. We also talked about functions that we type on the command line that make up the command to be executed. In this book, we will adopt a particular terminology around functions and commands, which we describe here.
In its true mathematical sense, a function is some operation that con- sumes some data and returns some result. Functions like dim , seq , and head , as we have seen, do this.
Functions might also have what we often call side effects—that is, they might do more than simply returning some result. In fact, the purpose of some functions is actually to perform some other action without necessarily returning a result.
Such functions we will tend to call commands. The function rattle , for example, does not return any result to the command line as such. Instead, its purpose is to start up the GUI and allow us to start data mining. Whilst rattle is still a function, we will usually refer to it as a command rather than a function.
The two terms can be used interchangeably. We can use R to write programs that analyse data—we program the data analyses. Note that if we are only using Rattle, then we will not need to program directly. Nonetheless, for the programs we might write, we can take advantage of the numerous programming styles offered by R to develop code that analyses data in a consistent, simple, reusable, transparent, and error-free way.
Mistakenly, we are often trained to think that writing sentences in a programming language is primarily for the benefit of having a computer perform some activity for us. Instead, we should think of the task as really writing sentences that convey to other humans a story—a story about analysing our data. Coincidentally, we also want a computer to perform some activity.
Keeping this simple message in mind, whenever writing in R, helps to ensure we write in such a way that others can easily understand what we are doing and that we can also understand what we have done when we come back to it after six months or more.
Environments as Containers in R For a particular project, we will usually analyse a collection of data, possibly transforming it and storing different bits of information about it. It is convenient to package all of our data and what we learn about it into some container, which we might save as a binary R object and reload more efficiently at a later time.
As a programming style, we can create a storage space and give it a name i. The container is an R environment and is initialised using new. We will store and access the relevant information from this container. We can now also quite easily use the same variable names, but within different containers. Then, when we write scripts to build models, for example, often we will be able to use exactly the same scripts, changing only the name of the container.
This encourages the reuse of our code and promotes efficiencies. This approach is also sympathetic to the concept of object-oriented programming. We will use this approach of encapsulating all of our data and infor- mation within a container when we start building models. This makes the variables defined in the data container avail- able within the model container. Rdata" It can at times become tiresome to be wrapping our code up within a container.
Whilst we retain the discipline of using containers we can also quickly interact with the variables in a container without having to specify the container each time. WE use attach and detach to add a container into the so called search path used by R to find variables. Instead it goes into the global environment. The variables data, nobs, etc. This is useful for quickly testing out ideas, for example, and is provided as a choice if you prefer not to use the container concept yourself.
Containers do, however, provide useful benefits. Rattle uses containers internally to collect together the data it needs. The Rattle container is called crs the current rattle store.
We have also built our first data mining model, albeit using an already prepared dataset. We have also introduced some of the basics of interacting with the R language. We are now ready to delve into the details of data mining.
Each of the following chapters will cover a specific aspect of the data mining process and illustrate how this is accomplished within Rattle and then further extended with direct coding in R.
Before proceeding, it is advisable to review Chapter 1 as an intro- duction to the overall data mining process if you have not already done so. R shell Start up the R statistical environment. Chapter 3 Working with Data Data is the starting point for all data mining—without it there is nothing to mine. We often think of data as being numbers or categories. But data can also be text, images, videos, and sounds. Data mining generally only deals with numbers and categories.
Often, the other forms of data can be mapped into numbers and categories if we wish to analyse them using the approaches we present here. Whilst data abounds in our modern era, we still need to scout around to obtain the data we need. This provides a fertile ground for sourcing data but also an extensive headache for us in navigating through a massive landscape.
An early step in a data mining project is to gather all the required data together. It should not be underestimated. When bringing data together, a number of issues need to be consid- ered. These include the provenance source and purpose and quality accuracy and reliability of the data. Data collected for different pur- poses may well store different information in confusingly similar ways.
Also, some data requires appropriate permission for its use, and the pri- vacy of anyone the data relates to needs to be considered. Time spent at this stage getting to know your data will be time well spent. In this chapter, we introduce data, starting with the language we use to describe and talk about data. A lot of this confusion of termi- nology is due to the history of data mining, with its roots in many dif- ferent disciplines, including databases, machine learning, and statistics.
Throughout this book, we will use a consistent and generally accepted nomenclature, which we introduce here. We refer to a collection of data as a dataset. This might be called in mathematical terms a matrix or in database terms a table.
Figure 3. We often view a dataset as consisting of rows, which we refer to as ob- servations, and those observations are recorded in terms of variables, which form the columns of the dataset.
Observations are also known as entities, rows, records, and objects. Variables are also known as fields, columns, attributes, characteristics, and features. The dimension of a dataset refers to the number of observations rows and the number of variables columns. Variables can serve different roles: as input variables or output variables.
Input variables are measured or preset data items. They might also be known as predictors, covariates, independent variables, ob- served variables, and descriptive variables. An output variable may be identified in the data. They might also be known as target, response, or dependent variables.
In data mining, we often build models to pre- dict the output variables in terms of the input variables. Early on in a data mining project, we may not know for sure which variables, if any, are output variables.
For some data mining tasks e. Some variables may only serve to uniquely identify the observations. Common examples include social security and other such government identity numbers. Even the date may be a unique identifier for particular observations. We refer to such variables as identifiers. Identifiers are not normally used in modelling, particularly those that are essentially randomly generated. Variables can store different types of data. The values might be the names or the qualities of objects, represented as character strings.
Or the values may be quantitative and thereby represented numerically. At a high level we often only need to distinguish these two broad types of data, as we do here. Evap Rain? Each column is a variable and each row is an observation. A categoric variable1 is one that takes on a single value, for a particular observation, from a fixed set of possible values. Examples include eye colour with possible values including blue, green, and brown , age group with possible values young, middle age, and old , and rain tomorrow with only two possible values, Yes and No.
Cat- egoric variables are always discrete i. Categoric variables like eye colour are also known as nominal vari- ables, qualitative variables, or factors. The possible values have no order to them.
That is, blue is no less than or greater than green. On the other hand, categoric variables like age group are also known as ordinal variables. The possible values have a natural order to them, so that young is in some sense less than middle age, which in turn is less than old. Numeric variables are also known as quantitative variables. Numeric variables can be discrete integers or continuous real. A dataset or, in particular, different randomly chosen subsets of a dataset can have different roles.
For building predictive models, for example, we often partition a dataset into three independent datasets: a training dataset, a validation dataset, and a testing dataset.
The partitioning is done randomly to ensure each dataset is representative of the whole collection of observations. A validation dataset is also known as a design dataset since it assists in the design of the model. We build our model using the training dataset. This will lead us to tune the model, perhaps through setting different model parameters. Once we are satisfied with the model, we assess its expected performance into the future using the testing dataset.
It is important to understand the significance of the testing dataset. This dataset must be a so-called holdout or out-of-sample dataset. It consists of randomly selected observations from the full dataset that are not used in any way in the building of the model. That is, it contains no observations in common with the training or validation datasets. This is important in relation to ensuring we obtain an unbiased estimate of the true performance of a model on new, previously unseen observations.
We can summarise our generic nomenclature, in one sentence, as: A dataset consists of observations recorded using vari- ables, which consist of a mixture of input variables and output variables, either of which may be categoric or nu- meric.
Having introduced our generic nomenclature, we also need to relate the same concepts to how they are implemented in an actual system, like R. We do so, briefly, here. R has the concept of a data frame to represent a dataset. A data frame is, technically, a list of variables.
For example, this might be a collection of integers recording the ages of clients. Technically, R refers to what we call a variable within a dataset as a vector.
Each variable will record the same number of data items, and thus we can picture the dataset as a rectangular matrix, as we illustrated in Figure 3. A data frame is much like a table in a database or a page in a spreadsheet. It consists of rows, which we have called observations, and columns, which we have called variables. Whilst that might be quite obvious, there are subtleties we need to address, as discussed in Chapter 1. We also need data—again, somewhat obvious.
As we suggested above, though, sourc- ing our data is usually not a trivial matter. We discuss the general data issue here before we delve into some technical aspects of data. In an ideal world, the data we require for data mining will be nicely stored in a data warehouse or a database, or perhaps a spreadsheet.
How- ever, we live in a less than ideal world. Data is stored in many different forms and on many different systems, with many different meanings. Data is everywhere, for sure, but we need to find it, understand it, and bring it together.
Over the years, organisations have implemented well-managed data warehouse systems. They serve as the organisation-wide repository of data.
It is true, though that, despite this, data will always spring up outside of the data warehouse, and will have none of the careful controls that surround the data warehouse with regard to data provenance and data quality. We will always face the challenge of finding data from many sources within an organisation. Data can be sourced from outside the organisation. This could include data publicly available, commercially collected, or legislatively obtained.
The data will be in a variety of formats and of varying quality. We delve further into understanding the data in Chapter 5. We consider data quality now. Despite the amount of effort organisations put into ensuring the quality of the data they collect, errors will always occur.
We need to understand issues relating to, for example, consistency, accuracy, completeness, interpretability, accessibility, and timeliness. It is important that we recognise and understand that our data will be of varying quality. We need to treat i. Chapter 7 covers many aspects of data quality and how we can work towards improving the quality of our available data. Below we summarise some of the issues.
In the past, much data was entered by data entry staff working from forms or directly in conversation with clients. Different data entry staff often interpret different data fields variables differently. Such incon- sistencies might include using different formats for dates or recording expenses in different currencies in the same field, with no information to identify the currency.
Often in the collection of data some data is more carefully or accu- rately collected than other data. For bank transactions, for example, the dollar amounts must be very accurate. Where the data must be accurate, extra resources will be made available to ensure data quality.
Where accuracy is less critical, resources might be saved. In analysing data, it is important to understand these aspects of accuracy. Related to accuracy is the issue of completeness. Some less important data might only be optionally collected, and thus we end up with much missing data in our datasets. Alternatively, some data might be hard to collect, and so for some observations it will be missing.
When analysing data, we need to understand the reasons for missing data and deal with the data appropriately. We cover this in detail in Chapter 7. Another major issue faced by the data miner is the interpretation of the data. Knowing that height is measured in feet or in meters will make a difference to the analysis. We might find that some data was entered as feet and other data as meters the consistency problem. We might have dollar amounts over many years, and our analysis might need to interpret the amounts in terms of their relative present-day value.
Codes are also often used, and we need to understand what each code means and how different codes relate to each other. As the data ages, the meaning of the different variables will often change or be altogether lost.
We need to understand and deal with this. The accessibility of the right data for analysis will often also be an issue. A typical process in data collection involves checking for obvious data errors in the data supplied and correcting those errors. In collecting tax return data from taxpayers, for example, basic checks will be per- formed to ensure the data appears correct e.
Sometimes the checks might involve discussing the data with its supplier and modifying it appropriately. The original data is often archived, but often it is such data that we actually need for the analysis—we want to analyse the data as supplied originally.
Accessing archived data is often problematic. Accessing the most recent data can sometimes be a challenge. In an online data processing environment, where the key measure of perfor- mance is the turnaround time of the transaction, providing other systems with access to the data in a timely manner can be a problem.
This can mean that the data may only be available after a day or so, which may present challenges for its timely analysis. Often, business processes need to be changed so that more timely access is possible. That is, we need to identify the same entities e. The doctor might have a unique number to identify his or her own pa- tients, as well as their names, dates of birth, and addresses. A hospital will also record data about patients that are admitted, including their reason for admission, treatment plan, and medications.
The process of data matching might be as simple as joining two datasets together based on shared identifiers that are used in each of the two databases. If the doctor and the hospital share the same unique numbers to identify the patients, then the data matching process is sim- plified.
However, the data matching task is usually much more complex. Data matching often involves, for example, matching of names, addresses, and dates and places of birth, all of which will have inaccuracies and alterna- tives for the same thing. An idea that can improve data matching quality is that of a trusted data matching bureau.
Many data matching bureaus within organisations almost start each new data matching effort from scratch. However, over time there is the opportunity to build up a data matching database that records relevant information about all previous data matches. Under this scenario, each time a new data matching effort is un- dertaken, the identities within this database, and their associated in- formation, are used to improve the new data matching. Importantly, the results of the new data matching feed back into the data matching database to improve the quality of the matched entities and thus even improve previously matched data.
A number of commercially available tools assist with the basic task. The open source Febrl2 system also provides data matching capabilities. They all aim to identify the same entity in all of the data sources. What we store in our warehouse is data. Data warehouses were topical in the s and primarily vendor driven, servicing a real opportunity to get on top of managing data.
Inmon provides a detailed introduction. We can view a data warehouse as a large database management sys- tem. It is designed to integrate data from many different sources and to support analysis for different objectives. In any organisation, the data warehouse can be the foundation for business intelligence, providing a single, integrated source of data for the whole organisation. Typically, a data warehouse will contain data collected together from multiple sources but focussed around the function of an organisation.
The data sources will often be operational systems such as transaction processing systems that run the day-to-day functions of the organisa- tion. Transaction processing systems collect data that gets uploaded to the data warehouse on a regular basis e. Well-organised data warehouses, at least from the point of view of data mining, will also be nonvolatile.
The data stored in our data ware- houses will capture data regularly, and older data is not removed. Even when an update to data is to correct existing data items, such data must be maintained, creating a massive repository of historic data that can be used to carefully track changes over time. Consider the case of tax returns held by our various revenue author- ities.
Many corrections are made to individual tax returns over time. Further changes might be made at a later time as a taxpayer corrects data originally sup- plied.
Changes might also be the result of audits leading to corrections made by the revenue authority, or a taxpayer may notify the authority of a change in address. Keeping the history of data changes is essential for data mining. It may be quite significant, from a fraud point of view, that a number of clients in a short period of time change their details in a common way.
Similarly, it might be significant, from the point of view of understanding client behaviour, that a client has had ten different addresses in the past 12 months. It might be of interest that a taxpayer always files his or her tax return on time each year, and then makes the same two adjustments subsequently, each year. All of this historic data is important in building a picture of the entities we are interested in.
Whilst the operational systems may only store data for one or two months before it is archived, having this data accessible for many years within a data warehouse for data mining is important. In building a data warehouse, much effort goes into how the data warehouse is structured. It must be designed to facilitate the queries that operate on a large proportion of data.
A careful design that exposes all of the data to those who require it will aid in the data mining process. Data warehouses quickly become unwieldly as more data is collected. This often leads to the development of specific data marts, which can be thought of as creating a tuned subset of the data warehouse for specific purposes. An organisation, for example, may have a finance data mart, a marketing data mart, and a sales data mart.
Each data mart will draw its information from various other data collected in the warehouse. Different data sources within the warehouse will be shared by different data marts and present the data in different ways.
A crucial aspect of a data warehouse and any data storage, in fact is the maintenance of information about the data—so-called metadata. Metadata helps make the data understandable and thereby useful. We might talk about two types of metadata: technical metadata and busi- ness metadata. Technical metadata captures data about the operational systems from which the data was obtained, how it was extracted from the source sys- tems, how it was transformed, how it was loaded into the warehouse, where it is stored within the warehouse, and its structure as stored in the warehouse.
Both R and Rattle are free to students. Chapter 3 then describes market basket analysis, comparing it with more advanced models, and addresses the concept of lift. R is a powerful and free software system for data analysis and graphics, with over 5, add-on packages available. It steps through over 30 programs written in all three packages, comparing and contrasting the packages' differing approaches. The programs and practice datasets are available for download.
I point out how they differ using terminology with which you are familiar, and show you which add-on packages will provide results most like those from SAS or SPSS. I provide many example programs done in SAS,. Knowledge management involves application of human knowledge epistemology with the technological advances of our current society computer systems and big data, both in terms of collecting data and in analyzing it.
We see three types of analytic tools. Descriptive analytics focus on reports of what has happened. It also. Learn how to perform data analysis with the R language and software environment, even if you have little or no programming experience. The second half of Learning R shows you real data analysis in action by covering everything from importing data to publishing your results.
Each chapter in the book.
0コメント