Consider today’s presentation a first exposure to basic data mining techniques. At the end of the session you will hopefully have a basic appreciation for how the methods work and why they could be attractive additions to the six sigma tool kit you may have. You will not be able to perform an analysis yourself when the session is over, but references will be supplied to support starting the learning process. And as always, keep in mind that “All models are wrong, but some are useful.” 0
1
We’ll zero in on two of the simplest methods to understand, communicate, and perform: classification and regression trees. And we’ll compare them to multiple linear regression, which is commonly used at this point in a six sigma project. Note that if the response is a discrete variable, multiple linear regression cannot be used. Logistic regression is the proper tool in that case. BTW, just because you can assign numerical values to the levels of a discrete variable doesn’t mean you can use MLR. Values are still only categories. 2
Recent article in Economist magazine cites studies that suggest that advanced algorithms derived by mining techniques will make additional occupations obsolete, ones that even ten years ago would have seemed “safe.” •Robots in manufacturing are “learning” from experience. •Computers using algorithms are better at detecting patterns in financial & security images than people can. •Algorithmic pattern recognition in medical images can detect smaller abnormalities than people are able to do. According a source quoted in the article, many occupations have p>0.50 of diminishing job prospects over the next two decades. Includes airline pilots, machinists, word processors & typists, real estate agents, technical writers, retail sales people, accountants & auditors, and telemarketers. Moneyball & Nate Silver’s accurate prediction of the last two presidential elections made use of predictive analytics. It seems obvious that quality professionals, including those involved in six sigma, will be touched by “big data” at some point in their careers. Yet, even if the topic has no direct impact on your job, as a consumer and as a citizen, it’s important that you 3
understand the basics. 3
Can’t assume that the data contained in a file is reliable, by which we mean stable and predictable over time. Trials are run, breakdowns occur. Each of these can either add to the range of values out there (possibly a good thing) or add noise, details of which are likely lost as time since the event has lengthened. How much useful information a project obtains from a database is at least partly determined by the quality of effort put into verifying the quality of the data available. 4
Always verify that the data you will be using contains only valid results. Make sure what you thought you were asking Access or SQL to get is what you really got. If not, fix the query and keep trying until it’s right. In databases with many columns, it’s not uncommon to find missing values or ones you decide to delete as bogus. Software commonly just deletes the entire record from the analysis. This can quickly make your big dataset not nearly as big. Fix values that are clearly just blunders, like misplacing a decimal point. Sometimes empty cells have place holders, like ‘9999.’ These can be set to missing, or values imputed. Imputation replaces empty or “defective” values with another one. Sometimes the mean or median is substituted. Other, more sophisticated, methods also exist, but are beyond what will be covered today. 5
6
There are comparable assumptions in the logistic regression case as well. 7
In noisy industrial systems it’s quite common to encounter low R^2 values, even when one knows all critical variables have been included in the model. Max R 2 = 1 – Pct Meas. Error . If one knows the variance components from measurements, he can calculate the maximum value possible and normalize to an “error free” basis. 8
Correlation between predictors is called multi-collinearity. In MLR it is detected by asking the software to supply variance inflation factors (VIF) for the predictors. Rule of thumb is 0<VIF<~5 : okay ~5<VIF<10: Marginal VIF>10: Multi-collinearity is a problem. Find somebody who can do Principal Components Analysis. 9
Data transformation can be a black art. Requires working knowledge of mathematical functions that many of us forgot shortly after leaving Algebra II in high school. Not an insurmountable problem, but something to keep in mind when patterns are detected in the residuals. 10
Standardized means that the residual value has been divided by the standard deviation of the residuals. The values are, thus, the # of stdev’s they are away from the average. 11
The values on the four panels are a cautionary tale of what happens when one does not do a residuals analysis, or, better yet, study the relationships between variables before starting a regression analysis. Developed by Princeton statistician Frank Anscombe and published in American Statistician ( 27 (1): 17–21) in 1973. Often called “Anscombe’s Quartet.” 12
13
14
15
16
Trees themselves are maps of inter-relationship of variables in what may be very complex models. Just because the assumptions we discussed for MLR are no longer a problem, does not suggest that the methods are fool proof. 17
Process the same for both methods. For classification trees, aim is to put all the “pink dots” in one box and the “green dots” in another. For regression trees want to minimize variation within each node. Both methods split parent nodes into two child nodes. Other methods allow more splits on a node. As we’ll see, significant effort is expended prior to submitting the dataset to the software. The effort is ideally the same expended before starting any statistical analysis. Aim is to assure that only valid data are included in the analysis. Over-fitting models occurs in MLR when analyst only focuses on maximizing R^2 value. In CART, one can fit models so that each leaf node (last split on a branch) has only one “color”. This tree might have one leaf node for each observation, and it would obviously have no predictive value, so we prune back far enough to eliminate silly. Unlike MLR, in CART methods, the original dataset is split into training, validation, and test sets. We build a tree with the first, prune it with the second, and see how well it works with the last. No reason why one wouldn’t want to consider the same process for MLR.
19
The dataset used in the first example is now a classic and was developed by R. A. Fisher and published in the Annals of Eugenics in (1936). Fifty plants from each of three species of irises--setosa, virginica, and versicolor-- were included in the set. The object was to determine whether the lengths & widths of their petals could discriminate between the three We will use a classification tree in place of the much more complex method Fisher devised for the task. At the end, we will have defined the parameter settings that split the three species most successfully and will assess how good a job our model does at discriminating between the three. 20
Black, red, and green dots tend to occupy space with little overlap, but note that some does occur between red and green. One could eyeball where splits might go (it’s pretty apparent for black and red) and write down rules to convey the info to others. However most cases aren’t this obvious—nor is the placement of the horizontal line. 21
This is the classification tree for the iris data. At the top, the root node, notice that each species is one-third of the observations. The software looks at all possible ways of splitting the data for both potential predictors and determines that the purity of splitting is maximized if a value of 2.45 inches for petal length is elected. The left daughter node contains only (and all of the) setosa observations. Purity of the right node has increased to 50% from 33% for each of the other two. The software repeats the recursive partitioning process a second time, and determines that splitting the right node at a petal width of 1.75 inches gives the cleanest split of the other two species. Unlike the other case, each of the daughter nodes contains observations from the minority species. Seven items are in the wrong nodes, or 4.6% of the original 150 items. The misclassification rate that this represents is a commonly used metric to assess model performance, sort of like an R 2 , but not exactly. While 4.6% sounds good for this case, the boss wouldn’t be keen on your model for predicting mortgage defaults or whether an individual was likely to be a terrorist. The rules for determining into which species to place a future iris plant are in the yellow box. These are easily communicated to others 22
Recommend
More recommend