comparative review of classification trees
play

Comparative Review of Classification Trees by Leonardo Auslender, - PowerPoint PPT Presentation

Comparative Review of Classification Trees by Leonardo Auslender, leoldv12 at gmail dot com Independent Statistical Research Consultant 2013 Contents 1) Trees/CART: varieties, algorithm 2) Model Deployment: scoring 3) Examples. 4)


  1. Comparative Review of Classification Trees by Leonardo Auslender, leoldv12 ‘at’ gmail ‘dot’ com Independent Statistical Research Consultant 2013

  2. Contents 1) Trees/CART: varieties, algorithm 2) Model Deployment: scoring 3) Examples. 4) Concluding Remarks: Brains, the future Review of Trees: Auslender, L. (1998): Alacart , Poor Man’s Classification Trees, North Eastern SAS Users Group Conference. — 2 —

  3. 1) Varieties of Tree Methods A Field vARIGuide to Tree AID ID3 THAID C4.5 CHAID C5.0 CART Tree (S+, R) — 3 —

  4. CART Classification and Regression Trees Source: Breiman L., Freedman J. Stone R., Olshen C.: Classification and Regression Trees , Wadsworth, International Group, Belmont, CA, 1984 — 4 —

  5. Aim: separate two classes by using X1 and X2 and producing more homogenous rectangular regions. — 5 —

  6. CART: Underlying Classification algorithm Using missclasification. Y X1 X2 X3 x4 0 1 10 21 1 1 1 30 8 1 0 2 0 8 0 0 3 10 8 0 Misscl (Y / X1 <=1) = .5 Misscl (Y / X1 > 1) = 0, repeat for every value of X1 and for every other X variable, select optimal variable and split (actually uses Gini in Cart). — 6 —

  7. Basic CART Algorithm: binary dependent variable or target (0,1) Original % of ‘0’s and ‘1’s of dep. var “1” Y 70% “1” 50% “0” 20% “0” X i Splitting point Range of Continuous Variable A — 7 —

  8. Divide and Conquer: recursive partitioning n = 5,000 10% HELOC yes no Debits < 19 n = 3,350 n = 1,650 5% HELOC 21% HELOC — 8 —

  9. Ideal SAS code to find splits Proc summary data = …. Nway; class (all independent vars); var depvar; /* this is ‘target’, 0/1*/ output out = ….. Sum = ; run; For large data sets (large N, large NVAR), hardware and software constraints prevent completion. — 9 —

  10. Fitted Decision Tree: Interpretation and structure VAR A <19  19 VAR B 21 % 0-52 >52 VAR C 45 % 0,1 >1 5 % 25 % — 10 —

  11. The Cultivation of Trees • Split Search – Which splits are to be considered? • Splitting Criterion – Which split is best? • Stopping Rule – When should the splitting stop? • Pruning Rule – Should some branches be lopped-off? — 11 —

  12. Possible Splits to Consider: most common is binary because ... 500,000 Nominal 400,000 Input Ordinal 300,000 Input 200,000 100,000 Input Levels 1 2 4 6 8 10 12 14 16 18 20 If input has 1000 levels,  999 possible binary splits and 999 * 998 /2 trinary split, etc. — 12 —

  13. Splitting Criterion: gini, twoing, misclassification, entropy… A) Minimize Gini impurity criterion (favors node homogeneity) ----------------- B) Maximize Twoing impurity criterion (favors class separation) Empirical results: for binary dependent variables, Gini and Twoing are equivalent. For trinomial, Gini provides more accurate trees. Beyond three categories, twoing performs better. — 13 —

  14. The Right-Sized Tree Stunting Pruning — 14 —

  15. — 15 —

  16. — 16 —

  17. — 17 —

  18. Benefits of Trees • Interpretability – Tree structured presentation • Mixed Measurement Scales – Nominal, ordinal, interval – Regression trees • Robustness • Missing Values — 18 —

  19. …Benefits • Automatically Prob – Detects interactions (AID) in hierarchical conditioning search, not ‘ ala ’ regression Input analysis. Input – Selects input Multivariate variables Step Function — 19 —

  20. Drawbacks of Trees . Unstable: small perturbations in data can lead to big changes in trees. . Linear structures are approximated in very rough form. . Applications may require that rules descriptions for different categories not share the same attributes. . It is a conditional Structure and interpretation many times misunderstands the conditioning effect. — 20 —

  21. Drawbacks of Trees (cont.) . Tends to over-fit => overly optimistic accuracy. . Large trees are very difficult to interpret. . Tree size conditioned by data set size. . No valid inferential procedures at present (if it matters). . Greedy search algorithm. — 21 —

  22. Note on Missing Values. 1) Missingness NOT in Y (see Wang and Sheng, 2007, JMLR for semi- supervised method for missing Y). 2) Different methods of imputation: 1) C4.5: probabilistic split: variables with missing values are attached to child nodes with weights equal to proportion of non-missing values. 2) Complete case: eliminate all missing observations, and train. 3) Grand mode/mean: imputed if categorical/continuous. 4) Separate class: appropriate for categorical. For continuous, create extreme large value and thus separate missings from non-missings. 5) Complete variable case: delete all variables with missing values. 6) Surrogate (CART default): Use surrogate variable/s whenever variable is missing. At testing or scoring, if variable is missing, uses surrogate/s. — 22 —

  23. Tree Derivative: Random Forests. (Breiman, 1999) Random Forests proceed in the following steps, and notice that there is no need to create a training, validation and a test data sets: 1. Take a random sample of N observations with replacement (“bagging”) from the data set. On average, select about 2/3 of rows. The remaining 1/3 are called “out of bag (OOB)” observations. A new random selection is performed for each tree constructed. 2. Using the observations selected in step 1, construct a decision tree to its maximum size, without pruning. As the tree is built, allow only a subset of the total set of predictor variables to be considered as possible splitters for each node. Select the set of predictors to be considered as random subset of the total set of available predictors. For example, if there are ten predictors, choose five of them randomly as candidate splitters. Perform a new random selection for each split. Some predictors (possibly best one) will not be considered for each split, but predictor excluded from one split may be used for another split in the same tree. — 23 —

  24. No Overfitting or Pruning. The "Over- fitting“ problem appears in large, single -tree models where the model fits noise in the data, which causes poor generalization power, which is the basis for pruning those models. In nearly all cases, decision tree forests do not have problem with over-fitting, and there is no need to prune trees in the forest. Generally, the more trees in a forest, the better the fit. Internal Measure of Test Set (Generalization) Error . About 1/3 of observations are excluded from each tree in the forest, which are called “out of bag (OOB)”. That is, each tree has a different set of out -of-bag observations that implies each OOB set constitutes an independent test sample. To measure the generalization error of decision tree forests, the OOB set for each tree is run through the tree and the error rate of prediction is computed. The error rates for the trees in the forest are then averaged to obtain the overall generalization error rate for the decision tree forest model. There are several advantages to this method of computing the generalization error: (1) All observations are used to construct the model, and none have to be held back as a separate test set, (2) The testing is fast because only one forest has to be constructed (as compared to V-fold cross-validation where additional trees have to be constructed). — 24 —

  25. 2) Scoring: battle horse of database marketing. Model Deployment. — 25 —

  26. Scoring Recipe • Model • Scoring Code – Formula  Scored data • Data Modifications  Original computation – Derived inputs algorithm – Variable Transformations – Missing value imputation — 26 —

  27. Scoring Recipe: example of scoring output generated by Alacart /* PROGRAM ALGOR8.PGM WITH 8 FINAL NODES*/ /* METHOD MISSCL ALACART TEST */ RETAIN ROOT 1; IF ROOT & CURRDUE <= 105.38 & PASTDUE <= 90.36 & CURRDUE <= 12 THEN DO; NODE = '4_1 '; PRED = 0 ; /* % NODE IMPURITY = 0.0399 ; */ /* BRANCH # = 1 ; */ /* NODE FREQ = 81 ; */ END; ELSE IF ROOT & CURRDUE <= 105.38 & PASTDUE <= 90.36 & CURRDUE > 12 THEN DO; NODE = '4_2 '; PRED = 1 ; /* % NODE IMPURITY = 0.4478 ; */ /* BRANCH # = 2 ; */ /* NODE FREQ = 212 ; */ END; ELSE IF ROOT & CURRDUE <= 105.38 & PASTDUE > 90.36 THEN DO; NODE = '3_2 '; PRED = 0 ; — 27 —

  28. Scorability Training Data Classifier Scoring Code X1 New Case 1 .8 .6 .4 Tree If x1<.47 .2 & x2<.18 0 or x1>.47 0 .2 .4 .6 .8 1 & x2>.29, X2 then red. — 28 —

  29. — 29 —

Recommend


More recommend