ricco rakotomalala
play

Ricco RAKOTOMALALA Ricco Rakotomalala 1 Tutoriels Tanagra - - PowerPoint PPT Presentation

CHAID CART C4.5 and the others Ricco RAKOTOMALALA Ricco Rakotomalala 1 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/ Main issues of the decision tree learning Choosing the splitting criterion Impurity based


  1. CHAID – CART – C4.5 and the others… Ricco RAKOTOMALALA Ricco Rakotomalala 1 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

  2. Main issues of the decision tree learning Choosing the splitting criterion • Impurity based criteria • Information gain • Statistical measures of association… Binary or multiway splits • Multiway split: 1 value of the splitting attribute = 1 leaf • Binary split: finding the best binary grouping • Grouping only the leaves which are similar regarding the classes distribution Finding the right sized tree • Pre-pruning • Post-pruning Other challenges : decision graph, oblique tree, etc. Ricco Rakotomalala 2 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

  3. Ricco Rakotomalala 3 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

  4. Splitting criterion Main properties S1: Maximum The leaves are homogenous. S2: Minimum Conditional distributions are the same. S3 : Intermediate situation The leaves are more homogeneous regarding Y. X provides information about Y. Ricco Rakotomalala 4 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

  5. Splitting criterion Chi-square test statistic for independence and its variants  Y / X x x x 1 l L y 1  Contingency table   y n n k kl k . Cross tabulation between Y and X  y K  n n . l  2   n n    k . . l Measures of association n kl   K L  n   2 Comparing the observed and theoretical frequencies  n n (under the null hypothesis : Y and X are independent)   k . . l 1 1 k l n  2 Tschuprow’s t  2 t         Allows comparing splits with different number of leaves n K 1 L 1 Ricco Rakotomalala 5 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

  6. Splitting criterion For “Improved” CHAID ( SIPINA software) S1 : 1.0 S2 : 0.0 0.0  S3 : 0.7746  1.0 Ricco Rakotomalala 6 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

  7. Splitting criterion Information Gain – Gain ratio (C4.5)   K n n Shannon entropy       k . k . ( ) log E Y 2   Measure of uncertainty n n  k 1   Condition entropy L K n n n        . l kl kl E ( Y / X ) log   2 Expected entropy of Y knowing the values of X   n n n   l 1 k 1 . l . l Information gain   G ( Y / X ) E ( Y ) E ( Y / X ) Reduction of uncertainty  (Information) Gain ratio E ( Y ) E ( Y / X )  GR ( Y / X ) Favors the splits with low number of leaves E ( X ) Ricco Rakotomalala 7 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

  8. Splitting criterion For C4.5 (SIPINA software) S1 : 1.0 S2 : 0.0 0.0  S3 : 0.5750  1.0 Ricco Rakotomalala 8 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

  9. Splitting criterion Gini impurity (CART)    K Gini index n n       k . k . I ( Y ) 1   Measure of impurity n n  k 1 Conditional impurity    L K n n n        . l kl kl I ( Y / X ) 1   Average impurity of Y conditionally to X   n n n   l 1 k 1 . l . l   Gain D ( Y / X ) I ( Y ) I ( Y / X ) Gini index = Viewed as an entropy (cf. Daroczy) D can be viewed as a kind of information gain Gini index = Viewed as a variance for categorical variable CATANOVA (analysis of variance for categorical data)  D = variance between groups Ricco Rakotomalala 9 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

  10. Splitting criterion For C&RT (Tanagra software) S1 : 0.5 S2 : 0.0 0.0  S3 : 0.3  1.0 Ricco Rakotomalala 10 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

  11. Using unbiased measure … allows to alleviate the data fragmentation problem Splitting into 4 subsets using the X1 attribute Y / X1 A1 B1 C1 D1 Total positif 2 3 6 3 14 CHI-2 3.9796 négatif 4 4 8 0 16 T Tschuprow 0.0766 Total 6 7 14 3 30 Y / X2 A2 B2 D2 Total positif 2 9 3 14 CHI-2 3.9796 négatif 4 12 0 16 T Tschuprow 0.0938 Total 6 21 3 30 X2 is better than X1 Splitting into 3 subsets using the X2 attribute •Tschuprow’s t corrects the bias of the chi-square measure • Gain Ratio corrects the bias of the information gain • The Gini reduction in impurity is biased in favor of variables with more levels (but the CART algorithm constructs necessarily a binary decision tree) Ricco Rakotomalala 11 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

  12. Ricco Rakotomalala 12 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

  13. Multiway splitting (C4.5) 1 level = 1 leaf in the splitting process TYPELAIT ={2%MILK} 28( 17%) 24( 15%) 109( 68%) 161( 67%) TYPELAIT ={NOMILK} 4( 31%) 1( 8%) 8( 62%) • The prediction rules are easy to read 13( 5%) • Data fragmentation problem, especially TYPELAIT ={POWDER} 50( 21%) 1(100%) for small dataset 38( 16%) 0( 0%) 153( 63%) 0( 0%) • “Large” decision tree with a high number 241(100%) 1( 0%) TYPELAIT of leaves ={SKIM} 1( 9%) 5( 45%) • “Low depth” decision tree 5( 45%) 11( 5%) TYPELAIT ={WHOLEMILK} 16( 29%) 8( 15%) 31( 56%) 55( 23%) Ricco Rakotomalala 13 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

  14. Binary splitting (CART) Detecting the best combination in two subsets TYPELAIT • This grouping allows to overcome the bias of the ={2%MILK,SKIM} 29( 18%) splitting measure used 26( 16%) 109( 66%) • The data fragmentation problem is alleviated 164( 72%) 49( 21%) 34( 15%) TYPELAIT • “High depth” decision tree (CART uses a post 145( 64%) ={NOMILK,WHOLEMILK,PO... 228(100%) 20( 31%) pruning process for remedy to this) 8( 13%) 36( 56%) • Merging into two groups is not always relevant ! 64( 28%) Ricco Rakotomalala 14 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

  15. Merging approach (CHAID) Merging the similar leaves according the classes distribution Principle: Iterative merging if the distributions • Alleviate the data fragmentation problem • Choosing the alpha level for merging is are not significantly different into leaves not obvious (bottom-up strategy) NoMilk, Powder WholeMilk High 5 16 TYPELAIT Low 1 8 ={2%MILK} Normal 8 31 28( 17%) 24( 15%) Total 14 55 109( 68%) 161( 67%) TYPELAIT            ={NOMILK,WHOLEMILK,PO... 2 2 2 5 / 14 16 / 55 1 / 14 8 / 55 8 / 14 31 / 55       2   14 55 50( 21%) 21( 30%)     5 16 1 8 8 31  38( 16%) 9( 13%) 153( 63%) 39( 57%)  0 . 6309 241(100%) 69( 29%) TYPELAIT ={SKIM}   1( 9%) p value 0 . 73     2 5( 45%) [( 3 1 ) ( 2 1 )] 5( 45%) 11( 5%) Merging if (p-value > alpha level for merging) Ricco Rakotomalala 15 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

  16. Ricco Rakotomalala 16 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

  17. Bias variance tradeoff … according the tree complexity Bias ~ how powerful is the model Variance ~ how sensitive the model is to the training set 0,8 0,7 0,6 0,5 Apprentissage Test 0,4 0,3 Test t sample le 0,2 0,1 Learn rnin ing sample le 0 Numb mber r of leave ves 0 50 100 150 200 250 Underfitting Overfitting “Optimal” tree size The tree is too small The tree is too large Ricco Rakotomalala 17 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

  18. Pre-pruning Stopping the growing process Confidence and support criteria + Easy to understand and easy to use • Group purity: confidence threshold - The right thresholds for a given problem • Support criterion: min. size node to split, is not obvious min. instances in leaves Statistical approach (CHAID) - The right alpha-level for the splitting test is very hard to determine • Chi-squared test for independence But in practice, this approach is often used because : • the obtained tree reaches a good performance, the area for the "optimal" error rate is large • it is fast (the growing is stopped earlier, no additional calculations for the post pruning) • it is preferred at least in the exploratory phase Ricco Rakotomalala 18 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

  19. Post-pruning An additional step in order to avoid the over-dependence to the growing sample Two steps for the decision tree learning (1) Growing phase  maximizing the purity of the leaves (2) (Post) pruning phase  minimizing the “true” error rate 0.8 0.7 0.6 0.5 Apprentissage Vraie erreur 0.4 0.3 “True” error rate 0.2 0.1 0 0 50 100 150 200 250 How to obtain a good estimation of the “true” error rate Ricco Rakotomalala 19 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/

Recommend


More recommend