CHAID – CART – C4.5 and the others… Ricco RAKOTOMALALA Ricco Rakotomalala 1 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/
Main issues of the decision tree learning Choosing the splitting criterion • Impurity based criteria • Information gain • Statistical measures of association… Binary or multiway splits • Multiway split: 1 value of the splitting attribute = 1 leaf • Binary split: finding the best binary grouping • Grouping only the leaves which are similar regarding the classes distribution Finding the right sized tree • Pre-pruning • Post-pruning Other challenges : decision graph, oblique tree, etc. Ricco Rakotomalala 2 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/
Ricco Rakotomalala 3 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/
Splitting criterion Main properties S1: Maximum The leaves are homogenous. S2: Minimum Conditional distributions are the same. S3 : Intermediate situation The leaves are more homogeneous regarding Y. X provides information about Y. Ricco Rakotomalala 4 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/
Splitting criterion Chi-square test statistic for independence and its variants Y / X x x x 1 l L y 1 Contingency table y n n k kl k . Cross tabulation between Y and X y K n n . l 2 n n k . . l Measures of association n kl K L n 2 Comparing the observed and theoretical frequencies n n (under the null hypothesis : Y and X are independent) k . . l 1 1 k l n 2 Tschuprow’s t 2 t Allows comparing splits with different number of leaves n K 1 L 1 Ricco Rakotomalala 5 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/
Splitting criterion For “Improved” CHAID ( SIPINA software) S1 : 1.0 S2 : 0.0 0.0 S3 : 0.7746 1.0 Ricco Rakotomalala 6 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/
Splitting criterion Information Gain – Gain ratio (C4.5) K n n Shannon entropy k . k . ( ) log E Y 2 Measure of uncertainty n n k 1 Condition entropy L K n n n . l kl kl E ( Y / X ) log 2 Expected entropy of Y knowing the values of X n n n l 1 k 1 . l . l Information gain G ( Y / X ) E ( Y ) E ( Y / X ) Reduction of uncertainty (Information) Gain ratio E ( Y ) E ( Y / X ) GR ( Y / X ) Favors the splits with low number of leaves E ( X ) Ricco Rakotomalala 7 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/
Splitting criterion For C4.5 (SIPINA software) S1 : 1.0 S2 : 0.0 0.0 S3 : 0.5750 1.0 Ricco Rakotomalala 8 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/
Splitting criterion Gini impurity (CART) K Gini index n n k . k . I ( Y ) 1 Measure of impurity n n k 1 Conditional impurity L K n n n . l kl kl I ( Y / X ) 1 Average impurity of Y conditionally to X n n n l 1 k 1 . l . l Gain D ( Y / X ) I ( Y ) I ( Y / X ) Gini index = Viewed as an entropy (cf. Daroczy) D can be viewed as a kind of information gain Gini index = Viewed as a variance for categorical variable CATANOVA (analysis of variance for categorical data) D = variance between groups Ricco Rakotomalala 9 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/
Splitting criterion For C&RT (Tanagra software) S1 : 0.5 S2 : 0.0 0.0 S3 : 0.3 1.0 Ricco Rakotomalala 10 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/
Using unbiased measure … allows to alleviate the data fragmentation problem Splitting into 4 subsets using the X1 attribute Y / X1 A1 B1 C1 D1 Total positif 2 3 6 3 14 CHI-2 3.9796 négatif 4 4 8 0 16 T Tschuprow 0.0766 Total 6 7 14 3 30 Y / X2 A2 B2 D2 Total positif 2 9 3 14 CHI-2 3.9796 négatif 4 12 0 16 T Tschuprow 0.0938 Total 6 21 3 30 X2 is better than X1 Splitting into 3 subsets using the X2 attribute •Tschuprow’s t corrects the bias of the chi-square measure • Gain Ratio corrects the bias of the information gain • The Gini reduction in impurity is biased in favor of variables with more levels (but the CART algorithm constructs necessarily a binary decision tree) Ricco Rakotomalala 11 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/
Ricco Rakotomalala 12 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/
Multiway splitting (C4.5) 1 level = 1 leaf in the splitting process TYPELAIT ={2%MILK} 28( 17%) 24( 15%) 109( 68%) 161( 67%) TYPELAIT ={NOMILK} 4( 31%) 1( 8%) 8( 62%) • The prediction rules are easy to read 13( 5%) • Data fragmentation problem, especially TYPELAIT ={POWDER} 50( 21%) 1(100%) for small dataset 38( 16%) 0( 0%) 153( 63%) 0( 0%) • “Large” decision tree with a high number 241(100%) 1( 0%) TYPELAIT of leaves ={SKIM} 1( 9%) 5( 45%) • “Low depth” decision tree 5( 45%) 11( 5%) TYPELAIT ={WHOLEMILK} 16( 29%) 8( 15%) 31( 56%) 55( 23%) Ricco Rakotomalala 13 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/
Binary splitting (CART) Detecting the best combination in two subsets TYPELAIT • This grouping allows to overcome the bias of the ={2%MILK,SKIM} 29( 18%) splitting measure used 26( 16%) 109( 66%) • The data fragmentation problem is alleviated 164( 72%) 49( 21%) 34( 15%) TYPELAIT • “High depth” decision tree (CART uses a post 145( 64%) ={NOMILK,WHOLEMILK,PO... 228(100%) 20( 31%) pruning process for remedy to this) 8( 13%) 36( 56%) • Merging into two groups is not always relevant ! 64( 28%) Ricco Rakotomalala 14 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/
Merging approach (CHAID) Merging the similar leaves according the classes distribution Principle: Iterative merging if the distributions • Alleviate the data fragmentation problem • Choosing the alpha level for merging is are not significantly different into leaves not obvious (bottom-up strategy) NoMilk, Powder WholeMilk High 5 16 TYPELAIT Low 1 8 ={2%MILK} Normal 8 31 28( 17%) 24( 15%) Total 14 55 109( 68%) 161( 67%) TYPELAIT ={NOMILK,WHOLEMILK,PO... 2 2 2 5 / 14 16 / 55 1 / 14 8 / 55 8 / 14 31 / 55 2 14 55 50( 21%) 21( 30%) 5 16 1 8 8 31 38( 16%) 9( 13%) 153( 63%) 39( 57%) 0 . 6309 241(100%) 69( 29%) TYPELAIT ={SKIM} 1( 9%) p value 0 . 73 2 5( 45%) [( 3 1 ) ( 2 1 )] 5( 45%) 11( 5%) Merging if (p-value > alpha level for merging) Ricco Rakotomalala 15 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/
Ricco Rakotomalala 16 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/
Bias variance tradeoff … according the tree complexity Bias ~ how powerful is the model Variance ~ how sensitive the model is to the training set 0,8 0,7 0,6 0,5 Apprentissage Test 0,4 0,3 Test t sample le 0,2 0,1 Learn rnin ing sample le 0 Numb mber r of leave ves 0 50 100 150 200 250 Underfitting Overfitting “Optimal” tree size The tree is too small The tree is too large Ricco Rakotomalala 17 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/
Pre-pruning Stopping the growing process Confidence and support criteria + Easy to understand and easy to use • Group purity: confidence threshold - The right thresholds for a given problem • Support criterion: min. size node to split, is not obvious min. instances in leaves Statistical approach (CHAID) - The right alpha-level for the splitting test is very hard to determine • Chi-squared test for independence But in practice, this approach is often used because : • the obtained tree reaches a good performance, the area for the "optimal" error rate is large • it is fast (the growing is stopped earlier, no additional calculations for the post pruning) • it is preferred at least in the exploratory phase Ricco Rakotomalala 18 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/
Post-pruning An additional step in order to avoid the over-dependence to the growing sample Two steps for the decision tree learning (1) Growing phase maximizing the purity of the leaves (2) (Post) pruning phase minimizing the “true” error rate 0.8 0.7 0.6 0.5 Apprentissage Vraie erreur 0.4 0.3 “True” error rate 0.2 0.1 0 0 50 100 150 200 250 How to obtain a good estimation of the “true” error rate Ricco Rakotomalala 19 Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/
Recommend
More recommend