Chapter 5. Tree-based Methods Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 Email: weip@biostat.umn.edu PubH 7475/8475 � Wei Pan c
Regression And Classification Tree (CART) ◮ § 9.2: Breiman et al (1984). ≈ C4.5 (Quinlan 1993). ◮ Main idea: approximate any f ( x ) by a piece-wise constant ˆ f ( x ). ◮ Use recursive partitioning: Fig 9.2, 1) Partition the x space into two regions R 1 and R 2 by x j < c j ; 2) Partition R 1 , R 2 ; 3) Then their sub-regions, ... until the model fits data well. ◮ ˆ f ( x ) = � m c m I ( x ∈ R m ). can be represented as a (decision) tree.
Elements of Statistical Learning (2nd Ed.) � Hastie, Tibshirani & Friedman 2009 Chap 9 c R 5 R 2 t 4 X 2 X 2 R 3 t 2 R 4 R 1 t 1 t 3 X 1 X 1 X 1 ≤ t 1 | X 2 ≤ t 2 X 1 ≤ t 3 X 2 ≤ t 4 R 1 R 2 R 3 X 2 X 1 R 4 R 5 FIGURE 9.2. Partitions and CART. Top right panel shows a partition of a two-dimensional feature space by recursive binary splitting, as used in CART, applied to some fake data. Top left panel shows a general partition that cannot be obtained from recursive binary splitting.
Regression Tree ◮ Y : continuous. ◮ Key: 1) determin splitting variables and split points (e.g. x j < t j ); = ⇒ R 1 , R 2 , ...; 2) determine c m in each R m . ◮ in 1), use a sequential or greedy searchfor each j and s : find x j < s s.t. R 1 ( j , s ) = { x | x j < s } , R 2 ( j , s ) = { x | x j ≥ s } , X i ∈ R 1 ( j , s ) ( Y i − c 1 ) 2 +min c 2 X i ∈ R 2 ( j , s ) ( Y i − c 2 ) 2 ]. min j , s [min c 1 � � ◮ in 2), given R 1 and R 2 , ˆ c k = Ave( Y i | X i ∈ R k } for k = 1 , 2. ◮ Repeat the process on R 1 and R 2 respectively, ... ◮ When to stop? Have to stop when having all equal or too few Y i ’s in R m ; Tree size gives a model complexity!
◮ A strategy: first grow a large tree, then prune it. ◮ Cost-complexity criterion for tree T : c m ) 2 + α | T | , � � C α ( T ) = RSS ( T ) + α | T | = ( Y i − ˆ m X i ∈ R m where | T | is # of terminal nodes (leaves) and α > 0 is a tuning parameter to be determined by CV.
Elements of Statistical Learning (2nd Ed.) � Hastie, Tibshirani & Friedman 2009 Chap 9 c α 176 21 7 5 3 2 0 0.4 0.3 Misclassification Rate 0.2 0.1 0.0 0 10 20 30 40 Tree Size FIGURE 9.4. Results for spam example. The blue curve is the 10 -fold cross-validation estimate of mis- classification rate as a function of tree size, with stan- dard error bars. The minimum occurs at a tree size with about 17 terminal nodes (using the “one-standard- -error” rule). The orange curve is the test error, which tracks the CV error quite closely. The cross-validation is indexed by values of α , shown above. The tree sizes shown below refer to | T α | , the size of the original tree indexed by α .
Classification Tree ◮ Y i ∈ { 1 , 2 , ..., K } . ◮ Classify obs’s in node m to the majority class: ˆ p mk = � X i ∈ R m I ( Y i = k ) / n m , k ( m ) = arg max k ˆ p mk . ◮ Impurity measure Q m ( T ): Used squarted error in regression trees. 1. Misclassification error: 1 � X i ∈ R m I ( Y i � = k ( m )) = 1 − ˆ p m , k ( m ) . n m 2. Gini index: � K k =1 ˆ p mk (1 − ˆ p mk ). 3. Cross-entropy or deviance: � K k =1 ˆ p mk log ˆ p mk . ◮ For K = 2, 1-3 reduce to 1 − max(ˆ p , 1 − ˆ p ), 2ˆ p (1 − ˆ p ), − ˆ p log ˆ p − (1 − ˆ p ) log(1 − ˆ p ). Look similar; see Fig 9.3. ◮ Example: ex5.1.r
◮ Advantages: 1. Easy to incorporate unequal losses of misclassifications: 1 � X i ∈ R m w i I ( Y i � = k ( m )) with w i = C k if Y i = k . n m 2. Handling missing data: use a surrogate splitting var/value at each node (to best approximate the selected one). ◮ Extensions: 1. May use non-binary splits; 2. A linear combination of multiple var’s as a splitting var. more flexible, but better? ◮ +: easy interpretation –decision trees! -: unstable due to greedy search and discontinuity; predicting performance not best. ◮ R packages tree , rpart ; commercial CART. ◮ Other implementations: C4.5/C5.0; FIRM by Prof Hawkins (U of M): to detect interactions; by Prof Loh’s group (UW-Madison): for count, survival, ... data; regression in each terminal node; ...
Elements of Statistical Learning (2nd Ed.) � Hastie, Tibshirani & Friedman 2009 Chap 9 c email 600/1536 ch$<0.0555 ch$>0.0555 email spam 280/1177 48/359 remove<0.06 hp<0.405 remove>0.06 hp>0.405 email spam spam email 180/1065 9/112 26/337 0/22 ch!<0.191 george<0.15 CAPAVE<2.907 ch!>0.191 george>0.15 CAPAVE>2.907 email email spam email spam spam 80/861 100/204 6/109 0/3 19/110 7/227 george<0.005 CAPAVE<2.7505 1999<0.58 george>0.005 CAPAVE>2.7505 1999>0.58 email email email spam spam email 80/652 0/209 36/123 16/81 18/109 0/1 hp<0.03 free<0.065 hp>0.03 free>0.065 email email email spam 77/423 3/229 16/94 9/29 CAPMAX<10.5 business<0.145 CAPMAX>10.5 business>0.145 email email email spam 20/238 57/185 14/89 3/5 receive<0.125 edu<0.045 receive>0.125 edu>0.045 email spam email email 19/236 1/2 48/113 9/72 our<1.2 our>1.2 email spam 37/101 1/12 FIGURE 9.5. The pruned tree for the spam example.
Elements of Statistical Learning (2nd Ed.) � Hastie, Tibshirani & Friedman 2009 Chap 9 c 1.0 • • • • • • • • • •• ••• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • 0.8 • • • • • • • • • • 0.6 • Sensitivity • • Tree (0.95) GAM (0.98) Weighted Tree (0.90) 0.4 • 0.2 • 0.0 • • • 0.0 0.2 0.4 0.6 0.8 1.0 Specificity FIGURE 9.6. ROC curves for the classification rules fit to the spam data. Curves that are closer to the north- east corner represent better classifiers. In this case the GAM classifier dominates the trees. The weighted tree achieves better sensitivity for higher specificity than the unweighted tree. The numbers in the legend represent the area under the curve.
Application: personalized medicine ◮ Also called subgroup analysis (or Precision Medicine): to identify subgroups of patients that would be most benefit from a treatment. ◮ Statistical problem: detect (qualitative) trt-predictor interaction! quantitative interactions: differ in magnitudes but in teh same direction; qualitative interactions: differ in directions. ◮ Many approaches ... one of them is to use trees. ◮ Prof Loh’s GUIDE: http://www.stat.wisc.edu/ ∼ loh/guide.html ◮ An example: http://onlinelibrary.wiley.com/doi/10.1002/sim. 6454/abstract ◮ Another example: https://www.ncbi.nlm.nih.gov/pubmed/24983709
Recommend
More recommend