Why and how to use random forest Introduction Construction R functions variable importance measures Variable importance (and how you shouldn’t) Tests for variable importance Conditional importance Summary Carolin Strobl (LMU M¨ unchen) and Achim Zeileis (WU Wien) References carolin.strobl@stat.uni-muenchen.de useR! 2008, Dortmund
Introduction Introduction Construction R functions Random forests Variable importance Tests for variable importance Conditional importance Summary References
Introduction Introduction Construction R functions Random forests Variable importance Tests for variable ◮ have become increasingly popular in, e.g., genetics and importance Conditional importance the neurosciences Summary References
Introduction Introduction Construction R functions Random forests Variable importance Tests for variable ◮ have become increasingly popular in, e.g., genetics and importance Conditional importance the neurosciences [imagine a long list of references here] Summary References
Introduction Introduction Construction R functions Random forests Variable importance Tests for variable ◮ have become increasingly popular in, e.g., genetics and importance Conditional importance the neurosciences [imagine a long list of references here] Summary ◮ can deal with “small n large p”-problems, high-order References interactions, correlated predictor variables
Introduction Introduction Construction R functions Random forests Variable importance Tests for variable ◮ have become increasingly popular in, e.g., genetics and importance Conditional importance the neurosciences [imagine a long list of references here] Summary ◮ can deal with “small n large p”-problems, high-order References interactions, correlated predictor variables ◮ are used not only for prediction, but also to assess variable importance
(Small) random forest Introduction 1 1 1 Construction 1 Start Start Start Start p < 0.001 p < 0.001 p < 0.001 p < 0.001 ≤ 8 ≤ > 8 > ≤ 12 > 12 ≤ 1 > 1 R functions ≤ 8 ≤ > 8 > 2 3 2 7 2 3 n = 13 Age Age n = 49 n = 8 Number y = (0.308, 0.692) p < 0.001 p < 0.001 y = (1, 0) y = (0.375, 0.625) p < 0.001 2 3 n = 15 Start ≤ ≤ 87 > 87 ≤ 68 > 68 ≤ 4 > 4 y = (0.4, 0.6) p < 0.001 4 5 3 6 4 7 Variable n = 36 Start Number n = 12 Age n = 31 ≤ ≤ 14 > > 14 y = (1, 0) p < 0.001 p < 0.001 y = (0.25, 0.75) p < 0.001 y = (0.806, 0.194) ≤ 13 > 13 ≤ 4 > 4 ≤ 125 > 125 4 5 importance n = 34 n = 32 6 7 4 5 5 6 y = (0.882, 0.118) y = (1, 0) n = 16 n = 16 n = 11 n = 9 n = 31 n = 11 y = (0.75, 0.25) y = (1, 0) y = (1, 0) y = (0.556, 0.444) y = (1, 0) y = (0.818, 0.182) 1 Tests for variable 1 1 Number 1 Start Start p < 0.001 Start p < 0.001 p < 0.001 p < 0.001 importance ≤ 5 ≤ > 5 2 9 ≤ 12 > 12 ≤ 14 > 14 Age n = 11 ≤ 12 ≤ > > 12 p < 0.001 y = (0.364, 0.636) 2 7 2 7 Age Number Age n = 35 ≤ 81 ≤ > > 81 p < 0.001 p < 0.001 p < 0.001 y = (1, 0) Conditional 2 3 3 4 n = 38 Number n = 33 Start ≤ 18 > 18 ≤ 3 > 3 ≤ 71 > 71 y = (0.711, 0.289) p < 0.001 y = (1, 0) p < 0.001 importance 3 4 8 9 3 4 ≤ 12 ≤ > > 12 n = 10 Number n = 28 n = 21 n = 15 Start 5 6 ≤ ≤ 3 > > 3 y = (0.9, 0.1) p < 0.001 y = (1, 0) y = (0.952, 0.048) y = (0.933, 0.067) p < 0.001 n = 13 Start y = (0.385, 0.615) p < 0.001 ≤ 4 > 4 ≤ 12 > 12 4 5 ≤ 15 > 15 n = 25 n = 18 5 6 5 6 7 8 y = (1, 0) y = (0.889, 0.111) n = 12 n = 10 n = 16 n = 15 Summary n = 12 n = 12 y = (0.417, 0.583) y = (0.2, 0.8) y = (0.375, 0.625) y = (0.733, 0.267) y = (0.833, 0.167) y = (1, 0) 1 1 Start 1 1 Number p < 0.001 Start Start p < 0.001 p < 0.001 p < 0.001 References ≤ ≤ 12 > > 12 ≤ 6 > 6 2 7 2 7 ≤ ≤ 12 > 12 ≤ 8 > 8 Age Start Number n = 10 p < 0.001 p < 0.001 p < 0.001 y = (0.5, 0.5) 2 5 2 5 Age Start Start Age ≤ ≤ 27 > > 27 ≤ 13 ≤ > > 13 ≤ 3 > 3 p < 0.001 p < 0.001 p < 0.001 p < 0.001 3 4 8 9 3 6 n = 10 Number n = 11 n = 37 Start n = 37 y = (1, 0) p < 0.001 y = (0.818, 0.182) y = (1, 0) ≤ 81 ≤ > 81 > ≤ 13 > 13 ≤ 3 > 3 ≤ 136 > 136 p < 0.001 y = (0.865, 0.135) ≤ ≤ 4 > > 4 ≤ 13 > 13 3 4 6 7 3 4 6 7 5 6 n = 20 n = 16 n = 11 n = 34 n = 12 n = 14 n = 47 n = 8 4 5 n = 14 n = 9 y = (0.85, 0.15) y = (0.188, 0.812) y = (0.818, 0.182) y = (1, 0) y = (0.667, 0.333) y = (0.143, 0.857) y = (1, 0) y = (0.75, 0.25) n = 10 n = 24 y = (0.357, 0.643) y = (0.111, 0.889) y = (0.8, 0.2) y = (1, 0) 1 1 1 1 Start Start Start Start p < 0.001 p < 0.001 p < 0.001 p < 0.001 ≤ 8 > 8 2 3 ≤ ≤ 8 > 8 > ≤ 12 ≤ > 12 ≤ 12 > 12 n = 18 Start y = (0.5, 0.5) p < 0.001 2 5 2 5 2 3 Start Start Age Start n = 28 Start ≤ 12 > 12 p < 0.001 p < 0.001 p < 0.001 p < 0.001 y = (0.607, 0.393) p < 0.001 4 5 n = 18 Number ≤ 1 ≤ > 1 > ≤ ≤ 12 > > 12 ≤ ≤ 71 > 71 > ≤ 14 > 14 ≤ 14 > 14 y = (0.833, 0.167) p < 0.001 ≤ 3 > 3 3 4 6 7 3 4 6 7 4 5 n = 9 n = 13 n = 12 n = 47 n = 15 n = 17 n = 17 n = 32 n = 21 n = 32 6 7 y = (0.778, 0.222) y = (0.154, 0.846) y = (0.833, 0.167) y = (1, 0) y = (0.667, 0.333) y = (0.235, 0.765) y = (0.882, 0.118) y = (1, 0) y = (0.905, 0.095) y = (1, 0) n = 30 n = 15 y = (1, 0) y = (0.933, 0.067)
Construction of a random forest Introduction Construction R functions Variable importance Tests for variable importance Conditional importance Summary References
Construction of a random forest Introduction Construction R functions ◮ draw ntree bootstrap samples from original sample Variable importance Tests for variable importance Conditional importance Summary References
Construction of a random forest Introduction Construction R functions ◮ draw ntree bootstrap samples from original sample Variable importance ◮ fit a classification tree to each bootstrap sample Tests for variable importance ⇒ ntree trees Conditional importance Summary References
Construction of a random forest Introduction Construction R functions ◮ draw ntree bootstrap samples from original sample Variable importance ◮ fit a classification tree to each bootstrap sample Tests for variable importance ⇒ ntree trees Conditional importance ◮ creates diverse set of trees because Summary ◮ trees are instable w.r.t. changes in learning data References ⇒ ntree different looking trees (bagging) ◮ randomly preselect mtry splitting variables in each split ⇒ ntree more different looking trees (random forest)
Random forests in R Introduction ◮ randomForest (pkg: randomForest ) Construction R functions ◮ reference implementation based on CART trees Variable (Breiman, 2001; Liaw and Wiener, 2008) importance Tests for variable – for variables of different types: biased in favor of importance Conditional continuous variables and variables with many categories importance (Strobl, Boulesteix, Zeileis, and Hothorn, 2007) Summary ◮ cforest (pkg: party ) References ◮ based on unbiased conditional inference trees (Hothorn, Hornik, and Zeileis, 2006) + for variables of different types: unbiased when subsampling, instead of bootstrap sampling, is used (Strobl, Boulesteix, Zeileis, and Hothorn, 2007)
Recommend
More recommend