Random Forest Applied Multivariate Statistics – Spring 2012
Overview Intuition of Random Forest The Random Forest Algorithm De-correlation gives better accuracy Healthy Diseased Out-of-bag error (OOB-error) Healthy Variable importance Diseased Diseased 1
Intuition of Random Forest Tree 2 Tree 1 young old young old diseased healthy diseased healthy male female tall short healthy healthy healthy diseased Tree 3 New sample: retired working old, retired, male, short Tree predictions: healthy healthy diseased, healthy, diseased tall short Majority rule: healthy diseased diseased 2
The Random Forest Algorithm 3
Differences to standard tree Train each tree on bootstrap resample of data (Bootstrap resample of data set with N samples: Make new data set by drawing with replacement N samples; i.e., some samples will probably occur multiple times in new data set) For each split, consider only m randomly selected variables Don’t prune Fit B trees in such a way and use average or majority voting to aggregate results 4
Why Random Forest works 1/2 Mean Squared Error = Variance + Bias 2 If trees are sufficiently deep, they have very small bias How could we improve the variance over that of a single tree? 5
Why Random Forest works 2/2 i=j De-correlation gives Decreaes, if better accuracy 𝜍 decreases, i.e., if m decreases Decreases, if number of trees B increases (irrespective of 𝜍 ) 6
Estimating generalization error: Out-of bag (OOB) error Similar to leave-one-out cross-validation, but almost without any additional computational burden OOB error is a random number, since based on random resamples of the data Out of bag samples: Data: Resampled Data: old, tall – healthy old, tall – healthy young, short – diseased old, short – diseased old, short – diseased young, short – healthy young, tall – healthy young, tall – healthy young, tall – healthy young, short – diseased young, tall – healthy old, short – diseased young, short – healthy young, tall – healthy young old old, short – diseased diseased Out of bag (OOB) error rate: healthy tall short ¼ = 0.25 healthy diseased 7
Variable Importance for variable i using Permutations Data Resampled Resampled Dataset 1 Dataset m OOB OOB … Data 1 Data m Permute values of variable i in OOB Tree 1 Tree m data set OOB error e 1 OOB error e m d 1 = e 1 – p 1 d m =e m -p m OOB error p m OOB error p 1 P m d = 1 i =1 d i d m v i = P m s d 1 s 2 i =1 ( d i ¡ d ) 2 d = m ¡ 1 8
Trees vs. Random Forest + Trees yield insight into + RF as smaller prediction decision rules variance and therefore usually a better general + Rather fast performance + Easy to tune + Easy to tune parameters parameters - Rather slow - “Black Box”: Rather hard - Prediction of trees tend to get insights into decision to have a high variance rules 9
Comparing runtime (just for illustration) • Up to “thousands” of variables • Problematic if there are categorical predictors with many levels (max: 32 levels) RF: First predictor cut into 15 levels RF Tree 10
RF vs. LDA + Can model nonlinear + Very fast class boundaries + Discriminants for visualizing + OOB error “for free” (no group separation + Can read off decision rule CV needed) + Works on continuous and - Can model only linear class categorical responses boundaries (regression / classification) - Mediocre performance + Gives variable - No variable selection importance - Only on categorical response + Very good performance - Needs CV for estimating x prediction error x x x x x x x - “Black box” x x x x x x x x x x x x - Slow x x x x 11
Concepts to know Idea of Random Forest and how it reduces the prediction variance of trees OOB error Variable Importance based on Permutation 12
R functions to know Function “ randomForest ” and “ varImpPlot ” from package “ randomForest ” 13
Recommend
More recommend