random forest
play

Random Forest Applied Multivariate Statistics Spring 2012 Overview - PowerPoint PPT Presentation

Random Forest Applied Multivariate Statistics Spring 2012 Overview Intuition of Random Forest The Random Forest Algorithm De-correlation gives better accuracy Healthy Diseased Out-of-bag error (OOB-error) Healthy Variable


  1. Random Forest Applied Multivariate Statistics – Spring 2012

  2. Overview  Intuition of Random Forest  The Random Forest Algorithm  De-correlation gives better accuracy Healthy Diseased  Out-of-bag error (OOB-error) Healthy  Variable importance Diseased Diseased 1

  3. Intuition of Random Forest Tree 2 Tree 1 young old young old diseased healthy diseased healthy male female tall short healthy healthy healthy diseased Tree 3 New sample: retired working old, retired, male, short Tree predictions: healthy healthy diseased, healthy, diseased tall short Majority rule: healthy diseased diseased 2

  4. The Random Forest Algorithm 3

  5. Differences to standard tree  Train each tree on bootstrap resample of data (Bootstrap resample of data set with N samples: Make new data set by drawing with replacement N samples; i.e., some samples will probably occur multiple times in new data set)  For each split, consider only m randomly selected variables  Don’t prune  Fit B trees in such a way and use average or majority voting to aggregate results 4

  6. Why Random Forest works 1/2  Mean Squared Error = Variance + Bias 2  If trees are sufficiently deep, they have very small bias  How could we improve the variance over that of a single tree? 5

  7. Why Random Forest works 2/2 i=j De-correlation gives Decreaes, if better accuracy 𝜍 decreases, i.e., if m decreases Decreases, if number of trees B increases (irrespective of 𝜍 ) 6

  8. Estimating generalization error: Out-of bag (OOB) error  Similar to leave-one-out cross-validation, but almost without any additional computational burden  OOB error is a random number, since based on random resamples of the data Out of bag samples: Data: Resampled Data: old, tall – healthy old, tall – healthy young, short – diseased old, short – diseased old, short – diseased young, short – healthy young, tall – healthy young, tall – healthy young, tall – healthy young, short – diseased young, tall – healthy old, short – diseased young, short – healthy young, tall – healthy young old old, short – diseased diseased Out of bag (OOB) error rate: healthy tall short ¼ = 0.25 healthy diseased 7

  9. Variable Importance for variable i using Permutations Data Resampled Resampled Dataset 1 Dataset m OOB OOB … Data 1 Data m Permute values of variable i in OOB Tree 1 Tree m data set OOB error e 1 OOB error e m d 1 = e 1 – p 1 d m =e m -p m OOB error p m OOB error p 1 P m d = 1 i =1 d i d m v i = P m s d 1 s 2 i =1 ( d i ¡ d ) 2 d = m ¡ 1 8

  10. Trees vs. Random Forest + Trees yield insight into + RF as smaller prediction decision rules variance and therefore usually a better general + Rather fast performance + Easy to tune + Easy to tune parameters parameters - Rather slow - “Black Box”: Rather hard - Prediction of trees tend to get insights into decision to have a high variance rules 9

  11. Comparing runtime (just for illustration) • Up to “thousands” of variables • Problematic if there are categorical predictors with many levels (max: 32 levels) RF: First predictor cut into 15 levels RF Tree 10

  12. RF vs. LDA + Can model nonlinear + Very fast class boundaries + Discriminants for visualizing + OOB error “for free” (no group separation + Can read off decision rule CV needed) + Works on continuous and - Can model only linear class categorical responses boundaries (regression / classification) - Mediocre performance + Gives variable - No variable selection importance - Only on categorical response + Very good performance - Needs CV for estimating x prediction error x x x x x x x - “Black box” x x x x x x x x x x x x - Slow x x x x 11

  13. Concepts to know  Idea of Random Forest and how it reduces the prediction variance of trees  OOB error  Variable Importance based on Permutation 12

  14. R functions to know  Function “ randomForest ” and “ varImpPlot ” from package “ randomForest ” 13

Recommend


More recommend