bagging and random forests
play

Bagging and Random Forests David S. Rosenberg New York University - PowerPoint PPT Presentation

Bagging and Random Forests David S. Rosenberg New York University April 10, 2018 David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 10, 2018 1 / 29 Contents Ensemble Methods: Introduction 1 The Benefits of Averaging


  1. Bagging and Random Forests David S. Rosenberg New York University April 10, 2018 David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 10, 2018 1 / 29

  2. Contents Ensemble Methods: Introduction 1 The Benefits of Averaging 2 Review: Bootstrap 3 Bagging 4 Random Forests 5 David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 10, 2018 2 / 29

  3. Ensemble Methods: Introduction David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 10, 2018 3 / 29

  4. Ensembles: Parallel vs Sequential Ensemble methods combine multiple models Parallel ensembles : each model is built independently e.g. bagging and random forests Main Idea: Combine many (high complexity, low bias) models to reduce variance Sequential ensembles : Models are generated sequentially Try to add new models that do well where previous models lack David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 10, 2018 4 / 29

  5. The Benefits of Averaging David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 10, 2018 5 / 29

  6. A Poor Estimator Let Z , Z 1 ,..., Z n i.i.d. E Z = µ and Var Z = σ 2 . We could use any single Z i to estimate µ . Performance? Unbiased: E Z i = µ . Standard error of estimator would be σ . The standard error is the standard deviation of the sampling distribution of a statistic. √ σ 2 = σ . � SD ( Z ) = Var ( Z ) = David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 10, 2018 6 / 29

  7. Variance of a Mean Let Z , Z 1 ,..., Z n i.i.d. E Z = µ and Var Z = σ 2 . Let’s consider the average of the Z i ’s. Average has the same expected value but smaller standard error: n n � � � � = σ 2 1 1 � � E Z i = µ Z i n . Var n n i = 1 i = 1 Clearly the average is preferred to a single Z i as estimator. Can we apply this to reduce variance of general prediction functions? David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 10, 2018 7 / 29

  8. Averaging Independent Prediction Functions Suppose we have B independent training sets from the same distribution. Learning algorithm gives B decision functions: ˆ f 1 ( x ) , ˆ f 2 ( x ) ,..., ˆ f B ( x ) Define the average prediction function as: B f avg = 1 ˆ � ˆ f b B b = 1 What’s random here? The B independent training sets are random, which gives rise to variation among the ˆ f b ’s. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 10, 2018 8 / 29

  9. Averaging Independent Prediction Functions Fix some particular x 0 ∈ X . Then average prediction on x 0 is B f avg ( x 0 ) = 1 � ˆ ˆ f b ( x 0 ) . B b = 1 Consider ˆ f avg ( x 0 ) and ˆ f 1 ( x 0 ) ,..., ˆ f B ( x 0 ) as random variables Since the training sets were random We have no idea about the distributions of ˆ f 1 ( x 0 ) ,..., ˆ f B ( x 0 ) – they could be crazy... But we do know that ˆ f 1 ( x 0 ) ,..., ˆ f B ( x 0 ) are i.i.d. And that’s all we need here... David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 10, 2018 9 / 29

  10. Averaging Independent Prediction Functions The average prediction on x 0 is B f avg ( x 0 ) = 1 � ˆ ˆ f b ( x 0 ) . B b = 1 f avg ( x 0 ) and ˆ ˆ f b ( x 0 ) have the same expected value, but ˆ f avg ( x 0 ) has smaller variance: � B � 1 � Var ( ˆ ˆ f avg ( x 0 )) = f b ( x 0 ) B 2 Var b = 1 1 � � ˆ = f 1 ( x 0 ) B Var David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 10, 2018 10 / 29

  11. Averaging Independent Prediction Functions Using B f avg = 1 � ˆ ˆ f b B b = 1 seems like a win. But in practice we don’t have B independent training sets... Instead, we can use the bootstrap .... David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 10, 2018 11 / 29

  12. Review: Bootstrap David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 10, 2018 12 / 29

  13. The Bootstrap Sample Definition A bootstrap sample from D n is a sample of size n drawn with replacement from D n . In a bootstrap sample, some elements of D n will show up multiple times, some won’t show up at all. Each X i has a probability ( 1 − 1 / n ) n of not being selected. Recall from analysis that for large n , � n � 1 − 1 ≈ 1 e ≈ . 368 . n So we expect ~63.2% of elements of D will show up at least once. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 10, 2018 13 / 29

  14. The Bootstrap Method Definition A bootstrap method is when you simulate having B independent samples from P by taking B bootstrap samples from the sample D n . Given original data D n , compute B bootstrap samples D 1 n ,..., D B n . For each bootstrap sample, compute some function φ ( D 1 n ) ,..., φ ( D B n ) Work with these values as though D 1 n ,..., D B n were i.i.d. P . Amazing fact: Things often come out very close to what we’d get with independent samples from P . David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 10, 2018 14 / 29

  15. Bagging David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 10, 2018 15 / 29

  16. Bagging Draw B bootstrap samples D 1 ,..., D B from original data D . Let ˆ f 1 , ˆ f 2 ,..., ˆ f B be the prediction functions for each set. The bagged prediction function is a combination of these: � � ˆ f 1 ( x ) , ˆ ˆ f 2 ( x ) ,..., ˆ f avg ( x ) = Combine f B ( x ) How might we combine prediction functions for regression? binary class predictions? binary probability predictions? multiclass predictions? Bagging proposed by Leo Breiman (1996). David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 10, 2018 16 / 29

  17. Bagging for Regression Draw B bootstrap samples D 1 ,..., D B from original data D . Let ˆ f 1 , ˆ f 2 ,..., ˆ f B : X → R be the predictions functions for each set. Bagged prediction function is given as B f bag ( x ) = 1 � ˆ ˆ f b ( x ) . B b = 1 Empirically, ˆ f bag often performs similarly to what we’d get from training on B independent samples: f bag ( x ) has same expectation as ˆ ˆ f 1 ( x ) , but f bag ( x ) has smaller variance than ˆ ˆ f 1 ( x ) David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 10, 2018 17 / 29

  18. Out-of-Bag Error Estimation Each bagged predictor is trained on about 63% of the data. Remaining 37% are called out-of-bag (OOB) observations. For i th training point, let b | D b does not contain i th point � � S i = . The OOB prediction on x i is f OOB ( x i ) = 1 � ˆ ˆ f b ( x i ) . | S i | b ∈ S i The OOB error is a good estimate of the test error. OOB error is similar to cross validation error – both are computed on training set. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 10, 2018 18 / 29

  19. Bagging Classification Trees Input space X = R 5 and output space Y = { − 1 , 1 } . Sample size n = 30 Each bootstrap tree is quite different Different splitting variable at the root This high degree of variability from small perturbations of the training data is why tree methods are described as high variance . From HTF Figure 8.9 David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 10, 2018 19 / 29

  20. Comparing Classification Combination Methods Two ways to combine classifications: consensus class or average probabilities. From HTF Figure 8.10 David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 10, 2018 20 / 29

  21. Terms “Bias” and “Variance” in Casual Usage (Warning! Confusion Zone!) Restricting the hypothesis space F “ biases ” the fit away from the best possible fit of the training data, and towards a [usually] simpler model. Full, unpruned decision trees have very little bias. Pruning decision trees introduces a bias. Variance describes how much the fit changes across different random training sets. If different random training sets give very similar fits, then algorithm has high stability . Decision trees are found to be high variance (i.e. not very stable). David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 10, 2018 21 / 29

  22. Conventional Wisdom on When Bagging Helps Hope is that bagging reduces variance without making bias worse. General sentiment is that bagging helps most when Relatively unbiased base prediction functions High variance / low stability i.e. small changes in training set can cause large changes in predictions Hard to find clear and convincing theoretical results on this But following this intuition leads to improved ML methods, e.g. Random Forests David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 10, 2018 22 / 29

  23. Random Forests David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 10, 2018 23 / 29

  24. Recall the Motivating Principal of Bagging Averaging ˆ f 1 ,..., ˆ f B reduces variance, if they’re based on i.i.d. samples from P X × Y Bootstrap samples are independent samples from the training set, but are not independent samples from P X × Y . This dependence limits the amount of variance reduction we can get. Would be nice to reduce the dependence between ˆ f i ’s... David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 10, 2018 24 / 29

Recommend


More recommend