boosting more than an ensemble method for prediction
play

Boosting: more than an ensemble method for prediction Peter B - PowerPoint PPT Presentation

Boosting: more than an ensemble method for prediction Peter B uhlmann ETH Z urich 1 1. Historically: Boosting is about multiple predictions Data: ( X 1 , Y 1 ) , . . . , ( X n , Y n ) (i.i.d. or stationary),


  1. ✬ ✩ Boosting: more than an ensemble method for prediction Peter B¨ uhlmann ETH Z¨ urich ✫ ✪ 1

  2. ✬ ✩ 1. Historically: Boosting is about multiple predictions Data: ( X 1 , Y 1 ) , . . . , ( X n , Y n ) (i.i.d. or stationary), predictor variables X i ∈ R p response variables Y i ∈ R or Y i ∈ { 0 , 1 , . . . , J − 1 } Aim: estimation of function f ( · ) : R p → R , e.g. f ( x ) = I E[ Y | X = x ] or f ( x ) = I P[ Y = 1 | X = x ] with Y ∈ { 0 , 1 } or distribution of survival time Y given X depends on some function f ( X ) only “historical” view (for classification): Boosting is a multiple predictions (estimation) & combination method ✫ ✪ 2

  3. ✬ ✩ Base procedure: algorithm A ˆ θ ( · ) (a function estimate) data − → e.g.: simple linear regression, tree, MARS, “classical” smoothing, neural nets, ... Generating multiple predictions: algorithm A ˆ − → θ 1 ( · ) weighted data 1 algorithm A ˆ − → θ 2 ( · ) weighted data 2 · · · · · · algorithm A ˆ − → θ M ( · ) weighted data M f A ( · ) = � M Aggregation: ˆ m =1 a m ˆ θ m ( · ) ✫ data weights? averaging weights a m ? ✪ 3

  4. ✬ ✩ classification of 2 lymph nodal status in breast cancer using gene expressions from microarray data: n = 33 , p = 7129 (for CART: gene-preselection, reducing to p = 50 ) method test set error gain over CART CART 22.5% – LogitBoost with trees 16.3% 28% LogitBoost with bagged trees 12.2% 46% this kind of boosting: mainly prediction, not much interpretation ✫ ✪ 4

  5. ✬ ✩ 2. Boosting algorithms around 1990: Schapire constructed some early versions of boosting AdaBoost proposed for classification by Freund & Schapire (1996) data weights (rough original idea): large weights to previously heavily misclassified instances (sequential algorithm) averaging weights a m : large if in-sample performance in m th round was good Why should this be good? ✫ ✪ 5

  6. ✬ ✩ Why should this be good? some common answers 5 years ago ... because • it works so well for prediction (which is quite true) • it concentrates on the “hard cases” (so what?) • AdaBoost almost never overfits the data no matter how many iterations it is run (not true) ✫ ✪ 6

  7. ✬ ✩ A better explanation Breiman (1998/99): AdaBoost is functional gradient descent (FGD) procedure aim: find f ∗ ( · ) = argmin f ( · ) I E[ ρ ( Y, f ( X ))] e.g. for ρ ( y, f ) = | y − f | 2 � f ∗ ( x ) = I E[ Y | X = x ] FGD solution: consider empirical risk n − 1 � n i =1 ρ ( Y i , f ( X i )) and do iterative steepest descent in function space ✫ ✪ 7

  8. ✬ ✩ 2.1. Generic FGD algorithm Step 1. ˆ f 0 ≡ 0 ; set m = 0 . Step 2. Increase m by 1. Compute negative gradient − ∂ ∂f ρ ( Y, f ) and evaluate at f = ˆ f m − 1 ( X i ) = U i ( i = 1 , . . . , n ) Step 3. Fit negative gradient vector U 1 , . . . , U n by base procedure algorithm A ˆ ( X i , U i ) n − → θ m ( · ) i =1 e.g. ˆ θ m fitted by (weighted) least squares i.e. ˆ θ m ( · ) is an approximation of the negative gradient vector Step 4. Up-date ˆ f m = ˆ f m − 1 ( · ) + νs m · ˆ θ m ( · ) s m = argmin s n − 1 � n i =1 ρ ( Y i , ˆ f m − 1 ( X i ) + s · ˆ θ m ( X i )) and 0 < ν ≤ 1 i.e. proceed along an estimate of the negative gradient vector Step 5. Iterate Steps 2-4 until m = m stop for some stopping iteration m stop ✫ ✪ 8

  9. ✬ ✩ Why “functional gradient”? Alternative formulation in function space: empirical risk functional: C ( f ) = n − 1 � n i =1 ρ ( Y i , f ( X i )) � f, g � = n − 1 � n i =1 f ( X i ) g ( X i ) inner product: negative Gateaux derivative: − dC ( f )( x ) = ∂ ∂αC ( f + α 1 x ) | α =0 , � − dC ( ˆ f m − 1 )( X i ) = U i if U 1 , ..., U n are fitted by least squares: equivalent to maximize �− dC ( f m ) , θ � w.r.t. θ ( · ) ( if � θ � = 1) (over all possible θ ( · ) ’s from the base procedure) i.e: ˆ θ m ( · ) is the best approximation (most parallel) to the negative gradient − dC ( f m ) ✫ ✪ 9

  10. ✬ ✩ By definition: FGD yields additive combination of base procedure fits ν � m stop m =1 s m ˆ θ m ( · ) Breiman (1998): FGD with ρ ( y, f ) = exp((2 y − 1) · f ) for binary classification yields the AdaBoost algorithm (great result!) Remark: FGD can not be represented as some explicit estimation function(al): n � ˆ f m ( · ) � = argmin f ∈F n − 1 ρ ( Y i , f ( X i )) for some function class F i =1 � FGD is mathematically more difficult to analyze but generically applicable (as an algorithm!) in very complex models ✫ ✪ 10

  11. ✬ ✩ 2.2. L 2 Boosting (see also Friedman, 2001) loss function ρ ( y, f ) = | y − f | 2 population minimizer: f ∗ ( x ) = I E[ Y | X = x ] FGD with base procedure ˆ θ ( · ) : repeated fitting of residuals i =1 � ˆ θ 1 ( · ) , ˆ f 1 = ν ˆ � resid. U i = Y i − ˆ m = 1 : ( X i , Y i ) n θ 1 f 1 ( X i ) i =1 � ˆ θ 2 ( · ) , ˆ f 2 = ˆ f 1 + ν ˆ � resid. U i = Y i − ˆ m = 2 : ( X i , U i ) n θ 2 f 2 ( X i ) ... ... f m stop ( · ) = ν � m stop ˆ m =1 ˆ θ m ( · ) (stagewise greedy fitting of residuals) Tukey (1977): twicing for m stop = 2 and ν = 1 ✫ ✪ 11

  12. ✬ ✩ Any gain over classical methods? (for additive modeling) Ozone data: n=300, p=8 n = 300 , p = 8 22 - magenta: L 2 Boosting with stumps (horiz. line = cross-validated stopping) 21 - black: L 2 Boosting with componentwise smoothing spline 20 MSE (horiz. line = cross-validated stopping) i.e: smoothing spline fi tting against the 19 selected predictor which reduces RSS most 18 - green: MARS restricted to additive modeling - red: additive model using backfi tting 0 20 40 60 80 100 boosting iterations L 2 Boosting with stumps or comp. smoothing splines also yields additive model: � m s top θ m ( x ( ˆ ✫ m =0 ˆ ✪ S m ) ) = ˆ g 1 ( x (1) ) + . . . + ˆ g p ( x ( p ) ) 12

  13. ✬ ✩ Simulated data: non-additive regression function, n = 200 , p = 100 Regression: n=200, p=100 16 15 - magenta: L 2 Boosting with stumps 14 - black: L 2 Boosting with componentwise MSE - green: MARS restricted to additive modeling 13 - red: additive model using backfi tting and 12 fwd. var. selection 11 0 50 100 150 200 250 300 boosting iterations ✫ ✪ 13

  14. ✬ ✩ similar for classifi cation ✫ ✪ 14

  15. ✬ ✩ 3. Structured models and choosing the base procedure have just seen the Componentwise smoothing spline base procedure smoothes the reponse against the one predictor variable which reduces RSS most we keep the degrees of freedom fixed for all candidate predictors, e.g. d.f. = 2.5 � L 2 Boosting yields an additive model fit, including variable selection ✫ ✪ 15

  16. ✬ ✩ Componentwise linear least squares simple linear OLS against the one predictor variable which reduces RSS most n n n S x ( ˆ S ) , ˆ YiX ( j ) ( X ( j ) βj X ( j ) X X )2 , X )2 θ ( x ) = ˆ ˆ ˆ ( Yi − ˆ S = arg min β ˆ βj = / i i i j i =1 i =1 i =1 first round of estimation: selected predictor variable X ( ˆ S 1 ) (e.g. = X (3) ) corresponding ˆ S 1 � fitted function ˆ β ˆ f 1 ( x ) second round of estimation: selected predictor variable X ( ˆ S 2 ) (e.g. = X (21) ) corresponding ˆ S 2 � fitted function ˆ β ˆ f 2 ( x ) etc. L 2 Boosting: ˆ f m ( x ) = ˆ f m − 1 ( x ) + ν · ˆ θ ( x ) � L 2 Boosting yields linear model fit, including variable selection, i.e. structured model fit ✫ ✪ 16

  17. ✬ ✩ for ν = 1 , this is known as Matching Pursuit (Mallat and Zhang, 1993) Weak greedy algorithm (deVore & Temlyakov, 1997) a version of Boosting (Schapire, 1992; Freund & Schapire, 1996) Gauss-Southwell algorithm C.F . Gauss in 1803 “Princeps Mathematicorum” R.V. Southwell in 1933 ✫ ✪ Professor in engineering, Oxford 17

  18. ✬ ✩ binary lymph node classification in breast cancer using gene expressions: a high noise problem n = 49 samples, p = 7129 gene expressions L 2 Boosting FPLR Pelora 1-NN DLDA SVM CV-misclassif.err. 17.7% 35.25% 27.8% 43.25% 36.12% 36.88% multivariate gene selection best 200 genes from Wilcox. L 2 Boosting selected 42 out of p = 7129 genes for this data-set: not good prediction, with any of the methods but L 2 Boosting may be a reasonable(?) multivariate gene selection method ✫ ✪ 18

  19. ✬ ✩ 42 (out of 7129) selected genes ( n = 49 ) sorted regression coefficients 0.05 0.00 −0.05 −0.10 −0.15 0 10 20 30 40 selected genes identifiability problem: strong correlations among some genes � consider groups of highly correlated genes, biological categories (e.g. GO), .... linear model: multivariate association between genes and tumor-type ✫ ✪ very different from 2-sample tests for individual genes 19

  20. ✬ ✩ Pairwise smoothing splines smoothes response against the pair of predictor variables which reduces RSS most we keep the degrees of freedom fixed for all candidate pairs, e.g. d.f. = 2.5 � L 2 Boosting yields a nonparametric interaction model, including variable selection ✫ ✪ 20

Recommend


More recommend