STK-IN4300 - Statistical Learning Methods in Data Science Outline of the lecture AdaBoost Introduction algorithm STK-IN4300 Statistical Learning Methods in Data Science Statistical Boosting Boosting as a forward stagewise additive modelling Why exponential loss? Riccardo De Bin Steepest Descent Gradient boosting debin@math.uio.no STK-IN4300: lecture 10 1/ 30 STK-IN4300: lecture 10 2/ 30 STK-IN4300 - Statistical Learning Methods in Data Science STK-IN4300 - Statistical Learning Methods in Data Science AdaBoost: introduction AdaBoost: introduction Starting challenge: L. Breiman: “[Boosting is] the best off-shelf classifier in the world” . “Can [a committee of blockheads] somehow arrive at highly ‚ originally developed for classification; reasoned decisions, despite the weak judgement of the individual ‚ as a pure machine learning black-box; members?” (Schapire & Freund, 2014) ‚ translated into the statistical world (Friedman et al., 2000); Goal: create a good classifier by combining several weak classifiers, ‚ extended to every statistical problem (Mayr et al., 2014), ‚ in classification, a “weak classifier” is a classifier which is able § regression; to produce results only slightly better than a random guess; § survival analysis; § . . . Idea: apply repeatedly (iteratively) a weak classifier to ‚ interpretable models, thanks to the statistical view; modifications of the data, ‚ extended to work in high-dimensional settings (B¨ uhlmann, ‚ at each iteration, give more weight to the misclassified 2006). observations. STK-IN4300: lecture 10 3/ 30 STK-IN4300: lecture 10 4/ 30
STK-IN4300 - Statistical Learning Methods in Data Science STK-IN4300 - Statistical Learning Methods in Data Science AdaBoost: introduction AdaBoost: algorithm Consider a two-class classification problem, y i P t´ 1 , 1 u , x i P R p . AdaBoost algorithm: 1. initialize the weights, w r 0 s “ p 1 { N, . . . , 1 { N q ; 2. for m from 1 to m stop , (a) fit the weak estimator G p¨q to the weighted data; (b) compute the weighted in-sample misclassification rate, err r m s “ ř N i “ 1 w r m ´ 1 s 1 p y i ‰ ˆ G r m s p x i qq ; i (c) compute the voting weights, α r m s “ log pp 1 ´ err r m s q{ err r m s q ; (d) update the weights w i “ w r m ´ 1 s exp t α r m s 1 p y i ‰ ˆ G r m s p x i qqu ; ˜ § i § w r m s w i { ř N “ ˜ i “ 1 ˜ w i ; i 3. compute the final result, m stop α r m s ˆ ˆ ÿ G r m s p x qq G AdaBoost “ sign p m “ 1 STK-IN4300: lecture 10 5/ 30 STK-IN4300: lecture 10 6/ 30 STK-IN4300 - Statistical Learning Methods in Data Science STK-IN4300 - Statistical Learning Methods in Data Science AdaBoost: example AdaBoost: example First iteration: ‚ apply the classifier G p¨q on observations with weights: 1 2 3 4 5 6 7 8 9 10 w i 0 . 10 0 . 10 0 . 10 0 . 10 0 . 10 0 . 10 0 . 10 0 . 10 0 . 10 0 . 10 ‚ observations 1, 2 and 3 are misclassified ñ err r 1 s “ 0 . 3 ; ‚ compute α r 1 s “ 0 . 5 log pp 1 ´ err r 1 s q{ err r 1 s q « 0 . 42 ; ‚ set w i “ w i exp t α r 1 s 1 p y i ‰ ˆ G r 1 s p x i qqu : 1 2 3 4 5 6 7 8 9 10 w i 0 . 15 0 . 15 0 . 15 0 . 07 0 . 07 0 . 07 0 . 07 0 . 07 0 . 07 0 . 07 figure from Schapire & Freund (2014) STK-IN4300: lecture 10 7/ 30 STK-IN4300: lecture 10 8/ 30
STK-IN4300 - Statistical Learning Methods in Data Science STK-IN4300 - Statistical Learning Methods in Data Science AdaBoost: example AdaBoost: example Second iteration: ‚ apply classifier G p¨q on re-weighted observations ( w i { ř i w i ): 1 2 3 4 5 6 7 8 9 10 w i 0 . 17 0 . 17 0 . 17 0 . 07 0 . 07 0 . 07 0 . 07 0 . 07 0 . 07 0 . 07 ‚ observations 6, 7 and 9 are misclassified ñ err r 2 s « 0 . 21 ; ‚ compute α r 2 s “ 0 . 5 log pp 1 ´ err r 2 s q{ err r 2 s q « 0 . 65 ; ‚ set w i “ w i exp t α r 2 s 1 p y i ‰ ˆ G r 2 s p x i qqu : 1 2 3 4 5 6 7 8 9 10 w i 0 . 09 0 . 09 0 . 09 0 . 04 0 . 04 0 . 14 0 . 14 0 . 04 0 . 14 0 . 04 figure from Schapire & Freund (2014) STK-IN4300: lecture 10 9/ 30 STK-IN4300: lecture 10 10/ 30 STK-IN4300 - Statistical Learning Methods in Data Science STK-IN4300 - Statistical Learning Methods in Data Science AdaBoost: example AdaBoost: example Third iteration: ‚ apply classifier G p¨q on re-weighted observations ( w i { ř i w i ): 1 2 3 4 5 6 7 8 9 10 w i 0 . 11 0 . 11 0 . 11 0 . 05 0 . 05 0 . 17 0 . 17 0 . 05 0 . 17 0 . 05 ‚ observations 4, 5 and 8 are misclassified ñ err r 3 s « 0 . 14 ; ‚ compute α r 3 s “ 0 . 5 log pp 1 ´ err r 3 s q{ err r 3 s q « 0 . 92 ; ‚ set w i “ w i exp t α r 3 s 1 p y i ‰ ˆ G r 3 s p x i qqu : 1 2 3 4 5 6 7 8 9 10 w i 0 . 04 0 . 04 0 . 04 0 . 11 0 . 11 0 . 07 0 . 07 0 . 11 0 . 07 0 . 02 figure from Schapire & Freund (2014) STK-IN4300: lecture 10 11/ 30 STK-IN4300: lecture 10 12/ 30
STK-IN4300 - Statistical Learning Methods in Data Science STK-IN4300 - Statistical Learning Methods in Data Science AdaBoost: example AdaBoost: example figure from Schapire & Freund (2014) STK-IN4300: lecture 10 13/ 30 STK-IN4300: lecture 10 14/ 30 STK-IN4300 - Statistical Learning Methods in Data Science STK-IN4300 - Statistical Learning Methods in Data Science Statistical Boosting: Boosting as a forward stagewise additive modelling Statistical Boosting: Boosting as a forward stagewise additive modelling The statistical view of boosting is based on the concept of forward stagewise additive modelling : ‚ minimizes a loss function L p y i , f p x i qq ; ‚ using an additive model, f p x q “ ř M m “ 1 β m b p x ; γ m q ; (see notes) § b p x ; γ m q is the basis, or weak learner; ‚ at each step, ř N p β m , γ m q “ argmin β,γ i “ 1 L p y i , f m ´ 1 p x i q ` βb p x i ; γ qq ; ‚ the estimate is updated as f m p x q “ f m ´ 1 p x q ` β m b p x ; γ m q ‚ e.g., in AdaBoost, β m “ α m { 2 , b p x ; γ m q “ G p x q ; STK-IN4300: lecture 10 15/ 30 STK-IN4300: lecture 10 16/ 30
STK-IN4300 - Statistical Learning Methods in Data Science STK-IN4300 - Statistical Learning Methods in Data Science Statistical Boosting: Why exponential loss? Statistical Boosting: Why exponential loss? Note: The statistical view of boosting: ‚ the exponential loss is not the only possible loss-function; ‚ allows to interpret the results; ‚ by studying the properties of the exponential loss; ‚ deviance (cross/entropy): binomial negative log-likelihood, ´ ℓ p π x q “ ´ y 1 log p π x q ´ p 1 ´ y 1 q log p 1 ´ π x q , It is easy to show that where: 2 log Pr p Y “ 1 | x q f ˚ p x q “ argmin f p x q E Y | X “ x r e ´ Y f p x q s “ 1 Pr p Y “ ´ 1 | x q , § y 1 “ p y ` 1 q{ 2 , i.e., y 1 P t 0 , 1 u ; e f p x q § π x “ Pr p Y “ 1 | X “ x q “ 1 e ´ f p x q ` e f p x q “ 1 ` e ´ 2 f p x q ; i.e. ‚ equivalently, 1 Pr p Y “ 1 | x q “ ´ ℓ p π x q “ log p 1 ` e ´ 2 yf p x q q . 1 ` e ´ 2 f ˚ p x q ; ‚ same population minimizers for E r´ ℓ p π x qs and E r e ´ Y f p x q s . therefore AdaBoost estimates 1/2 the log-odds of Pr p Y “ 1 | x q . STK-IN4300: lecture 10 17/ 30 STK-IN4300: lecture 10 18/ 30 STK-IN4300 - Statistical Learning Methods in Data Science STK-IN4300 - Statistical Learning Methods in Data Science Statistical Boosting: Why exponential loss? Statistical Boosting: steepest descent We saw that AdaBoost iteratively minimizes a loss function. In general, consider ‚ L p f q “ ř N i “ 1 L p y i , f p x i qq ; ‚ ˆ f “ argmin f L p f q ; ‚ the minimization problem can be solved by considering m stop ÿ f m stop “ h m m “ 0 where: § f 0 “ h 0 is the initial guess; § each f m improves the previous f m ´ 1 through h m ; § h m is called “step”. STK-IN4300: lecture 10 19/ 30 STK-IN4300: lecture 10 20/ 30
Recommend
More recommend