bbm406
play

BBM406 Fundamentals of Machine Learning Lecture 20: AdaBoost - PowerPoint PPT Presentation

Illustration adapted from Alex Rogozhnikov BBM406 Fundamentals of Machine Learning Lecture 20: AdaBoost Aykut Erdem // Hacettepe University // Fall 2019 Last time Bias/Variance Tradeo ff slide by David Sontag Graphical illustration of


  1. Illustration adapted from Alex Rogozhnikov BBM406 Fundamentals of 
 Machine Learning Lecture 20: AdaBoost Aykut Erdem // Hacettepe University // Fall 2019

  2. Last time… Bias/Variance Tradeo ff slide by David Sontag Graphical illustration of bias and variance. http://scott.fortmann-roe.com/docs/BiasVariance.html 2

  3. Last time… Bagging • Leo Breiman (1994) • Take repeated bootstrap samples from training set D. • Bootstrap sampling: Given set D containing N training examples, create D ’ by drawing N examples at random with replacement from D. • Bagging: - Create k bootstrap samples D 1 ... D k . - Train distinct classifier on each D i . - Classify new instance by majority vote / average. slide by David Sontag 3

  4. Last time… Random Forests Tree t=1 t=2 t=3 slide by Nando de Freitas [From the book of Hastie, Friedman and Tibshirani] 4

  5. Boosting 5

  6. Boosting Ideas • Main idea: use weak learner to create strong learner. • Ensemble method: combine base classifiers returned by weak learner. • Finding simple relatively accurate base classifiers often not hard. • But, how should base classifiers be combined? slide by Mehryar Mohri 6

  7. Example: “How May I Help You?” Goal: automatically categorize type of call requested by • phone customer (Collect, CallingCard, PersonToPerson, etc.) - yes I’d like to place a collect call long distance please (Collect) - operator I need to make a call but I need to bill it to my office (ThirdNumber) - yes I’d like to place a call on my master card please (CallingCard) - I just called a number in sioux city and I musta rang the wrong number because I got the wrong party and I would like to have that taken off of my bill (BillingCredit) Observation: • - easy to find “rules of thumb” that are “often” correct slide by Rob Schapire • e.g.: “IF ‘card’ occurs in utterance THEN predict ‘CallingCard’ ” - hard to find single highly accurate prediction rule [Gorin et al.] 7

  8. Boosting: Intuition • Instead of learning a single (weak) classifier, learn many weak classifiers that are good at different parts of the input space • Output class: (Weighted) vote of each classifier - Classifiers that are most “sure” will vote with more conviction - Classifiers will be most “sure” about a particular part of the space - On average, do better than single classifier! • But how do you??? slide by Aarti Singh & Barnabas Poczos - force classifiers to learn about di ff erent parts of the input space? - weigh the votes of di ff erent classifiers? 8

  9. Boosting [Schapire, 1989] • Idea: given a weak learner, run it multiple times on (reweighted) training data, then let the learned classifiers vote • On each iteration t : - weight each training example by how incorrectly it was classified - Learn a hypothesis – h t - A strength for this hypothesis – a t • Final classifier: - A linear combination of the votes of the di ff erent classifiers weighted by their strength slide by Aarti Singh & Barnabas Poczos • Practically useful • Theoretically interesting 9

  10. Boosting: Intuition • Want to pick weak classifiers that contribute something to the ensemble Greedy algorithm: for m =1,..., M Pick a weak classifier h m • Adjust weights: misclassified • examples get “heavier” α m set according to weighted • error of h m slide by Raquel Urtasun [Source: G. Shakhnarovich] 10

  11. Boosting: Intuition • Want to pick weak classifiers that contribute something to the ensemble Greedy algorithm: for m =1,..., M Pick a weak classifier h m • Adjust weights: misclassified • examples get “heavier” α m set according to weighted • error of h m slide by Raquel Urtasun [Source: G. Shakhnarovich] 11

  12. Boosting: Intuition • Want to pick weak classifiers that contribute something to the ensemble Greedy algorithm: for m =1,..., M Pick a weak classifier h m • Adjust weights: misclassified • examples get “heavier” α m set according to weighted • error of h m slide by Raquel Urtasun [Source: G. Shakhnarovich] 12

  13. Boosting: Intuition • Want to pick weak classifiers that contribute something to the ensemble Greedy algorithm: for m =1,..., M Pick a weak classifier h m • Adjust weights: misclassified • examples get “heavier” α m set according to weighted • error of h m slide by Raquel Urtasun [Source: G. Shakhnarovich] 13

  14. Boosting: Intuition • Want to pick weak classifiers that contribute something to the ensemble Greedy algorithm: for m =1,..., M Pick a weak classifier h m • Adjust weights: misclassified • examples get “heavier” α m set according to weighted • error of h m slide by Raquel Urtasun [Source: G. Shakhnarovich] 14

  15. Boosting: Intuition • Want to pick weak classifiers that contribute something to the ensemble Greedy algorithm: for m =1,..., M Pick a weak classifier h m • Adjust weights: misclassified • examples get “heavier” α m set according to weighted • error of h m slide by Raquel Urtasun [Source: G. Shakhnarovich] 15

  16. Boosting: Intuition • Want to pick weak classifiers that contribute something to the ensemble Greedy algorithm: for m =1,..., M Pick a weak classifier h m • Adjust weights: misclassified • examples get “heavier” α m set according to weighted • error of h m slide by Raquel Urtasun [Source: G. Shakhnarovich] 16

  17. First Boosting Algorithms • [Schapire ’89]: - first provable boosting algorithm • [Freund ’90]: - “optimal” algorithm that “boosts by majority” • [Drucker, Schapire & Simard ’92]: - first experiments using boosting - limited by practical drawbacks • [Freund & Schapire ’95]: - introduced “ AdaBoost ” algorithm - strong practical advantages over previous boosting slide by Rob Schapire algorithms 17

  18. The AdaBoost Algorithm 18

  19. Toy Example Minimize the error For binary h t , typically use slide by Rob Schapire weak hypotheses = vertical or horizontal half-planes 19

  20. Round 1 h 1 ε 1=0.30 slide by Rob Schapire 20

  21. Round 1 h 1 ε 1 =0.30 =0.42 α 1 slide by Rob Schapire 21

  22. Round 1 h 1 D 2 ε 1 =0.30 =0.42 α 1 slide by Rob Schapire 22

  23. Round 2 h 2 3 ε 2=0.21 slide by Rob Schapire 23

  24. Round 2 h 2 3 ε 2 =0.21 =0.65 α slide by Rob Schapire 2 24

  25. Round 2 h 2 D 3 ε 2 =0.21 =0.65 α slide by Rob Schapire 2 25

  26. Round 3 h 3 ε 3=0.14 slide by Rob Schapire 26

  27. Round 3 h 3 ε 3 =0.14 3=0.92 α slide by Rob Schapire 27

  28. Final Hypothesis H = sign 0.42 + 0.65 + 0.92 final = slide by Rob Schapire 28

  29. Voted combination of classifiers • The general problem here is to try to combine many simple “weak” classifiers into a single “strong” classifier • We consider voted combinations of simple binary ±1 component classifiers where the (non-negative) votes α i can be used to 
 emphasize component classifiers that are more 
 reliable than others slide by Tommi S. Jaakkola 29

  30. Components: Decision stumps • Consider the following simple family of component classifiers generating ±1 labels: where These are called decision 
 stumps. • Each decision stump pays attention to only a single component of the input vector slide by Tommi S. Jaakkola 30

  31. Voted combinations (cont’d.) • We need to define a loss function for the combination so we can determine which new component h (x; θ ) to add and how many votes it should receive 
 • While there are many options for the loss function we consider here only a simple exponential loss slide by Tommi S. Jaakkola 31

  32. Modularity, errors, and loss • Consider adding the m th component: slide by Tommi S. Jaakkola 32

  33. Modularity, errors, and loss • Consider adding the m th component: slide by Tommi S. Jaakkola 33

  34. Modularity, errors, and loss • Consider adding the m th component: 
 • So at the m th iteration the new component (and the votes) slide by Tommi S. Jaakkola should optimize a weighted loss (weighted towards mistakes). 34

  35. Empirical exponential loss (cont’d.) • To increase modularity we’d like to further decouple the optimization of h (x; θ m ) from the associated votes α m • To this end we select h (x; θ m ) that optimizes the rate at which the loss would decrease as a function of α m slide by Tommi S. Jaakkola 35

  36. 
 Empirical exponential loss (cont’d.) • We find that minimizes • We can also normalize the weights: 
 slide by Tommi S. Jaakkola so that 36

  37. Empirical exponential loss (cont’d.) • We find that minimizes 
 where • is subsequently chosen to minimize slide by Tommi S. Jaakkola 37

  38. 38 The AdaBoost Algorithm slide by Jiri Matas and Jan Š ochman

  39. The AdaBoost Algorithm Given: ( x 1 , y 1 ) , . . . , ( x m , y m ); x i ∈ X , y i ∈ { − 1 , +1 } slide by Jiri Matas and Jan Š ochman 39

  40. The AdaBoost Algorithm Given: ( x 1 , y 1 ) , . . . , ( x m , y m ); x i ∈ X , y i ∈ { − 1 , +1 } Initialise weights D 1 ( i ) = 1 /m slide by Jiri Matas and Jan Š ochman 40

Recommend


More recommend