statistical machine learning
play

Statistical Machine Learning A Crash Course Part III: Boosting - - PowerPoint PPT Presentation

Statistical Machine Learning A Crash Course Part III: Boosting - 11.05.2012 Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS Combining Classifiers Horse race prediction: Stefan Roth, 11.05.2012 | Department of Computer


  1. Statistical Machine Learning A Crash Course Part III: Boosting - 11.05.2012 Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS

  2. Combining Classifiers ■ Horse race prediction: Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 2

  3. Combining Classifiers ■ How do we make money from horse racing bets? ■ Ask a professional. ■ It is very likely that... • The professional cannot give a single highly accurate rule. • But presented with a set of races, can always generate better- than-random rules. ■ Can you get rich? ■ Disclaimer: We are not saying you should actually try this at home :-) Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 3

  4. Combining Classifiers ■ Idea: • Ask an expert for their rule-of-thumb. • Assemble the set of cases where the rule-of-thumb fails (hard cases). • Ask the expert again for the selected set of hard cases. • And so on… ■ Combine many rules-of-thumb. Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 4

  5. Combining Classifiers ■ How do we actually do this? ■ How to choose races on each round? • Concentrate on “hardest” races (those most often misclassified by previous rules of thumb) ■ How to combine rules of thumb into single prediction rule? • Take (weighted) majority vote of several h t : R d → { +1 , − 1 } rules-of-thumb • We take a weighted average of simple rules (models): � T ⇥ ⇤ H ( x ) = sign α t h t ( x ) t =1 Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 5

  6. Boosting ■ General method of converting rough rules of thumb into a highly accurate prediction rule. ■ More formally: • Given a “weak” learning algorithm that can consistently find ≤ 1 “weak classifiers” with a (training) error of 2 − γ • A boosting algorithm can provably construct a “strong classifier” that has a training error of . ≤ � ■ As long as we have a “weak” learning algorithm that does better than chance, we can convert it into an algorithm that performs arbitrarily well! Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 6

  7. AdaBoost: Toy Example ■ Training data: Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 7

  8. AdaBoost: Toy Example ■ Round 1: 1st weak reweighted classifier training data Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 8

  9. AdaBoost: Toy Example ■ Round 2: 1st weak 2nd weak reweighted classifier classifier training data Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 9

  10. AdaBoost: Toy Example ■ Round 3: 1st weak 2nd weak 3rd weak classifier classifier classifier Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 10

  11. AdaBoost: Toy Example ■ Weighted combination: Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 11

  12. AdaBoost: Toy Example ■ Final hypothesis / “strong” classifier: Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 12

  13. AdaBoost ■ Given: Training data with labels ( x 1 , y 1 ) , . . . , ( x N , y N ) x i ∈ R d , y i ∈ { +1 , − 1 } where ■ Initialize weights for every data point: D 1 ( i ) = 1 N # of boosting rounds ■ Loop over : t = 1 , . . . , T h t : R d → { +1 , − 1 } • Train the weak learner on the training data so that the weighted error with weights is minimized. D t α t ∈ R + • Choose an appropriate weight for the weak classifier. • Update the data weights as 1 D t +1 ( i ) = Z t D t ( i ) exp { − α t y i h t ( x i ) } where is chosen such that sums to 1. Z t D t +1 Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 13

  14. AdaBoost ■ Given: Training data with labels ( x 1 , y 1 ) , . . . , ( x N , y N ) x i ∈ R d , y i ∈ { +1 , − 1 } where ■ Return the weighted (“strong”, ensemble) classifier: � T ⇥ ⇤ H ( x ) = sign α t h t ( x ) t =1 ■ Intuition: • Boosting uses weighted training data and adapts the weights every round. • The weights make the algorithm focus on the wrongly classified examples: � ⇤ 1 if y i ⌅ = h t ( x i ) exp { � α t y i h t ( x i ) } = ⇥ 1 if y i = h t ( x i ) Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 14

  15. AdaBoost: Weak Learners ■ Training the weak learner: ( x 1 , y 1 ) , . . . , ( x N , y N ) • Given training data D t ( i ) • and weights for all data point. • Select the weak classifier with the smallest weighted error: N � � t = D t ( i )[ y i � = h ( x i )] h t = arg min with h ∈ H � t i =1 ⇥ t ≤ 1 • Prerequisite: Weighted training error � t > 0 2 − � t , ■ Examples for : H • Weighted least-squares classifier • Decision stumps (hold on...) Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 15

  16. AdaBoost: Weak Learners Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 16

  17. AdaBoost: Weak Learners Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 17

  18. AdaBoost ■ How do we select ? α t ■ We want to minimize the empirical error: N 1 � � tr ( H ) = M [ y i � = H ( x i )] i =1 ■ The empirical error can be upper bounded: � N ⇥ T T ⌅ ⇤ � = D t ( i ) exp { − α t y i h t ( x i ) } � tr ( H ) ≤ Z t t =1 t =1 i =1 [Freund & Schapire] ■ To minimize the empirical error, we can greedily minimize in each round. Z t Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 18

  19. AdaBoost ■ Select by greedily minimizing in each round. Z t ( α ) α t • Minimizes an upper bound on the empirical error. N ■ Minimize � Z t ( α ) = D t ( i ) exp { − α y i h t ( x i ) } i =1 ■ We obtain the AdaBoost weighting: � 1 − ⇥ t ⇥ � t = 1 2 log ⇥ t N � � t = D t ( i )[ y i � = h ( x i )] with i =1 Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 19

  20. AdaBoost: Reweighting 1 D t +1 ( i ) = Z t D t ( i ) exp { − α t y i h t ( x i ) } Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 20

  21. AdaBoost: Reweighting 1 D t +1 ( i ) = Z t D t ( i ) exp { − α t y i h t ( x i ) } Increase the weight on incorrectly classified examples Decrease the weight on correctly classified examples � ⇤ 1 if y i ⌅ = h t ( x i ) exp { � α t y i h t ( x i ) } = ⇥ 1 if y i = h t ( x i ) Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 21

  22. AdaBoost: Reweighting ■ Eventually only the very difficult cases will be focused on: Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 22

  23. AdaBoost: More realistic example t = 0 ■ Initialize... Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 23

  24. AdaBoost: More realistic example t = 1 ■ Initialize... ■ For t = 1 , . . . , T • Find h t = arg min h ∈ H � t � t > 1 • Stop if 2 � 1 − ⇥ t ⇥ � t = 1 • Set 2 log ⇥ t • Reweight the data: 1 D t +1 ( i ) = Z t D t ( i ) exp { − α t y i h t ( x i ) } Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 24

  25. AdaBoost: More realistic example t = 2 ■ Initialize... ■ For t = 1 , . . . , T • Find h t = arg min h ∈ H � t � t > 1 • Stop if 2 � 1 − ⇥ t ⇥ � t = 1 • Set 2 log ⇥ t • Reweight the data: 1 D t +1 ( i ) = Z t D t ( i ) exp { − α t y i h t ( x i ) } Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 25

  26. AdaBoost: More realistic example t = 3 ■ Initialize... ■ For t = 1 , . . . , T • Find h t = arg min h ∈ H � t � t > 1 • Stop if 2 � 1 − ⇥ t ⇥ � t = 1 • Set 2 log ⇥ t • Reweight the data: 1 D t +1 ( i ) = Z t D t ( i ) exp { − α t y i h t ( x i ) } Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 26

  27. AdaBoost: More realistic example t = 4 ■ Initialize... ■ For t = 1 , . . . , T • Find h t = arg min h ∈ H � t � t > 1 • Stop if 2 � 1 − ⇥ t ⇥ � t = 1 • Set 2 log ⇥ t • Reweight the data: 1 D t +1 ( i ) = Z t D t ( i ) exp { − α t y i h t ( x i ) } Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 27

  28. AdaBoost: More realistic example t = 5 ■ Initialize... ■ For t = 1 , . . . , T • Find h t = arg min h ∈ H � t � t > 1 • Stop if 2 � 1 − ⇥ t ⇥ � t = 1 • Set 2 log ⇥ t • Reweight the data: 1 D t +1 ( i ) = Z t D t ( i ) exp { − α t y i h t ( x i ) } Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 28

  29. AdaBoost: More realistic example t = 6 ■ Initialize... ■ For t = 1 , . . . , T • Find h t = arg min h ∈ H � t � t > 1 • Stop if 2 � 1 − ⇥ t ⇥ � t = 1 • Set 2 log ⇥ t • Reweight the data: 1 D t +1 ( i ) = Z t D t ( i ) exp { − α t y i h t ( x i ) } Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 29

  30. AdaBoost: More realistic example t = 7 ■ Initialize... ■ For t = 1 , . . . , T • Find h t = arg min h ∈ H � t � t > 1 • Stop if 2 � 1 − ⇥ t ⇥ � t = 1 • Set 2 log ⇥ t • Reweight the data: 1 D t +1 ( i ) = Z t D t ( i ) exp { − α t y i h t ( x i ) } Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 30

Recommend


More recommend