theory and applications of boosting
play

Theory and Applications of Boosting Yoav Freund UCSD 2 1 0 2 - PowerPoint PPT Presentation

Theory and Applications of Boosting Yoav Freund UCSD 2 1 0 2 l o o h c S r e m m Many slides from Rob Schapire u S z u r C a t n a S Monday, July 16, 2012 0 1 2 m e r S c h o o l 2 a C r u z S u m


  1. Theory and Applications of Boosting Yoav Freund UCSD 2 1 0 2 l o o h c S r e m m Many slides from Rob Schapire u S z u r C a t n a S Monday, July 16, 2012

  2. 0 1 2 m e r S c h o o l 2 a C r u z S u m Monday, July 16, 2012 S a n t

  3. Plan • • • Day 2: Applications Day 3: Advanced Topics Day 1: Basics • • • ADTrees Boosting and repeated Boosting, matrix games • • JBoost Adaboost, • Boosting and Loss • • Viola and Jones minimization. Margins theory. • • • Active Learning and Drifting games and Boost Confidence-rated Pedestrian Detection By Majority. boosting • • Genome Wide association Brownboost and Boosting studies with High Noise. • 2 Online boosting and 1 0 2 tracking. l o o h c S r e m m u S z u r C a t n a S Monday, July 16, 2012

  4. Example: “How May I Help You?” Example: “How May I Help You?” Example: “How May I Help You?” Example: “How May I Help You?” Example: “How May I Help You?” [Gorin et al.] • goal: automatically categorize type of call requested by phone customer ( Collect, CallingCard, PersonToPerson, etc. ) • yes I’d like to place a collect call long distance (Collect) please • operator I need to make a call but I need to bill (ThirdNumber) it to my office • yes I’d like to place a call on my master card (CallingCard) please • I just called a number in sioux city and I musta rang the wrong number because I got the wrong party and I would like to have that taken off of (BillingCredit) my bill 2 1 0 2 • observation: l o o h • easy to find “rules of thumb” that are “often” correct c S r e m • e.g.: “IF ‘ card ’ occurs in utterance m u S THEN predict ‘ CallingCard ’ ” z u r C • hard to find single highly accurate prediction rule a t n a S Monday, July 16, 2012

  5. The Boosting Approach The Boosting Approach The Boosting Approach The Boosting Approach The Boosting Approach • devise computer program for deriving rough rules of thumb • apply procedure to subset of examples • obtain rule of thumb • apply to 2nd subset of examples • obtain 2nd rule of thumb • repeat T times 2 1 0 2 l o o h c S r e m m u S z u r C a t n a S Monday, July 16, 2012

  6. Key Details Key Details Key Details Key Details Key Details • how to choose examples on each round? • concentrate on “hardest” examples (those most often misclassified by previous rules of thumb) • how to combine rules of thumb into single prediction rule? • take (weighted) majority vote of rules of thumb 2 1 0 2 l o o h c S r e m m u S z u r C a t n a S Monday, July 16, 2012

  7. Boosting Boosting Boosting Boosting Boosting • boosting = general method of converting rough rules of thumb into highly accurate prediction rule • technically: • assume given “weak” learning algorithm that can consistently find classifiers (“rules of thumb”) at least slightly better than random, say, accuracy ≥ 55% (in two-class setting) [ “weak learning assumption” ] • given su ffi cient data, a boosting algorithm can provably construct single classifier with very high accuracy, say, 2 1 0 2 99% l o o h c S r e m m u S z u r C a t n a S Monday, July 16, 2012

  8. Some History • How it all began ... 2 1 0 2 l o o h c S r e m m u S z u r C a t n 8 a S Monday, July 16, 2012

  9. Strong and Weak Learnability Strong and Weak Learnability Strong and Weak Learnability Strong and Weak Learnability Strong and Weak Learnability • boosting’s roots are in “PAC” learning model [Valiant ’84] • get random examples from unknown, arbitrary distribution • strong PAC learning algorithm: • for any distribution with high probability given polynomially many examples (and polynomial time) can find classifier with arbitrarily small generalization error • weak PAC learning algorithm 2 • same, but generalization error only needs to be slightly 1 0 2 better than random guessing ( 1 l 2 − γ ) o o h c S • [Kearns & Valiant ’88] : r e m m • does weak learnability imply strong learnability? u S z u r C a t n a S Monday, July 16, 2012

  10. If Boosting Possible, Then... If Boosting Possible, Then... If Boosting Possible, Then... If Boosting Possible, Then... If Boosting Possible, Then... • can use (fairly) wild guesses to produce highly accurate predictions • if can learn “part way” then can learn “all the way” • should be able to improve any learning algorithm • for any learning problem: • either can always learn with nearly perfect accuracy • or there exist cases where cannot learn even slightly better than random guessing 2 1 0 2 l o o h c S r e m m u S z u r C a t n a S Monday, July 16, 2012

  11. First Boosting Algorithms First Boosting Algorithms First Boosting Algorithms First Boosting Algorithms First Boosting Algorithms • [Schapire ’89] : • first provable boosting algorithm • [Freund ’90] : • “optimal” algorithm that “boosts by majority” • [Drucker, Schapire & Simard ’92] : • first experiments using boosting • limited by practical drawbacks • [Freund & Schapire ’95] : • introduced “AdaBoost” algorithm 2 1 0 2 • strong practical advantages over previous boosting l o o h algorithms c S r e m m u S z u r C a t n a S Monday, July 16, 2012

  12. Basic Algorithm and Core Theory Basic Algorithm and Core Theory Basic Algorithm and Core Theory Basic Algorithm and Core Theory Basic Algorithm and Core Theory • introduction to AdaBoost • analysis of training error • analysis of test error and the margins theory 2 1 • experiments and applications 0 2 l o o h c S r e m m u S z u r C a t n a S Monday, July 16, 2012

  13. A Formal Description of Boosting A Formal Description of Boosting A Formal Description of Boosting A Formal Description of Boosting A Formal Description of Boosting • given training set ( x 1 , y 1 ) , . . . , ( x m , y m ) • y i ∈ { − 1 , +1 } correct label of instance x i ∈ X • for t = 1 , . . . , T : • construct distribution D t on { 1 , . . . , m } • find weak classifier (“rule of thumb”) h t : X → { − 1 , +1 } with small error � t on D t : 2 1 0 � t = Pr i ∼ D t [ h t ( x i ) � = y i ] 2 l o o h • output final classifier H final c S r e m m u S z u r C a t n a S Monday, July 16, 2012

  14. AdaBoost AdaBoost AdaBoost AdaBoost AdaBoost [ Freund & Schapire 96] [with Freund] • constructing D t : • D 1 ( i ) = 1 / m • given D t and h t : � e − α t D t ( i ) if y i = h t ( x i ) D t +1 ( i ) = × if y i � = h t ( x i ) e α t Z t D t ( i ) = exp( − α t y i h t ( x i )) Z t where Z t = normalization factor 2 1 � 1 − � t � 0 2 α t = 1 2 ln > 0 l o o � t h c S • final classifier: r e m �� � m u S • H final ( x ) = sign α t h t ( x ) z u r C t a t n a S Monday, July 16, 2012

  15. Toy Example Toy Example Toy Example Toy Example Toy Example D 1 2 1 0 2 l o o h c S r e m m weak classifiers = vertical or horizontal half-planes u S z u r C a t n a S Monday, July 16, 2012

  16. Round 1 Round 1 Round 1 Round 1 Round 1 h 1 D 2 " 1 =0.30 ! =0.42 1 2 1 0 2 l o o h c S r e m m u S z u r C a t n a S Monday, July 16, 2012

  17. Round 2 Round 2 Round 2 Round 2 Round 2 h 2 D 3 " 2 =0.21 ! =0.65 2 2 1 0 2 l o o h c S r e m m u S z u r C a t n a S Monday, July 16, 2012

  18. Round 3 Round 3 Round 3 Round 3 Round 3 h 3 " 3 =0.14 ! 3=0.92 2 1 0 2 l o o h c S r e m m u S z u r C a t n a S Monday, July 16, 2012

  19. Final Classifier Final Classifier Final Classifier Final Classifier Final Classifier H = sign 0.42 + 0.65 + 0.92 final = 2 1 0 2 l o o h c S r e m m u S z u r C a t n a S Monday, July 16, 2012

  20. 2 1 0 2 l o o h c S r e m m u S http://cseweb.ucsd.edu/~yfreund/adaboost/index.html z u r C a t n a S Monday, July 16, 2012

  21. Basic Algorithm and Core Theory Basic Algorithm and Core Theory Basic Algorithm and Core Theory Basic Algorithm and Core Theory Basic Algorithm and Core Theory • introduction to AdaBoost • analysis of training error • analysis of test error and the margins theory 2 1 • experiments and applications 0 2 l o o h c S r e m m u S z u r C a t n a S Monday, July 16, 2012

  22. Analyzing the Training Error Analyzing the Training Error Analyzing the Training Error Analyzing the Training Error Analyzing the Training Error [ Freund & Schapire 96] [with Freund] • Theorem: • write � t as 1 [ γ t = “edge” ] 2 − γ t • then � � � � training error ( H final ) 2 � t (1 − � t ) ≤ t � � 1 − 4 γ 2 = t t � � � γ 2 exp − 2 ≤ t t 2 1 0 2 • so: if ∀ t : γ t ≥ γ > 0 l o o h then training error ( H final ) ≤ e − 2 γ 2 T c S r e m • AdaBoost is adaptive: m u S • does not need to know γ or T a priori z u r C • can exploit γ t � γ a t n a S Monday, July 16, 2012

Recommend


More recommend