coms 4721 machine learning for data science lecture 13 3
play

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017 - PowerPoint PPT Presentation

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University B OOSTING Robert E. Schapire and Yoav Freund, Boosting: Foundations and


  1. COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University

  2. B OOSTING Robert E. Schapire and Yoav Freund, Boosting: Foundations and Algorithms , MIT Press, 2012. See this textbook for many more details. (I borrow some figures from that book.)

  3. B AGGING CLASSIFIERS Algorithm: Bagging binary classifiers Given ( x 1 , y 1 ) , . . . , ( x n , y n ) , x ∈ X , y ∈ {− 1 , + 1 } ◮ For b = 1 , . . . , B ◮ Sample a bootstrap dataset B b of size n . For each entry in B b , select ( x i , y i ) with probability 1 n . Some ( x i , y i ) will repeat and some won’t appear in B b . ◮ Learn a classifier f b using data in B b . ◮ Define the classification rule to be � B � � f bag ( x 0 ) = sign f b ( x 0 ) . b = 1 ◮ With bagging, we observe that a committee of classifiers votes on a label. ◮ Each classifier is learned on a bootstrap sample from the data set. ◮ Learning a collection of classifiers is referred to as an ensemble method .

  4. B OOSTING How is it that a committee of blockheads can somehow arrive at highly reasoned decisions, despite the weak judgment of the individual members? - Schapire & Freund, “Boosting: Foundations and Algorithms” Boosting is another powerful method for ensemble learning. It is similar to bagging in that a set of classifiers are combined to make a better one. It works for any classifier, but a “weak” one that is easy to learn is usually chosen. (weak = accuracy a little better than random guessing) Short history 1984 : Leslie Valiant and Michael Kearns ask if “boosting” is possible. 1989 : Robert Schapire creates first boosting algorithm. 1990 : Yoav Freund creates an optimal boosting algorithm. 1995 : Freund and Schapire create AdaBoost (Adaptive Boosting), the major boosting algorithm.

  5. B AGGING VS B OOSTING ( OVERVIEW ) f 3 (x) f 3 (x) Bootstrap sample Weighted sample f 2 (x) Bootstrap sample f 2 (x) Weighted sample f 1 (x) f 1 (x) Bootstrap sample Weighted sample Training sample Training sample Bagging Boosting

  6. T HE A DA B OOST A LGORITHM ( SAMPLING VERSION ) Sample and α 3 , f 3 (x) Weighted sample classify B 3 weighted error ε 2 Sample and α 2 , f 2 (x) Weighted sample classify B 2 weighted error ε 1 Sample and α 1 , f 1 (x) Weighted sample classify B 1 � T � Training sample � f boost ( x 0 ) = sign α t f t ( x 0 ) t = 1 Boosting

  7. T HE A DA B OOST A LGORITHM ( SAMPLING VERSION ) Algorithm: Boosting a binary classifier Given ( x 1 , y 1 ) , . . . , ( x n , y n ) , x ∈ X , y ∈ {− 1 , + 1 } , set w 1 ( i ) = 1 n for i = 1 : n ◮ For t = 1 , . . . , T 1. Sample a bootstrap dataset B t of size n according to distribution w t . Notice we pick ( x i , y i ) with probability w t ( i ) and not 1 n . 2. Learn a classifier f t using data in B t . � � 3. Set ǫ t = � n i = 1 w t ( i ) 1 { y i � = f t ( x i ) } and α t = 1 1 − ǫ t 2 ln . ǫ t w t + 1 ( i ) ˆ w t + 1 ( i ) = w t ( i ) e − α t y i f t ( x i ) and set w t + 1 ( i ) = 4. Scale ˆ w t + 1 ( j ) . � j ˆ ◮ Set the classification rule to be �� T � f boost ( x 0 ) = sign t = 1 α t f t ( x 0 ) . Comment : Description usually simplified to “learn classifier f t using distribution w t .”

  8. B OOSTING A DECISION STUMP ( EXAMPLE 1) + - Original data + + Uniform distribution, w 1 - Learn weak classifier - Here: Use a decision stump + x 1 > 1 . 7 - + - y = 1 ˆ ˆ y = 3

  9. B OOSTING A DECISION STUMP ( EXAMPLE 1) + - Round 1 classifier + + Weighted error: ǫ 1 = 0 . 3 - Weight update: α 1 = 0 . 42 - + - + -

  10. B OOSTING A DECISION STUMP ( EXAMPLE 1) + - + Weighted data + After round 1 - - + - + -

  11. B OOSTING A DECISION STUMP ( EXAMPLE 1) + - + Round 2 classifier + Weighted error: ǫ 2 = 0 . 21 Weight update: α 2 = 0 . 65 - - + - + -

  12. B OOSTING A DECISION STUMP ( EXAMPLE 1) + - Weighted data + + After round 2 - - + - + -

  13. B OOSTING A DECISION STUMP ( EXAMPLE 1) + - Round 2 classifier + + Weighted error: ǫ 3 = 0 . 14 - Weight update: α 3 = 0 . 92 - + - + -

  14. B OOSTING A DECISION STUMP ( EXAMPLE 1) + Classifier after three rounds - + + + 0.42 x - - + + 0.65 x - + - 0.92 x

  15. B OOSTING A DECISION STUMP ( EXAMPLE 2) Example problem Random guessing 50% error Decision stump 45.8% error Full decision tree 24.7% error Boosted stump 5.8% error

  16. B OOSTING Point = one dataset. Location = error rate w/ and w/o boosting. The boosted version of the same classifier almost always produces better results.

  17. B OOSTING (left) Boosting a bad classifier is often better than not boosting a good one. (right) Boosting a good classifier is often better, but can take more time.

  18. B OOSTING AND FEATURE MAPS Q : What makes boosting work so well? A : This is a well-studied question. We will present one analysis later, but we can also give intuition by tying it in with what we’ve already learned. The classification for a new x 0 from boosting is � T � � f boost ( x 0 ) = sign α t f t ( x 0 ) . t = 1 Define φ ( x ) = [ f 1 ( x ) , . . . , f T ( x )] ⊤ , where each f t ( x ) ∈ {− 1 , + 1 } . ◮ We can think of φ ( x ) as a high dimensional feature map of x . ◮ The vector α = [ α 1 , . . . , α T ] ⊤ corresponds to a hyperplane. ◮ So the classifier can be written f boost ( x 0 ) = sign ( φ ( x 0 ) ⊤ α ) . ◮ Boosting learns the feature mapping and hyperplane simultaneously.

  19. A PPLICATION : F ACE DETECTION

  20. F ACE DETECTION (V IOLA & J ONES , 2001) Problem : Locate the faces in an image or video. Processing : Divide image into patches of different scales, e.g., 24 × 24, 48 × 48, etc. Extract features from each patch. Classify each patch as face or no face using a boosted decision stump . This can be done in real-time, for example by your digital camera (at 15 fps). ◮ One patch from a larger image. Mask it with many “feature extractors.” ◮ Each pattern gives one number, which is the sum of all pixels in black region minus sum of pixels in white region (total of 45,000+ features).

  21. F ACE DETECTION ( EXAMPLE RESULTS )

  22. A NALYSIS OF BOOSTING

  23. A NALYSIS OF BOOSTING Training error theorem We can use analysis to make a statement about the accuracy of boosting on the training data . Theorem : Under the AdaBoost framework, if ǫ t is the weighted error of classifier f t , then for the classifier f boost ( x 0 ) = sign ( � T t = 1 α t f t ( x 0 )) , n T training error = 1 � 2 − ǫ t ) 2 � � � ( 1 1 { y i � = f boost ( x i ) } ≤ exp − 2 . n i = 1 t = 1 Even if each ǫ t is only a little better than random guessing, the sum over T classifiers can lead to a large negative value in the exponent when T is large. For example, if we set: ǫ t = 0 . 45 , T = 1000 → training error ≤ 0 . 0067.

  24. P ROOF OF THEOREM Setup We break the proof into three steps. It is an application of the fact that if a < b and b < c then a < c � �� � � �� � � �� � Step 2 Step 3 conclusion ◮ Step 1 calculates the value of b . ◮ Steps 2 and 3 prove the two inequalities. Also recall the following step from AdaBoost: w t + 1 ( i ) = w t ( i ) e − α t y i f t ( x i ) . ◮ Update ˆ w t + 1 ( i ) ˆ Define Z t = � ◮ Normalize w t + 1 ( i ) = j ˆ w t + 1 ( j ) . − → � j ˆ w t + 1 ( j )

  25. P ROOF OF THEOREM ( a ≤ b ≤ c ) Step 1 We first want to expand the equation of the weights to show that � T T t = 1 α t f t ( x i ) e − y i h T ( x i ) e − y i w T + 1 ( i ) = 1 := 1 � → h T ( x ) := α t f t ( x i ) � T � T n n t = 1 Z t t = 1 Z t t = 1 Derivation of Step 1 : Notice the update rule: w t + 1 ( i ) = 1 w t ( i ) e − α t y i f t ( x i ) Z t Do the same expansion for w t ( i ) and continue until reaching w 1 ( i ) = 1 n , w T + 1 ( i ) = w 1 ( i ) e − α 1 y i f 1 ( x i ) × · · · × e − α T y i f T ( x i ) Z 1 Z T The product � T t = 1 Z t is “b” above. We use this form of w T + 1 ( i ) in Step 2.

  26. P ROOF OF THEOREM ( a ≤ b ≤ c ) Step 2 Next show the training error of f ( T ) boost (boosting after T steps) is ≤ � T t = 1 Z t . Currently we know T e − y i h T ( x i ) w T + 1 ( i ) = 1 Z t = 1 ne − y i h T ( x i ) f ( T ) � & ⇒ w T + 1 ( i ) boost ( x ) = sign ( h T ( x )) � T n t = 1 Z t t = 1 Derivation of Step 2 : Observe that 0 < e z 1 and 1 < e z 2 for any z 1 < 0 < z 2 . Therefore n n 1 1 � � 1 { y i � = f ( T ) e − y i h T ( x i ) boost ( x i ) } ≤ n n i = 1 i = 1 � �� � n T T a � � � = w T + 1 ( i ) = Z t Z t i = 1 t = 1 t = 1 � �� � b “a” is the training error – the quantity we care about.

  27. P ROOF OF THEOREM ( a ≤ b ≤ c ) Step 3 The final step is to calculate an upper bound on Z t , and by extension � T t = 1 Z t . Derivation of Step 3 : � � This step is slightly more involved. It also shows why α t := 1 1 − ǫ t 2 ln . ǫ t n � w t ( i ) e − α t y i f t ( x i ) Z t = i = 1 � � e − α t w t ( i ) + = e α t w t ( i ) i : y i = f t ( x i ) i : y i � = f t ( x i ) e − α t ( 1 − ǫ t ) + e α t ǫ t = Remember we defined ǫ t = � i : y i � = f t ( x i ) w t ( i ) , the probability of error for w t .

Recommend


More recommend