foundations of machine learning boosting
play

Foundations of Machine Learning Boosting Weak Learning (Kearns and - PowerPoint PPT Presentation

Foundations of Machine Learning Boosting Weak Learning (Kearns and Valiant, 1994) Definition: concept class is weakly PAC-learnable C if there exists a (weak) learning algorithm and L > 0 such that: for all , for all


  1. Foundations of Machine Learning Boosting

  2. Weak Learning (Kearns and Valiant, 1994) Definition: concept class is weakly PAC-learnable C if there exists a (weak) learning algorithm and L γ > 0 such that: • for all , for all and all distributions , δ > 0 c ∈ C D � R ( h S ) ≤ 1 � Pr ≥ 1 − δ , 2 − γ S ∼ D • for samples of size for a fixed S m = poly (1 / δ ) polynomial.

  3. Boosting Ideas Finding simple relatively accurate base classifiers often not hard weak learner. Main ideas: • use weak learner to create a strong learner. • combine base classifiers returned by weak learner (ensemble method). But, how should the base classifiers be combined?

  4. AdaBoost (Freund and Schapire, 1997) H ⊆ { − 1 , +1 } X . AdaBoost ( S =(( x 1 , y 1 ) , . . . , ( x m , y m ))) 1 for i 1 to m do D 1 ( i ) 1 2 m 3 for t 1 to T do 4 h t base classifier in H with small error ✏ t = Pr i ∼ D t [ h t ( x i ) 6 = y i ] 2 log 1 − ✏ t ↵ t 1 5 ✏ t 1 6 Z t 2[ ✏ t (1 � ✏ t )] . normalization factor 2 7 for i 1 to m do D t +1 ( i ) D t ( i ) exp( − ↵ t y i h t ( x i )) 8 Z t f t P t 9 s =1 ↵ s h s 10 return h = sgn( f T )

  5. Notes Distributions over training sample: D t • originally uniform. • at each round, the weight of a misclassified example is increased. • observation: , since D t +1 ( i )= e − yift ( xi ) m Q t s =1 Z s P t D t +1 ( i ) = D t ( i ) e − α t y i h t ( x i ) = D t − 1 ( i ) e − α t − 1 y i h t − 1 ( x i ) e − α t y i h t ( x i ) s =1 α s h s ( x i ) e − y i = 1 . � t Z t Z t − 1 Z t m s =1 Z s Weight assigned to base classifier : directly h t α t depends on the accuracy of at round . h t t

  6. Illustration t = 1 t = 2

  7. t = 3 . . . . . .

  8. + α 2 + α 3 α 1 =

  9. Bound on Empirical Error (Freund and Schapire, 1997) Theorem: The empirical error of the classifier output by AdaBoost verifies: � � 2 � � 1 T � � R ( h ) ≤ exp − 2 . 2 − � t t =1 • If further for all , , then t ∈ [1 , T ] � ≤ ( 1 2 − � t ) � R ( h ) ≤ exp( − 2 γ 2 T ) . • does not need to be known in advance: γ adaptive boosting.

  10. • Proof: Since, as we saw, , D t +1 ( i )= e − yift ( xi ) m Q t s =1 Z s m m � � R ( h ) = 1 1 y i f ( x i ) ≤ 0 ≤ 1 � exp( − y i f ( x i )) m m i =1 i =1 � � m T T � � � ≤ 1 D T +1 ( i ) = m Z t Z t . m i =1 t =1 t =1 • Now, since is a normalization factor, Z t m � D t ( i ) e − � t y i h t ( x i ) Z t = i =1 D t ( i ) e − � t + � � = D t ( i ) e � t i : y i h t ( x i ) ≥ 0 i : y i h t ( x i ) < 0 = (1 − � t ) e − � t + � t e � t � � 1 − � t � = (1 − � t ) 1 − � t + � t = 2 � t (1 − � t ) . � t � t

  11. • Thus, T T T � � 2 � � � � � 1 Z t = 2 � t (1 − � t ) = 1 − 4 2 − � t t =1 t =1 t =1 T T � � 2 � � � 2 � � � � � 1 1 exp − 2 = exp − 2 . ≤ 2 − � t 2 − � t t =1 t =1 • Notes: • minimizer of . � �� (1 � � t ) e − α + � t e α α t • since , at each round, AdaBoost (1 − � t ) e − α t = � t e α t assigns the same probability mass to correctly classified and misclassified instances. • for base classifiers , can be x �� [ � 1 , +1] α t similarly chosen to minimize . Z t

  12. AdaBoost Coordinate Descent = Objective Function: convex and differentiable. m m α ) = 1 e − y i f ( x i ) = 1 α j h j ( x i ) . P N j =1 ¯ X X e − y i F (¯ m m i =1 i =1 e − x 0 − 1 loss

  13. • Direction: unit vector with best directional e k derivative: F (¯ α t � 1 + η e k ) − F (¯ α t � 1 ) F 0 (¯ α t � 1 , e k ) = lim . η ! 0 η • Since , m P N j =1 ¯ α t − 1 ,j h j ( x i ) − η y i h k ( x i ) X e − y i F (¯ α t − 1 + η e k ) = i =1 m α t � 1 , e k ) = − 1 P N X j =1 ¯ α t − 1 ,j h j ( x i ) y i h k ( x i ) e � y i F 0 (¯ m i =1 m = − 1 X y i h k ( x i ) ¯ D t ( i ) ¯ Z t m i =1 # ¯ " m m Z t ¯ ¯ X X = − D t ( i )1 y i h k ( x i )=+1 − D t ( i )1 y i h k ( x i )= � 1 m i =1 i =1 i ¯ i ¯ Z t Z t h h = − (1 − ¯ ✏ t,k ) − ¯ m = 2¯ ✏ t,k − 1 m . ✏ t,k Thus, direction corresponding to base classifier with smallest error.

  14. • Step size: chosen to minimize ; F (¯ α t − 1 + η e k ) η m dF (¯ α t − 1 + ⌘ e k ) α t − 1 ,j h j ( x i ) e − η y i h k ( x i ) = 0 P N X j =1 ¯ y i h k ( x i ) e − y i = 0 ⇔ − d ⌘ i =1 m Z t e − η y i h k ( x i ) = 0 X y i h k ( x i ) ¯ D t ( i ) ¯ ⇔ − i =1 m D t ( i ) e − η y i h k ( x i ) = 0 y i h k ( x i ) ¯ X ⇔ − i =1 ✏ t,k ) e − η − ¯ ⇥ ✏ t,k e η ⇤ (1 − ¯ = 0 ⇔ − ⇔ ⌘ = 1 2 log 1 − ¯ ✏ t,k . ¯ ✏ t,k Thus, step size matches base classifier weight of AdaBoost.

  15. Alternative Loss Functions boosting loss square loss x �� (1 � x ) 2 1 x ≤ 1 x �� e − x logistic loss x �� log 2 (1 + e − x ) hinge loss x �� max(1 � x, 0) zero-one loss x �� 1 x< 0

  16. Standard Use in Practice Base learners: decision trees, quite often just decision stumps (trees of depth one). Boosting stumps: • data in , e.g., , . R N N =2 (height( x ) , weight( x )) • associate a stump to each component. • pre-sort each component: . O ( Nm log m ) • at each round, find best component and threshold. • total complexity: . O (( m log m ) N + mNT ) • stumps not weak learners: think XOR example!

  17. Overfitting? Assume that and for a fixed , define VCdim( H )= d T T � � � � � F T = α t h t − b : α t , b ∈ R , h t ∈ H . sgn t =1 can form a very rich family of classifiers. It can F T be shown (Freund and Schapire, 1997) that: VCdim( F T ) ≤ 2( d + 1)( T + 1) log 2 (( T + 1) e ) . This suggests that AdaBoost could overfit for large values of , and that is in fact observed in some T cases, but in various others it is not!

  18. Empirical Observations Several empirical observations (not all): AdaBoost does not seem to overfit, furthermore: 20 test error 15 error 10 training error 5 0 10 100 1000 # rounds C4.5 decision trees (Schapire et al., 1998).

  19. Rademacher Complexity of Convex Hulls Theorem: Let be a set of functions mapping H from to . Let the convex hull of be defined as X R H p p � � conv( H ) = { µ k h k : p ≥ 1 , µ k ≥ 0 , µ k ≤ 1 , h k ∈ H } . k =1 k =1 Then, for any sample , S � R S (conv( H )) = � R S ( H ) . � � p m � � Proof: R S (conv( H )) = 1 � m E sup µ k h k ( x i ) σ i σ h k � H, µ � 0 , � µ � 1 � 1 i =1 k =1 � �� p � m � � = 1 m E sup sup σ i h k ( x i ) µ k σ h k � H µ � 0 , � µ � 1 � 1 i =1 k =1 � �� � m � = 1 m E sup max σ i h k ( x i ) k � [1 ,p ] σ h k � H i =1 � � m � = 1 = � m E sup σ i h ( x i ) R S ( H ) . σ h � H i =1

  20. Margin Bound - Ensemble Methods (Koltchinskii and Panchenko, 2002) Corollary: Let be a set of real-valued functions. H Fix . For any , with probability at least , δ > 0 ρ > 0 1 − δ the following holds for all : h ∈ conv( H ) � log 1 � � R ρ ( h ) + 2 R ( h ) ≤ � δ + ρ R m H 2 m � log 2 � � R ρ ( h ) + 2 R ( h ) ≤ � � δ + 3 R S H 2 m . ρ Proof: Direct consequence of margin bound of Lecture 4 and . R S (conv( H ))= � � R S ( H )

  21. Margin Bound - Ensemble Methods (Koltchinskii and Panchenko, 2002); see also (Schapire et al., 1998) Corollary: Let be a family of functions taking H values in with VC dimension . Fix . { − 1 , +1 } d ρ > 0 For any , with probability at least , the δ > 0 1 − δ following holds for all : h ∈ conv( H ) � � 2 d log em log 1 R ρ ( h ) + 2 R ( h ) ≤ � d δ + 2 m . m ρ Proof: Follows directly previous corollary and VC dimension bound on Rademacher complexity (see lecture 3).

  22. Notes All of these bounds can be generalized to hold uniformly for all , at the cost of an additional ρ ∈ (0 , 1) term and other minor constant factor � 2 log log 2 ρ changes (Koltchinskii and Panchenko, 2002). m For AdaBoost, the bound applies to the functions � T t =1 α t h t ( x ) x �� f ( x ) = � conv( H ) . � α � 1 � α � 1 Note that does not appear in the bound. T

  23. Margin Distribution Theorem: For any , the following holds: ρ > 0 � yf ( x ) � � T � � � 1 − ρ � 2 T Pr � � (1 � � t ) 1+ ρ . t � α � 1 t =1 Proof: Using the identity , D t +1 ( i )= e − yif ( xi ) m Q T t =1 Z t m m 1 1 y i f ( x i ) �� α � 1 � � 0 � 1 � � exp( � y i f ( x i ) + � α � 1 � ) m m i =1 i =1 m T � � = 1 � � e � α � 1 � D T +1 ( i ) m Z t m i =1 t =1 T T �� � � � � � = e � α � 1 � 1 � � t Z t = 2 T � t (1 � � t ) . � t t =1 t =1

Recommend


More recommend