cs 6316 machine learning
play

CS 6316 Machine Learning Boosting Yangfeng Ji Department of - PowerPoint PPT Presentation

CS 6316 Machine Learning Boosting Yangfeng Ji Department of Computer Science University of Virginia Overview The Bias-Variance Decomposition The expected error is decomposed as 2 { h ( x , S ) E [ h ( x , S )]} 2 + {


  1. CS 6316 Machine Learning Boosting Yangfeng Ji Department of Computer Science University of Virginia

  2. Overview

  3. The Bias-Variance Decomposition The expected error is decomposed as � ǫ 2 � � { h ( x , S ) − E [ h ( x , S )]} 2 � + { E [ h ( x , S )] − f D ( x )} 2 E � E � ���������������������� �� ���������������������� � � ����������������������������� �� ����������������������������� � variance bias 2 ◮ bias : how far the expected prediction E [ h ( x , S )] diverges from the optimal predictor f D ( x ) ◮ variance : how a hypothesis learned from a specific S diverges from the average prediction E [ h ( x , S )] 2

  4. Motivation How can we reduce the overall error? E.g., ◮ Reduce the bias ◮ Boosting: start with simple classifiers, and gradually make a powerful one ◮ Reduce the variance ◮ Bagging: create multiple copies of data and train classifiers on each of them, then combine them together 3

  5. The Idea of Boosting 4

  6. Weak Learnability

  7. Weak Learnability ◮ A learning algorithm A is a γ -weak-learner for a hypothesis space, if for the PAC learning condition, the algorithm returns a hypothesis h such that, with probability of at least 1 − δ , L ( D , f ) ( h ) ≤ 1 2 − γ (1) ◮ A hypothesis space H is γ -weak-learnable if there exists a γ -weak-learner for this class 6

  8. Strong vs. Weak Learnability ◮ Strong learnability L ( D , f ) ( h ) ≤ ǫ (2) where ǫ is arbitrarily small ◮ Weak learnability L ( D , f ) ( h ) ≤ 1 2 − γ (3) where γ > 0 . In other words, the error rate of weak learnability is slightly better than random guessing 7

  9. Decision Stumps ◮ Let X � R d , the hypothesis space of decision stumps is defined as DS � { b · sign ( x · , j − θ ) : θ ∈ R , j ∈ [ d ]} (4) H with parameters θ ∈ R , j ∈ [ d ] , and b ∈ {− 1 , + 1 } 8

  10. Decision Stumps ◮ Let X � R d , the hypothesis space of decision stumps is defined as DS � { b · sign ( x · , j − θ ) : θ ∈ R , j ∈ [ d ]} (4) H with parameters θ ∈ R , j ∈ [ d ] , and b ∈ {− 1 , + 1 } ◮ For each h θ, j , b ∈ H DS with j � 1 and b � + 1 � + 1 x · , 1 > θ h θ, 1 , + 1 ( x ) � (5) − 1 x · , 1 < θ x 2 x 1 8

  11. Empirical Risk ◮ The empirical risk with a training set S � {( x 1 , y 1 ) , . . . , ( x m , y m )} is defined as m � L D ( h θ, j , b ) � D i · 1 [ h θ, j , b ( x i ) � y i ] (6) i � 1 where 1 [·] is the indicator function and 1 [ h ( x i ) � y i ] � 1 when h ( x i ) � y i is true 9

  12. Empirical Risk ◮ The empirical risk with a training set S � {( x 1 , y 1 ) , . . . , ( x m , y m )} is defined as m � L D ( h θ, j , b ) � D i · 1 [ h θ, j , b ( x i ) � y i ] (6) i � 1 where 1 [·] is the indicator function and 1 [ h ( x i ) � y i ] � 1 when h ( x i ) � y i is true ◮ A special case with D i � 1 m , then � m i � 1 1 [ h ( x i ) � y i ] L D ( h ) � L S ( h ) � (7) m 9

  13. Learning a Decision Stump ◮ For each j ∈ [ d ] ◮ Sort training examples, such that x 1 , j ≤ x 2 , j ≤ · · · ≤ x m , j (8) 10

  14. Learning a Decision Stump ◮ For each j ∈ [ d ] ◮ Sort training examples, such that x 1 , j ≤ x 2 , j ≤ · · · ≤ x m , j (8) ◮ Define x i , j + x i + 1 , j Θ j � { : i ∈ [ m − 1 ]} ∪ {( x 1 , j − 1 ) , ( x m , j + 1 )} 2 10

  15. Learning a Decision Stump ◮ For each j ∈ [ d ] ◮ Sort training examples, such that x 1 , j ≤ x 2 , j ≤ · · · ≤ x m , j (8) ◮ Define x i , j + x i + 1 , j Θ j � { : i ∈ [ m − 1 ]} ∪ {( x 1 , j − 1 ) , ( x m , j + 1 )} 2 ◮ Try each θ ′ ∈ Θ j and find the minimal risk with j m � L D ( h θ ′ , j , b ) � D i · 1 [ h θ ′ , j ( x i ) � y i ] (9) i � 1 10

  16. Learning a Decision Stump ◮ For each j ∈ [ d ] ◮ Sort training examples, such that x 1 , j ≤ x 2 , j ≤ · · · ≤ x m , j (8) ◮ Define x i , j + x i + 1 , j Θ j � { : i ∈ [ m − 1 ]} ∪ {( x 1 , j − 1 ) , ( x m , j + 1 )} 2 ◮ Try each θ ′ ∈ Θ j and find the minimal risk with j m � L D ( h θ ′ , j , b ) � D i · 1 [ h θ ′ , j ( x i ) � y i ] (9) i � 1 ◮ Find the minimal risk for all j ∈ [ d ] 10

  17. Example Build a decision stump for the following classification task with the assumption that D � ( 1 9 , . . . , 1 9 ) (10) x 2 x 1 11

  18. Example Build a decision stump for the following classification task with the assumption that D � ( 1 9 , . . . , 1 9 ) (10) x 2 x 1 11

  19. Example Build a decision stump for the following classification task with the assumption that D � ( 1 9 , . . . , 1 9 ) (10) x 2 x 1 11

  20. Example Build a decision stump for the following classification task with the assumption that D � ( 1 9 , . . . , 1 9 ) (10) x 2 x 1 The best decision stump is x · , 1 � 0 . 6 11

  21. Boosting

  22. Boosting Q: Cen we boost a set of weak classifiers and make a strong classifier? 13

  23. Boosting Q: Cen we boost a set of weak classifiers and make a strong classifier? A: Yes. It looks like T � h S ( x ) � sign ( w t h t ( x )) (11) t � 1 13

  24. Boosting Q: Cen we boost a set of weak classifiers and make a strong classifier? A: Yes. It looks like T � h S ( x ) � sign ( w t h t ( x )) (11) t � 1 Three questions ◮ How to find each weak classifier h t ( x ) ? ◮ How to compute w t ? ◮ How large the T is? 13

  25. AdaBoost 1: Input : S � {( x 1 , y 1 ) , . . . , ( x m , y m ))} , weak learner A , number of rounds T 2: Initialize D ( 1 ) � ( 1 m , . . . , 1 m ) 3: for t � 1 , . . . , T do 8: end for 9: Output : the hypothesis h S ( x ) � sign ( � T t � 1 w t h t ( x )) 14

  26. AdaBoost 1: Input : S � {( x 1 , y 1 ) , . . . , ( x m , y m ))} , weak learner A , number of rounds T 2: Initialize D ( 1 ) � ( 1 m , . . . , 1 m ) 3: for t � 1 , . . . , T do Learn a weak classifier h t � A ( D ( t ) , S ) 4: 8: end for 9: Output : the hypothesis h S ( x ) � sign ( � T t � 1 w t h t ( x )) 14

  27. AdaBoost 1: Input : S � {( x 1 , y 1 ) , . . . , ( x m , y m ))} , weak learner A , number of rounds T 2: Initialize D ( 1 ) � ( 1 m , . . . , 1 m ) 3: for t � 1 , . . . , T do Learn a weak classifier h t � A ( D ( t ) , S ) 4: Compute error ǫ t � � m i � 1 D ( t ) i 1 [ h t ( x i ) � y i ] 5: 8: end for 9: Output : the hypothesis h S ( x ) � sign ( � T t � 1 w t h t ( x )) 14

  28. AdaBoost 1: Input : S � {( x 1 , y 1 ) , . . . , ( x m , y m ))} , weak learner A , number of rounds T 2: Initialize D ( 1 ) � ( 1 m , . . . , 1 m ) 3: for t � 1 , . . . , T do Learn a weak classifier h t � A ( D ( t ) , S ) 4: Compute error ǫ t � � m i � 1 D ( t ) i 1 [ h t ( x i ) � y i ] 5: Let w t � 1 2 log ( 1 ǫ t − 1 ) 6: 8: end for 9: Output : the hypothesis h S ( x ) � sign ( � T t � 1 w t h t ( x )) 14

  29. AdaBoost 1: Input : S � {( x 1 , y 1 ) , . . . , ( x m , y m ))} , weak learner A , number of rounds T 2: Initialize D ( 1 ) � ( 1 m , . . . , 1 m ) 3: for t � 1 , . . . , T do Learn a weak classifier h t � A ( D ( t ) , S ) 4: Compute error ǫ t � � m i � 1 D ( t ) i 1 [ h t ( x i ) � y i ] 5: Let w t � 1 2 log ( 1 ǫ t − 1 ) 6: Update, for all i � 1 , . . . , m 7: D ( t ) exp (− w t y i h t ( x )) D ( t + 1 ) i � i � m j � 1 D ( t ) j exp (− w t y j h t ( x j )) 8: end for 9: Output : the hypothesis h S ( x ) � sign ( � T t � 1 w t h t ( x )) 14

  30. Example (a) t � 1 [Mohri et al., 2018, Page 147] 15

  31. Example (a) t � 1 (b) t � 2 [Mohri et al., 2018, Page 147] 15

  32. Example (a) t � 1 (b) t � 2 (c) t � 3 [Mohri et al., 2018, Page 147] 15

  33. Example (Cont.) T � sign ( w t h t ( x )) � h ( x ) (12) t � 1 [Mohri et al., 2018, Page 147] 16

  34. Theortical Analysis Let S be a training set and assume that at each iteration of AdaBoost, the weak learner returns a hypothesis for which ǫ t ≤ 1 2 − γ. [Shalev-Shwartz and Ben-David, 2014, Page 135 – 137] 17

  35. Theortical Analysis Let S be a training set and assume that at each iteration of AdaBoost, the weak learner returns a hypothesis for which ǫ t ≤ 1 2 − γ. Then, the training error of the output hypothesis of AdaBoost is at most L S ( h S ) � 1 m 1 [ h S ( x i ) � y i ] ≤ exp (− 2 γ 2 T ) (13) [Shalev-Shwartz and Ben-David, 2014, Page 135 – 137] 17

  36. VC Dimension Let ◮ B be a base hypothesis space (e.g., decision stumps) ◮ L ( B , T ) be the hypothesis space produced by the AdaBoost algorithm [Shalev-Shwartz and Ben-David, 2014, Page 139] 18

  37. VC Dimension Let ◮ B be a base hypothesis space (e.g., decision stumps) ◮ L ( B , T ) be the hypothesis space produced by the AdaBoost algorithm Assume that both T and VC-dim ( B ) are at least 3. Then, VC-dim ( L ( B , T )) ≤ O { T · VC-dim ( B ) · log ( T · VC-dim ( B ))} [Shalev-Shwartz and Ben-David, 2014, Page 139] 18

  38. Reference Mohri, M., Rostamizadeh, A., and Talwalkar, A. (2018). Foundations of machine learning . MIT press. Shalev-Shwartz, S. and Ben-David, S. (2014). Understanding machine learning: From theory to algorithms . Cambridge university press. 19

Recommend


More recommend