CS 6316 Machine Learning Boosting Yangfeng Ji Department of - PowerPoint PPT Presentation

CS 6316 Machine Learning Boosting Yangfeng Ji Department of Computer Science University of Virginia

Overview

The Bias-Variance Decomposition The expected error is decomposed as � ǫ 2 � � { h ( x , S ) − E [ h ( x , S )]} 2 � + { E [ h ( x , S )] − f D ( x )} 2 E � E � �� variance bias 2 ◮ bias : how far the expected prediction E [ h ( x , S )] diverges from the optimal predictor f D ( x ) ◮ variance : how a hypothesis learned from a specific S diverges from the average prediction E [ h ( x , S )] 2

Motivation How can we reduce the overall error? E.g., ◮ Reduce the bias ◮ Boosting: start with simple classifiers, and gradually make a powerful one ◮ Reduce the variance ◮ Bagging: create multiple copies of data and train classifiers on each of them, then combine them together 3

The Idea of Boosting 4

Weak Learnability

Weak Learnability ◮ A learning algorithm A is a γ -weak-learner for a hypothesis space, if for the PAC learning condition, the algorithm returns a hypothesis h such that, with probability of at least 1 − δ , L ( D , f ) ( h ) ≤ 1 2 − γ (1) ◮ A hypothesis space H is γ -weak-learnable if there exists a γ -weak-learner for this class 6

Strong vs. Weak Learnability ◮ Strong learnability L ( D , f ) ( h ) ≤ ǫ (2) where ǫ is arbitrarily small ◮ Weak learnability L ( D , f ) ( h ) ≤ 1 2 − γ (3) where γ > 0 . In other words, the error rate of weak learnability is slightly better than random guessing 7

Decision Stumps ◮ Let X � R d , the hypothesis space of decision stumps is defined as DS � { b · sign ( x · , j − θ ) : θ ∈ R , j ∈ [ d ]} (4) H with parameters θ ∈ R , j ∈ [ d ] , and b ∈ {− 1 , + 1 } 8

Decision Stumps ◮ Let X � R d , the hypothesis space of decision stumps is defined as DS � { b · sign ( x · , j − θ ) : θ ∈ R , j ∈ [ d ]} (4) H with parameters θ ∈ R , j ∈ [ d ] , and b ∈ {− 1 , + 1 } ◮ For each h θ, j , b ∈ H DS with j � 1 and b � + 1 � + 1 x · , 1 > θ h θ, 1 , + 1 ( x ) � (5) − 1 x · , 1 < θ x 2 x 1 8

Empirical Risk ◮ The empirical risk with a training set S � {( x 1 , y 1 ) , . . . , ( x m , y m )} is defined as m � L D ( h θ, j , b ) � D i · 1 [ h θ, j , b ( x i ) � y i ] (6) i � 1 where 1 [·] is the indicator function and 1 [ h ( x i ) � y i ] � 1 when h ( x i ) � y i is true 9

Empirical Risk ◮ The empirical risk with a training set S � {( x 1 , y 1 ) , . . . , ( x m , y m )} is defined as m � L D ( h θ, j , b ) � D i · 1 [ h θ, j , b ( x i ) � y i ] (6) i � 1 where 1 [·] is the indicator function and 1 [ h ( x i ) � y i ] � 1 when h ( x i ) � y i is true ◮ A special case with D i � 1 m , then � m i � 1 1 [ h ( x i ) � y i ] L D ( h ) � L S ( h ) � (7) m 9

Learning a Decision Stump ◮ For each j ∈ [ d ] ◮ Sort training examples, such that x 1 , j ≤ x 2 , j ≤ · · · ≤ x m , j (8) 10

Learning a Decision Stump ◮ For each j ∈ [ d ] ◮ Sort training examples, such that x 1 , j ≤ x 2 , j ≤ · · · ≤ x m , j (8) ◮ Define x i , j + x i + 1 , j Θ j � { : i ∈ [ m − 1 ]} ∪ {( x 1 , j − 1 ) , ( x m , j + 1 )} 2 10

Learning a Decision Stump ◮ For each j ∈ [ d ] ◮ Sort training examples, such that x 1 , j ≤ x 2 , j ≤ · · · ≤ x m , j (8) ◮ Define x i , j + x i + 1 , j Θ j � { : i ∈ [ m − 1 ]} ∪ {( x 1 , j − 1 ) , ( x m , j + 1 )} 2 ◮ Try each θ ′ ∈ Θ j and find the minimal risk with j m � L D ( h θ ′ , j , b ) � D i · 1 [ h θ ′ , j ( x i ) � y i ] (9) i � 1 10

Learning a Decision Stump ◮ For each j ∈ [ d ] ◮ Sort training examples, such that x 1 , j ≤ x 2 , j ≤ · · · ≤ x m , j (8) ◮ Define x i , j + x i + 1 , j Θ j � { : i ∈ [ m − 1 ]} ∪ {( x 1 , j − 1 ) , ( x m , j + 1 )} 2 ◮ Try each θ ′ ∈ Θ j and find the minimal risk with j m � L D ( h θ ′ , j , b ) � D i · 1 [ h θ ′ , j ( x i ) � y i ] (9) i � 1 ◮ Find the minimal risk for all j ∈ [ d ] 10

Example Build a decision stump for the following classification task with the assumption that D � ( 1 9 , . . . , 1 9 ) (10) x 2 x 1 11

Example Build a decision stump for the following classification task with the assumption that D � ( 1 9 , . . . , 1 9 ) (10) x 2 x 1 The best decision stump is x · , 1 � 0 . 6 11

Boosting

Boosting Q: Cen we boost a set of weak classifiers and make a strong classifier? 13

Boosting Q: Cen we boost a set of weak classifiers and make a strong classifier? A: Yes. It looks like T � h S ( x ) � sign ( w t h t ( x )) (11) t � 1 13

Boosting Q: Cen we boost a set of weak classifiers and make a strong classifier? A: Yes. It looks like T � h S ( x ) � sign ( w t h t ( x )) (11) t � 1 Three questions ◮ How to find each weak classifier h t ( x ) ? ◮ How to compute w t ? ◮ How large the T is? 13

AdaBoost 1: Input : S � {( x 1 , y 1 ) , . . . , ( x m , y m ))} , weak learner A , number of rounds T 2: Initialize D ( 1 ) � ( 1 m , . . . , 1 m ) 3: for t � 1 , . . . , T do 8: end for 9: Output : the hypothesis h S ( x ) � sign ( � T t � 1 w t h t ( x )) 14

AdaBoost 1: Input : S � {( x 1 , y 1 ) , . . . , ( x m , y m ))} , weak learner A , number of rounds T 2: Initialize D ( 1 ) � ( 1 m , . . . , 1 m ) 3: for t � 1 , . . . , T do Learn a weak classifier h t � A ( D ( t ) , S ) 4: 8: end for 9: Output : the hypothesis h S ( x ) � sign ( � T t � 1 w t h t ( x )) 14

AdaBoost 1: Input : S � {( x 1 , y 1 ) , . . . , ( x m , y m ))} , weak learner A , number of rounds T 2: Initialize D ( 1 ) � ( 1 m , . . . , 1 m ) 3: for t � 1 , . . . , T do Learn a weak classifier h t � A ( D ( t ) , S ) 4: Compute error ǫ t � � m i � 1 D ( t ) i 1 [ h t ( x i ) � y i ] 5: 8: end for 9: Output : the hypothesis h S ( x ) � sign ( � T t � 1 w t h t ( x )) 14

AdaBoost 1: Input : S � {( x 1 , y 1 ) , . . . , ( x m , y m ))} , weak learner A , number of rounds T 2: Initialize D ( 1 ) � ( 1 m , . . . , 1 m ) 3: for t � 1 , . . . , T do Learn a weak classifier h t � A ( D ( t ) , S ) 4: Compute error ǫ t � � m i � 1 D ( t ) i 1 [ h t ( x i ) � y i ] 5: Let w t � 1 2 log ( 1 ǫ t − 1 ) 6: 8: end for 9: Output : the hypothesis h S ( x ) � sign ( � T t � 1 w t h t ( x )) 14

AdaBoost 1: Input : S � {( x 1 , y 1 ) , . . . , ( x m , y m ))} , weak learner A , number of rounds T 2: Initialize D ( 1 ) � ( 1 m , . . . , 1 m ) 3: for t � 1 , . . . , T do Learn a weak classifier h t � A ( D ( t ) , S ) 4: Compute error ǫ t � � m i � 1 D ( t ) i 1 [ h t ( x i ) � y i ] 5: Let w t � 1 2 log ( 1 ǫ t − 1 ) 6: Update, for all i � 1 , . . . , m 7: D ( t ) exp (− w t y i h t ( x )) D ( t + 1 ) i � i � m j � 1 D ( t ) j exp (− w t y j h t ( x j )) 8: end for 9: Output : the hypothesis h S ( x ) � sign ( � T t � 1 w t h t ( x )) 14

Example (a) t � 1 [Mohri et al., 2018, Page 147] 15

Example (a) t � 1 (b) t � 2 [Mohri et al., 2018, Page 147] 15

Example (a) t � 1 (b) t � 2 (c) t � 3 [Mohri et al., 2018, Page 147] 15

Example (Cont.) T � sign ( w t h t ( x )) � h ( x ) (12) t � 1 [Mohri et al., 2018, Page 147] 16

Theortical Analysis Let S be a training set and assume that at each iteration of AdaBoost, the weak learner returns a hypothesis for which ǫ t ≤ 1 2 − γ. [Shalev-Shwartz and Ben-David, 2014, Page 135 – 137] 17

Theortical Analysis Let S be a training set and assume that at each iteration of AdaBoost, the weak learner returns a hypothesis for which ǫ t ≤ 1 2 − γ. Then, the training error of the output hypothesis of AdaBoost is at most L S ( h S ) � 1 m 1 [ h S ( x i ) � y i ] ≤ exp (− 2 γ 2 T ) (13) [Shalev-Shwartz and Ben-David, 2014, Page 135 – 137] 17

VC Dimension Let ◮ B be a base hypothesis space (e.g., decision stumps) ◮ L ( B , T ) be the hypothesis space produced by the AdaBoost algorithm [Shalev-Shwartz and Ben-David, 2014, Page 139] 18

VC Dimension Let ◮ B be a base hypothesis space (e.g., decision stumps) ◮ L ( B , T ) be the hypothesis space produced by the AdaBoost algorithm Assume that both T and VC-dim ( B ) are at least 3. Then, VC-dim ( L ( B , T )) ≤ O { T · VC-dim ( B ) · log ( T · VC-dim ( B ))} [Shalev-Shwartz and Ben-David, 2014, Page 139] 18

Reference Mohri, M., Rostamizadeh, A., and Talwalkar, A. (2018). Foundations of machine learning . MIT press. Shalev-Shwartz, S. and Ben-David, S. (2014). Understanding machine learning: From theory to algorithms . Cambridge university press. 19

CS 6316 Machine Learning Boosting Yangfeng Ji Department of - PowerPoint PPT Presentation

CS 6316 Machine Learning Boosting Yangfeng Ji Department of Computer Science University of Virginia Overview The Bias-Variance Decomposition The expected error is decomposed as 2 { h ( x , S ) E [ h ( x , S )]} 2 + {

CS 6316 Machine Learning The Bias-Complexity Tradeoff Yangfeng Ji Department of Computer Science

CS 6316 Machine Learning Introduction to Learning Theory Yangfeng Ji Department of Computer

CS 6316 Machine Learning Neural Networks Yangfeng Ji Department of Computer Science University

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

CS 6316 Machine Learning Clustering Yangfeng Ji Department of Computer Science University of

CS 6316 Machine Learning Support Vector Machines and Kernel Meth- ods Yangfeng Ji Department of

CS 6316 Machine Learning Dimensionality Reduction Yangfeng Ji Department of Computer Science

CS 6316 Machine Learning Review of Linear Algebra and Probability Yangfeng Ji Department of

CS 6316 Machine Learning Linear Predictors Yangfeng Ji Department of Computer Science

CS 6316 Machine Learning Model Selection and Validation Yangfeng Ji Department of Computer

CS 6316 Machine Learning Generative Models Yangfeng Ji Department of Computer Science

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Understanding the safety-relevance of visual cue perception at a Surface Manager HMI Lothar

5. Bayesian decision theory Chlo-Agathe Azencot Centre for Computatjonal Biology, Mines

The Shadow Cost of Bank Capital Requirements Roni Kisin Washington University in St. Louis Asaf

Principled Deep Neural Network Training through Linear Programming Daniel Bienstock 1 , Gonzalo

learning to compare using operator-valued large-margin classiers andreas maurer a binary

Lecture 2: Linear Regression Jan 27th 2020 Lecturer: Steven Wu Scribe: Steven Wu A curious

Classification from Pairwise Similarity and Unlabeled Data Han Bao 1,2 , Gang Niu 2 , Masashi

Kernel-based Methods and Support Vector Machines Larry Holder CptS 570 Machine Learning

Sambuz

Useful Links

Newsletter

Mail Us