Lecture 20: − AdaBoost Aykut Erdem December 2017 Hacettepe University
Last time… Bias/Variance Tradeo ff slide by David Sontag Graphical illustration of bias and variance. http://scott.fortmann-roe.com/docs/BiasVariance.html 2
Last time… Bagging • Leo Breiman (1994) • Take repeated bootstrap samples from training set D. • Bootstrap sampling: Given set D containing N training examples, create D ’ by drawing N examples at random with replacement from D. • Bagging: - Create k bootstrap samples D 1 ... D k . - Train distinct classifier on each D i . - Classify new instance by majority vote / average. slide by David Sontag 3
Last time… Random Forests Tree t=1 t=2 t=3 slide by Nando de Freitas [From the book of Hastie, Friedman and Tibshirani] 4
Last time… Boosting • Idea: given a weak learner, run it multiple times on (reweighted) training data, then let the learned classifiers vote • On each iteration t : - weight each training example by how incorrectly it was classified - Learn a hypothesis – h t - A strength for this hypothesis – a t • Final classifier: - A linear combination of the votes of the di ff erent classifiers weighted by their strength slide by Aarti Singh & Barnabas Poczos • Practically useful • Theoretically interesting 5
The AdaBoost Algorithm 6
Voted combination of classifiers • The general problem here is to try to combine many simple “weak” classifiers into a single “strong” classifier • We consider voted combinations of simple binary ±1 component classifiers where the (non-negative) votes α i can be used to emphasize component classifiers that are more reliable than others slide by Tommi S. Jaakkola 7
Components: Decision stumps • Consider the following simple family of component classifiers generating ±1 labels: where These are called decision stumps. • Each decision stump pays attention to only a single component of the input vector slide by Tommi S. Jaakkola 8
Voted combinations (cont’d.) • We need to define a loss function for the combination so we can determine which new component h (x; θ ) to add and how many votes it should receive • While there are many options for the loss function we consider here only a simple exponential loss slide by Tommi S. Jaakkola 9
Modularity, errors, and loss • Consider adding the m th component: slide by Tommi S. Jaakkola 10
Modularity, errors, and loss • Consider adding the m th component: slide by Tommi S. Jaakkola 11
Modularity, errors, and loss • Consider adding the m th component: • So at the m th iteration the new component (and the votes) slide by Tommi S. Jaakkola should optimize a weighted loss (weighted towards mistakes). 12
Empirical exponential loss (cont’d.) • To increase modularity we’d like to further decouple the optimization of h (x; θ m ) from the associated votes α m • To this end we select h (x; θ m ) that optimizes the rate at which the loss would decrease as a function of α m slide by Tommi S. Jaakkola 13
Empirical exponential loss (cont’d.) • We find that minimizes • We can also normalize the weights: slide by Tommi S. Jaakkola so that 14
Empirical exponential loss (cont’d.) • We find that minimizes where • is subsequently chosen to minimize slide by Tommi S. Jaakkola 15
16 The AdaBoost Algorithm slide by Jiri Matas and Jan Š ochman
The AdaBoost Algorithm Given: ( x 1 , y 1 ) , . . . , ( x m , y m ); x i ∈ X , y i ∈ { − 1 , +1 } slide by Jiri Matas and Jan Š ochman 17
The AdaBoost Algorithm Given: ( x 1 , y 1 ) , . . . , ( x m , y m ); x i ∈ X , y i ∈ { − 1 , +1 } Initialise weights D 1 ( i ) = 1 /m slide by Jiri Matas and Jan Š ochman 18
The AdaBoost Algorithm Given: ( x 1 , y 1 ) , . . . , ( x m , y m ); x i 2 X , y i 2 { � 1 , +1 } Initialise weights D 1 ( i ) = 1 /m t = 1 For t = 1 , ..., T : m ⌅ Find h t = arg min h j ∈ H ✏ j = P D t ( i ) J y i 6 = h j ( x i ) K i =1 ⌅ If ✏ t � 1 / 2 then stop slide by Jiri Matas and Jan Š ochman 19
The AdaBoost Algorithm Given: ( x 1 , y 1 ) , . . . , ( x m , y m ); x i 2 X , y i 2 { � 1 , +1 } Initialise weights D 1 ( i ) = 1 /m t = 1 For t = 1 , ..., T : m ⌅ Find h t = arg min h j ∈ H ✏ j = P D t ( i ) J y i 6 = h j ( x i ) K i =1 ⌅ If ✏ t � 1 / 2 then stop 2 log( 1 − ✏ t Set ↵ t = 1 ⌅ ✏ t ) slide by Jiri Matas and Jan Š ochman 20
The AdaBoost Algorithm Given: ( x 1 , y 1 ) , . . . , ( x m , y m ); x i 2 X , y i 2 { � 1 , +1 } Initialise weights D 1 ( i ) = 1 /m t = 1 For t = 1 , ..., T : m ⌅ Find h t = arg min h j ∈ H ✏ j = P D t ( i ) J y i 6 = h j ( x i ) K i =1 ⌅ If ✏ t � 1 / 2 then stop 2 log( 1 − ✏ t Set ↵ t = 1 ⌅ ✏ t ) ⌅ Update D t +1 ( i ) = D t ( i ) exp ( � ↵ t y i h t ( x i )) Z t where Z t is normalisation factor slide by Jiri Matas and Jan Š ochman 21
The AdaBoost Algorithm Given: ( x 1 , y 1 ) , . . . , ( x m , y m ); x i 2 X , y i 2 { � 1 , +1 } Initialise weights D 1 ( i ) = 1 /m t = 1 For t = 1 , ..., T : m P ⌅ Find h t = arg min h j ∈ H ✏ j = D t ( i ) J y i 6 = h j ( x i ) K i =1 ⌅ If ✏ t � 1 / 2 then stop 2 log( 1 − ✏ t Set ↵ t = 1 ⌅ ✏ t ) ⌅ Update D t +1 ( i ) = D t ( i ) exp ( � ↵ t y i h t ( x i )) 0.35 Z t 0.3 where Z t is normalisation factor training error 0.25 Output the final classifier: 0.2 slide by Jiri Matas and Jan Š ochman 0.15 T ! 0.1 X H ( x ) = sign ↵ t h t ( x ) 0.05 t =1 0 0 5 10 15 20 25 30 35 40 step 22
The AdaBoost Algorithm Given: ( x 1 , y 1 ) , . . . , ( x m , y m ); x i 2 X , y i 2 { � 1 , +1 } Initialise weights D 1 ( i ) = 1 /m t = 2 For t = 1 , ..., T : m P ⌅ Find h t = arg min h j ∈ H ✏ j = D t ( i ) J y i 6 = h j ( x i ) K i =1 ⌅ If ✏ t � 1 / 2 then stop 2 log( 1 − ✏ t Set ↵ t = 1 ⌅ ✏ t ) ⌅ Update D t +1 ( i ) = D t ( i ) exp ( � ↵ t y i h t ( x i )) 0.35 Z t 0.3 where Z t is normalisation factor training error 0.25 Output the final classifier: 0.2 slide by Jiri Matas and Jan Š ochman 0.15 T ! 0.1 X H ( x ) = sign ↵ t h t ( x ) 0.05 t =1 0 0 5 10 15 20 25 30 35 40 step 23
The AdaBoost Algorithm Given: ( x 1 , y 1 ) , . . . , ( x m , y m ); x i 2 X , y i 2 { � 1 , +1 } Initialise weights D 1 ( i ) = 1 /m t = 3 For t = 1 , ..., T : m ⌅ P Find h t = arg min h j ∈ H ✏ j = D t ( i ) J y i 6 = h j ( x i ) K i =1 ⌅ If ✏ t � 1 / 2 then stop 2 log( 1 − ✏ t Set ↵ t = 1 ⌅ ✏ t ) ⌅ Update D t +1 ( i ) = D t ( i ) exp ( � ↵ t y i h t ( x i )) 0.35 Z t 0.3 where Z t is normalisation factor training error 0.25 Output the final classifier: 0.2 slide by Jiri Matas and Jan Š ochman 0.15 T ! 0.1 X H ( x ) = sign ↵ t h t ( x ) 0.05 t =1 0 0 5 10 15 20 25 30 35 40 step 24
The AdaBoost Algorithm Given: ( x 1 , y 1 ) , . . . , ( x m , y m ); x i 2 X , y i 2 { � 1 , +1 } Initialise weights D 1 ( i ) = 1 /m t = 4 For t = 1 , ..., T : m ⌅ P Find h t = arg min h j ∈ H ✏ j = D t ( i ) J y i 6 = h j ( x i ) K i =1 ⌅ If ✏ t � 1 / 2 then stop 2 log( 1 − ✏ t Set ↵ t = 1 ⌅ ✏ t ) ⌅ Update D t +1 ( i ) = D t ( i ) exp ( � ↵ t y i h t ( x i )) 0.35 Z t 0.3 where Z t is normalisation factor training error 0.25 Output the final classifier: 0.2 slide by Jiri Matas and Jan Š ochman 0.15 T ! 0.1 X H ( x ) = sign ↵ t h t ( x ) 0.05 t =1 0 0 5 10 15 20 25 30 35 40 step 25
The AdaBoost Algorithm Given: ( x 1 , y 1 ) , . . . , ( x m , y m ); x i 2 X , y i 2 { � 1 , +1 } Initialise weights D 1 ( i ) = 1 /m t = 5 For t = 1 , ..., T : m P ⌅ Find h t = arg min h j ∈ H ✏ j = D t ( i ) J y i 6 = h j ( x i ) K i =1 ⌅ If ✏ t � 1 / 2 then stop 2 log( 1 − ✏ t Set ↵ t = 1 ⌅ ✏ t ) ⌅ Update D t +1 ( i ) = D t ( i ) exp ( � ↵ t y i h t ( x i )) 0.35 Z t 0.3 where Z t is normalisation factor training error 0.25 Output the final classifier: 0.2 slide by Jiri Matas and Jan Š ochman 0.15 T ! 0.1 X H ( x ) = sign ↵ t h t ( x ) 0.05 t =1 0 0 5 10 15 20 25 30 35 40 step 26
The AdaBoost Algorithm Given: ( x 1 , y 1 ) , . . . , ( x m , y m ); x i 2 X , y i 2 { � 1 , +1 } Initialise weights D 1 ( i ) = 1 /m t = 6 For t = 1 , ..., T : m P ⌅ Find h t = arg min h j ∈ H ✏ j = D t ( i ) J y i 6 = h j ( x i ) K i =1 ⌅ If ✏ t � 1 / 2 then stop 2 log( 1 − ✏ t Set ↵ t = 1 ⌅ ✏ t ) ⌅ Update D t +1 ( i ) = D t ( i ) exp ( � ↵ t y i h t ( x i )) 0.35 Z t 0.3 where Z t is normalisation factor training error 0.25 Output the final classifier: 0.2 slide by Jiri Matas and Jan Š ochman 0.15 T ! 0.1 X H ( x ) = sign ↵ t h t ( x ) 0.05 t =1 0 0 5 10 15 20 25 30 35 40 step 27
The AdaBoost Algorithm Given: ( x 1 , y 1 ) , . . . , ( x m , y m ); x i 2 X , y i 2 { � 1 , +1 } Initialise weights D 1 ( i ) = 1 /m t = 7 For t = 1 , ..., T : m P ⌅ Find h t = arg min h j ∈ H ✏ j = D t ( i ) J y i 6 = h j ( x i ) K i =1 ⌅ If ✏ t � 1 / 2 then stop 2 log( 1 − ✏ t Set ↵ t = 1 ⌅ ✏ t ) ⌅ Update D t +1 ( i ) = D t ( i ) exp ( � ↵ t y i h t ( x i )) 0.35 Z t 0.3 where Z t is normalisation factor training error 0.25 Output the final classifier: 0.2 slide by Jiri Matas and Jan Š ochman 0.15 T ! 0.1 X H ( x ) = sign ↵ t h t ( x ) 0.05 t =1 0 0 5 10 15 20 25 30 35 40 step 28
Recommend
More recommend