Lecture 20: AdaBoost Aykut Erdem December 2017 Hacettepe University - - PowerPoint PPT Presentation

lecture 20
SMART_READER_LITE
LIVE PREVIEW

Lecture 20: AdaBoost Aykut Erdem December 2017 Hacettepe University - - PowerPoint PPT Presentation

Lecture 20: AdaBoost Aykut Erdem December 2017 Hacettepe University Last time Bias/Variance Tradeo ff slide by David Sontag Graphical illustration of bias and variance. http://scott.fortmann-roe.com/docs/BiasVariance.html 2 Last time


slide-1
SLIDE 1

Lecture 20:

−AdaBoost

Aykut Erdem

December 2017 Hacettepe University

slide-2
SLIDE 2

Last time… Bias/Variance Tradeoff

2

http://scott.fortmann-roe.com/docs/BiasVariance.html Graphical illustration of bias and variance.

slide by David Sontag

slide-3
SLIDE 3

Last time… Bagging

  • Leo Breiman (1994)
  • Take repeated bootstrap samples from training set D.
  • Bootstrap sampling: Given set D containing N

training examples, create D’ by drawing N examples at random with replacement from D.

  • Bagging:
  • Create k bootstrap samples D1 ... Dk.
  • Train distinct classifier on each Di.
  • Classify new instance by majority vote / average.

3

slide by David Sontag

slide-4
SLIDE 4

Last time… Random Forests

4

slide by Nando de Freitas

[From the book of Hastie, Friedman and Tibshirani]

Tree t=1 t=2 t=3

slide-5
SLIDE 5

Last time… Boosting

  • Idea: given a weak learner, run it multiple times on (reweighted)

training data, then let the learned classifiers vote

  • On each iteration t:
  • weight each training example by how incorrectly it was classified
  • Learn a hypothesis – ht
  • A strength for this hypothesis – at
  • Final classifier:
  • A linear combination of the votes of the different classifiers

weighted by their strength

  • Practically useful
  • Theoretically interesting

5

slide by Aarti Singh & Barnabas Poczos

slide-6
SLIDE 6

The AdaBoost Algorithm

6

slide-7
SLIDE 7

Voted combination of classifiers

  • The general problem here is to try to combine many

simple “weak” classifiers into a single “strong” classifier

  • We consider voted combinations of simple binary ±1

component classifiers where the (non-negative) votes αi can be used to 
 emphasize component classifiers that are more 
 reliable than others

7

slide by Tommi S. Jaakkola

slide-8
SLIDE 8

Components: Decision stumps

  • Consider the following simple family of component

classifiers generating ±1 labels: where These are called decision 
 stumps.

  • Each decision stump pays attention to only a single

component of the input vector

8

slide by Tommi S. Jaakkola

slide-9
SLIDE 9

Voted combinations (cont’d.)

  • We need to define a loss function for the combination

so we can determine which new component h(x; θ) to add and how many votes it should receive


  • While there are many options for the loss function we

consider here only a simple exponential loss

9

slide by Tommi S. Jaakkola

slide-10
SLIDE 10

Modularity, errors, and loss

  • Consider adding the mth component:

10

slide by Tommi S. Jaakkola

slide-11
SLIDE 11

Modularity, errors, and loss

  • Consider adding the mth component:

11

slide by Tommi S. Jaakkola

slide-12
SLIDE 12

Modularity, errors, and loss

  • Consider adding the mth component:

  • So at the mth iteration the new component (and the votes)

should optimize a weighted loss (weighted towards mistakes).

12

slide by Tommi S. Jaakkola

slide-13
SLIDE 13

Empirical exponential loss (cont’d.)

  • To increase modularity we’d like to further decouple the
  • ptimization of h(x; θm) from the associated votes αm
  • To this end we select h(x; θm) that optimizes the rate at

which the loss would decrease as a function of αm

13

slide by Tommi S. Jaakkola

slide-14
SLIDE 14

Empirical exponential loss (cont’d.)

  • We find that minimizes
  • We can also normalize the weights:



 so that

14

slide by Tommi S. Jaakkola

slide-15
SLIDE 15

Empirical exponential loss (cont’d.)

  • We find that minimizes


where

  • is subsequently chosen to minimize

15

slide by Tommi S. Jaakkola

slide-16
SLIDE 16

The AdaBoost Algorithm

16

slide by Jiri Matas and Jan Šochman

slide-17
SLIDE 17

The AdaBoost Algorithm

17

Given: (x1, y1), . . . , (xm, ym); xi ∈ X, yi ∈ {−1, +1}

slide by Jiri Matas and Jan Šochman

slide-18
SLIDE 18

The AdaBoost Algorithm

18

Given: (x1, y1), . . . , (xm, ym); xi ∈ X, yi ∈ {−1, +1} Initialise weights D1(i) = 1/m

slide by Jiri Matas and Jan Šochman

slide-19
SLIDE 19

The AdaBoost Algorithm

19

Given: (x1, y1), . . . , (xm, ym); xi 2 X, yi 2 {1, +1} Initialise weights D1(i) = 1/m For t = 1, ..., T:

Find ht = arg min

hj∈H ✏j = m

P

i=1

Dt(i)Jyi 6= hj(xi)K

If ✏t 1/2 then stop t = 1

slide by Jiri Matas and Jan Šochman

slide-20
SLIDE 20

The AdaBoost Algorithm

20

Given: (x1, y1), . . . , (xm, ym); xi 2 X, yi 2 {1, +1} Initialise weights D1(i) = 1/m For t = 1, ..., T:

Find ht = arg min

hj∈H ✏j = m

P

i=1

Dt(i)Jyi 6= hj(xi)K

If ✏t 1/2 then stop

Set ↵t = 1

2 log(1−✏t ✏t )

t = 1

slide by Jiri Matas and Jan Šochman

slide-21
SLIDE 21

The AdaBoost Algorithm

21

Given: (x1, y1), . . . , (xm, ym); xi 2 X, yi 2 {1, +1} Initialise weights D1(i) = 1/m For t = 1, ..., T:

Find ht = arg min

hj∈H ✏j = m

P

i=1

Dt(i)Jyi 6= hj(xi)K

If ✏t 1/2 then stop

Set ↵t = 1

2 log(1−✏t ✏t ) ⌅

Update Dt+1(i) = Dt(i)exp(↵tyiht(xi)) Zt where Zt is normalisation factor t = 1

slide by Jiri Matas and Jan Šochman

slide-22
SLIDE 22

The AdaBoost Algorithm

22

Given: (x1, y1), . . . , (xm, ym); xi 2 X, yi 2 {1, +1} Initialise weights D1(i) = 1/m For t = 1, ..., T:

Find ht = arg min

hj∈H ✏j = m

P

i=1

Dt(i)Jyi 6= hj(xi)K

If ✏t 1/2 then stop

Set ↵t = 1

2 log(1−✏t ✏t ) ⌅

Update Dt+1(i) = Dt(i)exp(↵tyiht(xi)) Zt where Zt is normalisation factor Output the final classifier: H(x) = sign T X

t=1

↵tht(x) !

step training error

t = 1

5 10 15 20 25 30 35 40 0.05 0.1 0.15 0.2 0.25 0.3 0.35

slide by Jiri Matas and Jan Šochman

slide-23
SLIDE 23

The AdaBoost Algorithm

23

Given: (x1, y1), . . . , (xm, ym); xi 2 X, yi 2 {1, +1} Initialise weights D1(i) = 1/m For t = 1, ..., T:

Find ht = arg min

hj∈H ✏j = m

P

i=1

Dt(i)Jyi 6= hj(xi)K

If ✏t 1/2 then stop

Set ↵t = 1

2 log(1−✏t ✏t ) ⌅

Update Dt+1(i) = Dt(i)exp(↵tyiht(xi)) Zt where Zt is normalisation factor Output the final classifier: H(x) = sign T X

t=1

↵tht(x) !

step training error

t = 2

5 10 15 20 25 30 35 40 0.05 0.1 0.15 0.2 0.25 0.3 0.35

slide by Jiri Matas and Jan Šochman

slide-24
SLIDE 24

The AdaBoost Algorithm

24

Given: (x1, y1), . . . , (xm, ym); xi 2 X, yi 2 {1, +1} Initialise weights D1(i) = 1/m For t = 1, ..., T:

Find ht = arg min

hj∈H ✏j = m

P

i=1

Dt(i)Jyi 6= hj(xi)K

If ✏t 1/2 then stop

Set ↵t = 1

2 log(1−✏t ✏t ) ⌅

Update Dt+1(i) = Dt(i)exp(↵tyiht(xi)) Zt where Zt is normalisation factor Output the final classifier: H(x) = sign T X

t=1

↵tht(x) !

step training error

t = 3

5 10 15 20 25 30 35 40 0.05 0.1 0.15 0.2 0.25 0.3 0.35

slide by Jiri Matas and Jan Šochman

slide-25
SLIDE 25

The AdaBoost Algorithm

25

Given: (x1, y1), . . . , (xm, ym); xi 2 X, yi 2 {1, +1} Initialise weights D1(i) = 1/m For t = 1, ..., T:

Find ht = arg min

hj∈H ✏j = m

P

i=1

Dt(i)Jyi 6= hj(xi)K

If ✏t 1/2 then stop

Set ↵t = 1

2 log(1−✏t ✏t ) ⌅

Update Dt+1(i) = Dt(i)exp(↵tyiht(xi)) Zt where Zt is normalisation factor Output the final classifier: H(x) = sign T X

t=1

↵tht(x) !

step training error

t = 4

5 10 15 20 25 30 35 40 0.05 0.1 0.15 0.2 0.25 0.3 0.35

slide by Jiri Matas and Jan Šochman

slide-26
SLIDE 26

The AdaBoost Algorithm

26

Given: (x1, y1), . . . , (xm, ym); xi 2 X, yi 2 {1, +1} Initialise weights D1(i) = 1/m For t = 1, ..., T:

Find ht = arg min

hj∈H ✏j = m

P

i=1

Dt(i)Jyi 6= hj(xi)K

If ✏t 1/2 then stop

Set ↵t = 1

2 log(1−✏t ✏t ) ⌅

Update Dt+1(i) = Dt(i)exp(↵tyiht(xi)) Zt where Zt is normalisation factor Output the final classifier: H(x) = sign T X

t=1

↵tht(x) !

step training error

t = 5

5 10 15 20 25 30 35 40 0.05 0.1 0.15 0.2 0.25 0.3 0.35

slide by Jiri Matas and Jan Šochman

slide-27
SLIDE 27

The AdaBoost Algorithm

27

Given: (x1, y1), . . . , (xm, ym); xi 2 X, yi 2 {1, +1} Initialise weights D1(i) = 1/m For t = 1, ..., T:

Find ht = arg min

hj∈H ✏j = m

P

i=1

Dt(i)Jyi 6= hj(xi)K

If ✏t 1/2 then stop

Set ↵t = 1

2 log(1−✏t ✏t ) ⌅

Update Dt+1(i) = Dt(i)exp(↵tyiht(xi)) Zt where Zt is normalisation factor Output the final classifier: H(x) = sign T X

t=1

↵tht(x) !

step training error

t = 6

5 10 15 20 25 30 35 40 0.05 0.1 0.15 0.2 0.25 0.3 0.35

slide by Jiri Matas and Jan Šochman

slide-28
SLIDE 28

The AdaBoost Algorithm

28

Given: (x1, y1), . . . , (xm, ym); xi 2 X, yi 2 {1, +1} Initialise weights D1(i) = 1/m For t = 1, ..., T:

Find ht = arg min

hj∈H ✏j = m

P

i=1

Dt(i)Jyi 6= hj(xi)K

If ✏t 1/2 then stop

Set ↵t = 1

2 log(1−✏t ✏t ) ⌅

Update Dt+1(i) = Dt(i)exp(↵tyiht(xi)) Zt where Zt is normalisation factor Output the final classifier: H(x) = sign T X

t=1

↵tht(x) !

step training error

t = 7

5 10 15 20 25 30 35 40 0.05 0.1 0.15 0.2 0.25 0.3 0.35

slide by Jiri Matas and Jan Šochman

slide-29
SLIDE 29

The AdaBoost Algorithm

29

Given: (x1, y1), . . . , (xm, ym); xi 2 X, yi 2 {1, +1} Initialise weights D1(i) = 1/m For t = 1, ..., T:

Find ht = arg min

hj∈H ✏j = m

P

i=1

Dt(i)Jyi 6= hj(xi)K

If ✏t 1/2 then stop

Set ↵t = 1

2 log(1−✏t ✏t ) ⌅

Update Dt+1(i) = Dt(i)exp(↵tyiht(xi)) Zt where Zt is normalisation factor Output the final classifier: H(x) = sign T X

t=1

↵tht(x) !

step training error

t = 40

5 10 15 20 25 30 35 40 0.05 0.1 0.15 0.2 0.25 0.3 0.35

slide by Jiri Matas and Jan Šochman

slide-30
SLIDE 30

Reweighting

30

slide by Jiri Matas and Jan Šochman

slide-31
SLIDE 31

31

Reweighting

slide by Jiri Matas and Jan Šochman

slide-32
SLIDE 32

32

Reweighting

slide by Jiri Matas and Jan Šochman

slide-33
SLIDE 33

Boosting results – Digit recognition

  • Boosting often (but not always)
  • Robust to overfitting
  • Test set error decreases even after training error is zero

33

[Schapire, 1989]

10 100 1000 5 10 15 20

error # rounds training error test error

slide by Carlos Guestrin

slide-34
SLIDE 34

Application: Detecting Faces

  • Training Data
  • 5000 faces
  • All frontal
  • 300 million non-faces
  • 9500 non-face images

34

[Viola & Jones]

slide by Rob Schapire

slide-35
SLIDE 35

Application: Detecting Faces

  • Problem: find faces in photograph or movie
  • Weak classifiers: detect light/dark rectangle in image


  • Many clever tricks to make extremely fast and accurate

35

[Viola & Jones]

slide by Rob Schapire

slide-36
SLIDE 36

Boosting vs. Logistic Regression

36

Logis+c$regression:$

  • Minimize$log$loss$
  • Define$$

$ $ where$xj$predefined$ features$(linear$classifier)$

  • Jointly$op+mize$over$all$

weights$w0,w1,w2,…$

$ Boos+ng:$

  • Minimize$exp$loss$
  • Define$

$ $ where$ht(x)$defined$ dynamically$to$fit$data$$

(not$a$$linear$classifier)$

  • Weights$αt$learned$per$

itera+on$t$incrementally$

  • Minimize+log+loss+
  • Minimize+exp+loss+

where+x +predefined+ ++++++

slide by Aarti Singh

slide-37
SLIDE 37

Boosting vs. Bagging

Bagging:

  • Resample data points
  • Weight of each classifier

is the same

  • Only variance reduction

37

Boosting:

  • Reweights data points

(modifies their distribution)

  • Weight is dependent on

classifier’s accuracy

  • Both bias and variance

reduced – learning rule becomes more complex with iterations

slide by Aarti Singh