MACHINE LEARNING - 2013 MACHINE LEARNING Bagging, Boosting and RANSAC
MACHINE LEARNING - 2013 Bootstrap Aggregation Bagging • The Main Idea • Some Examples • Why it works 2
MACHINE LEARNING - 2013 The Main Idea Aggregation • Imagine we have m sets of n independent before after observations S (1) ={( X 1 , Y 1 ),...,( X n , Y n )} (1) , ... , S ( m ) ={( X 1 , Y 1 ),..., ϕ ( x , S ) ( X n , Y n )} ( m ) all taken iid from the same underlying distribution P • Traditional approach: generate some ϕ ( x , S ) from before after all the data samples • Aggregation: learn ϕ ( x , S ) by averaging ϕ ( x , S ( k ) ) ϕ ( x , S ( k ) ) over many k 3
MACHINE LEARNING - 2013 The Main Idea Bootstrapping • Unfortunately, we usually have one single S observations set S • Idea: bootstrap S to form the S (k) observation sets S (1) • (canonical) Choose some samples, duplicate them until you fill a new S (i) of the same size of S S (2) • (practical) Take a sub-set of samples of S , (use a smaller set) S (3) • The samples not used by each set are validation samples 4
MACHINE LEARNING - 2013 The Main Idea Bagging • Generate S (1) ,..., S (m) from bootstrapping • Compute the ϕ ( x , S ) ( k ) individually • Compute ϕ ( x , S )= E k ( ϕ ( x , S ) ( k ) ) by aggregation 5
MACHINE LEARNING - 2013 A concrete example • We select some input samples ( x (1) 1 , y (1) ˆ 1 ) , . . . , ( x (1) n , y (1) f (1) ( x ) = Y (1) n ) • We learn a regression model ( x (2) 1 , y (2) ˆ 1 ) , . . . , ( x (2) n , y (2) f (2) ( x ) = Y (2) n ) • Very sensitive to the input selection • m training sets = m di ff erent models f (1) , · · · , ˆ f ( m ) ˆ Y (1) , · · · , Y ( m ) 6
MACHINE LEARNING - 2013 Aggregation: combine several models • Linear combination of simple models m Z = 1 • More examples = Better model X Y ( i ) m = 60 m = 10 m = 4 • We can stop when we’re satisfied m i =1 7
MACHINE LEARNING - 2013 Proof of convergence Hypothesis: The average will converge to something meaningful • Assumptions • Y (1) ,..., Y ( m ) are iid • E(Y) = y ( E(Y) is an unbiased estimator of y ) • Expected Error E (( Y − y ) 2 ) = E (( Y − E ( Y )) 2 = σ 2 ( Y ) • With Aggregation m m m E ( Z ) = 1 E ( Y ( i ) ) = 1 Z = 1 X Y ( i ) X X y = y m m m i =1 i =1 i =1 E (( Z − y ) 2 ) = E (( Z − E ( Z )) 2 = σ 2 ( Z ) = σ 2 ⇣ 1 m Y ( i ) ⌘ X m i =1 m m ! 1 σ 2 ( Y ( i ) ) = 1 1 = 1 X X σ 2 ( Y ( i ) ) m σ 2 ( Y ) = m 2 m m i =1 i =1 infinite observations = zero error : we have our underlying estimator! 8
MACHINE LEARNING - 2013 In layman terms y Z Y The expected error (variance) of Y is larger than Z The variance of Z shrinks with m 9
MACHINE LEARNING - 2013 Relaxing the assumptions • We DROP the second assumption • Y (1) ,..., Y ( m ) are iid • E(Y) = y ( E(Y) is an unbiased estimator of y ) E (( Y − y ) 2 ) = E (( Y − E ( Y )+ E ( Y ) − y ) 2 ) = E ((( Y − E ( Y )+( E ( Y ) − y )) 2 ) we regroup them we add these = E (( Y − E ( Y )) 2 ) + E (( E ( Y ) − y ) 2 ) + E (2( Y − E ( Y ))( E ( Y ) − y )) � � � � Y − E ( Y ) 2( E ( Y ) − y ) E E σ 2 ( Y ) ≥ 0 = 0 ( Y − y ) 2 � ( E ( Y ) − y ) 2 � � � E ≥ E Z ( Y − y ) 2 � ( Z − y ) 2 � � � E ≥ E The larger it is, the better for us using Z gives us a smaller error (even if we can’t prove convergence to zero) 10
MACHINE LEARNING - 2013 Peculiarities • Instability is good • The more variable (unstable) the form of ϕ ( x , S ) is, the more improvement can potentially be obtained • Low-variability methods (e.g. PCA, LDA) improve less than high-variability ones (e.g. LWR, Decision Trees) • Loads of redundancy • Most predictors do roughly “the same thing” 11
MACHINE LEARNING - 2013 From Bagging to Boosting • Bagging: each model is trained independently • Boosting: each model is built on top of the previous ones 12
MACHINE LEARNING - 2013 Adaptive Boosting AdaBoost • The Main Idea • The Thousand Flavours of Boost • Weak Learners and Cascades 13
MACHINE LEARNING - 2013 The Main Idea Iterative Approach • Combine several simple models (weak learners) • Avoid redundancy • Each learner complements the previous ones • Keep track of the errors of the previous learners 14
MACHINE LEARNING - 2013 Weak Learners • A “simple” classifier that can be generated easily • As long as it is better than random, we can use it • Better when tailored to the problem at hand • E.g. very fast at retrieval (for images) 15
MACHINE LEARNING - 2013 AdaBoost Initialization • We choose a weak learner model ϕ ( x ) v (e.g. f ( x,v ) = x ∙ v > θ ) θ • Initialization • Generate ϕ 1 ( x ), ... , ϕ N ( x ) weak learners v 1 v 2 v 7 v 3 v 4 v 5 v 6 • N can be in the hundreds of thousands • Assign a weight w i to each training sample 16
MACHINE LEARNING - 2013 AdaBoost Iterations • Compute the error e j for each classifier ϕ j ( x ) e j : P n � � w i · 1 ϕ j ( x i ) 6 = y i i =1 • Select the ϕ j with the smallest classification error n ! X v 1 v 1 � � argmin j w i · 1 ϕ j ( x i ) 6 = y i v 2 v 2 v 7 v 7 v 3 v 4 v 3 v 4 i =1 v 5 v 6 v 5 v 6 • Update the weights w i depending on how they are classified by ϕ j . Here comes the important part 17
MACHINE LEARNING - 2013 Updating the weights • Evaluate how “well” ϕ j ( x ) is performing how far are we from a ✓ 1 − e j ◆ α = 1 perfect classification? 2ln e j • Update the weights for each sample make it bigger ( w ( t ) i exp( α ( t ) ) if ϕ j ( x i ) 6 = y i , w ( t +1) = w ( t ) i i exp( � α ( t ) ) if ϕ j ( x i ) = y i . make it smaller 18
MACHINE LEARNING - 2013 AdaBoost Rinse and Repeat • Recompute the error e j for each classifier ϕ j ( x ) using the updated weights e j : P n � � w i · 1 ϕ j ( x i ) 6 = y i i =1 • Select the new ϕ j with the smallest classification error n ! X � � argmin j w i · 1 ϕ j ( x i ) 6 = y i i =1 • Update the weights w i .
MACHINE LEARNING - 2013 Boosting In Action The Checkerboard Problem 20
MACHINE LEARNING - 2013 Boosting In Action Initialization • We choose a simple weak learner f ( x,v ) = x ∙ v > θ • We generate a thousand random vectors v 1 ,...,v 1000 and corresponding learners f j ( x,v j ) • For each f j ( x,v j ) we compute a good threshold θ j 21
MACHINE LEARNING - 2013 Boosting In Action and we keep going... 100% • We look for the best weak learner 90% 80% • We adjust the importance (weight) of the errors 70% • Rinse and repeat 60% 50% 1 2 3 4 5 6 7 8 9 10 22
MACHINE LEARNING - 2013 Boosting In Action 100% 90% 80% 120 weak learners 40 weak learners 80 weak learners 20 weak learners 70% 60% 50% 1 10 20 40 80 120 23
MACHINE LEARNING - 2013 Drawbacks of Boosting • Overfitting! • Boost will always overfit with many weak learners • Training Time • Training of a face detector takes up to 2 weeks on modern computers 24
MACHINE LEARNING - 2013 A thousand different flavors 1989 Boosting 1996 • A couple of new boost variants every year AdaBoost 1999 Real AdaBoost • Reduce overfitting 2000 Margin Boost Modest AdaBoost Gentle AdaBoost • Increase robustness to noise AnyBoost LogitBoost 2001 BrownBoost • Tailored to specific problems 2003 KLBoost Weight Boost • Mainly change two things 2004 FloatBoost ActiveBoost 2005 • How the error is represented JensenShannonBoost Infomax Boost 2006 • How the weights are updated Emphasis Boost 2007 Entropy Boost Reweight Boost ... 25
MACHINE LEARNING - 2013 An example • Instead of counting the errors, we compute the probability of correct classification Discrete AdaBoost Real AdaBoost n e j : P n Y � � p j = w i P ( y i = 1 | x i ) w i · 1 ϕ j ( x i ) 6 = y i i =1 i =1 ✓ 1 − e j ◆ ✓ 1 − p j ◆ α = 1 α = 1 2ln 2ln e j p j ( w ( t ) i exp( α ( t ) ) if ϕ j ( x i ) 6 = y i , w ( t +1) w ( t +1) = w ( t ) = i exp( − y i α ( t ) ) w ( t ) i i exp( � α ( t ) ) if ϕ j ( x i ) = y i . i 26
MACHINE LEARNING - 2013 A celebrated example Viola-Jones Haar-Like wavelets I ( x ) : pixel of image I at position x B X X A f ( x ) = I ( x ) − I ( x ) x ∈ A x ∈ B ⇢ 1 if f ( x ) > 0 , ϕ ( x ) = − 1 otherwise . image pixels 2 rectangles of pixels 1 positive, 1 negative • Millions of possible classifiers ... ϕ 1 ( x ) ϕ 2 ( x ) 27
MACHINE LEARNING - 2013 Real-Time on HD video 28
MACHINE LEARNING - 2013 Some simpler examples Feature: the distance from a point c f ( x,c ) = ( x - c ) T ( x - c ) > θ 100% 90% 80% 70% Random Circles Random Projections 60% 50% 1 10 20 40 80 120 29
MACHINE LEARNING - 2013 Some simpler examples Feature: being inside a rectangle R f ( x,R ) = 1 x ∈ R 100% 90% 80% 70% Random Circles Random Projections 60% Random Rectangles 50% 1 10 20 40 80 120 30
MACHINE LEARNING - 2013 Some simpler examples Feature: full-covariance gaussian f ( x, μ , ∑ ) = P( x | μ , ∑ ) 100% 90% 80% Random Gaussians 70% Random Circles Random Projections 60% Random Rectangles 50% 1 10 20 40 80 120 31
MACHINE LEARNING - 2013 Weak Learners don’t need to be weak! 20 boosted SVMs with 5 SVs and the RBF kernel 32
Recommend
More recommend