Bagging, Boosting and RANSAC MACHINE LEARNING - 2013 Bootstrap - PowerPoint PPT Presentation

MACHINE LEARNING - 2013 MACHINE LEARNING Bagging, Boosting and RANSAC

MACHINE LEARNING - 2013 Bootstrap Aggregation Bagging • The Main Idea • Some Examples • Why it works 2

MACHINE LEARNING - 2013 The Main Idea Aggregation • Imagine we have m sets of n independent before after observations S (1) ={( X 1 , Y 1 ),...,( X n , Y n )} (1) , ... , S ( m ) ={( X 1 , Y 1 ),..., ϕ ( x , S ) ( X n , Y n )} ( m ) all taken iid from the same underlying distribution P • Traditional approach: generate some ϕ ( x , S ) from before after all the data samples • Aggregation: learn ϕ ( x , S ) by averaging ϕ ( x , S ( k ) ) ϕ ( x , S ( k ) ) over many k 3

MACHINE LEARNING - 2013 The Main Idea Bootstrapping • Unfortunately, we usually have one single S observations set S • Idea: bootstrap S to form the S (k) observation sets S (1) • (canonical) Choose some samples, duplicate them until you fill a new S (i) of the same size of S S (2) • (practical) Take a sub-set of samples of S , (use a smaller set) S (3) • The samples not used by each set are validation samples 4

MACHINE LEARNING - 2013 The Main Idea Bagging • Generate S (1) ,..., S (m) from bootstrapping • Compute the ϕ ( x , S ) ( k ) individually • Compute ϕ ( x , S )= E k ( ϕ ( x , S ) ( k ) ) by aggregation 5

MACHINE LEARNING - 2013 A concrete example • We select some input samples ( x (1) 1 , y (1) ˆ 1 ) , . . . , ( x (1) n , y (1) f (1) ( x ) = Y (1) n ) • We learn a regression model ( x (2) 1 , y (2) ˆ 1 ) , . . . , ( x (2) n , y (2) f (2) ( x ) = Y (2) n ) • Very sensitive to the input selection • m training sets = m di ff erent models f (1) , · · · , ˆ f ( m ) ˆ Y (1) , · · · , Y ( m ) 6

MACHINE LEARNING - 2013 Aggregation: combine several models • Linear combination of simple models m Z = 1 • More examples = Better model X Y ( i ) m = 60 m = 10 m = 4 • We can stop when we’re satisfied m i =1 7

MACHINE LEARNING - 2013 Proof of convergence Hypothesis: The average will converge to something meaningful • Assumptions • Y (1) ,..., Y ( m ) are iid • E(Y) = y ( E(Y) is an unbiased estimator of y ) • Expected Error E (( Y − y ) 2 ) = E (( Y − E ( Y )) 2 = σ 2 ( Y ) • With Aggregation m m m E ( Z ) = 1 E ( Y ( i ) ) = 1 Z = 1 X Y ( i ) X X y = y m m m i =1 i =1 i =1 E (( Z − y ) 2 ) = E (( Z − E ( Z )) 2 = σ 2 ( Z ) = σ 2 ⇣ 1 m Y ( i ) ⌘ X m i =1 m m ! 1 σ 2 ( Y ( i ) ) = 1 1 = 1 X X σ 2 ( Y ( i ) ) m σ 2 ( Y ) = m 2 m m i =1 i =1 infinite observations = zero error : we have our underlying estimator! 8

MACHINE LEARNING - 2013 In layman terms y Z Y The expected error (variance) of Y is larger than Z The variance of Z shrinks with m 9

MACHINE LEARNING - 2013 Relaxing the assumptions • We DROP the second assumption • Y (1) ,..., Y ( m ) are iid • E(Y) = y ( E(Y) is an unbiased estimator of y ) E (( Y − y ) 2 ) = E (( Y − E ( Y )+ E ( Y ) − y ) 2 ) = E ((( Y − E ( Y )+( E ( Y ) − y )) 2 ) we regroup them we add these = E (( Y − E ( Y )) 2 ) + E (( E ( Y ) − y ) 2 ) + E (2( Y − E ( Y ))( E ( Y ) − y )) � � � � Y − E ( Y ) 2( E ( Y ) − y ) E E σ 2 ( Y ) ≥ 0 = 0 ( Y − y ) 2 � ( E ( Y ) − y ) 2 � � � E ≥ E Z ( Y − y ) 2 � ( Z − y ) 2 � � � E ≥ E The larger it is, the better for us using Z gives us a smaller error (even if we can’t prove convergence to zero) 10

MACHINE LEARNING - 2013 Peculiarities • Instability is good • The more variable (unstable) the form of ϕ ( x , S ) is, the more improvement can potentially be obtained • Low-variability methods (e.g. PCA, LDA) improve less than high-variability ones (e.g. LWR, Decision Trees) • Loads of redundancy • Most predictors do roughly “the same thing” 11

MACHINE LEARNING - 2013 From Bagging to Boosting • Bagging: each model is trained independently • Boosting: each model is built on top of the previous ones 12

MACHINE LEARNING - 2013 Adaptive Boosting AdaBoost • The Main Idea • The Thousand Flavours of Boost • Weak Learners and Cascades 13

MACHINE LEARNING - 2013 The Main Idea Iterative Approach • Combine several simple models (weak learners) • Avoid redundancy • Each learner complements the previous ones • Keep track of the errors of the previous learners 14

MACHINE LEARNING - 2013 Weak Learners • A “simple” classifier that can be generated easily • As long as it is better than random, we can use it • Better when tailored to the problem at hand • E.g. very fast at retrieval (for images) 15

MACHINE LEARNING - 2013 AdaBoost Initialization • We choose a weak learner model ϕ ( x ) v (e.g. f ( x,v ) = x ∙ v > θ ) θ • Initialization • Generate ϕ 1 ( x ), ... , ϕ N ( x ) weak learners v 1 v 2 v 7 v 3 v 4 v 5 v 6 • N can be in the hundreds of thousands • Assign a weight w i to each training sample 16

MACHINE LEARNING - 2013 AdaBoost Iterations • Compute the error e j for each classifier ϕ j ( x ) e j : P n � � w i · 1 ϕ j ( x i ) 6 = y i i =1 • Select the ϕ j with the smallest classification error n ! X v 1 v 1 � � argmin j w i · 1 ϕ j ( x i ) 6 = y i v 2 v 2 v 7 v 7 v 3 v 4 v 3 v 4 i =1 v 5 v 6 v 5 v 6 • Update the weights w i depending on how they are classified by ϕ j . Here comes the important part 17

MACHINE LEARNING - 2013 Updating the weights • Evaluate how “well” ϕ j ( x ) is performing how far are we from a ✓ 1 − e j ◆ α = 1 perfect classification? 2ln e j • Update the weights for each sample make it bigger ( w ( t ) i exp( α ( t ) ) if ϕ j ( x i ) 6 = y i , w ( t +1) = w ( t ) i i exp( � α ( t ) ) if ϕ j ( x i ) = y i . make it smaller 18

MACHINE LEARNING - 2013 AdaBoost Rinse and Repeat • Recompute the error e j for each classifier ϕ j ( x ) using the updated weights e j : P n � � w i · 1 ϕ j ( x i ) 6 = y i i =1 • Select the new ϕ j with the smallest classification error n ! X � � argmin j w i · 1 ϕ j ( x i ) 6 = y i i =1 • Update the weights w i .

MACHINE LEARNING - 2013 Boosting In Action The Checkerboard Problem 20

MACHINE LEARNING - 2013 Boosting In Action Initialization • We choose a simple weak learner f ( x,v ) = x ∙ v > θ • We generate a thousand random vectors v 1 ,...,v 1000 and corresponding learners f j ( x,v j ) • For each f j ( x,v j ) we compute a good threshold θ j 21

MACHINE LEARNING - 2013 Boosting In Action and we keep going... 100% • We look for the best weak learner 90% 80% • We adjust the importance (weight) of the errors 70% • Rinse and repeat 60% 50% 1 2 3 4 5 6 7 8 9 10 22

MACHINE LEARNING - 2013 Boosting In Action 100% 90% 80% 120 weak learners 40 weak learners 80 weak learners 20 weak learners 70% 60% 50% 1 10 20 40 80 120 23

MACHINE LEARNING - 2013 Drawbacks of Boosting • Overfitting! • Boost will always overfit with many weak learners • Training Time • Training of a face detector takes up to 2 weeks on modern computers 24

MACHINE LEARNING - 2013 A thousand different flavors 1989 Boosting 1996 • A couple of new boost variants every year AdaBoost 1999 Real AdaBoost • Reduce overfitting 2000 Margin Boost Modest AdaBoost Gentle AdaBoost • Increase robustness to noise AnyBoost LogitBoost 2001 BrownBoost • Tailored to specific problems 2003 KLBoost Weight Boost • Mainly change two things 2004 FloatBoost ActiveBoost 2005 • How the error is represented JensenShannonBoost Infomax Boost 2006 • How the weights are updated Emphasis Boost 2007 Entropy Boost Reweight Boost ... 25

MACHINE LEARNING - 2013 An example • Instead of counting the errors, we compute the probability of correct classification Discrete AdaBoost Real AdaBoost n e j : P n Y � � p j = w i P ( y i = 1 | x i ) w i · 1 ϕ j ( x i ) 6 = y i i =1 i =1 ✓ 1 − e j ◆ ✓ 1 − p j ◆ α = 1 α = 1 2ln 2ln e j p j ( w ( t ) i exp( α ( t ) ) if ϕ j ( x i ) 6 = y i , w ( t +1) w ( t +1) = w ( t ) = i exp( − y i α ( t ) ) w ( t ) i i exp( � α ( t ) ) if ϕ j ( x i ) = y i . i 26

MACHINE LEARNING - 2013 A celebrated example Viola-Jones Haar-Like wavelets I ( x ) : pixel of image I at position x B X X A f ( x ) = I ( x ) − I ( x ) x ∈ A x ∈ B ⇢ 1 if f ( x ) > 0 , ϕ ( x ) = − 1 otherwise . image pixels 2 rectangles of pixels 1 positive, 1 negative • Millions of possible classifiers ... ϕ 1 ( x ) ϕ 2 ( x ) 27

MACHINE LEARNING - 2013 Real-Time on HD video 28

MACHINE LEARNING - 2013 Some simpler examples Feature: the distance from a point c f ( x,c ) = ( x - c ) T ( x - c ) > θ 100% 90% 80% 70% Random Circles Random Projections 60% 50% 1 10 20 40 80 120 29

MACHINE LEARNING - 2013 Some simpler examples Feature: being inside a rectangle R f ( x,R ) = 1 x ∈ R 100% 90% 80% 70% Random Circles Random Projections 60% Random Rectangles 50% 1 10 20 40 80 120 30

MACHINE LEARNING - 2013 Some simpler examples Feature: full-covariance gaussian f ( x, μ , ∑ ) = P( x | μ , ∑ ) 100% 90% 80% Random Gaussians 70% Random Circles Random Projections 60% Random Rectangles 50% 1 10 20 40 80 120 31

MACHINE LEARNING - 2013 Weak Learners don’t need to be weak! 20 boosted SVMs with 5 SVs and the RBF kernel 32

Bagging, Boosting and RANSAC MACHINE LEARNING - 2013 Bootstrap - PowerPoint PPT Presentation

MACHINE LEARNING - 2013 MACHINE LEARNING Bagging, Boosting and RANSAC MACHINE LEARNING - 2013 Bootstrap Aggregation Bagging The Main Idea Some Examples Why it works 2 MACHINE LEARNING - 2013 The Main Idea Aggregation

Bagging and Boosting Amit Srinet Dave Snyder Outline Bagging Definition Variants Examples

ECON 950 Winter 2020 Prof. James MacKinnon 7. Boosting Like bagging and random forests,

Boosting: Foundations and Algorithms Boosting: Foundations and Algorithms Boosting: Foundations

Theory and Applications of Boosting Theory and Applications of Boosting Theory and Applications

Theory and Applications of Boosting Theory and Applications of Boosting Theory and Applications

RECSM Summer School: Machine Learning for Social Sciences Session 2.4: Boosting Reto West

Random Forest Bagging Bagging or bootstrap aggregation a technique for reducing the variance

Lecture 13 Lecture 13 Oct-27-2007 Bagging Bagging Generate T random sample from training

CS489/698 Lecture 22: March 27, 2017 Bagging and Distributed Computing [RN] Sec. 18.10, [M] Sec.

Geometric Registration for Deformable Shapes 3.4 Probabilistic Techniques RANSAC Forward

RANSAC 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University Up to now, weve

Interest Point Detectors & RANSAC Instructor - Simon Lucey 16-423 - Designing Computer

Introduction CSCE CSCE Sometimes a single classifier (e.g., neural network, 478/878 478/878

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Ensemble

Applied Machine Learning Applied Machine Learning Bootstrap, Bagging and Boosting Siamak

Applied Machine Learning Applied Machine Learning Bootstrap, Bagging and Boosting Siamak

Parallel Programming Overview and Concepts Dr Mark Bull, EPCC markb@epcc.ed.ac.uk Outline

Drawbacks of single cycle implementation All instructions take the same time although

LR(0) Drawbacks Simple LR (SLR) Consider the unambiguous augmented grammar: New algorithm for

The Impact of Thread- Per-Core Architecture on Application Tail Latency Pekka Enberg, Ashwin

Topics Topics Thread Programming (Chapter 12) Threads & Locks

Distance-based Methods: Drawbacks Hard to find clusters with irregular shapes Hard to

Analysis of VPLS Deployment draft-gu-l2vpn-vpls-analysis-00 R. Gu, J. Dong, M. Chen, Q. Zeng

Sofya

Bagging, Boosting and RANSAC MACHINE LEARNING - 2013 Bootstrap - PowerPoint PPT Presentation

MACHINE LEARNING - 2013 MACHINE LEARNING Bagging, Boosting and RANSAC MACHINE LEARNING - 2013 Bootstrap Aggregation Bagging The Main Idea Some Examples Why it works 2 MACHINE LEARNING - 2013 The Main Idea Aggregation

Bagging and Boosting Amit Srinet Dave Snyder Outline Bagging Definition Variants Examples

ECON 950 Winter 2020 Prof. James MacKinnon 7. Boosting Like bagging and random forests,

Boosting: Foundations and Algorithms Boosting: Foundations and Algorithms Boosting: Foundations

Theory and Applications of Boosting Theory and Applications of Boosting Theory and Applications

Theory and Applications of Boosting Theory and Applications of Boosting Theory and Applications

RECSM Summer School: Machine Learning for Social Sciences Session 2.4: Boosting Reto West

Random Forest Bagging Bagging or bootstrap aggregation a technique for reducing the variance

Lecture 13 Lecture 13 Oct-27-2007 Bagging Bagging Generate T random sample from training

CS489/698 Lecture 22: March 27, 2017 Bagging and Distributed Computing [RN] Sec. 18.10, [M] Sec.

Geometric Registration for Deformable Shapes 3.4 Probabilistic Techniques RANSAC Forward

RANSAC 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University Up to now, weve

Interest Point Detectors &amp; RANSAC Instructor - Simon Lucey 16-423 - Designing Computer

Introduction CSCE CSCE Sometimes a single classifier (e.g., neural network, 478/878 478/878

Advanced Analytics in Business [D0S07a] Big Data Platforms &amp; Technologies [D0S06a] Ensemble

Applied Machine Learning Applied Machine Learning Bootstrap, Bagging and Boosting Siamak

Applied Machine Learning Applied Machine Learning Bootstrap, Bagging and Boosting Siamak

Parallel Programming Overview and Concepts Dr Mark Bull, EPCC markb@epcc.ed.ac.uk Outline

Drawbacks of single cycle implementation All instructions take the same time although

LR(0) Drawbacks Simple LR (SLR) Consider the unambiguous augmented grammar: New algorithm for

The Impact of Thread- Per-Core Architecture on Application Tail Latency Pekka Enberg, Ashwin

Topics Topics Thread Programming (Chapter 12) Threads &amp; Locks

Distance-based Methods: Drawbacks Hard to find clusters with irregular shapes Hard to

Analysis of VPLS Deployment draft-gu-l2vpn-vpls-analysis-00 R. Gu, J. Dong, M. Chen, Q. Zeng

Sofya

Interest Point Detectors & RANSAC Instructor - Simon Lucey 16-423 - Designing Computer

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Ensemble

Topics Topics Thread Programming (Chapter 12) Threads & Locks