Ensembles L eon Bottou COS 424 4/8/2010 Readings T. G. - PowerPoint PPT Presentation

Ensembles L´ eon Bottou COS 424 – 4/8/2010

Readings • T. G. Dietterich (2000) “Ensemble Methods in Machine Learning”. • R. E. Schapire (2003): “The Boosting Approach to Machine Learning”. Sections 1,2,3,4,6. L´ eon Bottou 2/33 COS 424 – 4/8/2010

Summary 1. Why ensembles? 2. Combining outputs. 3. Constructing ensembles. 4. Boosting. L´ eon Bottou 3/33 COS 424 – 4/8/2010

I. Ensembles L´ eon Bottou 4/33 COS 424 – 4/8/2010

Ensemble of classifiers Ensemble of classifiers – Consider a set of classifiers h 1 , h 2 , . . . , h L . – Construct a classifier by combining their individual decisions. – For example by voting their outputs. Accuracy – The ensemble works if the classifiers have low error rates. Diversity – No gain if all classifiers make the same mistakes. – What if classifiers make different mistakes? L´ eon Bottou 5/33 COS 424 – 4/8/2010

Uncorrelated classifiers Assume ∀ r � = s Cov [ 1 I { h r ( x ) = y } , 1 I { h s ( x ) = y } ] = 0 The tally of classifier votes follows a binomial distribution. Example Twenty-one uncorrelated classifiers with 30% error rate. L´ eon Bottou 6/33 COS 424 – 4/8/2010

Statistical motivation blue : classifiers that work well on the training set(s) f : best classifier. L´ eon Bottou 7/33 COS 424 – 4/8/2010

Computational motivation blue : classifier search may reach local optima f : best classifier. L´ eon Bottou 8/33 COS 424 – 4/8/2010

Representational motivation blue : classifier space may not contain best classifier f : best classifier. L´ eon Bottou 9/33 COS 424 – 4/8/2010

Practical success Recommendation system – Netflix “movies you may like”. – Customers sometimes rate movies they rent. – Input: (movie, customer) – Output: rating Netflix competition – 1M$ for the first team to do 10% better than their system. Winner: BellKor team and friends – Ensemble of more than 800 rating systems. Runner-up: everybody else – Ensemble of all the rating systems built by the other teams. L´ eon Bottou 10/33 COS 424 – 4/8/2010

II. Combining Outputs L´ eon Bottou 12/33 COS 424 – 4/8/2010

Simple averaging � � � � � � �� L´ eon Bottou 13/33 COS 424 – 4/8/2010

Weighted averaging a priori �� Weights derived from the training errors, e.g. exp( − β TrainingError ( h t )) . Approximate Bayesian ensemble. L´ eon Bottou 14/33 COS 424 – 4/8/2010

Weighted averaging with trained weights �� Train weights on the validation set. Training weights on the training set overfits easily. You need another validation set to estimate the performance! L´ eon Bottou 15/33 COS 424 – 4/8/2010

Stacked classifiers � � �� Second tier classifier trained on the validation set. You need another validation set to estimate the performance! L´ eon Bottou 16/33 COS 424 – 4/8/2010

III. Constructing Ensembles L´ eon Bottou 17/33 COS 424 – 4/8/2010

Diversification Cause of the mistake Diversification strategy Pattern was difficult. hopeless Overfitting ( ⋆ ) vary the training sets Some features were noisy vary the set of input features Multiclass decisions were inconsistent vary the class encoding L´ eon Bottou 18/33 COS 424 – 4/8/2010

Manipulating the training examples Bootstrap replication simulates training set selection – Given a training set of size n , construct a new training set by sampling n examples with replacement. – About 30% of the examples are excluded. Bagging – Create bootstrap replicates of the training set. – Build a decision tree for each replicate. – Estimate tree performance using out-of-bootstrap data. – Average the outputs of all decision trees. Boosting – See part IV. L´ eon Bottou 19/33 COS 424 – 4/8/2010

Manipulating the features Random forests – Construct decision trees on bootstrap replicas. Restrict the node decisions to a small subset of features picked randomly for each node. – Do not prune the trees. Estimate tree performance using out-of-bootstrap data. Average the outputs of all decision trees. Multiband speech recognition – Filter speech to eliminate a random subset of the frequencies. – Train speech recognizer on filtered data. – Repeat and combine with a second tier classifier. – Resulting recognizer is more robust to noise. L´ eon Bottou 20/33 COS 424 – 4/8/2010

Manipulating the output codes Reducing multiclass problems to binary classification – We have seen one versus all. – We have seen all versus all. Error correcting codes for multiclass problems – Code the class numbers with an error correcting code. – Construct a binary classifier for each bit of the code. – Run the error correction algorithm on the binary classifier outputs. L´ eon Bottou 21/33 COS 424 – 4/8/2010

IV. Boosting L´ eon Bottou 22/33 COS 424 – 4/8/2010

Motivation • Easy to come up with rough rules of thumb for classifying data – email contains more than 50% capital letters. – email contains expression “buy now”. • Each alone isnt great, but better than random. • Boosting converts rough rules of thumb into an accurate classier. Boosting was invented by Prof. Schapire. L´ eon Bottou 23/33 COS 424 – 4/8/2010

Adaboost Given examples ( x 1 , y 1 ) . . . ( x n , y n ) with y i = ± 1 . Let D 1 ( i ) = 1 /n for i = 1 . . . n . For t = 1 . . . T do • Run weak learner using examples with weights D t . • Get weak classifier h t Compute error: ε t = � • i D t ( i ) 1 I( h t ( x i ) � = y i ) � 1 − ε t � Compute magic coefficient α t = 1 • 2 log ε t Update weights D t +1 ( i ) = D t ( i ) e − α t y i h t ( x i ) • Z t   T � Output the final classifier f T ( x ) = sign α t h t ( x )   t =1 L´ eon Bottou 24/33 COS 424 – 4/8/2010

Toy example Weak classifiers: vertical or horizontal half-planes. L´ eon Bottou 25/33 COS 424 – 4/8/2010

Adaboost round 1 L´ eon Bottou 26/33 COS 424 – 4/8/2010

Adaboost final classifier L´ eon Bottou 29/33 COS 424 – 4/8/2010

From weak learner to strong classifier (1) Preliminary D T +1 ( i ) = D 1 ( i ) e − α 1 y i h 1 ( x i ) · · · e − α T y i h T ( x i ) e − y i f T ( x i ) = 1 � Z 1 Z T n t Z t Bounding the training error 1 I { f T ( x i ) � = y i } ≤ 1 e − y i f T ( x i ) = 1 � � � � � 1 D T +1 ( i ) Z t = Z t n n n t t i i i Idea: make Z t as small as possible. n D t ( i ) e − α t y i h t ( x i ) = n (1 − ε t ) e − α t + n ε t e α t � Z t = i =1 1. Pick h t to minimize ε t . 2. Pick α t to minimize Z t . L´ eon Bottou 30/33 COS 424 – 4/8/2010

From weak learner to strong classifier (2) Pick α t to minimize Z t (the magic coefficient) ∂Z t α t = 1 2 log 1 − ε t = − (1 − ε t ) e − α t + ε t e α t = 0 = ⇒ ∂α t ε t Weak learner assumption: γ t = 1 2 − ε t is positive and small. � � ε 1 − ε � − 2 γ 2 1 − 4 γ 2 � � � Z t = (1 − ε ) 1 − ε + ε = 4 ε (1 − ε ) = ≤ exp t t ε   T T � � γ 2 TrainingError ( f T ) ≤ Z t ≤ exp  − 2 t  t =1 t =1 The training error decreases exponentially if inf γ t > 0 . But that does not happen beyond a certain point. . . L´ eon Bottou 31/33 COS 424 – 4/8/2010

Boosting and exponential loss Proofs are instructive We obtain the bound T TrainingError ( f T ) ≤ 1 e − y i H ( x i ) = � � Z t n t =1 i ^ y y(x) – without saying how D t relates to h t – without using the value of α t Conclusion – Round T chooses the h T and α T that maximize the exponential loss reduction from f T − 1 to f T . Exercise – Tweak Adaboost to minimize the log loss instead of the exp loss. L´ eon Bottou 32/33 COS 424 – 4/8/2010

Boosting and margins � y H ( x ) t α t y h t ( x ) margin H ( x, y ) = t | α t | = � � t | α t | Remember support vector machines? L´ eon Bottou 33/33 COS 424 – 4/8/2010

Ensembles L eon Bottou COS 424 4/8/2010 Readings T. G. - PowerPoint PPT Presentation

Ensembles L eon Bottou COS 424 4/8/2010 Readings T. G. Dietterich (2000) Ensemble Methods in Machine Learning. R. E. Schapire (2003): The Boosting Approach to Machine Learning. Sections 1,2,3,4,6. L eon Bottou

Monte Carlo in different ensembles Daan Frenkel Different Ensembles Ensemble Name Constant

COS424 Scribe Notes Lecture 14: Ensembles Donghun Lee April 8, 2010 1 Ensembles A set of

Coulomb gas ensembles in 2D H. Hedenmalm December 11, 2015 H. Hedenmalm Coulomb gas ensembles

ENSEMBLES FOR TIME SERIES FORECASTING Mariana Oliveira & Lus Torgo Ensembles for Time

Guidance Information or Probability Forecast: Where do Ensembles Aim? It is widely held that

Low Rank Ensembles Eric Xing Ankur Parikh Avneesh Saluja Chris Dyer 1 Overview 2 Overview

Synchronization in Ensembles of Oscillators: Theory of Collective Dynamics A. Pikovsky Institut

DCSO: Dynamic Combination of Detector Scores for Outlier Ensembles Yue Zhao Maciej K.

Monte Carlo Methods Ensembles (Chapter 5) Biased Sampling (Chapter 14) Practical Aspects

Unfolding and Shrinking Neural Machine Translation Ensembles Felix Stahlberg and Bill Byrne

Ensemble methods CS 446 Why ensembles? Standard machine learning setup: We have some data.

CSC 411 Lecture 5: Ensembles II Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla

Summer School 2008, Disentis Gap Probabilities for Random Matrix Ensembles Felix Rubin July 21,

Moments of Traces for Circular -ensembles Tiefeng Jiang University of Minnesota This is a

-Gaussian Ensembles and the Non-orientability of Polygonal Glueings Michael La Croix

Rydberg atomic ensembles Huy Nguyen Quantum Optics Final Project April 17 th , 2018 Quantum

Concurrency 557/648 G. Castagna (CNRS) Cours de Programmation Avance 557 / 648 Outline 52

Maryland Health Services Cost Review Commission Steering Committee Meeting October 9, 2020

Stuttering multipartitions and blocks of ArikiKoike algebras Salim Rostam Univ Rennes

Synchronous Kahn Networks (ten years later) Marc Pouzet LRI Marc.Pouzet@lri.fr Workshop

3 rd Parameterized Algorithms & Computational Experiments Challenge Where it came from, how

Attacking and Fixing PKCS#11 Security Tokens with Tookan Graham Steel LSV, INRIA & CNRS

CodeContracts & Clousot Francesco Logozzo - Microsoft Mehdi Bouaziz ENS CodeContracts?

Affine symmetries in supergravity work with Hermann Nicolai, Martin Weidner, Thomas Ortiz IHES

Ensembles L eon Bottou COS 424 4/8/2010 Readings T. G. - PowerPoint PPT Presentation

Ensembles L eon Bottou COS 424 4/8/2010 Readings T. G. Dietterich (2000) Ensemble Methods in Machine Learning. R. E. Schapire (2003): The Boosting Approach to Machine Learning. Sections 1,2,3,4,6. L eon Bottou

Monte Carlo in different ensembles Daan Frenkel Different Ensembles Ensemble Name Constant

COS424 Scribe Notes Lecture 14: Ensembles Donghun Lee April 8, 2010 1 Ensembles A set of

Coulomb gas ensembles in 2D H. Hedenmalm December 11, 2015 H. Hedenmalm Coulomb gas ensembles

ENSEMBLES FOR TIME SERIES FORECASTING Mariana Oliveira &amp; Lus Torgo Ensembles for Time

Guidance Information or Probability Forecast: Where do Ensembles Aim? It is widely held that

Low Rank Ensembles Eric Xing Ankur Parikh Avneesh Saluja Chris Dyer 1 Overview 2 Overview

Synchronization in Ensembles of Oscillators: Theory of Collective Dynamics A. Pikovsky Institut

DCSO: Dynamic Combination of Detector Scores for Outlier Ensembles Yue Zhao Maciej K.

Monte Carlo Methods Ensembles (Chapter 5) Biased Sampling (Chapter 14) Practical Aspects

Unfolding and Shrinking Neural Machine Translation Ensembles Felix Stahlberg and Bill Byrne

Ensemble methods CS 446 Why ensembles? Standard machine learning setup: We have some data.

CSC 411 Lecture 5: Ensembles II Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla

Summer School 2008, Disentis Gap Probabilities for Random Matrix Ensembles Felix Rubin July 21,

Moments of Traces for Circular -ensembles Tiefeng Jiang University of Minnesota This is a

-Gaussian Ensembles and the Non-orientability of Polygonal Glueings Michael La Croix

Rydberg atomic ensembles Huy Nguyen Quantum Optics Final Project April 17 th , 2018 Quantum

Concurrency 557/648 G. Castagna (CNRS) Cours de Programmation Avance 557 / 648 Outline 52

Maryland Health Services Cost Review Commission Steering Committee Meeting October 9, 2020

Stuttering multipartitions and blocks of ArikiKoike algebras Salim Rostam Univ Rennes

Synchronous Kahn Networks (ten years later) Marc Pouzet LRI Marc.Pouzet@lri.fr Workshop

3 rd Parameterized Algorithms &amp; Computational Experiments Challenge Where it came from, how

Attacking and Fixing PKCS#11 Security Tokens with Tookan Graham Steel LSV, INRIA &amp; CNRS

CodeContracts &amp; Clousot Francesco Logozzo - Microsoft Mehdi Bouaziz ENS CodeContracts?

Affine symmetries in supergravity work with Hermann Nicolai, Martin Weidner, Thomas Ortiz IHES

ENSEMBLES FOR TIME SERIES FORECASTING Mariana Oliveira & Lus Torgo Ensembles for Time

3 rd Parameterized Algorithms & Computational Experiments Challenge Where it came from, how

Attacking and Fixing PKCS#11 Security Tokens with Tookan Graham Steel LSV, INRIA & CNRS

CodeContracts & Clousot Francesco Logozzo - Microsoft Mehdi Bouaziz ENS CodeContracts?