ensembles
play

Ensembles L eon Bottou COS 424 4/8/2010 Readings T. G. - PowerPoint PPT Presentation

Ensembles L eon Bottou COS 424 4/8/2010 Readings T. G. Dietterich (2000) Ensemble Methods in Machine Learning. R. E. Schapire (2003): The Boosting Approach to Machine Learning. Sections 1,2,3,4,6. L eon Bottou


  1. Ensembles L´ eon Bottou COS 424 – 4/8/2010

  2. Readings • T. G. Dietterich (2000) “Ensemble Methods in Machine Learning”. • R. E. Schapire (2003): “The Boosting Approach to Machine Learning”. Sections 1,2,3,4,6. L´ eon Bottou 2/33 COS 424 – 4/8/2010

  3. Summary 1. Why ensembles? 2. Combining outputs. 3. Constructing ensembles. 4. Boosting. L´ eon Bottou 3/33 COS 424 – 4/8/2010

  4. I. Ensembles L´ eon Bottou 4/33 COS 424 – 4/8/2010

  5. Ensemble of classifiers Ensemble of classifiers – Consider a set of classifiers h 1 , h 2 , . . . , h L . – Construct a classifier by combining their individual decisions. – For example by voting their outputs. Accuracy – The ensemble works if the classifiers have low error rates. Diversity – No gain if all classifiers make the same mistakes. – What if classifiers make different mistakes? L´ eon Bottou 5/33 COS 424 – 4/8/2010

  6. Uncorrelated classifiers Assume ∀ r � = s Cov [ 1 I { h r ( x ) = y } , 1 I { h s ( x ) = y } ] = 0 The tally of classifier votes follows a binomial distribution. Example Twenty-one uncorrelated classifiers with 30% error rate. L´ eon Bottou 6/33 COS 424 – 4/8/2010

  7. Statistical motivation blue : classifiers that work well on the training set(s) f : best classifier. L´ eon Bottou 7/33 COS 424 – 4/8/2010

  8. Computational motivation blue : classifier search may reach local optima f : best classifier. L´ eon Bottou 8/33 COS 424 – 4/8/2010

  9. Representational motivation blue : classifier space may not contain best classifier f : best classifier. L´ eon Bottou 9/33 COS 424 – 4/8/2010

  10. Practical success Recommendation system – Netflix “movies you may like”. – Customers sometimes rate movies they rent. – Input: (movie, customer) – Output: rating Netflix competition – 1M$ for the first team to do 10% better than their system. Winner: BellKor team and friends – Ensemble of more than 800 rating systems. Runner-up: everybody else – Ensemble of all the rating systems built by the other teams. L´ eon Bottou 10/33 COS 424 – 4/8/2010

  11. Bayesian ensembles Let D represent the training data. Enumerating all the classifiers � P ( y | x, D ) = P ( y, h | x, D ) h � = P ( h | x, D ) P ( y | h, x, D ) h � = P ( h | D ) P ( y | x, h ) h P ( h | D ) : how well does h match the training data. P ( y | x, h ) : what h predicts for pattern x . Note that this is a weighted average. L´ eon Bottou 11/33 COS 424 – 4/8/2010

  12. II. Combining Outputs L´ eon Bottou 12/33 COS 424 – 4/8/2010

  13. Simple averaging � � � � � � ��� � � L´ eon Bottou 13/33 COS 424 – 4/8/2010

  14. Weighted averaging a priori ��������������������������� ������������������� � � � � � � ��� � � Weights derived from the training errors, e.g. exp( − β TrainingError ( h t )) . Approximate Bayesian ensemble. L´ eon Bottou 14/33 COS 424 – 4/8/2010

  15. Weighted averaging with trained weights ������������������������ ������������������ � � � � � � ��� � � Train weights on the validation set. Training weights on the training set overfits easily. You need another validation set to estimate the performance! L´ eon Bottou 15/33 COS 424 – 4/8/2010

  16. Stacked classifiers � � ����������������� � � �������������� � �������������� ��� � � Second tier classifier trained on the validation set. You need another validation set to estimate the performance! L´ eon Bottou 16/33 COS 424 – 4/8/2010

  17. III. Constructing Ensembles L´ eon Bottou 17/33 COS 424 – 4/8/2010

  18. Diversification Cause of the mistake Diversification strategy Pattern was difficult. hopeless Overfitting ( ⋆ ) vary the training sets Some features were noisy vary the set of input features Multiclass decisions were inconsistent vary the class encoding L´ eon Bottou 18/33 COS 424 – 4/8/2010

  19. Manipulating the training examples Bootstrap replication simulates training set selection – Given a training set of size n , construct a new training set by sampling n examples with replacement. – About 30% of the examples are excluded. Bagging – Create bootstrap replicates of the training set. – Build a decision tree for each replicate. – Estimate tree performance using out-of-bootstrap data. – Average the outputs of all decision trees. Boosting – See part IV. L´ eon Bottou 19/33 COS 424 – 4/8/2010

  20. Manipulating the features Random forests – Construct decision trees on bootstrap replicas. Restrict the node decisions to a small subset of features picked randomly for each node. – Do not prune the trees. Estimate tree performance using out-of-bootstrap data. Average the outputs of all decision trees. Multiband speech recognition – Filter speech to eliminate a random subset of the frequencies. – Train speech recognizer on filtered data. – Repeat and combine with a second tier classifier. – Resulting recognizer is more robust to noise. L´ eon Bottou 20/33 COS 424 – 4/8/2010

  21. Manipulating the output codes Reducing multiclass problems to binary classification – We have seen one versus all. – We have seen all versus all. Error correcting codes for multiclass problems – Code the class numbers with an error correcting code. – Construct a binary classifier for each bit of the code. – Run the error correction algorithm on the binary classifier outputs. L´ eon Bottou 21/33 COS 424 – 4/8/2010

  22. IV. Boosting L´ eon Bottou 22/33 COS 424 – 4/8/2010

  23. Motivation • Easy to come up with rough rules of thumb for classifying data – email contains more than 50% capital letters. – email contains expression “buy now”. • Each alone isnt great, but better than random. • Boosting converts rough rules of thumb into an accurate classier. Boosting was invented by Prof. Schapire. L´ eon Bottou 23/33 COS 424 – 4/8/2010

  24. Adaboost Given examples ( x 1 , y 1 ) . . . ( x n , y n ) with y i = ± 1 . Let D 1 ( i ) = 1 /n for i = 1 . . . n . For t = 1 . . . T do • Run weak learner using examples with weights D t . • Get weak classifier h t Compute error: ε t = � • i D t ( i ) 1 I( h t ( x i ) � = y i ) � 1 − ε t � Compute magic coefficient α t = 1 • 2 log ε t Update weights D t +1 ( i ) = D t ( i ) e − α t y i h t ( x i ) • Z t   T � Output the final classifier f T ( x ) = sign α t h t ( x )   t =1 L´ eon Bottou 24/33 COS 424 – 4/8/2010

  25. Toy example Weak classifiers: vertical or horizontal half-planes. L´ eon Bottou 25/33 COS 424 – 4/8/2010

  26. Adaboost round 1 L´ eon Bottou 26/33 COS 424 – 4/8/2010

  27. Adaboost round 2 L´ eon Bottou 27/33 COS 424 – 4/8/2010

  28. Adaboost round 3 L´ eon Bottou 28/33 COS 424 – 4/8/2010

  29. Adaboost final classifier L´ eon Bottou 29/33 COS 424 – 4/8/2010

  30. From weak learner to strong classifier (1) Preliminary D T +1 ( i ) = D 1 ( i ) e − α 1 y i h 1 ( x i ) · · · e − α T y i h T ( x i ) e − y i f T ( x i ) = 1 � Z 1 Z T n t Z t Bounding the training error 1 I { f T ( x i ) � = y i } ≤ 1 e − y i f T ( x i ) = 1 � � � � � 1 D T +1 ( i ) Z t = Z t n n n t t i i i Idea: make Z t as small as possible. n D t ( i ) e − α t y i h t ( x i ) = n (1 − ε t ) e − α t + n ε t e α t � Z t = i =1 1. Pick h t to minimize ε t . 2. Pick α t to minimize Z t . L´ eon Bottou 30/33 COS 424 – 4/8/2010

  31. From weak learner to strong classifier (2) Pick α t to minimize Z t (the magic coefficient) ∂Z t α t = 1 2 log 1 − ε t = − (1 − ε t ) e − α t + ε t e α t = 0 = ⇒ ∂α t ε t Weak learner assumption: γ t = 1 2 − ε t is positive and small. � � ε 1 − ε � − 2 γ 2 1 − 4 γ 2 � � � Z t = (1 − ε ) 1 − ε + ε = 4 ε (1 − ε ) = ≤ exp t t ε   T T � � γ 2 TrainingError ( f T ) ≤ Z t ≤ exp  − 2 t  t =1 t =1 The training error decreases exponentially if inf γ t > 0 . But that does not happen beyond a certain point. . . L´ eon Bottou 31/33 COS 424 – 4/8/2010

  32. Boosting and exponential loss Proofs are instructive We obtain the bound T TrainingError ( f T ) ≤ 1 e − y i H ( x i ) = � � Z t n t =1 i ^ y y(x) – without saying how D t relates to h t – without using the value of α t Conclusion – Round T chooses the h T and α T that maximize the exponential loss reduction from f T − 1 to f T . Exercise – Tweak Adaboost to minimize the log loss instead of the exp loss. L´ eon Bottou 32/33 COS 424 – 4/8/2010

  33. Boosting and margins � y H ( x ) t α t y h t ( x ) margin H ( x, y ) = t | α t | = � � t | α t | Remember support vector machines? L´ eon Bottou 33/33 COS 424 – 4/8/2010

Recommend


More recommend