Introduction to Machine Learning 25. Multiplicative Updates, Games and Boosting Alex Smola Carnegie Mellon University http://alex.smola.org/teaching/cmu2013-10-701 10-701
Multiplicative updates and experts http://www.cs.princeton.edu/~arora/pubs/MWsurvey.pdf
Finding an expert http://xkcd.com/451/
Finding an expert • Pool of Experts E • At each time step • Each expert makes a prediction f it • We observe event y t • Goals • Find the expert who gets things right • Predict well in the meantime
Halving algorithm Halving algorithm
Halving algorithm Halving algorithm
Halving algorithm Halving algorithm
Halving algorithm Halving algorithm
Halving algorithm Halving algorithm
Halving algorithm Halving algorithm
Halving algorithm Halving algorithm
Halving algorithm • Start with pool of experts E • Predict with the majority of experts • Observe outcome • Discard those that got it wrong • Theorem Algorithm makes at most log 2 E errors • Proof Each time we make an error, at least half of the experts are removed. Otherwise no experts left.
Predicting as well as the best expert • Experts (can) make mistakes So we shouldn’t fire them immediately • Can we predict nearly as well as the best expert in the pool? • Regret - error relative to best expert t X L e ( t ) := l ( y τ , f e τ ) and R (ˆ y, t ) := L (ˆ y, t ) − min e 0 L e 0 ( t ) τ =1 Our predictions need not match any expert!
Weighted Majority
Weighted Majority
Weighted Majority
Weighted Majority
Weighted Majority
Weighted Majority
Weighted Majority
Weighted majority • Binary loss (1 if wrong, 0 if correct) • Experts initially have some weight w e 0 • For all observations do • Predict using weighted majority P e w et y et y τ = sgn ˆ P e w et • Observe outcome and reweight wrong experts w e,t +1 = β w et • Alternative variant draws from pool of experts
Weighted majority analysis • Update equation for mistakes w e,t +1 = β w et • Total expert weight X w t := w et t • We incur a loss when majority gets it wrong ◆ ˆ L t ✓ 1 + β 2 w t + β w t +1 ≤ 1 2 w t ≤ w 0 2 • For each expert we have the bound w e,t +1 = w e 0 β L et < w t +1 • Solving for loss yields L t ≤ L it log β − 1 + log w 0 − log w − 1 ˆ i 0 log 2 / (1 + β )
Weighted majority analysis • Solving for loss yields L t ≤ L it log β − 1 + log w 0 − log w − 1 ˆ i 0 log 2 / (1 + β ) • Small downweighting leads to small regret in the long term. • Initially give uniform weight to all experts (this is where you could hide your prior). log β log n ˆ L t ≤ L it log(1 + β ) / 2 + log 2 / (1 + β ) • Exponentially fast converges to the best expert!
Multiplicative Updates • Multiply by loss expert e would incur at time t w e,t +1 = w et e − η l ( f et ,y t ) = w e 0 e − η L e ( t ) • Lower bound for all experts e w t +1 > w e,t +1 • Hoeffding bound (rather nonstandard variant) η E [ X ] + η 2 / 8 � � E [exp( η X )] ≤ exp • Upper bound w e 0 t e − η l ( f 0 et ,y t ) ≤ w t e − η l t + η 2 / 8 ≤ w 0 e − η L (ˆ y,t )+ t η 2 / 8 X w t +1 = w t w t e 0 y, t ) ≤ L e ( t ) + η t 8 + η − 1 log n we set p L (ˆ 8 log n/t η =
Multiplicative Updates • Multiply by loss expert e would incur at time t w e,t +1 = w et e − η l ( f et ,y t ) = w e 0 e − η L e ( t ) • Lower bound for all experts e w t +1 > w e,t +1 • Hoeffding bound (rather nonstandard variant) η E [ X ] + η 2 / 8 � � E [exp( η X )] ≤ exp • Upper bound q 1 L (ˆ y, t ) ≤ L e ( t ) + 2 t log n
Application to Boosting
Boosting intuition • Data set (x i , y i ) • Weak learners that perform better than random for weighted distribution of data (w i , x i , y i ) m L ( w, f ) := 1 w i y i f ( x i ) ≥ 1 X 2 + γ m i =1 • Combine weak learners to get strong learner t f = 1 X f t t τ =1 • How do we weigh instances to generate good weak learners? Idea - find difficult instances.
Non-adaptive Boosting • Data set (w i , x i , y i ) • For t iterations do • Invoke weak learner m 1 X f t = argmax L ( w t , f ) = argmax w i,t − 1 y i f ( x i ) m f f i =1 • Reweight instances reduce weight w it = w i,t − 1 e − α y i f t ( x i ) if we got things • Output linear combination right t f = 1 X f t t τ =1
Boosting Analysis • For mistakes (majority wrong) we have w it ≥ e − α t 2 hence w t ≥ |{ f ( x i ) y i ≤ 0 }| e − α t 2 • Upper bound on weight w i,t − 1 e − α y i f t ( x i ) ≤ w t − 1 e − α ( γ +1 / 2)+ α 2 / 8 X w t = i ≤ ne − α t ( γ +1 / 2)+ t α 2 / 8 • Combining upper and lower bound n errors ≤ ne − t ( αγ − α 2 / 8) hence n error ≤ ne − 2 t γ 2 for α = 4 γ Error vanishes exponentially fast
AdaBoost • Refine algorithm by weighting functions • Adaptive in the performance of weak learner • Error for weighted observations n ✏ t := 1 X w it 1 2 { 1 − y i f t ( x i ) } n i =1 • Stepsize and weight ↵ t := 1 2 log 1 − ✏ t ✏ t X α t f t and w it = w i 0 e − P t α t y i f t ( x i ) f = t
Usage • Simple classifiers (weak learners) • Linear threshold functions • Decision trees (simple ones) • Neural networks • Do not boost SVMs. Why? Boosting the Margin, Schapire et al http://goo.gl/aLCSO • Overfitting is possible Boost with noisy data for too long Fix e.g. by limiting weight of observations.
Application to Game Theory
Games
Games rock scissors paper rock 0 1 -1 scissors -1 0 1 paper 1 -1 0
Games • Game • Row player picks i, column player picks j Receive outcomes M ijk • Zero sum game has M ij,1 = -M ij,2 (my gain is your loss) • How to play • Deterministic strategy usually not optimal • Distribution over actions • Nash equilibrium Players have no incentive to change policy
Games • von Neumann minimax theorem x > M ⇥ ⇤ min x 2 P max j = max y 2 P min [ My ] i j i • Proof x > M y 2 P x > My ⇥ ⇤ min x 2 P max j = min x 2 P max j due to vertex solution. Apply linear programming duality to get y 2 P x > My = max x 2 P x > My min x 2 P max y 2 P min Apply vertex property again to complete proof.
Finding a Nash equilibrium approximately • Repeated game (initial distribution p 0 for player) • For t rounds do • Opponent picks best distribution q t+1 using p t • Player updates action distribution p t+1 using ! t X p i,t + t ∝ p i, 0 exp [ Mq t ] i − η τ =1 • Regret bound tells us that t t 1 1 [ Mq τ ] i + O ( t � 1 X X 2 ) p > τ Mq τ ≤ min t t i τ =1 τ =1
Finding a Nash equilibrium approximately • Regret bound t t t 1 1 1 [ Mq τ ] i + O ( t � 1 p > Mq τ + O ( t � 1 X X 2 ) = min X 2 ) p > τ Mq τ ≤ min t t t p i τ =1 τ =1 τ =1 • By construction of the algorithm we have t t 1 τ Mq ≤ 1 X X p > Mq ≤ max p > p > min p max τ Mq τ t t q q τ =1 τ =1 • Combining this yields t 1 p p > Mq + O ( t � 1 X 2 ) p > max τ Mq ≤ max min t q q τ =1
Simplified algorithm • Repeated game (initial distribution p 0 for player) • For t rounds do action • Opponent picks best distribution q t+1 using p t • Player updates action distribution p t+1 using ! t X p i,t + t ∝ p i, 0 exp [ Mq t ] i − η τ =1 • Regret bound tells us that t t 1 1 [ Mq τ ] i + O ( t � 1 X X 2 ) p > τ Mq τ ≤ min t t i τ =1 τ =1
Application to Particle Filtering
Sequential Monte Carlo • Recall particle filter idea (simplified) • Observe data in sequence • At each step approximate distribution of p ( θ | x 1: n ) by weighted samples from posterior • Bayes Rule p ( θ | x 1: n +1 ) ∝ p ( x n +1 | θ , x 1: n ) p ( θ | x 1: n ) Assuming conditional independence x i ⊥ x j | θ w i,n +1 = w in p ( x n +1 | θ ) = w in e log p ( x n +1 | θ )
Sequential Monte Carlo • Experts • Particle Filters • Neg. Loglikelihood • Loss • Weights • Weights • Convergence • Convergence bad news good news • Have only single sample • Found the best expert left (good) • Need to resample • Solution only as good • Adaptively find better as best expert solution
Sequential Monte Carlo • On a chain • Observe data in sequence • Fill in latent variables in sequence • Bayes Rule p ( x n +1 , θ n +1 | x 1: n , θ 1: n ) = p ( x n +1 | x 1: n , θ 1: n ) p ( θ n +1 | x 1: n +1 , θ 1: n ) • sample latent parameter prediction θ n +1 ∼ p ( θ n +1 | x 1: n +1 , θ 1: n ) “error” • update particle weight with w i,n +1 = w in p ( x n +1 | θ 1: n , x 1: n )
Recommend
More recommend