Anytime Online-to-Batch, Optimism, and Acceleration Ashok Cutkosky Google Research
Stochastic Optimization First-Order Stochastic Optimization Find the minimum of some convex function F : W → R using a stochastic gradient oracle: given w we can obtain a random variable g where E [ g ] = ∇ F ( w ) .
Example: Stochastic Gradient Descent A popular algorithm is gradient descent: w 1 = 0 w t + 1 = w t − η t g t
Example: Stochastic Gradient Descent A popular algorithm is gradient descent: w 1 = 0 w t + 1 = w t − η t g t How should we analyze its convergence?
Online Optimization For t = 1 . . . T , repeat: 1. Learner chooses a point w t . 2. Environment presents learner with a gradient g t (think E [ g t ] = ∇ F ( w t ) ). 3. Learner suffers loss � g t , w t � . The objective is minimize regret : T � R T ( w ⋆ ) = � g t , w t � − � g t , w ⋆ � � �� � � �� � t = 1 loss suffered benchmark loss
Back to Gradient Descent w t + 1 = w t − η t g t √ Simplest analysis chooses η t ∝ 1 / T , but can also do more 1 √ � T complicated things like η t ∝ t = 1 � g t � 2 .
Back to Gradient Descent w t + 1 = w t − η t g t √ Simplest analysis chooses η t ∝ 1 / T , but can also do more 1 √ � T complicated things like η t ∝ t = 1 � g t � 2 .These yield √ R T ( w ⋆ ) ≤ � w ⋆ � T � � T � � � g t � 2 R T ( w ⋆ ) ≤ � w ⋆ � � t = 1
Back to Gradient Descent w t + 1 = w t − η t g t √ Simplest analysis chooses η t ∝ 1 / T , but can also do more 1 √ � T complicated things like η t ∝ t = 1 � g t � 2 .These yield √ R T ( w ⋆ ) ≤ � w ⋆ � T � � T � � � g t � 2 R T ( w ⋆ ) ≤ � w ⋆ � � t = 1 We want to use regret bounds to solve stochastic optimization.
What We Hope Happens
What Could Happen Instead
Online-to-Batch Conversion ◮ Run an online learner for T steps on gradients E [ g t ] = ∇ F ( w t ) . � T w = 1 ◮ Pick ˆ t = 1 w t . T w ) − F ( w ⋆ )] ≤ E [ R T ( w ⋆ )] ◮ E [ F (ˆ T
Online-to-Batch Conversion ◮ Run an online learner for T steps on gradients E [ g t ] = ∇ F ( w t ) . � T w = 1 ◮ Pick ˆ t = 1 w t . T w ) − F ( w ⋆ )] ≤ E [ R T ( w ⋆ )] ◮ E [ F (ˆ T ◮ For example: � w ⋆ � √ � T √ t = 1 � g t � 2 = O ( 1 / T ) . T
Averages Converge
Something That Could Be Beter ◮ The conversion is not “anytime”: you must stop and average in order to get a convergence guarantee. ◮ The iterates w t are still not well-behaved. For example, �∇ F ( w T ) � may be much larger than �∇ F (ˆ w ) � .
Simple Fix Just evaluate gradients at running averages! � t ◮ Let x t = 1 i = 1 w i t ◮ Let g t be stochastic gradient at x t . ◮ Send g t to online learner and get w t + 1 .
Using Running Averages
Notation Recap ◮ x t : where we evaluate gradients g t . ◮ w t : iterate of online learner (now exists only for analysis). ◮ R T ( w ⋆ ) = � T t = 1 � g t , w t − w ⋆ � . No longer clear what the relationship is between R T and the original loss function F since g t is no longer a gradient at w t .
Online-To-Batch is unchanged Theorem Define T � R T ( x ⋆ ) = � α t g t , w t − x ⋆ � t = 1 � t i = 1 α i w i x t = � t i = 1 α i Then for all x ⋆ and all T, � � R T ( x ⋆ ) E [ F ( x T ) − F ( x ⋆ )] ≤ E � T t = 1 α t
Proof Sketch Suppose α t = 1 for simplicity. � T � T � � � � F ( x t ) − F ( x ⋆ ) ≤ E � g t , x t − x ⋆ � E t = 1 t = 1 T � ≤ E � g t , x t − w t � + � g t , w t − x ⋆ � � �� � � �� � t = 1 ( t − 1 )( x t − 1 − x t ) R T ( x ⋆ ) � � T � ≤ E R T ( x ⋆ ) + ( t − 1 )( F ( x t − 1 ) − F ( x t )) t = 1 Subtract � T t = 1 F ( x t ) from both sides, and telescope.
Stability It’s clear that F ( x t ) → F ( x ⋆ ) . But (in a bounded domain) we also have: x t − x t − 1 = α t ( x t − w t ) = O ( 1 / t ) � t − 1 i = 1 α i In contrast, the iterates of the base online learner are less stable: w t − w t − 1 = O ( 1 / √ t ) usually (because learning rate η t ∝ 1 / √ t ).
An Algorithm That Likes Stability Optimistic online learning algorithms can obtain [RS13; HK10; MY16]: � � T � � � g t − g t − 1 � 2 R T ( w ⋆ ) ≤ � t = 1 ◮ This algorithm does beter if the gradients are stable.
An Algorithm That Likes Stability Optimistic online learning algorithms can obtain [RS13; HK10; MY16]: � � T � � � g t − g t − 1 � 2 R T ( w ⋆ ) ≤ � t = 1 ◮ This algorithm does beter if the gradients are stable. ◮ When F is smooth, then gradient stability is implied by iterate stability!
Using Optimism with Stability ◮ With previous conversion, we might hope that w t − w t − 1 = O ( 1 / √ t ) . This implies � 1 � T + σ √ E [ F (ˆ w T ) − F ( x ⋆ )] ≤ O T ◮ In the new conversion, g t − g t − 1 ≈ x t − x t − 1 = O ( 1 / t ) , so we can do much beter.
Faster Rates with Optimism Theorem Suppose � � T � � α 2 t � g t − g t − 1 � 2 R T ( x ⋆ ) ≤ � t = 1 Set α t = t for all t. Suppose each g t has variance at most σ 2 , and F is L-smooth. Then � L � T 3 / 2 + σ E [ F ( x T ) − F ( x ⋆ )] ≤ O √ T
Acceleration The optimal rate is E [ F ( x T ) − F ( x ⋆ )] ≤ L T 2 + σ √ T
Acceleration The optimal rate is E [ F ( x T ) − F ( x ⋆ )] ≤ L T 2 + σ √ T ◮ A small change to the algorithm can get this rate too. ◮ The algorithm does not know L or σ . ◮ Unfortunately, the algebra no longer fits on a slide.
Online-to-Batch Summary ◮ Evaluate gradients at running averages. ◮ Keeps the same convergence guarantee, but is anytime. ◮ Stabilizes the iterates − → faster rates on smooth problems.
Online-to-Batch Summary ◮ Evaluate gradients at running averages. ◮ Keeps the same convergence guarantee, but is anytime. ◮ Stabilizes the iterates − → faster rates on smooth problems. Thank you!
Recommend
More recommend