adaptivity and optimism an improved exponentiated
play

Adaptivity and Optimism: An Improved Exponentiated Gradient - PowerPoint PPT Presentation

Adaptivity and Optimism: An Improved Exponentiated Gradient Algorithm Jacob Steinhardt Percy Liang Stanford University { jsteinhardt,pliang } @cs.stanford.edu Jun 11, 2013 J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated


  1. Adaptivity and Optimism: An Improved Exponentiated Gradient Algorithm Jacob Steinhardt Percy Liang Stanford University { jsteinhardt,pliang } @cs.stanford.edu Jun 11, 2013 J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 1 / 10

  2. Setup Setting is learning from experts : n experts, T rounds For t = 1 ,..., T : Learner chooses distribution w t ∈ ∆ n over the experts Nature reveals losses z t ∈ [ − 1 , 1 ] n of the experts Learner suffers loss w ⊤ t z t Goal: minimize T T Regret def w ⊤ ∑ ∑ = t z t − z t , i ∗ , t = 1 t = 1 where i ∗ is the best fixed expert. Typical algorithm: multiplicative weights (aka exponentiated gradient): w t + 1 , i ∝ w t , i exp ( − η z t , i ) . J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 2 / 10

  3. Outline Compare two variants of the multiplicative weights (exponentiated gradient) algorithm Understand the difference through lens of adaptive mirror descent (Orabona et al., 2013) Combine with machinery of optimistic updates (Rakhlin & Sridharan, 2012) to beat best existing bounds. J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 3 / 10

  4. Two Types of Updates In literature, two similar but different updates (Kivinen & Warmuth, 1997; Cesa-Bianchi et al., 2007): w t + 1 , i ∝ w t , i exp ( − η z t , i ) (MW1) w t + 1 , i ∝ w t , i ( 1 − η z t , i ) (MW2) The regret is bounded as T Regret ≤ log ( n ) ∑ � z t � 2 + η (Regret:MW1) ∞ η t = 1 T Regret ≤ log ( n ) ∑ z 2 + η (Regret:MW2) t , i ∗ η t = 1 If best expert i ∗ has loss close to zero, then second bound better than first. √ Gap can be Θ ( T ) (in actual performance, not just upper bounds). J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 4 / 10

  5. Two Types of Updates In literature, two similar but different updates (Kivinen & Warmuth, 1997; Cesa-Bianchi et al., 2007): w t + 1 , i ∝ w t , i exp ( − η z t , i ) (MW1) w t + 1 , i ∝ w t , i ( 1 − η z t , i ) (MW2) Mirror descent is the gold standard meta-algorithm for online learning. How do (MW1, MW2) relate to it? J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 4 / 10

  6. Two Types of Updates In literature, two similar but different updates (Kivinen & Warmuth, 1997; Cesa-Bianchi et al., 2007): w t + 1 , i ∝ w t , i exp ( − η z t , i ) (MW1) w t + 1 , i ∝ w t , i ( 1 − η z t , i ) (MW2) Mirror descent is the gold standard meta-algorithm for online learning. How do (MW1, MW2) relate to it? (MW1) is mirror descent with regularizer 1 η ∑ n i = 1 w i log ( w i ) J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 4 / 10

  7. Two Types of Updates In literature, two similar but different updates (Kivinen & Warmuth, 1997; Cesa-Bianchi et al., 2007): w t + 1 , i ∝ w t , i exp ( − η z t , i ) (MW1) w t + 1 , i ∝ w t , i ( 1 − η z t , i ) (MW2) Mirror descent is the gold standard meta-algorithm for online learning. How do (MW1, MW2) relate to it? (MW1) is mirror descent with regularizer 1 η ∑ n i = 1 w i log ( w i ) (MW2) is NOT mirror descent for any fixed regularizer J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 4 / 10

  8. Two Types of Updates In literature, two similar but different updates (Kivinen & Warmuth, 1997; Cesa-Bianchi et al., 2007): w t + 1 , i ∝ w t , i exp ( − η z t , i ) (MW1) w t + 1 , i ∝ w t , i ( 1 − η z t , i ) (MW2) Mirror descent is the gold standard meta-algorithm for online learning. How do (MW1, MW2) relate to it? (MW1) is mirror descent with regularizer 1 η ∑ n i = 1 w i log ( w i ) (MW2) is NOT mirror descent for any fixed regularizer Unsettling: should we abandon mirror descent as a gold standard? J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 4 / 10

  9. Two Types of Updates In literature, two similar but different updates (Kivinen & Warmuth, 1997; Cesa-Bianchi et al., 2007): w t + 1 , i ∝ w t , i exp ( − η z t , i ) (MW1) w t + 1 , i ∝ w t , i ( 1 − η z t , i ) (MW2) Mirror descent is the gold standard meta-algorithm for online learning. How do (MW1, MW2) relate to it? (MW1) is mirror descent with regularizer 1 η ∑ n i = 1 w i log ( w i ) (MW2) is NOT mirror descent for any fixed regularizer Unsettling: should we abandon mirror descent as a gold standard? No: can cast (MW2) as adaptive mirror descent (Orabona et al., 2013) J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 4 / 10

  10. Adaptive Mirror Descent to the Rescue Recall that mirror descent is the (meta-)algorithm t − 1 w ⊤ z s . ∑ w t = argmin w ψ ( w )+ s = 1 For ψ ( w ) = 1 η ∑ n i = 1 w i log ( w i ) , we recover (MW1). J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 5 / 10

  11. Adaptive Mirror Descent to the Rescue Recall that mirror descent is the (meta-)algorithm t − 1 w ⊤ z s . ∑ w t = argmin w ψ ( w )+ s = 1 For ψ ( w ) = 1 η ∑ n i = 1 w i log ( w i ) , we recover (MW1). Adaptive mirror descent (Orabona et al., 2013) is the meta-algorithm t − 1 w ⊤ z s . ∑ w t = argmin w ψ t ( w )+ s = 1 J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 5 / 10

  12. Adaptive Mirror Descent to the Rescue Recall that mirror descent is the (meta-)algorithm t − 1 w ⊤ z s . ∑ w t = argmin w ψ ( w )+ s = 1 For ψ ( w ) = 1 η ∑ n i = 1 w i log ( w i ) , we recover (MW1). Adaptive mirror descent (Orabona et al., 2013) is the meta-algorithm t − 1 w ⊤ z s . ∑ w t = argmin w ψ t ( w )+ s = 1 J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 5 / 10

  13. Adaptive Mirror Descent to the Rescue Recall that mirror descent is the (meta-)algorithm t − 1 w ⊤ z s . ∑ w t = argmin w ψ ( w )+ s = 1 For ψ ( w ) = 1 η ∑ n i = 1 w i log ( w i ) , we recover (MW1). Adaptive mirror descent (Orabona et al., 2013) is the meta-algorithm t − 1 w ⊤ z s . ∑ w t = argmin w ψ t ( w )+ s = 1 i = 1 ∑ t − 1 For ψ t ( w ) = 1 η ∑ n i = 1 w i log ( w i )+ η ∑ n s = 1 w i z 2 s , i , we approximately recover (MW2). Update: w t + 1 , i ∝ w t , i exp ( − η z t , i − η 2 z 2 t , i ) ≈ w t , i ( 1 − η z t , i ) Enough to achieve better regret bound. Can recover (MW2) exactly with more complicated ψ t . J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 5 / 10

  14. Advantages of Our Perspective So far, we have cast (MW2) as adaptive mirror descent, with regularizer � � η log ( w i )+ η ∑ t − 1 ψ t ( w ) = ∑ n 1 s = 1 z 2 i = 1 w i . s , i Explains the better regret bound while staying within the mirror descent framework, which is nice. Our new perspective also allows us to apply lots of modern machinery: optimistic updates (Rakhlin & Sridharan, 2012) matrix multiplicative weights (Tsuda et al., 2005; Arora & Kale, 2007) By “turning the crank”, we get results that beat state of the art! J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 6 / 10

  15. Beating State of the Art Optimism Adaptivity In the above we let J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 7 / 10

  16. Beating State of the Art Optimism S ∞ Kivinen & Warmuth, 1997 Adaptivity In the above we let def = ∑ T t = 1 � z t � 2 S ∞ ∞ J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 7 / 10

  17. Beating State of the Art Optimism S ∞ Kivinen & Warmuth, 1997 Adaptivity max i S i S i ∗ Cesa-Bianchi et al., 2007 In the above we let def = ∑ T t = 1 � z t � 2 S ∞ ∞ def = ∑ T t = 1 z 2 S i t , i J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 7 / 10

  18. Beating State of the Art Optimism V ∞ S ∞ Kivinen & Warmuth, 1997 Adaptivity max i V i max i S i Hazan & Kale, 2008 S i ∗ Cesa-Bianchi et al., 2007 In the above we let def def = ∑ T = ∑ T z � 2 t = 1 � z t � 2 t = 1 � z t − ¯ V ∞ S ∞ ∞ , ∞ def def = ∑ T = ∑ T t = 1 ( z t , i − ¯ z i ) 2 , t = 1 z 2 V i S i t , i J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 7 / 10

  19. Beating State of the Art Optimism D ∞ V ∞ S ∞ Chiang et al., 2012 Kivinen & Warmuth, 1997 Adaptivity max i V i max i S i Hazan & Kale, 2008 S i ∗ Cesa-Bianchi et al., 2007 In the above we let def def def = ∑ T = ∑ T = ∑ T t = 1 � z t − z t − 1 � 2 z � 2 t = 1 � z t � 2 t = 1 � z t − ¯ D ∞ V ∞ S ∞ ∞ , ∞ , ∞ def def = ∑ T = ∑ T t = 1 ( z t , i − ¯ z i ) 2 , t = 1 z 2 V i S i t , i J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 7 / 10

Recommend


More recommend