On the optimality of anytime Hedge in the stochastic regime Jaouad - PowerPoint PPT Presentation

On the optimality of anytime Hedge in the stochastic regime Jaouad Mourtada , Stéphane Gaïffas CMAP, École polytechnique CMStatistics 2018 Pisa, 15/12/18 Reference: “On the optimality of the Hedge algorithm in the stochastic regime”, J. Mourtada & S. Gaïffas, arXiv preprint arXiv:1809.01382.

Hedge setting Experts i = 1 , . . . , M ; can be thought of as sources of predictions. Aim is to predict almost as well as the best expert in hindsight. Hedge problem (= online linear optimization on the simplex) At each time step t = 1 , 2 , . . . 1 Forecaster chooses probability distribution v t = ( v i , t ) 1 � i � M ∈ ∆ M on the experts; 2 Environment chooses loss vector ℓ t = ( ℓ i , t ) 1 � i � M ∈ [ 0 , 1 ] M ; 3 Forecaster incurs loss ℓ t := � v t , ℓ t � = � M i = 1 v i , t ℓ i , t . Goal: Control, for every loss vectors ℓ t ∈ [ 0 , 1 ] M , the regret T T � � R T = ℓ t − min ℓ i , t . 1 � i � M t = 1 t = 1

Hedge algorithm and regret bound First observation : Follow the Leader (FTL) / ERM, v i t , t = 1 � t − 1 where i t ∈ argmin i s = 1 ℓ i , s ⇒ no sublinear regret ! Indeed, let ( ℓ 1 , 1 , ℓ 2 , 1 ) , ( ℓ 1 , 2 , ℓ 2 , 2 ) , ( ℓ 1 , 3 , ℓ 2 , 3 ) , · · · = ( 1 / 2 , 0 ) , ( 0 , 1 ) , ( 1 , 0 ) , . . . t = 1 � v t , ℓ t � = T − 1 t = 1 ℓ 2 , t � T − 1 Then, � T 2 , but � T 2 , hence R T � T − 1 � = o ( T ) . 2

Hedge algorithm and regret bound First observation : Follow the Leader (FTL) / ERM, v i t , t = 1 � t − 1 where i t ∈ argmin i s = 1 ℓ i , s ⇒ no sublinear regret ! Hedge algorithm (Constant learning rate) e − η L i , t − 1 v i , t = � M j = 1 e − η L j , t − 1 where L i , t = � t s = 1 ℓ i , s , η learning rate. Regret bound [Freund & Schapire 1997; Vovk, 1998]: R T � log M + η T � ( T / 2 ) log M � η 8 � for η = 8 (log M ) / T tuned knowing fixed time horizon T . O ( √ T log M ) regret bound is minimax (worst-case) optimal .

Hedge algorithm and regret bound Hedge algorithm (Time-varying learning rate) e − η t L i , t − 1 v i , t = � M j = 1 e − η t L j , t − 1 where L i , t = � t s = 1 ℓ i , s , η t learning rate. Regret bound: if η t decreases, T R T � log M + 1 � � η t � T log M η T 8 t = 1 � for η t = 2 (log M ) / t , valid for every horizon T (anytime) . O ( √ T log M ) regret bound is minimax (worst-case) optimal .

Beyond worst case: adaptivity to easy stochastic instances � � Hedge with η ≍ (log M ) / T (constant) or η t ≍ (log M ) / t (anytime) achieve optimal worst case O ( √ T log M ) regret . 1 E.g., van Erven et al., 2011; Gaillard et al., 2014; Luo & Schapire, 2015.

Beyond worst case: adaptivity to easy stochastic instances � � Hedge with η ≍ (log M ) / T (constant) or η t ≍ (log M ) / t (anytime) achieve optimal worst case O ( √ T log M ) regret . However, worst-case is pessimistic and can lead to overly conservative algorithms. “Easy” problem instance : stochastic case. If the loss vectors ℓ 1 , ℓ 2 , . . . are i.i.d. (e.g., ℓ i , t = ℓ ( f i ( X t ) , Y t ) ), FTL/ERM achieves constant O (log M ) regret ⇒ fast rate . 1 E.g., van Erven et al., 2011; Gaillard et al., 2014; Luo & Schapire, 2015.

Beyond worst case: adaptivity to easy stochastic instances � � Hedge with η ≍ (log M ) / T (constant) or η t ≍ (log M ) / t (anytime) achieve optimal worst case O ( √ T log M ) regret . However, worst-case is pessimistic and can lead to overly conservative algorithms. “Easy” problem instance : stochastic case. If the loss vectors ℓ 1 , ℓ 2 , . . . are i.i.d. (e.g., ℓ i , t = ℓ ( f i ( X t ) , Y t ) ), FTL/ERM achieves constant O (log M ) regret ⇒ fast rate . Recent line of work 1 : algorithms that combine worst-case O ( √ T log M ) regret with faster rate on “easier” instances. Example: AdaHedge algorithm [van Erven et al., 2011,2015]. Data-dependent learning rate η t . (log M ) / t , O ( √ T log M ) regret; � Worst-case: “safe” η t ≍ Stochastic case: η t ≍ cst ( ≈ FTL), O (log M ) regret. 1 E.g., van Erven et al., 2011; Gaillard et al., 2014; Luo & Schapire, 2015.

Optimality of anytime Hedge in the stochastic regime � Our result: anytime Hedge with “conservative” η t ≍ (log M ) / t is actually optimal in the easy stochastic regime! Stochastic instance : i.i.d. loss vectors ℓ 1 , ℓ 2 , . . . such that E [ ℓ i , t − ℓ i ∗ , t ] � ∆ for i � = i ∗ (where i ∗ = argmin i E [ ℓ i , t ] ). Proposition (M., Gaïffas, 2018) On any stochastic instance with sub-optimality gap ∆ , anytime � Hedge with η t ≍ (log M ) / t achieves, for every T � 1 : E [ R T ] � log M . ∆ Remark: log M regret is optimal under the gap assumption. ∆

Anytime Hedge vs. Fixed horizon Hedge Theorem (M., Gaïffas, 2018) On any stochastic instance with sub-optimality gap ∆ , anytime � Hedge with η t ≍ (log M ) / t achieves, for every T � 1 : E [ R T ] � log M . ∆ Proposition (M., Gaïffas, 2018) If ℓ i ∗ , t = 0 , ℓ i , t = 1 for i � = i ∗ , t � 1 , a stochastic instance with � gap ∆ = 1 , constant Hedge with η t ≍ (log M ) / T achieves � R T ≍ T log M . Seemingly similar Hedge variants behave very differently on stochastic instances! Even if horizon T is known, anytime variant is preferable.

Some proof ideas Divide time two phases [ 1 , τ ] (dominated by noise) and [ τ, T ] (weights concentrate fast to i ∗ ), with τ ≍ log M ∆ 2 . Early phase: worst-case regret R τ � √ τ log M � log M ∆ . At the beginning of late phase, i.e. t ≈ τ ≈ log M ∆ 2 , two things occur simultaneously: i ∗ linearly dominates the other experts: for every i � = i ∗ , 1 2 ∆ t . Hoeffding: it suffices that Me − t ∆ 2 � 1. L i , t − L i ∗ , t � 1 Expert i ∗ receives at least 1 / 2 of the weights: under previous 2 condition, it suffices that Me − ∆ √ t log M � 1. Condition (2) eliminates potentially linear dependence on M in the bound. To control regret in the second phase, we then use t � 0 e − c √ t � 1 (1) and the fact that for c > 0, � c 2 .

The advantage of adaptive algorithms Stochastic regime with gap ∆ often considered in the literature to show the improvement of adaptive algorithms. However, anytime Hedge achieves optimal O ( log M ∆ ) regret in this case. No need to tune η t ? 2 Mammen & Tsybakov, 1999; Bartlett & Mendelson, 2006.

The advantage of adaptive algorithms Stochastic regime with gap ∆ often considered in the literature to show the improvement of adaptive algorithms. However, anytime Hedge achieves optimal O ( log M ∆ ) regret in this case. No need to tune η t ? ( β, B ) -Bernstein condition 2 ( β ∈ [ 0 , 1 ] , B > 0): for i � = i ∗ , E [( ℓ i , t − ℓ i ∗ , t ) 2 ] � B E [ ℓ i , t − ℓ i ∗ , t ] β . Proposition (Koolen, Grünwald & van Erven, 2016) Algorithms with so-called “second-order regret bounds” (including AdaHedge) achieve on ( β, B ) -Bernstein stochastic losses: 1 − β 1 2 − β T 2 − β + log M . E [ R T ] � ( B log M ) For β = 1, gives O ( B log M ) regret; we can have B ≪ 1 ∆ ! 2 Mammen & Tsybakov, 1999; Bartlett & Mendelson, 2006.

The advantage of adaptive algorithms ( 1 , B ) -Bernstein condition: E [( ℓ i , t − ℓ i ∗ , t ) 2 ] � B E [ ℓ i , t − ℓ i ∗ , t ] . In this case, adaptive algorithms achieve O ( B log M ) regret. We have B � 1 ∆ , but potentially B ≪ 1 ∆ (e.g., low noise). Proposition There exists a ( 1 , 1 ) -Bernstein stochastic instance on which anytime Hedge satisfies � E [ R T ] � T log M . In fact, gap ∆ (essentially) characterizes anytime Hedge’s regret on any stochastic instance: for T � 1 / ∆ 2 , 1 E [ R T ] � (log M ) 2 ∆ .

Experiments 70 35 hedge hedge_cst 60 30 hedge_doubling adahedge 50 25 FTL 40 20 Regret Regret 30 15 20 10 10 5 0 0 0 250 500 750 1000 1250 1500 1750 2000 0 200 400 600 800 1000 Round Round (a) (b) Figure: Cumulative regret of Hedge algorithms on two stochastic instances. (a) Stochastic instance with a gap, independent losses across experts ( M = 20 , ∆ = 0 . 1); (b) Bernstein instance with small ∆ , but small B ( M = 10 , ∆ = 0 . 04 , B = 4).

Conclusion and perspectives Despite conservative learning rate ( i.e. , large penalization), anytime Hedge achieves O ( log M ∆ ) regret, adaptively in the gap ∆ , in the easy stochastic case. � Not the case with fixed-horizon η t ≍ (log M ) / T instead of � η t ≍ (log M ) / t . Tuning the learning rate does help in some situations. Result of a similar flavor in stochastic optimization 3 : SGD √ t achieves O ( 1 1 with step size η t ≍ µ T ) excess risk after averaging on µ -strongly convex problems ( adaptively in µ ). Not directly related, in fact “opposite” phenomenon. 3 Moulines & Bach, 2011.

Thank you!

On the optimality of anytime Hedge in the stochastic regime Jaouad - PowerPoint PPT Presentation

On the optimality of anytime Hedge in the stochastic regime Jaouad Mourtada , Stphane Gaffas CMAP, cole polytechnique CMStatistics 2018 Pisa, 15/12/18 Reference: On the optimality of the Hedge algorithm in the stochastic regime, J.

Hedge Fund of Funds VI. 2 Portfolio View: Hedge Fund-of-Funds Section One: Why Consider Hedge

Intro on Hedge Funds AQF-2005 Hedge Funds What are hedge funds? Why is their

Hedge Planting In Whitworth Park Hedge Bed Preparation Planting Hedge Plants Wildflower Planting

Hedge Fund Derivatives Date : 18 Feb 2011 Produced by : Angelo De Pol Contents 1. Introduction

"Hedge That Puppy Capital" Hedge Fund Style Our strategy is based off of a

The Anytime Automaton Joshua San Miguel Natalie Enright Jerger Summary We propose the Anytime

Optimality Conditions Fabio Schoen 2008 http://gol.dsi.unifi.it/users/schoen Optimality

Strategic planning in ATM with a stochastic anytime approach J. A. Cobano, D. Alejo, G. Heredia

Anytime Online-to-Batch, Optimism, and Acceleration Ashok Cutkosky Google Research Stochastic

Does Experience Matter for Hedge Fund Managers? Effects of Industry Expertise on Hedge Fund

The Russian Hedge Fund Universe data as of December 31, 2015 [presentation dated April 7, 2016]

IFM 2003 Geneva 2003 Alternative Strategies Hedge Funds Geneva, February 2003 Hedge funds

HEDGE FUND ADVISER REGISTRATION AND COMPLIANCE Cary J. Meer Mark D. Perlow Hedge Fund Adviser

Q Group October 19, 2011 Institutional Quality Hedge Funds David A Hsieh (c) David A. Hsieh,

Anytime Reliability of Systematic LDPC Motivation Convolutional Codes LDPC Convolutional Codes

Your Perfect Presentation: Speak in Front of Any Audience Anytime Your Perfect Presentation: Speak

Safety Management and Site Establishment (D39MS) Dr. Turker Bayrak Room: EC G.28 E-mail:

A Model of the Feds View on Inflation Thomas Hasenzagl 1 , Filippo Pellegrino 2 , Lucrezia

Explaining the impact of Adult Social Care and Refreshing the Adult Social Care Outcomes Framework

National Grid Gas Distribution MOD0186 Report (Dec-16) UK GAS DISTRIBUTION Headline movements in

M odels for Inexact Reasoning Fuzzy Logic Lesson 4 Fuzzy Hedges M aster in Computational

Hedging and Calibration for Log-normal Rough Volatility Models Masaaki Fukasawa Osaka University

Why is Hedge Fund Activism Procyclical? Mike Burkart Amil Dasgupta SSE, CEPR, ECGI & FMG

On Compiling (Online) Combinatorial Learning Problems Fr ed eric Koriche CRIL - CNRS UMR

On the optimality of anytime Hedge in the stochastic regime Jaouad - PowerPoint PPT Presentation

On the optimality of anytime Hedge in the stochastic regime Jaouad Mourtada , Stphane Gaffas CMAP, cole polytechnique CMStatistics 2018 Pisa, 15/12/18 Reference: On the optimality of the Hedge algorithm in the stochastic regime, J.

Hedge Fund of Funds VI. 2 Portfolio View: Hedge Fund-of-Funds Section One: Why Consider Hedge

Intro on Hedge Funds AQF-2005 Hedge Funds What are hedge funds? Why is their

Hedge Planting In Whitworth Park Hedge Bed Preparation Planting Hedge Plants Wildflower Planting

Hedge Fund Derivatives Date : 18 Feb 2011 Produced by : Angelo De Pol Contents 1. Introduction

&quot;Hedge That Puppy Capital&quot; Hedge Fund Style Our strategy is based off of a

The Anytime Automaton Joshua San Miguel Natalie Enright Jerger Summary We propose the Anytime

Optimality Conditions Fabio Schoen 2008 http://gol.dsi.unifi.it/users/schoen Optimality

Strategic planning in ATM with a stochastic anytime approach J. A. Cobano, D. Alejo, G. Heredia

Anytime Online-to-Batch, Optimism, and Acceleration Ashok Cutkosky Google Research Stochastic

Does Experience Matter for Hedge Fund Managers? Effects of Industry Expertise on Hedge Fund

The Russian Hedge Fund Universe data as of December 31, 2015 [presentation dated April 7, 2016]

IFM 2003 Geneva 2003 Alternative Strategies Hedge Funds Geneva, February 2003 Hedge funds

HEDGE FUND ADVISER REGISTRATION AND COMPLIANCE Cary J. Meer Mark D. Perlow Hedge Fund Adviser

Q Group October 19, 2011 Institutional Quality Hedge Funds David A Hsieh (c) David A. Hsieh,

Anytime Reliability of Systematic LDPC Motivation Convolutional Codes LDPC Convolutional Codes

Your Perfect Presentation: Speak in Front of Any Audience Anytime Your Perfect Presentation: Speak

Safety Management and Site Establishment (D39MS) Dr. Turker Bayrak Room: EC G.28 E-mail:

A Model of the Feds View on Inflation Thomas Hasenzagl 1 , Filippo Pellegrino 2 , Lucrezia

Explaining the impact of Adult Social Care and Refreshing the Adult Social Care Outcomes Framework

National Grid Gas Distribution MOD0186 Report (Dec-16) UK GAS DISTRIBUTION Headline movements in

M odels for Inexact Reasoning Fuzzy Logic Lesson 4 Fuzzy Hedges M aster in Computational

Hedging and Calibration for Log-normal Rough Volatility Models Masaaki Fukasawa Osaka University

Why is Hedge Fund Activism Procyclical? Mike Burkart Amil Dasgupta SSE, CEPR, ECGI &amp; FMG

On Compiling (Online) Combinatorial Learning Problems Fr ed eric Koriche CRIL - CNRS UMR

"Hedge That Puppy Capital" Hedge Fund Style Our strategy is based off of a

Why is Hedge Fund Activism Procyclical? Mike Burkart Amil Dasgupta SSE, CEPR, ECGI & FMG