easy data
play

Easy Data Peter Grnwald Centrum Wiskunde & Informatica - PowerPoint PPT Presentation

Easy Data Peter Grnwald Centrum Wiskunde & Informatica Amsterdam Mathematical Institute Leiden University Joint work with W. Koolen, T. Van Erven, N. Mehta, T. Sterkenburg Today: Three Things To Tell You 1. Nifty Reformulation of


  1. Easy Data Peter Grünwald Centrum Wiskunde & Informatica – Amsterdam Mathematical Institute – Leiden University Joint work with W. Koolen, T. Van Erven, N. Mehta, T. Sterkenburg

  2. Today: Three Things To Tell You 1. Nifty Reformulation of Conditions for Fast Rates in Statistical Learning – Tsybakov, Bernstein, Exp-Concavity,... 2. Do this via new concept: ESI 3. Precise Analogue of Bernstein Condition for Fast Rates in Individual Sequence Setting – ...and algorithm that achieves these rates!

  3. Today: Three Things To Tell You 1. Nifty Reformulation of Conditions for Fast Rates in Statistical Learning 2. Do this via new concept: ESI 3. Precise Analogue of Bernstein Condition for Fast Rates in Individual Sequence Setting – ...and algorithm that achieves these rates!

  4. Van Erven, G. Mehta, Reid, Williamson Fast Rates in Statistical and Online Learning. JMLR Special Issue in Memory of A. Chervonenkis, Oct. 2015 VC: Vapnik-Chervonenkis (1974!) optimistic (realizability) • Plaatje van stochmix paper condition TM: Tsybakov (2004) margin condition (special case: Massart Condition) 𝒗 -BC: Audibert, Bousquet (2005), Bartlett, Mendelson (2006 ) “Bernstein Condition” • Does not require 0/1 or absolute loss • Does not require Bayes act to be in model

  5. Decision Problem • A decision problem (DP) is defined as a tuple where • 𝑄 is the distribution of random quantity 𝑎 taking values in , is a set of predictors 𝑔 , and for each • the model indicates loss 𝑔 makes on 𝑎 , • Example: squared error loss

  6. Decision Problem • A decision problem (DP) is defined as a tuple where • 𝑄 is the distribution of random quantity 𝑎 taking values in , is a set of predictors 𝑔 , and for each • the model indicates loss 𝑔 makes on 𝑎 , • We assume throughout that the model contains a risk minimizer 𝑔 ∗ , achieving • abbreviates

  7. Bernstein Condition • Fix a DP with (for now) bounded loss • DP satisfies the 𝐷, 𝛽 -Bernstein condition if there exists 𝐷 > 0, 𝛽 ∈ 0,1 , such that for all where we set and is ‘ regret of 𝑔 relative to 𝑔 ∗ ’. •

  8. Bernstein Condition • Fix a DP with (for now) bounded loss • DP satisfies the 𝐷, 𝛽 -Bernstein condition if there exists 𝐷 > 0, 𝛽 ∈ 0,1 , such that for all where we set and Generalizes Tsybakov condition: 𝑔 ∗ does not need • to be Bayes act, loss does not need to be 0/1

  9. Bernstein Condition • Fix a DP with (for now) bounded loss • DP satisfies the 𝐷, 𝛽 -Bernstein condition if there exists 𝐷 > 0, 𝛽 ∈ 0,1 , such that for all where we set and Suppose data are i.i.d. and the 𝐷, 𝛽 -Bernstein • condition holds. Then...

  10. Under Bernstein (𝑫, 𝜷) • Empirical Risk minimization satisfies, with high prob*, 𝛽 = 0 : condition trivially satisfied, get minimax rate • 𝛽 = 1 : nice case (Massart condition), get ‘log - loss’ • rate

  11. Under Bernstein (𝑫, 𝜷) 𝜽 − “Bayes” MAP satisfies , with high prob*, • This requires setting “learning rate” 𝜃 in terms of 𝛽 • and 𝑈 ! 𝛽 = 0 : slow rate ; 𝛽 = 1 : fast rate •

  12. GOAL: Sequential Bernstein 𝜃 − “Bayes” MAP satisfies, with high prob*, • • GOAL: design ‘sequential Bernstein condition’ and accompanying sequential prediction algorithm s.t. 1. cumulative regret always satisfies, for all 𝑔 ∗ , all sequences 2. if condition holds, it also satisfies, with high prob*

  13. GOAL: Sequential Bernstein • GOAL: design ‘sequential Bernstein condition’ and accompanying sequential prediction algorithm s.t. 1. cumulative regret always satisfies, for all 𝑔 ∗ , all sequences 2. if condition holds, it also satisfies, with high prob*

  14. DREAM • DREAM: design ‘sequential Bernstein condition’ and accompanying sequential prediction algorithm s.t. 1. cumulative regret always satisfies, for all 𝑔 ∗ , all sequences 2. if condition holds for given sequence , then cumulative regret satisfies, for that sequence:

  15. GOAL: Sequential Bernstein • GOAL: design ‘sequential Bernstein condition’ s.t. 1. for all 𝑔 ∗ , all sequences 2. if condition holds, it also satisfies, with high prob*, Approach 1: define seq. Bernstein as standard Bernstein+i.i.d. Even then none of the standard algorithms achieve this... With one (?) exception!

  16. Today: Three Things To Tell You 1. Nifty Reformulation of Fast Rate Conditions in Statistical Learning 2. Do this via new concept: ESI 3. Precise Analogue of Bernstein Condition for Fast Rates in Individual Sequence Setting – ...and algorithm that achieves these rates!

  17. Exponential Stochastic Inequality (ESI) ∗ 𝝑 as For any given 𝜃 > 0 we write 𝒀 ≤ 𝜽 • shorthand for ∗ 𝜗 implies, via Jensen, 𝑌 ≤ 𝜃 • ∗ 𝜗 implies, via Markov, for all 𝐵 , 𝑌 ≤ 𝜃 •

  18. ESI-Example Hoeffding’s Inequality : suppose that 𝑌 has • support [−1,1] , and mean 0. Then

  19. ESI – More Properties For i.i.d. rvs 𝑌, 𝑌 1 , … , 𝑌 𝑈 we have • For arbitrary rvs 𝑌, 𝑍 we have •

  20. Bernstein in ESI Terms • Most general form of Bernstein condition: for some nondecreasing function :

  21. Bernstein in ESI Terms • Most general form of Bernstein condition: for some nondecreasing function : • Van Erven et al. (2015) show this is equivalent to having for some nondecreasing function with

  22. U-Central Condition • Van Erven et al. (2015) show Bernstein condition is is equivalent to the existence of increasing function such that for some : They term this the 𝒗 -central condition

  23. U-Central Condition • Van Erven et al. (2015) show Bernstein condition is is equivalent to the existence of increasing function such that for some : They term this the 𝒗 -central condition – can also be related to mixability, exp-concavity, JRT-condition , condition for well-behavedness of Bayesian inference under misspecification

  24. U-Central Condition • Van Erven et al. (2015) show Bernstein condition is is equivalent to the existence of increasing function such that for some : They term this the 𝒗 -central condition – can also be related to mixability, exp-concavity, JRT-condition , condition for well-behavedness of Bayesian inference under misspecification – for unbounded losses, it becomes different (and better!) than Bernstein condition – it is one-sided

  25. Three Equivalent Notions for Bounded Losses • U-central condition in terms of regret : .....or equivalently (extending notation):

  26. Three Equivalent Notions for Bounded Losses • U-central condition in terms of regret : with • For bounded losses, this turns out to be equivalent to: for some appropriately chosen with :

  27. Three Equivalent Notions for Bounded Losses • U-central condition in terms of regret : with • For bounded losses, this turns out to be equivalent to: for some appropriately chosen with : • More similar to original Bernstein condition. However, condition is now in ‘exponential’ rather than ‘expectation’ form

  28. Today: Three Things To Tell You 1. Nifty Reformulation of Fast Rate Conditions in Statistical Learning 2. Do this via new concept: ESI 3. Precise Analogue of Bernstein Condition for Fast Rates in Individual Sequence Setting – ...and algorithm that achieves these rates!

  29. T-fold U-Central Condition • Suppose that 𝑣 -central condition holds (i.e. 𝑦 / 𝑣(𝑦) – Bernstein holds) , and data are i.i.d. Then by generic property of ESI, with 𝜃 𝜗 = 𝐷 1 ⋅ 𝑣(𝜗) , where

  30. T-fold U-Central Condition • Under 𝑣 -central cond. and iid data, with 𝜃 𝜗 = 𝐷 1 ⋅ 𝑣 𝜗 : but also for every learning algorithm with

  31. Cumulative U-Central Condition • Under 𝑣 -central cond. and iid data, with 𝜃 𝜗 = 𝐷 1 ⋅ 𝑣 𝜗 : but also for every learning algorithm This condition may of course also hold for non-i.i.d. data . It is the condition we need, so we term it the cumulative u-central condition

  32. Hedge with Oracle Learning Rate Hedge with learning rate 𝜃 achieves regret bound, • for all We assume cumulative 𝑣 -central condition for some • 𝑣 . For simplicity assume ; then: and even for some other constant

  33. Hedge with Oracle Learning Rate • Combining we get We can set 𝜗 (or eqv. 𝜃 ) as we like. Best possible • bound achieved if we make sure all terms are of same order, i.e. we set at time 𝑈, • and then and

  34. Squint without Oracle Learning Rate! • Hedge achieves ESI- (!)-bound ...but needs to know 𝑔 ∗ , 𝛾 and 𝑈 to set learning rate! • Squint (Koolen and Van Erven ’15) • achieves same bound without knowing these! Gets bound with 𝛾 = 0 automatically for individual • sequences • What about Adanormalhedge? (Luo & Shapire ‘15)

  35. Dessert: Easy Data Rather than Distributions • We are working with algorithms such as Hedge and Squint, designed for individual, nonstochastic sequences • Yet condition is stochastic • Does there exist nonstochastic analogue ? • Answer is yes:

Recommend


More recommend