a chaining algorithm for online non parametric regression Pierre Gaillard December 2, 2015 University of Copenhagen This is joint work with Sebastien Gerchinovitz
table of contents 1. Online prediction of arbitrary sequences 2. Finite reference class: prediction with expert advice 3. Large reference class 4. Extensions, current (and future) work 2
online prediction of arbitrary sequences
the framework of this talk Sequential prediction of arbitrary time-series 1 : N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games . 2006. 1 Difficulty: no stochastic assumption on the time series - the environment reveals y t 4 - a time-series y 1 , . . . , y n ∈ Y = [ − B , B ] is to be predicted step by step - covariates x 1 , . . . , x n ∈ X are sequentially available At each forecasting instance t = 1 , . . . , n - the environment reveals x t ∈ X - the player is ask to form a prediction � y t of y t based on – the past observations y 1 , . . . , y t − 1 – the current and past covariates x 1 , . . . , x t L n = ∑ n y t − y t ) 2 . Goal: minimize the cumulative loss: � t = 1 ( � - neither on the observations ( y t ) - neither on the covariates ( x t )
the framework of this talk - the environment reveals y t reference performance n inf our performance Sequential prediction of arbitrary time-series: n def 4 - solution: produce the prediction as a function of x t - a time-series y 1 , . . . , y n ∈ Y = [ − B , B ] is to be predicted step by step - covariates x 1 , . . . , x n ∈ X are sequentially available At each forecasting instance t = 1 , . . . , n - the environment reveals x t ∈ X y t = � � f t ( x t ) Goal: minimize our regret against a reference function class F ∈ Y X ∑ ∑ (� ) 2 ( ) 2 Reg n ( F ) = f ( x t ) − y t − f ( x t ) − y t f ∈F t = 1 t = 1 � �� � � �� �
the framework of this talk def Goal reference performance n inf our performance Sequential prediction of arbitrary time-series: n 4 - the environment reveals y t - solution: produce the prediction as a function of x t - a time-series y 1 , . . . , y n ∈ Y = [ − B , B ] is to be predicted step by step - covariates x 1 , . . . , x n ∈ X are sequentially available At each forecasting instance t = 1 , . . . , n - the environment reveals x t ∈ X y t = � � f t ( x t ) Goal: minimize our regret against a reference function class F ∈ Y X ∑ ∑ (� ) 2 ( ) 2 Reg n ( F ) = f ( x t ) − y t − f ( x t ) − y t = o ( n ) f ∈F t = 1 t = 1 � �� � � �� � � �� �
finite reference class: prediction with expert advice
6 def The exponentially weighted average forecaster (EWA) 1 At each forecasting instance t , Littlestone and M. K. Warmuth (1994) and Vovk (1990) 1 n exp n a strategy for finite F Assumption: F = { f 1 , . . . , f K } ⊂ Y X is finite - assign to each function f k the weight ( ) 2 ) − η ∑ t − 1 ( y s − f k ( x s ) s = 1 � p k , t = ( ) 2 ) ∑ K − η ∑ t − 1 ( y s − f j ( x s ) j = 1 exp s = 1 f t = ∑ K - form function � y t = � k = 1 � p k , t f k and predict � f t ( x t ) Performance: if Y = [ − B , B ] and η = 1 / ( 8 B 2 ) ∑ ∑ ( ) 2 − inf ( ) 2 ⩽ 8 B 2 log K y t − � Reg n ( F ) = f ( x t ) y t − f ( x t ) f ∈F t = 1 t = 1 If B is not known in advance, η can be tuned online (doubling trick).
proof 1. Upper bound the instantaneous loss n 2. Sum over all t , the sum telescopes 7 ( ) 2 ( ) 2 y t − ∑ K y t − � f t ( x t ) = k = 1 � p k , t f k ( x t ) ( K ) 2 ) ( ∑ for η ⩽ 1 / ( 8 B 2 ) p k , t e − η y t − f k ( x t ) − 1 � ⩽ η log k = 1 ( ) 2 ) ( � by definition of � pk , t + 1 p k , t e − η y t − f k ( x t ) − 1 = η log � p k , t + 1 ( ) 2 + 1 � p k , t + 1 = y t − f k ( x t ) η log � p k , t ✟ ∑ ( ) 2 − ( ) 2 ⩽ 1 η log ✟✟ � p k , n + 1 y t − � f t ( x t ) y t − f k ( x t ) ⩽ log K = 8 B 2 log K � η p k , 1 t = 1
large reference class
9 Definition (metric entropy) n Vovk (2001) n inf (1) Regret bound of order (forgetting constants): approximate F by a finite class 1. Approximate F by a finite set F ε such that ∀ f ∈ F ∃ f ε ∈ F ε ∥ f − f ε ∥ ∞ ⩽ ε . Such set F ε is called an ε -net of F 2. Run EWA on F ε The cardinal of the smallest ε -net F ε that satisfies (1) is denoted N ∞ ( F , ε ) . The metric entropy of F is log N ∞ ( F , ε ) . ∑ ∑ ( ) 2 − inf ( ) 2 Reg n ( F ) = Reg n ( F ε ) + y t − f ε ( x t ) y t − f ( x t ) f ε ∈F ε f ∈F t = 1 t = 1 ≲ log N ∞ ( F , ε ) + ε n ���� � �� � Approximation of F by F ε Regret of EWA on F ε
examples of reference classes: the parametric case Example Journal of Approximation Theory (2013). F. Gao, C-K. Ing, and Y. Yang. “Metric entropy and sparse linear approximation of ℓq-hulls for 0< q≤ 1”. In: 2 s Then 2 , - sparse linear regression - linear regression in a compact ball 10 If N ∞ ( F , ε ) ≲ ε − p for p > 0 as ε → 0, Reg n ( F ) ≲ log N ∞ ( F , ε ) + ε n ε ≈ 1 / n log ( ε − p ) ≈ + ε n ≈ p log ( n ) Assume you have d ⩾ 1 black-box forecasters φ 1 , . . . , φ d ∈ X Y {∑ d comp. R d } N ∞ ( F , ε ) ≲ ε − d F = j = 1 u j φ j : for u ∈ Θ ⊂ → {∑ d } F = j = 1 u j φ j : for u ∈ [ 0 , 1 ] d s.t. ∥ u ∥ 1 = 1 and ∥ u ∥ 0 = s ( d ) 1 + 1 / ( ε √ s ) ( ) log N ∞ ( F , ε ) ≲ log → Reg n ( F ) ≲ s log ( 1 + dn / s ) + s log
examples of reference classes: the parametric case Example Journal of Approximation Theory (2013). F. Gao, C-K. Ing, and Y. Yang. “Metric entropy and sparse linear approximation of ℓq-hulls for 0< q≤ 1”. In: 2 s Then 2 , - sparse linear regression - linear regression in a compact ball 10 If N ∞ ( F , ε ) ≲ ε − p for p > 0 as ε → 0, Reg n ( F ) ≲ log N ∞ ( F , ε ) + ε n ε ≈ 1 / n log ( ε − p ) ≈ + ε n ≈ p log ( n ) ➝ optimal Assume you have d ⩾ 1 black-box forecasters φ 1 , . . . , φ d ∈ X Y {∑ d comp. R d } N ∞ ( F , ε ) ≲ ε − d F = j = 1 u j φ j : for u ∈ Θ ⊂ → {∑ d } F = j = 1 u j φ j : for u ∈ [ 0 , 1 ] d s.t. ∥ u ∥ 1 = 1 and ∥ u ∥ 0 = s ( d ) 1 + 1 / ( ε √ s ) ( ) log N ∞ ( F , ε ) ≲ log → Reg n ( F ) ≲ s log ( 1 + dn / s ) + s log
11 n G.G. Lorentz. “Metric Entropy, Widths, and Superpositions of Functions”. In: Amer. Math. Monthly 6 (1962). 3 . 1 Example p what if F is non parametric? if log N ∞ ( F , ε ) ≲ ε − p for p > 0 as ε → 0, Reg n ( F ) ≲ log N ∞ ( F , ε ) + ε n ε = n − 1 / ( p + 1 ) ε − p ≲ + ε n ≈ p + 1 - 1-Lipschitz ball on [ 0 , 1 ] { } � � f ∈ Y X : � ⩽ ∥ x − y ∥ � f ( x ) − f ( y ) F = ∀ x , y ∈ X ⊂ [ 0 , 1 ] Reg n ( F ) ≲ √ n Then log N ∞ ( F , ε ) ≈ ε − 1 → - Hölder ball on X ⊂ [ 0 , 1 ] with regularity β = q + α > 1 / 2 { } � � f ∈ Y X : ∀ x , y ∈ X � f ( q ) ( x ) − f ( q ) ( y ) � ⩽ | x − y | α and ∀ k ⩽ q , ∥ f ( k ) ∥ ∞ ⩽ B F = Then 3 log N ∞ ( F , ε ) ≈ ε − 1 /β → Reg n ( F ) ≲ n 1 + β
11 n G.G. Lorentz. “Metric Entropy, Widths, and Superpositions of Functions”. In: Amer. Math. Monthly 6 (1962). 3 1 1 3 1 Example p n n p p what if F is non parametric? if log N ∞ ( F , ε ) ≲ ε − p for p > 0 as ε → 0, ➝ suboptimal: Reg n ( F ) ≲ log N ∞ ( F , ε ) + ε n p + 2 if p < 2 p − 1 ε = n − 1 / ( p + 1 ) ε − p if p > 2 ≲ + ε n ≈ p + 1 - 1-Lipschitz ball on [ 0 , 1 ] { } � � f ∈ Y X : � ⩽ ∥ x − y ∥ � f ( x ) − f ( y ) F = ∀ x , y ∈ X ⊂ [ 0 , 1 ] Reg n ( F ) ≲ √ n ➝ suboptimal: n Then log N ∞ ( F , ε ) ≈ ε − 1 → - Hölder ball on X ⊂ [ 0 , 1 ] with regularity β = q + α > 1 / 2 { } � � f ∈ Y X : ∀ x , y ∈ X � f ( q ) ( x ) − f ( q ) ( y ) � ⩽ | x − y | α and ∀ k ⩽ q , ∥ f ( k ) ∥ ∞ ⩽ B F = Then 3 log N ∞ ( F , ε ) ≈ ε − 1 /β → Reg n ( F ) ≲ n 1 + β 1 + 2 β . ➝ suboptimal: n
minimax rates Theorem (Rakhlin and Sridharan 2014 4 ) A. Rakhlin and K. Sridharan. “Online Nonparametric Regression”. In: COLT (2014). 4 and Lugosi 1999) - Online learning with arbitrary sequences (Opper and Haussler 1997; Cesa-Bianchi Rakhlin et al. 2013) - Statistical learning with i.i.d. data to derive risk bounds (e.g., Massart 2007; - Chaining to bound the supremum of a stochastic process (Dudley 1967) This term is a Dudley entropy integral that appears in 12 The minimax rate of the regret if of order n inf { ∫ γ } √ √ log N seq ( F , γ ) + log N seq ( τ, F ) d τ + ε n γ ⩾ ε ⩾ 0 ε where log N seq ( F , ε ) ⩽ log N ∞ ( F , ε ) is the sequential entropy of F . log N ∞ ( F , γ ) : regret of EWA against γ -net ➝ crude approximation ε n : approximation error of the ε -net ➝ fine approximation √ n ∫ γ √ log N ∞ ( F , τ ) d τ : from large scale γ to small scale ε . ε
Recommend
More recommend