Learnability, Stability and Strong Convexity Nati Srebro Ohad Shamir Shai Shalev-Shwartz Karthik Sridharan Ambuj Tewari Weizmann HUJI Cornell Michigan Toyota Technological Institute — Chicago (2008-2011)
Outline • Theme: Role of Stability in Learning • Story: Necessary and sufficient condition for learnability • Characterizing (statistical) learnability – Stability as the master property • Convex Problems – Strong convexity as the master property • Stability in online learning – From Stability to Online Mirror Descent
The General Learning Setting Vapnik95 aka Stochastic Optimization 𝑥∈𝒳 𝐺 𝑥 = 𝐹 𝑨~ 𝑔 𝑥, 𝑨 min given an iid sample 𝑨 1 , 𝑨 2 , … , 𝑨 𝑛 ∼ Known objective function 𝑔: 𝒳 × Ω → ℝ , • unknown distribution over 𝑎 ∈ Ω Problem specified by 𝒳, Ω, 𝑔 is learnable if there exists a learning • rule 𝑥(𝑨 1 , … , 𝑨 𝑛 ) s.t. for every 𝜗 > 0 and large enough sample size 𝑛(𝜗) , for any distribution D : 𝔽 𝑨 1 ,…,𝑨 𝑛 ∼ 𝐺 𝑥 ≤ inf 𝑥∈𝒳 𝐺 𝑥 + 𝜗 𝐺(𝑥 ∗ )
General Learning: Examples Minimize F(w)=E z [f(w;z)] based on sample z 1 ,z 2 ,…,z n • Supervised learning: z = (x,y) w specifies a perdictor h w : X ! Y f( w ; (x,y) ) = loss(h w (x),y) e.g. linear prediction: 𝑔 𝑥 ; 𝑦, 𝑧 = 𝑚𝑝𝑡𝑡 𝑥, 𝑦 , 𝑧 • Unsupervised learning, e.g. k-means clustering: = x 2 R d w = ( [1 ],…, [k]) 2 R d £ k specifies k cluster centers f( ( [1 ],…, [k]) ; x ) = min j | [j]-x| 2 • Density estimation: w specifies probability density p w (x) f( w ; x ) = -log p w (x) • Optimization in uncertain environment, e.g.: z = traffic delays on each road segment w = route chosen (indicator over road segments in route) f( w ; z ) = h w,z i = total delay along route
{ h w | w 2 W } has finite fat-shattering dimension ¯ ¯ ¯ ¯ n !1 ¯ F ( w ) ¡ ^ ¡ ! 0 sup F ( w ) ¯ Uniform convergence: w 2 W Learnable using ERM: n !1 ! F ( w ? ) F ( ^ w ) ¡ w = arg min ^ ^ F ( w ) 𝐺 𝑥 = 1 𝑛 𝑔 𝑥, 𝑨 𝑗 𝑥 = arg min 𝐺(𝑥) 𝑥 𝑗
Supervised Classification f(w;(x,y)) = loss(h w (x),y): { h w | w 2 W } has finite fat-shattering dimension ¯ ¯ ¯ ¯ n !1 ¯ F ( w ) ¡ ^ ¡ ! 0 sup F ( w ) ¯ Uniform convergence: w 2 W Learnable using ERM: n !1 ! F ( w ? ) F ( ^ w ) ¡ w = arg min ^ ^ F ( w ) [Alon n !1 ! F ( w ? ) F ( ~ w ) ¡ Learnable (using some rule): Ben-David Cesa-Bianchi Haussler 93]
Beyond Supervised Learning • Supervised learning: 𝑔 𝑥, 𝑦, 𝑧 = 𝑚𝑝𝑡𝑡(ℎ 𝑥 𝑦 , 𝑧) – Combinatorial necessary and sufficient condition of learning – Uniform convergence necessary and sufficient for learning – ERM universal (if learnable, can do it with ERM) • General learning / stochastic optimization: 𝑔(𝑥, 𝑨) ????
Online Learning (Optimization) f( ¢ ;z 1 ) f( ¢ ;z 2 ) f( ¢ ;z 3 ) Adversary: …. w 1 w 2 w 3 Learner: • Known function 𝑔 ⋅,⋅ • Unknown sequence 𝑨 1 , 𝑨 2 , … • Online learning rule: 𝑥 𝑗 (𝑨 1 , … , 𝑨 𝑗−1 ) • Goal: 𝑗 𝑔(𝑥 𝑗 , 𝑨 𝑗 ) Differences vs stochastic setting: • Any sequence — not necessarily iid • No distinction between “train” and “test”
Online and Stochastic Regret • Online Regret: for any sequence, 𝑛 𝑛 1 1 𝑛 𝑔(𝑥 𝑗 (𝑨 1 , … , 𝑨 𝑗−1 ), 𝑨 𝑗 ) ≤ inf 𝑛 𝑔 𝑥, 𝑨 𝑗 + 𝑆𝑓(𝑛) 𝑥∈𝒳 𝑗=1 𝑗=1 𝐺 𝑥 • Statistical Regret: for any distribution , 𝔽 𝑨 1 ,…,𝑨 𝑛 ∼ 𝐺 𝑥 𝑨 1 , … , 𝑨 𝑛 ≤ inf 𝑥∈𝒳 𝐺 𝑥 + 𝜗(𝑛) 𝐺 𝑥 ∗ • Online-To-Batch: 𝑥(𝑨 1 , … , 𝑨 𝑛 ) = 𝑥 𝑗 with prob 1/𝑛 ≤ 𝐺 𝑥 ∗ + 𝑆𝑓 𝑛 𝔽 𝐺 𝑥
Supervised Classification f(w;(x,y)) = loss(h w (x),y): { h w | w 2 W } has finite fat-shattering dimension ¯ ¯ ¯ ¯ n !1 ¯ F ( w ) ¡ ^ ¡ ! 0 sup F ( w ) ¯ Uniform convergence: w 2 W Learnable using ERM: n !1 ! F ( w ? ) F ( ^ w ) ¡ w = arg min ^ ^ F ( w ) Online Learnable [Alon n !1 ! F ( w ? ) F ( ~ w ) ¡ Learnable (using some rule): Ben-David Cesa-Bianchi Haussler 93]
Convex Lipschitz Problems • 𝒳 convex bounded subset of Hilbert space (or ℝ 𝑒 ) ∀ 𝑥∈𝒳 𝑥 2 ≤ 𝑪 • For each 𝑨 , 𝑔(𝑥, 𝑨) convex Lipschitz w.r.t 𝑥 𝑔 𝑥, 𝑨 − 𝑔 𝑥 ′ , 𝑨 ≤ 𝑴 ⋅ 𝑥 − 𝑥 ′ 2 = 𝑚𝑝𝑡𝑡( 𝑥, 𝑦 ; 𝑧) , 𝑚𝑝𝑡𝑡 ′ ≤ 1 E.g., 𝑔 𝑥, 𝑦, 𝑧 • 𝑦 2 ≤ 𝑀 𝐶 2 𝑀 2 Online Gradient Descent: 𝑆𝑓 𝑛 ≤ • 𝑛 • Stochastic Setting: – For generalized linear (including supervised): matches ERM rate – For general Convex Lipschitz Problems? • Learnable via online-to-batch (SGD) • Using ERM?
Center of Mass with Missing Data 2 = 𝑗∈𝐽 𝑥 𝑗 − 𝑦 𝑗 𝑔 𝑥, 𝐽, 𝑦 𝐽 𝑥 ∈ ℝ 𝑒 , 𝑥 ≤ 1 𝐽 ⊆ 𝑒 , 𝑦 𝑗 , 𝑗 ∈ 𝐽, 𝑦 ≤ 1 Consider 𝑄 𝑗 ∈ 𝐽 = 1/2 independently for all 𝑗 , 𝑦 = 0 If d>>2 m (think of d= 1 ) then with high probability there exists a coordinate j that is never seen in the sample, i.e. 𝑘 ∉ 𝐽 for all i=1..m ^ F ( e j ) = 0 F ( e j ) = 1 = 2 ¯ ¯ ¯ ¯ ¯ F ( w ) ¡ ^ sup F ( w ) ¯ ¸ 1 = 2 e j is an empirical minimizer with w 2 W F(e j ) = ½, far from F(w * )=F(0)=0 No uniform convergence!
{ z ! f(w;z) | w 2 W } has finite fat-shattering dimension Supervised general learning setting ¯ ¯ ¯ ¯ n !1 ¯ F ( w ) ¡ ^ ¡ ! 0 sup F ( w ) ¯ Uniform convergence: w 2 W Supervised general learning setting n !1 ! F ( w ? ) F ( ^ w ) ¡ Learnable with ERM: Supervised general learning Online setting Learnable n !1 ! F ( w ? ) F ( ~ w ) ¡ Learnable (using some rule):
Stochastic Convex Optimization • Empirical minimization might not be consistent • Learnable using specific procedural rule (online-to-batch conversion of online gradient descent) • ??????????
Strongly Convex Objectives 𝑔(𝑥, 𝑨) is 𝜇 -strongly convex in 𝑥 iff: ≤ 𝑔 𝑥, 𝑨 + 𝑔 𝑥 ′ , 𝑨 𝑔 𝑥 + 𝑥′ − 𝜇 8 𝑥 − 𝑥 ′ 2 , 𝑨 2 2 2 2 𝑔 𝑥, 𝑨 ≽ 𝜇 Equivalent to 𝛼 𝑥 If 𝑔(𝑥, 𝑨) is 𝜇 -convex and 𝑀 -Lipschitz w.r.t. 𝑥 • Online Gradient Descent [Hazan Kalai Kale Agarwal 2006] 𝑆𝑓 ≤ 𝑃 𝑀 2 log(𝑛) 𝜇𝑛 • Empirical Risk Minimization: Stochastic Setting? 𝑀 2 ≤ 𝐺 𝑥 ∗ + 𝑃 𝔽 𝐺 𝑥 𝜇𝑛 ERM?
Strong Convexity and Stability • Definition: rule 𝑥(𝑨 1 , … 𝑨 𝑛 ) is 𝛾(𝑛) -stable if: 𝑔 𝑥 𝑨 1 , … , 𝑨 𝑛−1 , 𝑨 𝑛 − 𝑔 𝑥 𝑨 1 , … , 𝑨 𝑛 , 𝑨 𝑛 ≤ 𝛾(𝑛) ≤ 𝔽 • Symmetric 𝑥 is 𝛾 −stable ⇒ 𝔽 𝐺 𝑥 𝑛−1 𝐺 𝑥 𝑛 + 𝛾 For ERM: 𝔽 ≤ 𝔽 𝐺 𝑥 ∗ = 𝐺 𝑥 ∗ 𝐺 𝑥 • 𝑔 is 𝜇 -strongly convex and 𝑀 -Lipschitz ⇒ ≤ 𝛾 = 4𝑀 2 𝑔 𝑥 𝑨 1 , … , 𝑨 𝑛−1 , 𝑨 𝑛 − 𝑔 𝑥 𝑨 1 , … , 𝑨 𝑛 , 𝑨 𝑛 𝜇𝑛 • Conclusion: 𝔽 𝐺 𝑥 ≤ 𝛾 𝑛
Empirical Minimization Consistent, but is there Uniform Convergence? 2 + 𝜇 𝑥 2 2 = 𝑗∈𝐽 𝑥 𝑗 − 𝑦 𝑗 𝑔 𝑥, 𝐽, 𝑦 𝐽 𝑥 ∈ ℝ 𝑒 , 𝑥 ≤ 1 𝐽 ⊆ 𝑒 , 𝑦 𝑗 , 𝑗 ∈ 𝐽, 𝑦 ≤ 1 Consider 𝑄 𝑗 ∈ 𝐽 = 1/2 independently for all 𝑗 , 𝑦 = 0 For j that is never seen in the sample: 𝑘 = 1 𝑘 = 𝜇𝑢 2 𝐺 𝑢𝑓 2 𝑢 + 𝜇𝑢 2 𝐺 𝑢𝑓 ¯ ¯ ¯ ¯ ¯ F ( w ) ¡ ^ sup F ( w ) ¯ ¸ 1 = 2 No uniform convergence: w 2 W
{ z ! f(w;z) | w 2 W } has finite fat-shattering dimension Supervised learning ¯ ¯ ¯ ¯ n !1 ¯ F ( w ) ¡ ^ ¡ ! 0 sup F ( w ) ¯ Uniform convergence: w 2 W not even Supervised learning local n !1 ! F ( w ? ) Empirical minimizer is consistent: F ( ^ w ) ¡ Supervised learning Online Learnable n !1 ! F ( w ? ) F ( ~ w ) ¡ Solvable (using some algorithm):
Back to Weak Convexity 𝑔(𝑥, 𝑨) 𝑀 -Lipschitz (and convex), 𝑥 2 ≤ 𝐶 • Use Regularized ERM: 𝐺 𝑥 + 𝜇 𝑥∈𝒳 2 𝑥 𝜇 = arg min 2 𝑥 2 𝑀 2 Setting 𝜇 = • 𝐶 2 𝑛 : 𝑀 2 𝐶 2 ≤ 𝐺 𝑥 ∗ + 𝑃 𝔽 𝐺 𝑥 𝜇 𝑛 • Key: strongly convex regularizer ensures stability
The Role of Regularization • Structure Risk Minimization view: – Adding regularization term effectively constrains domain to lower complexity domain 𝒳 𝑠 = 𝑥 | 𝑥 ≤ 𝑠 – Learning guarantees (e.g. for SVMs, LASSO) are actually for empirical minimization inside 𝒳 𝑠 , and are based on uniform convergence in 𝒳 𝑠 . • In our case: – No uniform convergence in 𝒳 𝑠 , for any r>0 – No uniform convergence even of regularized loss – Cannot solve stochastic optimization problem by restricting to 𝒳 𝑠 , for any 𝑠 . – What regularization buys is stability
Recommend
More recommend