and Strong Convexity Nati Srebro Ohad Shamir Shai Shalev-Shwartz - PowerPoint PPT Presentation

Learnability, Stability and Strong Convexity Nati Srebro Ohad Shamir Shai Shalev-Shwartz Karthik Sridharan Ambuj Tewari Weizmann HUJI Cornell Michigan Toyota Technological Institute — Chicago (2008-2011)

Outline • Theme: Role of Stability in Learning • Story: Necessary and sufficient condition for learnability • Characterizing (statistical) learnability – Stability as the master property • Convex Problems – Strong convexity as the master property • Stability in online learning – From Stability to Online Mirror Descent

The General Learning Setting Vapnik95 aka Stochastic Optimization 𝑥∈𝒳 𝐺 𝑥 = 𝐹 𝑨~𝒠 𝑔 𝑥, 𝑨 min given an iid sample 𝑨 1 , 𝑨 2 , … , 𝑨 𝑛 ∼ 𝒠 Known objective function 𝑔: 𝒳 × Ω → ℝ , • unknown distribution 𝒠 over 𝑎 ∈ Ω Problem specified by 𝒳, Ω, 𝑔 is learnable if there exists a learning • rule 𝑥(𝑨 1 , … , 𝑨 𝑛 ) s.t. for every 𝜗 > 0 and large enough sample size 𝑛(𝜗) , for any distribution D : 𝔽 𝑨 1 ,…,𝑨 𝑛 ∼𝒠 𝐺 𝑥 ≤ inf 𝑥∈𝒳 𝐺 𝑥 + 𝜗 𝐺(𝑥 ∗ )

General Learning: Examples Minimize F(w)=E z [f(w;z)] based on sample z 1 ,z 2 ,…,z n • Supervised learning: z = (x,y) w specifies a perdictor h w : X ! Y f( w ; (x,y) ) = loss(h w (x),y) e.g. linear prediction: 𝑔 𝑥 ; 𝑦, 𝑧 = 𝑚𝑝𝑡𝑡 𝑥, 𝑦 , 𝑧 • Unsupervised learning, e.g. k-means clustering:  = x 2 R d w = (  [1 ],…,  [k]) 2 R d £ k specifies k cluster centers f( (  [1 ],…,  [k]) ; x ) = min j |  [j]-x| 2 • Density estimation: w specifies probability density p w (x) f( w ; x ) = -log p w (x) • Optimization in uncertain environment, e.g.: z = traffic delays on each road segment w = route chosen (indicator over road segments in route) f( w ; z ) = h w,z i = total delay along route

{ h w | w 2 W } has finite fat-shattering dimension ¯ ¯ ¯ ¯ n !1 ¯ F ( w ) ¡ ^ ¡ ! 0 sup F ( w ) ¯ Uniform convergence: w 2 W Learnable using ERM: n !1 ! F ( w ? ) F ( ^ w ) ¡ w = arg min ^ ^ F ( w ) 𝐺 𝑥 = 1 𝑛 𝑔 𝑥, 𝑨 𝑗 𝑥 = arg min 𝐺(𝑥) 𝑥 𝑗

Supervised Classification f(w;(x,y)) = loss(h w (x),y): { h w | w 2 W } has finite fat-shattering dimension ¯ ¯ ¯ ¯ n !1 ¯ F ( w ) ¡ ^ ¡ ! 0 sup F ( w ) ¯ Uniform convergence: w 2 W Learnable using ERM: n !1 ! F ( w ? ) F ( ^ w ) ¡ w = arg min ^ ^ F ( w ) [Alon n !1 ! F ( w ? ) F ( ~ w ) ¡ Learnable (using some rule): Ben-David Cesa-Bianchi Haussler 93]

Beyond Supervised Learning • Supervised learning: 𝑔 𝑥, 𝑦, 𝑧 = 𝑚𝑝𝑡𝑡(ℎ 𝑥 𝑦 , 𝑧) – Combinatorial necessary and sufficient condition of learning – Uniform convergence necessary and sufficient for learning – ERM universal (if learnable, can do it with ERM) • General learning / stochastic optimization: 𝑔(𝑥, 𝑨) ????

Online Learning (Optimization) f( ¢ ;z 1 ) f( ¢ ;z 2 ) f( ¢ ;z 3 ) Adversary: …. w 1 w 2 w 3 Learner: • Known function 𝑔 ⋅,⋅ • Unknown sequence 𝑨 1 , 𝑨 2 , … • Online learning rule: 𝑥 𝑗 (𝑨 1 , … , 𝑨 𝑗−1 ) • Goal: 𝑗 𝑔(𝑥 𝑗 , 𝑨 𝑗 ) Differences vs stochastic setting: • Any sequence — not necessarily iid • No distinction between “train” and “test”

Online and Stochastic Regret • Online Regret: for any sequence, 𝑛 𝑛 1 1 𝑛 𝑔(𝑥 𝑗 (𝑨 1 , … , 𝑨 𝑗−1 ), 𝑨 𝑗 ) ≤ inf 𝑛 𝑔 𝑥, 𝑨 𝑗 + 𝑆𝑓𝑕(𝑛) 𝑥∈𝒳 𝑗=1 𝑗=1 𝐺 𝑥 • Statistical Regret: for any distribution 𝒠 , 𝔽 𝑨 1 ,…,𝑨 𝑛 ∼𝒠 𝐺 𝒠 𝑥 𝑨 1 , … , 𝑨 𝑛 ≤ inf 𝑥∈𝒳 𝐺 𝒠 𝑥 + 𝜗(𝑛) 𝐺 𝑥 ∗ • Online-To-Batch: 𝑥(𝑨 1 , … , 𝑨 𝑛 ) = 𝑥 𝑗 with prob 1/𝑛 ≤ 𝐺 𝑥 ∗ + 𝑆𝑓𝑕 𝑛 𝔽 𝐺 𝑥

Supervised Classification f(w;(x,y)) = loss(h w (x),y): { h w | w 2 W } has finite fat-shattering dimension ¯ ¯ ¯ ¯ n !1 ¯ F ( w ) ¡ ^ ¡ ! 0 sup F ( w ) ¯ Uniform convergence: w 2 W Learnable using ERM: n !1 ! F ( w ? ) F ( ^ w ) ¡ w = arg min ^ ^ F ( w ) Online Learnable [Alon n !1 ! F ( w ? ) F ( ~ w ) ¡ Learnable (using some rule): Ben-David Cesa-Bianchi Haussler 93]

Convex Lipschitz Problems • 𝒳 convex bounded subset of Hilbert space (or ℝ 𝑒 ) ∀ 𝑥∈𝒳 𝑥 2 ≤ 𝑪 • For each 𝑨 , 𝑔(𝑥, 𝑨) convex Lipschitz w.r.t 𝑥 𝑔 𝑥, 𝑨 − 𝑔 𝑥 ′ , 𝑨 ≤ 𝑴 ⋅ 𝑥 − 𝑥 ′ 2 = 𝑚𝑝𝑡𝑡( 𝑥, 𝑦 ; 𝑧) , 𝑚𝑝𝑡𝑡 ′ ≤ 1 E.g., 𝑔 𝑥, 𝑦, 𝑧 • 𝑦 2 ≤ 𝑀 𝐶 2 𝑀 2 Online Gradient Descent: 𝑆𝑓𝑕 𝑛 ≤ • 𝑛 • Stochastic Setting: – For generalized linear (including supervised): matches ERM rate – For general Convex Lipschitz Problems? • Learnable via online-to-batch (SGD) • Using ERM?

Center of Mass with Missing Data 2 = 𝑗∈𝐽 𝑥 𝑗 − 𝑦 𝑗 𝑔 𝑥, 𝐽, 𝑦 𝐽 𝑥 ∈ ℝ 𝑒 , 𝑥 ≤ 1 𝐽 ⊆ 𝑒 , 𝑦 𝑗 , 𝑗 ∈ 𝐽, 𝑦 ≤ 1 Consider 𝑄 𝑗 ∈ 𝐽 = 1/2 independently for all 𝑗 , 𝑦 = 0 If d>>2 m (think of d= 1 ) then with high probability there exists a coordinate j that is never seen in the sample, i.e. 𝑘 ∉ 𝐽 for all i=1..m ^ F ( e j ) = 0 F ( e j ) = 1 = 2 ¯ ¯ ¯ ¯ ¯ F ( w ) ¡ ^ sup F ( w ) ¯ ¸ 1 = 2 e j is an empirical minimizer with w 2 W F(e j ) = ½, far from F(w * )=F(0)=0 No uniform convergence!

{ z ! f(w;z) | w 2 W } has finite fat-shattering dimension Supervised general learning setting ¯ ¯ ¯ ¯ n !1 ¯ F ( w ) ¡ ^ ¡ ! 0 sup F ( w ) ¯ Uniform convergence: w 2 W Supervised general learning setting n !1 ! F ( w ? ) F ( ^ w ) ¡ Learnable with ERM: Supervised general learning Online setting Learnable n !1 ! F ( w ? ) F ( ~ w ) ¡ Learnable (using some rule):

Stochastic Convex Optimization • Empirical minimization might not be consistent • Learnable using specific procedural rule (online-to-batch conversion of online gradient descent) • ??????????

Strongly Convex Objectives 𝑔(𝑥, 𝑨) is 𝜇 -strongly convex in 𝑥 iff: ≤ 𝑔 𝑥, 𝑨 + 𝑔 𝑥 ′ , 𝑨 𝑔 𝑥 + 𝑥′ − 𝜇 8 𝑥 − 𝑥 ′ 2 , 𝑨 2 2 2 2 𝑔 𝑥, 𝑨 ≽ 𝜇 Equivalent to 𝛼 𝑥 If 𝑔(𝑥, 𝑨) is 𝜇 -convex and 𝑀 -Lipschitz w.r.t. 𝑥 • Online Gradient Descent [Hazan Kalai Kale Agarwal 2006] 𝑆𝑓𝑕 ≤ 𝑃 𝑀 2 log(𝑛) 𝜇𝑛 • Empirical Risk Minimization: Stochastic Setting? 𝑀 2 ≤ 𝐺 𝑥 ∗ + 𝑃 𝔽 𝐺 𝑥 𝜇𝑛 ERM?

Strong Convexity and Stability • Definition: rule 𝑥(𝑨 1 , … 𝑨 𝑛 ) is 𝛾(𝑛) -stable if: 𝑔 𝑥 𝑨 1 , … , 𝑨 𝑛−1 , 𝑨 𝑛 − 𝑔 𝑥 𝑨 1 , … , 𝑨 𝑛 , 𝑨 𝑛 ≤ 𝛾(𝑛) ≤ 𝔽 • Symmetric 𝑥 is 𝛾 −stable ⇒ 𝔽 𝐺 𝑥 𝑛−1 𝐺 𝑥 𝑛 + 𝛾 For ERM: 𝔽 ≤ 𝔽 𝐺 𝑥 ∗ = 𝐺 𝑥 ∗ 𝐺 𝑥 • 𝑔 is 𝜇 -strongly convex and 𝑀 -Lipschitz ⇒ ≤ 𝛾 = 4𝑀 2 𝑔 𝑥 𝑨 1 , … , 𝑨 𝑛−1 , 𝑨 𝑛 − 𝑔 𝑥 𝑨 1 , … , 𝑨 𝑛 , 𝑨 𝑛 𝜇𝑛 • Conclusion: 𝔽 𝐺 𝑥 ≤ 𝛾 𝑛

Empirical Minimization Consistent, but is there Uniform Convergence? 2 + 𝜇 𝑥 2 2 = 𝑗∈𝐽 𝑥 𝑗 − 𝑦 𝑗 𝑔 𝑥, 𝐽, 𝑦 𝐽 𝑥 ∈ ℝ 𝑒 , 𝑥 ≤ 1 𝐽 ⊆ 𝑒 , 𝑦 𝑗 , 𝑗 ∈ 𝐽, 𝑦 ≤ 1 Consider 𝑄 𝑗 ∈ 𝐽 = 1/2 independently for all 𝑗 , 𝑦 = 0 For j that is never seen in the sample: 𝑘 = 1 𝑘 = 𝜇𝑢 2 𝐺 𝑢𝑓 2 𝑢 + 𝜇𝑢 2 𝐺 𝑢𝑓 ¯ ¯ ¯ ¯ ¯ F ( w ) ¡ ^ sup F ( w ) ¯ ¸ 1 = 2 No uniform convergence: w 2 W

{ z ! f(w;z) | w 2 W } has finite fat-shattering dimension Supervised learning ¯ ¯ ¯ ¯ n !1 ¯ F ( w ) ¡ ^ ¡ ! 0 sup F ( w ) ¯ Uniform convergence: w 2 W not even Supervised learning local n !1 ! F ( w ? ) Empirical minimizer is consistent: F ( ^ w ) ¡ Supervised learning Online Learnable n !1 ! F ( w ? ) F ( ~ w ) ¡ Solvable (using some algorithm):

Back to Weak Convexity 𝑔(𝑥, 𝑨) 𝑀 -Lipschitz (and convex), 𝑥 2 ≤ 𝐶 • Use Regularized ERM: 𝐺 𝑥 + 𝜇 𝑥∈𝒳 2 𝑥 𝜇 = arg min 2 𝑥 2 𝑀 2 Setting 𝜇 = • 𝐶 2 𝑛 : 𝑀 2 𝐶 2 ≤ 𝐺 𝑥 ∗ + 𝑃 𝔽 𝐺 𝑥 𝜇 𝑛 • Key: strongly convex regularizer ensures stability

The Role of Regularization • Structure Risk Minimization view: – Adding regularization term effectively constrains domain to lower complexity domain 𝒳 𝑠 = 𝑥 | 𝑥 ≤ 𝑠 – Learning guarantees (e.g. for SVMs, LASSO) are actually for empirical minimization inside 𝒳 𝑠 , and are based on uniform convergence in 𝒳 𝑠 . • In our case: – No uniform convergence in 𝒳 𝑠 , for any r>0 – No uniform convergence even of regularized loss – Cannot solve stochastic optimization problem by restricting to 𝒳 𝑠 , for any 𝑠 . – What regularization buys is stability

and Strong Convexity Nati Srebro Ohad Shamir Shai Shalev-Shwartz - PowerPoint PPT Presentation

Learnability, Stability and Strong Convexity Nati Srebro Ohad Shamir Shai Shalev-Shwartz Karthik Sridharan Ambuj Tewari Weizmann HUJI Cornell Michigan Toyota Technological Institute Chicago (2008-2011) Outline Theme: Role of

Discrete convexity and packages Gleb Koshevoy IITP(RAS) and Poncelet Center (CNRS) 12/05/2020,

Convexity and the Kalmbach monad Gejza Jena August 10, 2018 Gejza Jena Convexity and the

A Tightrope Walk Between Convexity and Non-convexity in Computer Vision Thomas Pock Institute

Convexity and Polyhedra Carlo Mannino (from Geir Dahl notes on convexity) University of Oslo,

Optimal covering of a straight line application to discrete convexity Jean-Marc Chassery, Isabelle

3. Convex functions basic properties and examples operations that preserve convexity

Unit 1: Convexity Mathematics II Departament de Matemtiques per a lEconomia i lEmpresa

Kernel Properties - Convexity Leila Wehbe October 1st 2013 Leila Wehbe Kernel Properties -

Summary Key topics. Familiarity with form of basic network gradient. Deep network

Summary Key topics. Familiarity with form of basic network gradient. Deep network

Restarting accelerated gradient methods with a rough strong convexity estimate Olivier Fercoq

Strong Convexity for Risk-Averse Two-Stage Models with Linear Recourse Matthias Claus 1 , Kai

Strong Workforce 2017-2018 UPDATES STRONG WORKFORCE PROGRAM STRONG WORKFORCE AD HOC GROUP What

Legal Essentials of Estate Planning Strong in Commun unity ity Strong in Knowledge ge Strong

Strong balance sheet Strong returns Analysts' conference 2014 Munich, 20 March 2014 Munich

Strong quarter Strong financial results Growing production as expected - Record

Learning Theory and Model Selection Weinan Zhang Shanghai Jiao Tong University

CSC2412: Private Gradient Descent & Empirical Risk Minimization Sasho Nikolov 1 Empirical

Stability of Clustering Methods Sasha Rakhlin Ph.D. candidate, MIT 1 A procedure is stable if P

Testing in the PHP w orld Marcus Brger PHP Qubec Conference 2007 The need for Testing

ECML 2015 Big Targets Workshop Paul Mineiro Paul Mineiro ECML 2015 Big Targets Workshop How can

Distributed Work: Forecasts + Recommendations ibute d Wor k: T e am Numbe r : 3 Government

ERP IMPLEMENTATION ERP IMPLEMENTATION Kedar Gaonkar Kedar Gaonkar IETF IETF- -69 Chicago,

Smooth and Flexible ERP Migration between Heterogeneous ERP Systems/ERP Modules Lars Frank

and Strong Convexity Nati Srebro Ohad Shamir Shai Shalev-Shwartz - PowerPoint PPT Presentation

Learnability, Stability and Strong Convexity Nati Srebro Ohad Shamir Shai Shalev-Shwartz Karthik Sridharan Ambuj Tewari Weizmann HUJI Cornell Michigan Toyota Technological Institute Chicago (2008-2011) Outline Theme: Role of

Discrete convexity and packages Gleb Koshevoy IITP(RAS) and Poncelet Center (CNRS) 12/05/2020,

Convexity and the Kalmbach monad Gejza Jena August 10, 2018 Gejza Jena Convexity and the

A Tightrope Walk Between Convexity and Non-convexity in Computer Vision Thomas Pock Institute

Convexity and Polyhedra Carlo Mannino (from Geir Dahl notes on convexity) University of Oslo,

Optimal covering of a straight line application to discrete convexity Jean-Marc Chassery, Isabelle

3. Convex functions basic properties and examples operations that preserve convexity

Unit 1: Convexity Mathematics II Departament de Matemtiques per a lEconomia i lEmpresa

Kernel Properties - Convexity Leila Wehbe October 1st 2013 Leila Wehbe Kernel Properties -

Summary Key topics. Familiarity with form of basic network gradient. Deep network

Summary Key topics. Familiarity with form of basic network gradient. Deep network

Restarting accelerated gradient methods with a rough strong convexity estimate Olivier Fercoq

Strong Convexity for Risk-Averse Two-Stage Models with Linear Recourse Matthias Claus 1 , Kai

Strong Workforce 2017-2018 UPDATES STRONG WORKFORCE PROGRAM STRONG WORKFORCE AD HOC GROUP What

Legal Essentials of Estate Planning Strong in Commun unity ity Strong in Knowledge ge Strong

Strong balance sheet Strong returns Analysts' conference 2014 Munich, 20 March 2014 Munich

Strong quarter Strong financial results Growing production as expected - Record

Learning Theory and Model Selection Weinan Zhang Shanghai Jiao Tong University

CSC2412: Private Gradient Descent &amp; Empirical Risk Minimization Sasho Nikolov 1 Empirical

Stability of Clustering Methods Sasha Rakhlin Ph.D. candidate, MIT 1 A procedure is stable if P

Testing in the PHP w orld Marcus Brger PHP Qubec Conference 2007 The need for Testing

ECML 2015 Big Targets Workshop Paul Mineiro Paul Mineiro ECML 2015 Big Targets Workshop How can

Distributed Work: Forecasts + Recommendations ibute d Wor k: T e am Numbe r : 3 Government

ERP IMPLEMENTATION ERP IMPLEMENTATION Kedar Gaonkar Kedar Gaonkar IETF IETF- -69 Chicago,

Smooth and Flexible ERP Migration between Heterogeneous ERP Systems/ERP Modules Lars Frank

CSC2412: Private Gradient Descent & Empirical Risk Minimization Sasho Nikolov 1 Empirical