Stochastic Composition Optimization Algorithms and Sample - PowerPoint PPT Presentation

Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang Joint works with Ethan X. Fang, Han Liu, and Ji Liu ORFE@Princeton ICCOPT, Tokyo, August 8-11, 2016 1 / 24

Collaborators • M. Wang, X. Fang, and H. Liu. Stochastic Compositional Gradient Descent: Algorithms for Minimizing Compositions of Expected-Value Functions. Mathematical Programming, Submitted in 2014, to appear in 2016. • M. Wang and J. Liu. Accelerating Stochastic Composition Optimization. 2016. • M. Wang and J. Liu. A Stochastic Compositional Subgradient Method Using Markov Samples. 2016. 2 / 24

Outline 1 Background: Why is SGD a good method? 2 A New Problem: Stochastic Composition Optimization 3 Stochastic Composition Algorithms: Convergence and Sample Complexity 4 Acceleration via Smoothing-Extrapolation 3 / 24

Background: Why is SGD a good method? Outline 1 Background: Why is SGD a good method? 2 A New Problem: Stochastic Composition Optimization 3 Stochastic Composition Algorithms: Convergence and Sample Complexity 4 Acceleration via Smoothing-Extrapolation 4 / 24

Background: Why is SGD a good method? Background • Machine learning is optimization Learning from batch data Learning from online data � n min x ∈ℜ d 1 i =1 ℓ ( x ; A i , b i ) + ρ ( x ) min x ∈ℜ d E A , b [ ℓ ( x ; A , b )] + ρ ( x ) n • Both problems can be formulated as Stochastic Convex Optimization min E [ f ( x , ξ )] x � �� expectation over batch data set or unknown distribution A general framework encompasses likelihood estimation, online learning, empirical risk minimization, multi-arm bandit, online MDP • Stochastic gradient descent (SGD) updates by taking sample gradients: x k +1 = x k − α ∇ f ( x k , ξ k ) A special case of stochastic approximation with a long history (Robbins and Monro, Kushner and Yin, Polyak and Juditsky, Benveniste et al., Ruszcy` nski, Borkar, Bertsekas and Tsitsiklis, and many) 5 / 24

Background: Why is SGD a good method? Background: Stochastic first-order methods • Stochastic gradient descent (SGD) updates by taking sample gradients: x k +1 = x k − α ∇ f ( x k , ξ k ) (1,410,000 results on Google Scholar Search and 24,400 since 2016!) Why is SGD a good method in practice? • When processing either batch or online data, a scalable algorithm needs to update using partial information (a small subset of all data) • Answer: We have no other choice Why is SGD a good method beyond practical reasons? • SGD achieves optimal convergence after processing k samples: √ • E [ F ( x k ) − F ∗ ] = O (1 / k ) for convex minimization • E [ F ( x k ) − F ∗ ] = O (1 / k ) for strongly convex minimization (Nemirovski and Yudin 1983, Agarwal et al. 2012, Rakhlin et al. 2012, Ghadimi and Lan 2012,2013, Shamir and Zhang 2013 and many more) • Beyond convexity: nearly optimal online PCA (Li, Wang, Liu, Zhang 2015) • Answer: Strong theoretical guarantees for data-driven problems 6 / 24

A New Problem: Stochastic Composition Optimization Outline 1 Background: Why is SGD a good method? 2 A New Problem: Stochastic Composition Optimization 3 Stochastic Composition Algorithms: Convergence and Sample Complexity 4 Acceleration via Smoothing-Extrapolation 7 / 24

A New Problem: Stochastic Composition Optimization Stochastic Composition Optimization Consider the problem � � min F ( x ) := ( f ◦ g )( x ) = f ( g ( x )) , x ∈X where the outer and inner functions are f : ℜ m → ℜ , g : ℜ n → ℜ m f ( y ) = E [ f v ( y )] , g ( x ) = E [ g w ( x )] , and X is a closed and convex set in ℜ n . • We focus on the case where the overall problem is convex (for now) • No structural assumptions on f , g (nonconvex/nonmonotone/nondifferentiable) • We may not know the distribution of v , w . 8 / 24

A New Problem: Stochastic Composition Optimization Expectation Minimization vs. Stochastic Composition Optimization Recall the classical problem: min E [ f ( x , ξ )] x ∈X � �� linear w.r.t the distribution of ξ In stochastic composition optimization, the objective is no longer a linear functional of the ( v , w ) distribution: min E [ f v ( E [ g w ( x )])] x ∈X � �� nonlinear w.r.t the distribution of ( w , v ) • In the classical problem, nice properties come from linearity w.r.t. data distribution • In stochastic composition optimization, they are all lost A little nonlinearity takes a long way to go. 9 / 24

A New Problem: Stochastic Composition Optimization Motivating Example: High-Dimensional Nonparametric Estimation • Sparse Additive Model (SpAM): d � y i = h j ( x ij ) + ǫ i . j =1 • High-dimensional feature space with relatively few data samples: Features ! d Features d Sample size n Sample size ! n n � d n � d • Optimization model for SpAM 1 : d d � � 2 � � � � � h 2 min E [ f v ( E [ g w ( x )])] ↔ min E Y − h j ( X j ) + λ E j ( X j ) x h j ∈H j j =1 j =1 � • The term λ � d � � h 2 E j ( X j ) induces sparsity in the feature space. j =1 1 P. Ravikumar, J. Lafferty, H. Liu, and L. Wasserman. Sparse additive models. Journal of the Royal Statistical Society: Series B, 71(5):1009-1030, 2009. 10 / 24

A New Problem: Stochastic Composition Optimization Motivating Example: Risk-Averse Learning Consider the mean-variance minimization problem min E a , b [ ℓ ( x ; a , b )] + λ Var a , b [ ℓ ( x ; a , b )] , x Its batch version is � � 2 N N N 1 ℓ ( x ; a i , b i ) + λ ℓ ( x ; a i , b i ) − 1 � � � min ℓ ( x ; a i , b i ) . N N N x i =1 i =1 i =1 � ( Z − E [ Z ]) 2 � • The variance Var[ Z ] = E is a composition between two functions • Many other risk functions are equivalent to compositions of multiple expected-value functions (Shapiro, Dentcheva, Ruszcy` nski 2014) • A central limit theorem for composite of multiple smooth functions has been established for risk metrics (Dentcheva, Penev, Ruszcy` nski 2016) • No good way to optimize a risk-averse objective while learning from online data 11 / 24

A New Problem: Stochastic Composition Optimization Motivating Example: Reinforcement Learning On-policy reinforcement learning is to learn the value-per-state of a stochastic system. • We want to solve a (huge) Bellman equations γ P π V π + r π = V π , where P π is transition prob. matrix and r π are rewards, both unknown. • On-policy learning aims to solve Bellman equation via blackbox simulation. It becomes a special stochastic composition optimization problem: � E [ A ] x − E [ b ] � 2 , min E [ f v ( E [ g w ( x )])] ↔ min x x ∈ℜ S where E [ A ] = I − γ P π and E [ b ] = r π . 12 / 24

Stochastic Composition Algorithms: Convergence and Sample Complexity Outline 1 Background: Why is SGD a good method? 2 A New Problem: Stochastic Composition Optimization 3 Stochastic Composition Algorithms: Convergence and Sample Complexity 4 Acceleration via Smoothing-Extrapolation 13 / 24

Stochastic Composition Algorithms: Convergence and Sample Complexity Problem Formulation � � min F ( x ) := E [ f v ( E [ g w ( x )])] , x ∈X � �� nonlinear w.r.t the distribution of ( w , v ) Sampling Oracle ( SO ) Upon query ( x , y ), the oracle returns: • Noisy inner sample g w ( x ) and its noisy subgradient ˜ ▽ g w ( x ); • Noisy outer gradient ▽ f v ( y ) Challenges • Stochastic gradient descent (SGD) method does not work since an “unbiased” sample of the gradient ˜ ▽ g ( x k ) ▽ f ( g ( x k )) is not available. • Fenchel dual does not work except for rare conditions • Sample average approximation (SAA) subject to curse of dimensionality. • Sample complexity unclear 14 / 24

Stochastic Composition Algorithms: Convergence and Sample Complexity Basic Idea To approximate � � x k − α k ˜ x k +1 = Π X ▽ g ( x k ) ▽ f ( g ( x k )) , by a quasi-gradient iteration using estimates of g ( x k ) Algorithm 1: Stochastic Compositional Gradient Descent (SCGD) Require: x 0 , z 0 ∈ ℜ n , y 0 ∈ ℜ m , SO , K , stepsizes { α k } K k =1 , and { β k } K k =1 . Ensure: { x k } K k =1 for k = 1 , · · · , K do Query the SO and obtain ˜ ∇ g w k ( x k ) , g w k ( x k ) , f v k ( y k +1 ) Update by y k +1 = (1 − β k ) y k + β k g w k ( x k ) , � � x k − α k ˜ x k +1 = Π X ▽ g w k ( x k ) ▽ f v k ( y k +1 ) , end for Remarks • Each iteration makes simple updates by interacting with SO • Scalable with large-scale batch data and can process streaming data points online • Considered for the first time by (Ermoliev 1976) as a stochastic approximation method without rate analysis 15 / 24

Stochastic Composition Algorithms: Convergence and Sample Complexity Sample Complexity (Wang et al., 2016) Under suitable conditions (inner function nonsmooth, outer function smooth), and X is bounded, let the stepsizes be α k = k − 3 / 4 , β k = k − 1 / 2 , we have that, if k is large enough,     k � �  2 1 �  − F ∗  = O  F x t . E k 1 / 4 k t = k / 2+1 (Optimal rate which matches the lowerbound for stochastic programming) Sample Convexity in Strongly Convex Case (Wang et al., 2016) Under suitable conditions (inner function nonsmooth, outer function smooth), suppose that the compositional function F ( · ) is strongly convex, let the stepsizes be α k = 1 1 k , and β k = k 2 / 3 , we have, if k is sufficiently large, � 1 � � � x k − x ∗ � 2 � = O . E k 2 / 3 16 / 24

Stochastic Composition Optimization Algorithms and Sample - PowerPoint PPT Presentation

Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang Joint works with Ethan X. Fang, Han Liu, and Ji Liu ORFE@Princeton ICCOPT, Tokyo, August 8-11, 2016 1 / 24 Collaborators M. Wang, X. Fang, and H. Liu.

Dual Effect in Stochastic Optimization February 10, 2015 P. Carpentier Master MMMEF Cours

Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic

Stochastic Optimization and Discretization January 06, 2021 P. Carpentier Master Optimization

Introduction to Stochastic Optimization January 13, 2015 P. Carpentier Master MMMEF Cours

Stochastic Online Optimization Jian Li Institute of Interdisciplinary Information Sciences

Convergence of a Stochastic Gradient Method with Momentum for Non-Smooth Non-Convex Optimization

Framework for Metric Composition + Spatial Composition of Spatial Composition of Metrics Al

CHAPTER V V CHAPTER Annealing by Stochastic Annealing by Stochastic Neural Networks for

Overview of the Stochastic Gradient Method December 02, 2020 P. Carpentier Master Optimization

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Various Topics Outline 1. Dynamic (time-varying) Optimization Problems 2. Stochastic

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Stochastic Processes Will Perkins March 7, 2013 Stochastic Processes Q: What is a Stochastic

What If We Only Have Stochastic . . . What if the Stochastic . . . Approximate Stochastic

The importance of better models in stochastic optimization John Duchi (based on joint work with

Some References P. Carpentier Master MMMEF Cours MNOS 2014-2015 263 / 263 Stochastic

The Danish Experience of > > Revising the Weight 10 May 2012 Contents > > -

COMMENTS: CRIMINAL CAREERS AND CRIMINAL FIRMS Randi Hjalmarsson Department of Economics,

Project Presentation Norbert Kreuzkamp Reutlingen, 16.11.2011 The basis of the project The

PPG1007 Ministers Briefing Workshop Nick Dalla Guarda Danielle Pineda February 16, 2018 1

A review of numerical relativity and black-hole collisions U. Sperhake DAMTP , University of

Health Affairs Committee Compliance Update November 5, 2018 OPEN HEALTH AFF INFO 2 1

Point processes characterized by their one dimensional distributions Aihua Xia Department of

Probabilistic Graphical Models David Sontag New York University Lecture 8, March 22, 2012 David

Sambuz

Useful Links

Newsletter

Mail Us

Stochastic Composition Optimization Algorithms and Sample - PowerPoint PPT Presentation

Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang Joint works with Ethan X. Fang, Han Liu, and Ji Liu ORFE@Princeton ICCOPT, Tokyo, August 8-11, 2016 1 / 24 Collaborators M. Wang, X. Fang, and H. Liu.

Dual Effect in Stochastic Optimization February 10, 2015 P. Carpentier Master MMMEF Cours

Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic

Stochastic Optimization and Discretization January 06, 2021 P. Carpentier Master Optimization

Introduction to Stochastic Optimization January 13, 2015 P. Carpentier Master MMMEF Cours

Stochastic Online Optimization Jian Li Institute of Interdisciplinary Information Sciences

Convergence of a Stochastic Gradient Method with Momentum for Non-Smooth Non-Convex Optimization

Framework for Metric Composition + Spatial Composition of Spatial Composition of Metrics Al

CHAPTER V V CHAPTER Annealing by Stochastic Annealing by Stochastic Neural Networks for

Overview of the Stochastic Gradient Method December 02, 2020 P. Carpentier Master Optimization

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Various Topics Outline 1. Dynamic (time-varying) Optimization Problems 2. Stochastic

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Stochastic Processes Will Perkins March 7, 2013 Stochastic Processes Q: What is a Stochastic

What If We Only Have Stochastic . . . What if the Stochastic . . . Approximate Stochastic

The importance of better models in stochastic optimization John Duchi (based on joint work with

Some References P. Carpentier Master MMMEF Cours MNOS 2014-2015 263 / 263 Stochastic

The Danish Experience of &gt; &gt; Revising the Weight 10 May 2012 Contents &gt; &gt; -

COMMENTS: CRIMINAL CAREERS AND CRIMINAL FIRMS Randi Hjalmarsson Department of Economics,

Project Presentation Norbert Kreuzkamp Reutlingen, 16.11.2011 The basis of the project The

PPG1007 Ministers Briefing Workshop Nick Dalla Guarda Danielle Pineda February 16, 2018 1

A review of numerical relativity and black-hole collisions U. Sperhake DAMTP , University of

Health Affairs Committee Compliance Update November 5, 2018 OPEN HEALTH AFF INFO 2 1

Point processes characterized by their one dimensional distributions Aihua Xia Department of

Probabilistic Graphical Models David Sontag New York University Lecture 8, March 22, 2012 David

Sambuz

Useful Links

Newsletter

Mail Us

The Danish Experience of > > Revising the Weight 10 May 2012 Contents > > -