Adaptive primal-dual stochastic gradient methods Yangyang Xu - PowerPoint PPT Presentation

Adaptive primal-dual stochastic gradient methods Yangyang Xu Mathematical Sciences, Rensselaer Polytechnic Institute October 26, 2019 1 / 22

Stochastic gradient method stochastic program: � F ( x ; ξ ) � min x ∈ X f ( x ) = E ξ � N • if ξ uniform on { ξ 1 , . . . , ξ N } , then f ( x ) = 1 i =1 F ( x ; ξ i ) N • stochastic gradient (that requires samples of ξ ): x k +1 = Proj X x k − α k g k � � where g k is a stochastic approximation of ∇ f ( x k ) • low per-update complexity compared to deterministic gradient descent • Literature: tons of works (e.g., [Robbins-Monro’51, Polyak-Juditsky’92, Nemirovski et. al. ’09, Ghadimi-Lan’13, Davis et. al’18] ) 2 / 22

adaptive learning • adaptive gradient [Duchi-Hazan-Singer’11] : x k +1 = Proj v k x k − α k · g k ⊘ v k � � X �� k where v k = t =0 ( g k ) 2 • many other adaptive variants: Adam [Kingma-Ba’14] , AMSGrad [Reddi-Kale-Kumar’18] , and so on • extremely popular in training deep neural networks 3 / 22

Adaptiveness improves convergence speed 0.7 AdaGrad Adam 0.6 tuned SGD objective value 0.5 0.4 0.3 0.2 0.1 0 1 2 3 4 5 pass of data • test on solving a neural network with one hidden layer Observation: adaptive methods much faster, and all methods have similar per-update cost 4 / 22

Take a close look: x k +1 = Proj v k x k − α k · g k ⊘ v k � � X • Proj v k X is assumed simple (holds if X is simple) • Not (easily) implementable if X is complicated This talk: adaptive primal-dual stochastic gradient for problems with complicated constraints 5 / 22

Outline 1. Problem formula and motivating examples 2. Review of existing methods 3. Proposed primal-dual stochastic gradient method 4. Numerical and convergence results and conclusions 6 / 22

Stochastic functional constrained stochastic program � F 0 ( x ; ξ 0 ) � min x ∈ X f 0 ( x ) = E ξ 0 , (P) � F j ( x ; ξ j ) � s . t . f j ( x ) = E ξ j ≤ 0 , j = 1 , . . . , m • X is a simple closed convex set (but the feasible set is complicated) • f j is convex and possibly nondifferentiable • m could be very big: expensive to access all f j ’s at every update Goal: design an efficient stochastic method without complicated projection that can guarantees (near) optimality and feasibility 7 / 22

Example I: linear programming of Markov decision process discounted Markov decision process: ( S , A , P , r , γ ) • state space S = { s 1 , . . . , s m } , action space A = { a 1 , . . . , a n } • transition probability P = [ P a ( s, s ′ )] , reward r = [ r a ( s, s ′ )] • discount factor: γ ∈ (0 , 1] Bellman optimality equation: � P a ( s, s ′ ) � r a ( s, s ′ ) + γv ( s ′ ) � v ( s ) = max , ∀ s ∈ S a ∈A s ′ ∈S equivalent to linear programming [Puterman’14] : v e ⊤ v , s . t . ( I − γ P a ) v − r a ≥ 0 , ∀ a ∈ A min • r a ( s ) = � s ′ ∈S P a ( s, s ′ ) r a ( s, s ′ ) • huge number of constraints if m and/or n is big 8 / 22

Example II: robust optimization by sampling Robust optimization: x ∈ X f 0 ( x ) , s . t . g ( x ; ξ ) ≤ 0 , ∀ ξ ∈ Ξ min Sampled approximation [Calafiore-Campi’05] : x ∈ X f 0 ( x ) , s . t . g ( x ; ξ i ) ≤ 0 , ∀ i = 1 , . . . , m min • { ξ 1 , . . . , ξ m } : m independently extracted samples • solution of the sampled approximation problem is a (1 − τ ) -level robustly feasible solution with probability at least 1 − ε if m ≥ n τε − 1 , where τ ∈ (0 , 1) and ε ∈ (0 , 1) . 9 / 22

Literature Few for problems with functional constraints • penalty method with stochastic approximation [Wang-Ma-Yuan’17] • uses exact function/gradient information of all constraint functions • stochastic mirror-prox descent for saddle-point problems [Baes-Brgisser-Nemirovski’13] • cooperative stochastic approximation (CSA) for problems with expectation constraint [Lan-Zhou’16] • level-set methods [Lin et. al’18] 10 / 22

Stochastic mirror-prox method [Baes-Brgisser-Nemirovski’13] For a saddle-point problem: min x ∈ X max z ∈ Z L ( x , z ) Iterative update scheme: �� x k − α k g k x , z k + α k g k �� x k , ˆ z k � � ˆ = Proj X × Z , z �� x k − α k ˆ x , z k + α k ˆ �� x k +1 , z k +1 � g k g k � = Proj X × Z z • ( g k x ; g k z ) : a stochastic approximation of ∇L ( x k , z k ) • (ˆ g k g k x k , ˆ z k ) x ; ˆ z ) : a stochastic approximation of ∇L (ˆ √ • O (1 / k ) rate in terms of primal-dual gap 11 / 22

Cooperative stochastic approximation [Lan-Zhou’16] For the problem with expectation constraint: min x ∈ X f ( x ) = E ξ [ F ( x , ξ )] , s . t . E ξ [ G ( x , ξ )] ≤ 0 For k = 0 , 1 , . . . , do 1. sample ξ k ; 2. If G ( x k , ξ k ) ≤ η k , set g k = ˜ ∇ F ( x k , ξ k ) ; otherwise, g k = ˜ ∇ G ( x k , ξ k ) 3. Update x by 1 x k +1 = arg min 2 γ k � x − x k � 2 � g k , x � + x ∈ X • purely primal method √ • O (1 / k ) rate for convex problems • O (1 /k ) if both objective and constraint functions are strongly convex 12 / 22

proposed method by the augmented Lagrangian function 13 / 22

Augmented Lagrangian function With slack variables s ≥ 0 , (P) is equivalent to x ∈ X, s ≥ 0 f 0 ( x ) , s . t . f i ( x ) + s i = 0 , i = 1 , . . . , m. min By quadratic penalty, the augmented Lagrangian function is m m + β � � � 2 . ˜ � � � L β ( x , s , z ) = f 0 ( x ) + z i f i ( x ) + s i f i ( x ) + s i 2 i =1 i =1 Fix ( x , z ) and minimize ˜ L β about s ≥ 0 (through solving ∇ s ˜ L β = 0 ): � � − z i s i = β − f i ( x ) , i = 1 , . . . , m. + 14 / 22

Augmented Lagrangian function Eliminate s to have the classic augmented Lagrangian function of (P): m � L β ( x , z ) = f 0 ( x ) + ψ β ( f i ( x ) , z i ) , i =1 where � uv + β 2 u 2 , if βu + v ≥ 0 , ψ β ( u, v ) = − v 2 2 β , if βu + v < 0 . • ψ β ( f i ( x ) , z i ) convex in x and concave in z i for each i • thus L β convex in x and concave in z 15 / 22

Augmented Lagrangian method Choose ( x 1 , y 1 , z 1 ) . For k = 1 , 2 , . . . , iteratively do: x k +1 ∈ Arg min L β ( x , z k ) , x ∈ X z k +1 = z k + ρ ∇ z L β ( x k +1 , z k ) • if ρ < 2 β , globally convergent with rate O � 1 � kρ • bigger ρ and β gives faster convergence in term of iteration number but yields harder x -subproblem 16 / 22

Proposed primal-dual stochastic gradient method Consider the case: • exact f j and ˜ ∇ f j can be obtained for each j = 1 , . . . , m • m is big: expensive to access all f j ’s every update Examples: MDP, robust optimization by sampling, multi-class SVM Remarks: • if f j is stochastic, AL function is a compositional expectation form • difficult to obtain unbiased stochastic estimation of ˜ ∇ x L β • ordinary Lagrangian function can be used to handle the most general case 17 / 22

Proposed primal-dual stochastic gradient method For k = 0 , 1 , . . . , do 1. Sample ξ k and pick j k ∈ [ m ] uniformly at random; 2. Let g k = ˜ ∇ F 0 ( x k , ξ k ) + ˜ � � f j k ( x k ) , z k ∇ x ψ β ; j k 3. Update the primal variable x by x k +1 = Proj X x k − D − 1 � k g k � 4. Let z k +1 = z k j for j � = j k and update z j k by j z k � � j z k +1 = z k β , f j ( x k ) j + ρ k · max − , for j = j k . j • g k unbiased stochastic estimation of ˜ ∇ x L β at x k • ˜ ∇ f j k ( x k ) required, and f j k ( x k ) and f j k ( x k ) needed for the updates �� k � g k scaled version of g k • D k = I /α k + η · diag t =0 | ˜ g k | 2 with ˜ 18 / 22

How the proposed method performs Test on convex quadratically constrained quadratic programming � N 1 i =1 � H i x − c i � 2 , s . t . 1 2 x ⊤ Q j x + a ⊤ min j x ≤ b j , j = 1 , . . . , m, 2 N x ∈ X where N = m = 10 , 000 . 10 2 10 2 objective distance to optimality PDSG-nonadp PDSG-nonadp average feasibility residual PDSG-adp PDSG-adp 10 0 10 0 CSA CSA mirror-prox mirror-prox 10 -2 10 -2 10 -4 10 -4 10 -6 10 -6 0 10 20 30 40 50 0 10 20 30 40 50 number of epochs number of epochs Observations : • proposed methods better than mirror-prox and CSA • adaptiveness significantly improves convergence speed • all methods have roughly the same asymptotic convergence rate 19 / 22

Sublinear convergence result Assumptions: 1. existence of a primal-dual solution ( x ∗ , z ∗ ) 2. unbiased estimate and bounded variance 3. bounded constraint function and subgradient ρ α Theorem: Given K , let α k = K , ρ k = K , β ≥ ρ . Then √ √ √ max � E � x K ) − f 0 ( x ∗ ) � x K )] + � � = O � K � � f 0 (¯ � , E � [ f (¯ 1 / 2 ρ If f 0 is strongly convex, let α k = α ρ k , ρ k = log( K +1) , β ≥ log 2 . Then � log( K + 1) � E � x K − x ∗ � 2 = O K x K weighted average of { x k } K +1 • ¯ k =1 Remark : CSA [Lan-Zhou’16] requires strong convexity of both objective and constraint functions to achieve O ( 1 K ) 20 / 22

Adaptive primal-dual stochastic gradient methods Yangyang Xu - PowerPoint PPT Presentation

Adaptive primal-dual stochastic gradient methods Yangyang Xu Mathematical Sciences, Rensselaer Polytechnic Institute October 26, 2019 1 / 22 Stochastic gradient method stochastic program: F ( x ; ) min x X f ( x ) = E N

Contents 1. General Problem 2. Quasi-primal algebras Logics associated with a quasi-primal

New primal-dual subgradient methods for Convex Problems with Functional Constraints Yurii

4 THE PRIMAL-DUAL METHOD FOR APPROXIMATION ALGORITHMS AND ITS APPLICATION TO NETWORK DESIGN

optimization problems for primal-dual algorithms minimize f ( x ) + g ( x ) + h ( Ax ) x f ,

Primal-dual Subgradient Method for Convex Problems with Functional Constraints Yurii Nesterov,

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Introducing Adaptive Algorithmic Behavior of Primal Heuristics in SCIP for Solving Mixed Integer

BDDC Algorithms with Adaptive Choices of Primal Constraints Olof B. Widlund Courant Institute,

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

Overview of the Stochastic Gradient Method December 02, 2020 P. Carpentier Master Optimization

Accelerated primal-dual methods for linearly constrained convex problems Yangyang Xu SIAM

Neural Nets for Adaptive Filter and Adaptive Neural Nets as Adaptive Filters Pattern Recognition

Adaptive Control Chapter 1: Introduction to Adaptive Control Adaptive Control Landau, Lozano,

Adaptive Control Chapter 11: Direct Adaptive Control 1 Adaptive Control Landau, Lozano,

Conditional Gradient Methods via Stochastic Path-Integrated Differential Estimator Alp Yurtsever

This Class Weighted Majority Algorithm Mul+ple experts

Multiplicative Weights Algorithms CompSci 590.03 Instructor: Ashwin Machanavajjhala Lecture 13 :

Ontological Pathfinding: Mining First-Order Knowledge from Large Knowledge Bases Yang Chen, Sean

The Sample-Computational Tradeoff Shai Shalev-Shwartz School of Computer Science and Engineering

Waiting for 6+ years Pete Beckman Argonne National Laboratory 2 Data from Peter

Highly Scalable Parallel Sorting Edgar Solomonik and Laxmikant Kale University of Illinois at

A Quantum Interior Point Method for LPs and SDPs Iordanis Kerenidis 1 Anupam Prakash 1 1 CNRS,