Adaptive primal-dual stochastic gradient methods Yangyang Xu Mathematical Sciences, Rensselaer Polytechnic Institute October 26, 2019 1 / 22
Stochastic gradient method stochastic program: � F ( x ; ξ ) � min x ∈ X f ( x ) = E ξ � N • if ξ uniform on { ξ 1 , . . . , ξ N } , then f ( x ) = 1 i =1 F ( x ; ξ i ) N • stochastic gradient (that requires samples of ξ ): x k +1 = Proj X x k − α k g k � � where g k is a stochastic approximation of ∇ f ( x k ) • low per-update complexity compared to deterministic gradient descent • Literature: tons of works (e.g., [Robbins-Monro’51, Polyak-Juditsky’92, Nemirovski et. al. ’09, Ghadimi-Lan’13, Davis et. al’18] ) 2 / 22
adaptive learning • adaptive gradient [Duchi-Hazan-Singer’11] : x k +1 = Proj v k x k − α k · g k ⊘ v k � � X �� k where v k = t =0 ( g k ) 2 • many other adaptive variants: Adam [Kingma-Ba’14] , AMSGrad [Reddi-Kale-Kumar’18] , and so on • extremely popular in training deep neural networks 3 / 22
Adaptiveness improves convergence speed 0.7 AdaGrad Adam 0.6 tuned SGD objective value 0.5 0.4 0.3 0.2 0.1 0 1 2 3 4 5 pass of data • test on solving a neural network with one hidden layer Observation: adaptive methods much faster, and all methods have similar per-update cost 4 / 22
Take a close look: x k +1 = Proj v k x k − α k · g k ⊘ v k � � X • Proj v k X is assumed simple (holds if X is simple) • Not (easily) implementable if X is complicated This talk: adaptive primal-dual stochastic gradient for problems with complicated constraints 5 / 22
Outline 1. Problem formula and motivating examples 2. Review of existing methods 3. Proposed primal-dual stochastic gradient method 4. Numerical and convergence results and conclusions 6 / 22
Stochastic functional constrained stochastic program � F 0 ( x ; ξ 0 ) � min x ∈ X f 0 ( x ) = E ξ 0 , (P) � F j ( x ; ξ j ) � s . t . f j ( x ) = E ξ j ≤ 0 , j = 1 , . . . , m • X is a simple closed convex set (but the feasible set is complicated) • f j is convex and possibly nondifferentiable • m could be very big: expensive to access all f j ’s at every update Goal: design an efficient stochastic method without complicated projection that can guarantees (near) optimality and feasibility 7 / 22
Example I: linear programming of Markov decision process discounted Markov decision process: ( S , A , P , r , γ ) • state space S = { s 1 , . . . , s m } , action space A = { a 1 , . . . , a n } • transition probability P = [ P a ( s, s ′ )] , reward r = [ r a ( s, s ′ )] • discount factor: γ ∈ (0 , 1] Bellman optimality equation: � P a ( s, s ′ ) � r a ( s, s ′ ) + γv ( s ′ ) � v ( s ) = max , ∀ s ∈ S a ∈A s ′ ∈S equivalent to linear programming [Puterman’14] : v e ⊤ v , s . t . ( I − γ P a ) v − r a ≥ 0 , ∀ a ∈ A min • r a ( s ) = � s ′ ∈S P a ( s, s ′ ) r a ( s, s ′ ) • huge number of constraints if m and/or n is big 8 / 22
Example II: robust optimization by sampling Robust optimization: x ∈ X f 0 ( x ) , s . t . g ( x ; ξ ) ≤ 0 , ∀ ξ ∈ Ξ min Sampled approximation [Calafiore-Campi’05] : x ∈ X f 0 ( x ) , s . t . g ( x ; ξ i ) ≤ 0 , ∀ i = 1 , . . . , m min • { ξ 1 , . . . , ξ m } : m independently extracted samples • solution of the sampled approximation problem is a (1 − τ ) -level robustly feasible solution with probability at least 1 − ε if m ≥ n τε − 1 , where τ ∈ (0 , 1) and ε ∈ (0 , 1) . 9 / 22
Literature Few for problems with functional constraints • penalty method with stochastic approximation [Wang-Ma-Yuan’17] • uses exact function/gradient information of all constraint functions • stochastic mirror-prox descent for saddle-point problems [Baes-Brgisser-Nemirovski’13] • cooperative stochastic approximation (CSA) for problems with expectation constraint [Lan-Zhou’16] • level-set methods [Lin et. al’18] 10 / 22
Stochastic mirror-prox method [Baes-Brgisser-Nemirovski’13] For a saddle-point problem: min x ∈ X max z ∈ Z L ( x , z ) Iterative update scheme: �� x k − α k g k x , z k + α k g k �� x k , ˆ z k � � ˆ = Proj X × Z , z �� x k − α k ˆ x , z k + α k ˆ �� x k +1 , z k +1 � g k g k � = Proj X × Z z • ( g k x ; g k z ) : a stochastic approximation of ∇L ( x k , z k ) • (ˆ g k g k x k , ˆ z k ) x ; ˆ z ) : a stochastic approximation of ∇L (ˆ √ • O (1 / k ) rate in terms of primal-dual gap 11 / 22
Cooperative stochastic approximation [Lan-Zhou’16] For the problem with expectation constraint: min x ∈ X f ( x ) = E ξ [ F ( x , ξ )] , s . t . E ξ [ G ( x , ξ )] ≤ 0 For k = 0 , 1 , . . . , do 1. sample ξ k ; 2. If G ( x k , ξ k ) ≤ η k , set g k = ˜ ∇ F ( x k , ξ k ) ; otherwise, g k = ˜ ∇ G ( x k , ξ k ) 3. Update x by 1 x k +1 = arg min 2 γ k � x − x k � 2 � g k , x � + x ∈ X • purely primal method √ • O (1 / k ) rate for convex problems • O (1 /k ) if both objective and constraint functions are strongly convex 12 / 22
proposed method by the augmented Lagrangian function 13 / 22
Augmented Lagrangian function With slack variables s ≥ 0 , (P) is equivalent to x ∈ X, s ≥ 0 f 0 ( x ) , s . t . f i ( x ) + s i = 0 , i = 1 , . . . , m. min By quadratic penalty, the augmented Lagrangian function is m m + β � � � 2 . ˜ � � � L β ( x , s , z ) = f 0 ( x ) + z i f i ( x ) + s i f i ( x ) + s i 2 i =1 i =1 Fix ( x , z ) and minimize ˜ L β about s ≥ 0 (through solving ∇ s ˜ L β = 0 ): � � − z i s i = β − f i ( x ) , i = 1 , . . . , m. + 14 / 22
Augmented Lagrangian function Eliminate s to have the classic augmented Lagrangian function of (P): m � L β ( x , z ) = f 0 ( x ) + ψ β ( f i ( x ) , z i ) , i =1 where � uv + β 2 u 2 , if βu + v ≥ 0 , ψ β ( u, v ) = − v 2 2 β , if βu + v < 0 . • ψ β ( f i ( x ) , z i ) convex in x and concave in z i for each i • thus L β convex in x and concave in z 15 / 22
Augmented Lagrangian method Choose ( x 1 , y 1 , z 1 ) . For k = 1 , 2 , . . . , iteratively do: x k +1 ∈ Arg min L β ( x , z k ) , x ∈ X z k +1 = z k + ρ ∇ z L β ( x k +1 , z k ) • if ρ < 2 β , globally convergent with rate O � 1 � kρ • bigger ρ and β gives faster convergence in term of iteration number but yields harder x -subproblem 16 / 22
Proposed primal-dual stochastic gradient method Consider the case: • exact f j and ˜ ∇ f j can be obtained for each j = 1 , . . . , m • m is big: expensive to access all f j ’s every update Examples: MDP, robust optimization by sampling, multi-class SVM Remarks: • if f j is stochastic, AL function is a compositional expectation form • difficult to obtain unbiased stochastic estimation of ˜ ∇ x L β • ordinary Lagrangian function can be used to handle the most general case 17 / 22
Proposed primal-dual stochastic gradient method For k = 0 , 1 , . . . , do 1. Sample ξ k and pick j k ∈ [ m ] uniformly at random; 2. Let g k = ˜ ∇ F 0 ( x k , ξ k ) + ˜ � � f j k ( x k ) , z k ∇ x ψ β ; j k 3. Update the primal variable x by x k +1 = Proj X x k − D − 1 � k g k � 4. Let z k +1 = z k j for j � = j k and update z j k by j z k � � j z k +1 = z k β , f j ( x k ) j + ρ k · max − , for j = j k . j • g k unbiased stochastic estimation of ˜ ∇ x L β at x k • ˜ ∇ f j k ( x k ) required, and f j k ( x k ) and f j k ( x k ) needed for the updates ��� k � g k scaled version of g k • D k = I /α k + η · diag t =0 | ˜ g k | 2 with ˜ 18 / 22
How the proposed method performs Test on convex quadratically constrained quadratic programming � N 1 i =1 � H i x − c i � 2 , s . t . 1 2 x ⊤ Q j x + a ⊤ min j x ≤ b j , j = 1 , . . . , m, 2 N x ∈ X where N = m = 10 , 000 . 10 2 10 2 objective distance to optimality PDSG-nonadp PDSG-nonadp average feasibility residual PDSG-adp PDSG-adp 10 0 10 0 CSA CSA mirror-prox mirror-prox 10 -2 10 -2 10 -4 10 -4 10 -6 10 -6 0 10 20 30 40 50 0 10 20 30 40 50 number of epochs number of epochs Observations : • proposed methods better than mirror-prox and CSA • adaptiveness significantly improves convergence speed • all methods have roughly the same asymptotic convergence rate 19 / 22
Sublinear convergence result Assumptions: 1. existence of a primal-dual solution ( x ∗ , z ∗ ) 2. unbiased estimate and bounded variance 3. bounded constraint function and subgradient ρ α Theorem: Given K , let α k = K , ρ k = K , β ≥ ρ . Then √ √ √ max � E � x K ) − f 0 ( x ∗ ) � x K )] + � � = O � K � � f 0 (¯ � , E � [ f (¯ 1 / 2 ρ If f 0 is strongly convex, let α k = α ρ k , ρ k = log( K +1) , β ≥ log 2 . Then � log( K + 1) � E � x K − x ∗ � 2 = O K x K weighted average of { x k } K +1 • ¯ k =1 Remark : CSA [Lan-Zhou’16] requires strong convexity of both objective and constraint functions to achieve O ( 1 K ) 20 / 22
Recommend
More recommend