Accelerated primal-dual methods for linearly constrained convex problems Yangyang Xu SIAM Conference on Optimization May 24, 2017 1 / 23
Accelerated proximal gradient For convex composite problem: minimize F ( x ) := f ( x ) + g ( x ) x • f : convex and Lipschitz differentiable • g : closed convex (possibly nondifferentiable) and simple Proximal gradient: �∇ f ( x k ) , x � + L f x k +1 = arg min 2 � x − x k � 2 + g ( x ) x • convergence rate: F ( x k ) − F ( x ∗ ) = O (1 /k ) Accelerated Proximal gradient [Beck-Teboulle’09, Nesterov’14]: x k ) , x � + L f x k +1 = arg min x k � 2 + g ( x ) �∇ f (ˆ 2 � x − ˆ x x k : extrapolated point • ˆ • convergence rate (with smart extrapolation): F ( x k ) − F ( x ∗ ) = O (1 /k 2 ) This talk: ways to accelerate primal-dual methods 2 / 23
Part I: accelerated linearized augmented Lagrangian 3 / 23
Affinely constrained composite convex problems minimize F ( x ) = f ( x ) + g ( x ) , subject to Ax = b (LCP) x • f : convex and Lipschitz differentiable • g : closed convex and simple Examples • nonnegative quadratic programming: f = 1 2 x ⊤ Qx + c ⊤ x , g = ι R n + • TV image denoising: min { 1 2 � X − B � 2 F + λ � Y � 1 , s . t . D ( X ) = Y } 4 / 23
Augmented Lagrangian method (ALM) At iteration k , f ( x ) + g ( x ) − � λ k , Ax � + β x k +1 ← arg min 2 � Ax − b � 2 , x λ k +1 ← λ k − γ ( Ax k +1 − b ) • augmented dual gradient ascent with stepsize γ • β : penalty parameter; dual gradient Lipschitz constant 1 /β • 0 < γ < 2 β : convergence guaranteed • also popular for (nonlinear, nonconvex) constrained problems x -subproblem as difficult as original problem 5 / 23
Linearized augmented Lagrangian method • Linearize the smooth term f : �∇ f ( x k ) , x � + η 2 � x − x k � 2 + g ( x ) − � λ k , Ax � + β x k +1 ← arg min 2 � Ax − b � 2 . x • Linearize both f and � Ax − b � 2 : �∇ f ( x k ) , x � + g ( x ) − � λ k , Ax � + � βA ⊤ r k , x � + η x k +1 ← arg min 2 � x − x k � 2 , x where r k = Ax k − b is the residual. Easier updates and nice convergence speed O (1 /k ) 6 / 23
Accelerated linearized augmented Lagrangian method At iteration k , x k ← (1 − α k )¯ x k + α k x k , ˆ x k ) − A ⊤ λ k , x � + g ( x ) + β k 2 � Ax − b � 2 + η k x k +1 ← arg min 2 � x − x k � 2 , �∇ f (ˆ x x k +1 ← (1 − α k )¯ x k + α k x k +1 , ¯ λ k +1 ← λ k − γ k ( Ax k +1 − b ) . • Inspired by [Lan ’12] on accelerated stochastic approximation • reduces to linearized ALM if α k = 1 , β k = β, η k = η, γ k = γ, ∀ k • convergence rate: O (1 /k ) if η ≥ L f and 0 < γ < 2 β • adaptive parameters to have O (1 /k 2 ) (next slides) 7 / 23
Better numerical performance Objective error Feasibility Violation 0 0 10 10 Nonaccelerated ALM Nonaccelerated ALM |objective minus optimal value| Accelerated ALM Accelerated ALM −1 10 −2 10 violation of feasibility −2 10 −4 10 −3 10 −6 10 −4 10 −8 10 −5 10 −6 −10 10 10 0 200 400 600 800 1000 0 200 400 600 800 1000 Iteration numbers Iteration numbers • Tested on quadratic programming (subproblems solved exactly) • Parameters set according to theorem (see next slide) • Accelerated ALM significantly better 8 / 23
Guaranteed fast convergence Assumptions: • There is a pair of primal-dual solution ( x ∗ , λ ∗ ) . • ∇ f is Lipschitz continuous: �∇ f ( x ) − ∇ f ( y ) � ≤ L f � x − y � Convergence rate of order O (1 /k 2 ) : • Set parameters to k + 1 , γ k = kγ, β k ≥ γ k 2 2 , η k = η ∀ k : α k = k , where γ > 0 and η ≥ 2 L f . Then � � η � x 1 − x ∗ � 2 + 4 � λ ∗ � 2 1 x k +1 ) − F ( x ∗ ) | ≤ | F (¯ , k ( k + 1) γ � � η � x 1 − x ∗ � 2 + 4 � λ ∗ � 2 1 x t +1 − b � ≤ � A ¯ , k ( k + 1) max(1 , � λ ∗ � ) γ 9 / 23
Sketch of proof Let Φ(¯ x, x, λ ) = F (¯ x ) − F ( x ) − � λ, A ¯ x − b � . 1. Fundamental inequality (for any λ ): x k +1 , x ∗ , λ ) − (1 − α k )Φ(¯ x k , x ∗ , λ ) Φ(¯ α 2 � x k +1 − x ∗ � 2 − � x k − x ∗ � 2 + � x k +1 − x k � 2 � k L f � x k +1 − x k � 2 ≤− α k η k � + 2 2 � λ k − λ � 2 − � λ k +1 − λ � 2 + � λ k +1 − λ k � 2 � � λ k +1 − λ k � 2 , + α k � − α k β k 2 γ k γ 2 k k +1 , γ k = kγ, β k ≥ γ k 2 2 , η k = η 2. α k = k and multiply k ( k + 1) to the above ineq.: x k +1 , x ∗ , λ ) − k ( k − 1)Φ(¯ x k , x ∗ , λ ) k ( k + 1)Φ(¯ + 1 � x k +1 − x ∗ � 2 − � x k − x ∗ � 2 � � λ k − λ � 2 − � λ k +1 − λ � 2 � ≤ − η � � . γ 3. Set λ 1 = 0 and sum the above inequality over k : 1 � η � x 1 − x ∗ � 2 + 1 γ � λ � 2 � x k +1 , x ∗ , λ ) ≤ Φ(¯ k ( k + 1) x k +1 − b 4. Take λ = max (1 + � λ ∗ � , 2 � λ ∗ � ) A ¯ x k +1 − b � and use the optimality condition � A ¯ x k +1 − b � x, x ∗ , λ ∗ ) ≥ 0 ⇒ F (¯ x k +1 ) − F ( x ∗ ) ≥ −� λ ∗ � · � A ¯ Φ(¯ 10 / 23
Literature • [He-Yuan ’10]: accelerated ALM to O (1 /k 2 ) for smooth problems • [Kang et. al ’13]: accelerated ALM to O (1 /k 2 ) for nonsmooth problems • [Huang-Ma-Goldfarb ’13]: accelerated linearized ALM (with linearization of augmented term) to O (1 /k 2 ) for strongly convex problems 11 / 23
Part II: accelerated linearized ADMM 12 / 23
Two-block structured problems Variable is partitioned into two blocks, smooth part involves one block, and nonsmooth part is separable minimize h ( y ) + f ( z ) + g ( z ) , subject to By + Cz = b (LCP-2) y,z • f convex and Lipschitz differentiable • g and h closed convex and simple Examples: • Total-variation regularized regression: � y,z λ � y � 1 + f ( z ) , s . t . D z = y � min 13 / 23
Alternating direction method of multipliers (ADMM) At iteration k , h ( y ) − � λ k , By � + β y k +1 ← arg min 2 � By + Cz k − b � 2 , y f ( z ) + g ( z ) − � λ k , Cz � + β z k +1 ← arg min 2 � By k +1 + Cz − b � 2 , z λ k +1 ← λ k − γ ( By k +1 + Cz k +1 − b ) √ • 0 < γ < 1+ 5 β : convergence guaranteed [Glowinski-Marrocco’75] 2 • updating y, z alternatingly: easier than jointly update • but z -subproblem can still be difficult 14 / 23
Accelerated linearized ADMM At iteration k , h ( y ) − � λ k , By � + β k y k +1 ← arg min 2 � By + Cz k + − b � 2 , y 2 , z � + g ( z ) + η k z k +1 ← arg min �∇ f ( z k ) − C ⊤ λ k + β k C ⊤ r k + 1 2 � z − z k � 2 , z λ k +1 ← λ k − γ k ( By k +1 + Cz k +1 − b ) where r k + 1 2 = By k +1 + Cz k − b . • reduces to linearized ADMM if β k = β, η k = η, γ k = γ, ∀ k • convergence rate: O (1 /k ) if 0 < γ ≤ β and η ≥ L f + β � C � 2 • O (1 /k 2 ) if adaptive parameters and strong convexity on z (next two slides) 15 / 23
Accelerated convergence speed Assumptions: • Existence of a pair of primal-dual solution ( y ∗ , z ∗ , λ ∗ ) • ∇ f Lipschitz continuous: �∇ f (ˆ z ) − ∇ f (˜ z ) � ≤ L f � ˆ z − ˜ z � • f strongly convex with modulus µ f (not required for y ) Convergence rate of order O (1 /k 2 ) • Set parameters as follows (with γ > 0 and γ < η ≤ µ f / 2 ) ∀ k : β k = γ k = ( k + 1) γ, η k = ( k + 1) η + L f , Then � z k − z ∗ � 2 , | F (¯ y k + C ¯ z k − b � � max � y k , ¯ z k ) − F ∗ | , � B ¯ ≤ O (1 /k 2 ) , where F ( y, z ) = h ( y ) + f ( z ) + g ( z ) and F ∗ = F ( y ∗ , z ∗ ) . 16 / 23
Sketch of proof 1. Fundamental inequality from optimality conditions of each iterate: F ( y k +1 , z k +1 ) − F ( y, z ) − � λ, By k +1 + Cz k +1 − b � ≤− � 1 γ k ( λ k − λ k +1 ) , λ − λ k + β k γ k ( λ k − λ k +1 ) − β k C ( z k +1 − z k ) � L f µ f 2 � z k +1 − z k � 2 − 2 � z k − z � 2 − η k � z k +1 − z, z k +1 − z k � , + 2. Plug in parameters and bound cross terms: F ( y k +1 , z k +1 ) − F ( y ∗ , z ∗ ) − � λ, By k +1 + Cz k +1 − b � η ( k + 1) � z k +1 − z ∗ � 2 + L f � z k +1 − z ∗ � 2 � � + 1 2 γ ( k +1) � λ − λ k +1 � 2 1 + 2 η ( k + 1) � z k − z ∗ � 2 + ( L f − µ f ) � z k − z ∗ � 2 � ≤ 1 � 1 2 γ ( k +1) � λ − λ k � 2 . + 2 2 L f 3. Multiply k + k 0 (here k 0 ∼ µ f ) and sum the inequality over k : z k +1 − b � ≤ φ ( y ∗ , z ∗ , λ ) y k +1 + C ¯ y k +1 , ¯ z k +1 ) − F ( y ∗ , z ∗ ) − � λ, B ¯ F (¯ k 2 4. Take a special λ and use KKT conditions 17 / 23
Literature • [Ouyang et. al’15] : O ( L f /k 2 + C 0 /k ) with only weak convexity • [Goldstein et. al’14] : O (1 /k 2 ) with strong convexity on both y and z • [Chambolle-Pock’11, Chambolle-Pock’16, Dang-Lan’14, Bredies-Sun’16] : accelerated first-order methods on bilinear saddle-point problems Open question: weakest conditions to have O (1 /k 2 ) 18 / 23
Numerical experiments (More results in paper) 19 / 23
Recommend
More recommend