Distributed nonsmooth composite optimization via the proximal - PowerPoint PPT Presentation

Distributed nonsmooth composite optimization via the proximal augmented Lagrangian Neil K. Dhingra neilkdh.com joint work with Sei Zhen Khong Mihailo Jovanović LCCC Focus Period on Large-Scale and Distributed Optimization June 9, 2017 1 / 35

Applications satellite formations combination drug therapy power networks control of buildings 2 / 35

Structure via composite optimization minimize f ( x ) + g ( Tx ) − − ← ← performance structure ◮ f – possibly nonconvex; cts-differentiable ◮ g – convex; often non-differentiable ◮ Tx – promote structure in alternate coordinates ◮ g ( x ) admits easily computable proximal operator, g ( Tx ) does not 3 / 35

Outline I Proximal augmented Lagrangian - centralized approach – method of multipliers II Primal-dual method - distributable - convergence for convex problems - linear convergence for strongly convex problems 4 / 35

Proximal gradient method minimize f ( x ) + g ( x ) Generalizes gradient descent x k +1 = prox α k g � x k − α k ∇ f ( x k ) � - cannot be used for g ( Tx ) in general Nesterov ‘07 Beck & Teboulle ‘09 5 / 35

Proximal operator and Moreau envelope ◮ Proximal operator 1 2 µ � z − v � 2 prox µg ( v ) := argmin g ( z ) + z ◮ Moreau envelope 1 2 µ � z − v � 2 M µg ( v ) := inf g ( z ) + z - continuously differentiable even when g is not ∇ M µg ( v ) = 1 � � v − prox µg ( v ) µ Parikh & Boyd, FnT in Optimization ‘14 6 / 35

Example ◮ Soft-thresholding – proximal operator for ℓ 1 norm � � 1 � 2 µ ( z i − v i ) 2 minimize γ | z i | + z i i separability ⇒ element-wise analytical solution ∇ M prox operator Moreau envelope soft-thresholding Huber function saturation a = µγ 7 / 35

Auxiliary variable minimize f ( x ) + g ( z ) x, z subject to Tx − z = 0 ◮ Decouples f and g ◮ Can use methods for constrained optimization Augmented Lagrangian 1 2 µ � Tx − z � 2 L µ ( x, z ; y ) = f ( x ) + g ( z ) + � y, Tx − z � + 8 / 35

Method of multipliers ( x k +1 , z k +1 ) L µ ( x, z ; y k ) = argmin x,z y k + 1 µ ( Tx k +1 − z k +1 ) y k +1 = ◮ Gradient ascent on a strengthened dual problem ◮ Requires joint minimization over x and z ◮ Well-studied: convergence to local minimum, adaptive µ update, inexact subproblems, etc. 9 / 35

MM cartoon L µ ( x, z ; y 0 ) 10 / 35

MM cartoon L µ ( x, z ; y 1 ) 10 / 35

MM cartoon L µ ( x, z ; y ⋆ ) 10 / 35

Alternating direction method of multipliers x k +1 L µ ( x, z k ; y k ) = argmin differentiable x z k +1 L µ ( x k +1 , z ; y k ) = argmin prox µg ( · ) z y k + 1 µ ( Tx k +1 − z k +1 ) y k +1 = ◮ Convenient for distributed implementation ◮ Convergence speed influenced by µ ◮ Challenge: convergence for nonconvex f Hong, Luo, Razaviyayn, SIAM J. Optimiz. ‘16 11 / 35

ADMM cartoon L µ ( x, z ; y 0 ) 12 / 35

Alternating direction method of multipliers x k +1 L µ ( x, z k ; y k ) = argmin differentiable x z k +1 L µ ( x k +1 , z ; y k ) = argmin prox µg ( · ) z y k + 1 µ ( Tx k +1 − z k +1 ) y k +1 = 13 / 35

Proximal augmented Lagrangian 1 − µ 2 µ � z − ( Tx + µy ) � 2 2 � y � 2 L µ ( x, z ; y ) = f ( x ) + g ( z ) + � �� Minimize over z z ⋆ µ ( x, y ) = prox µg ( Tx + µy ) Evaluate L µ ( x, z ; y ) at z ⋆ L µ ( x, z ⋆ L µ ( x ; y ) := µ ( x, y ); y ) f ( x ) + M µg ( Tx + µy ) − µ 2 � y � 2 = continuously differentiable in x and y Dhingra, Khong, Jovanović, arXiv:1610.04514 14 / 35

Proximal augmented Lagrangian MM x k +1 L µ ( x ; y k ) = argmin x y k + 1 µ ( Tx k +1 − prox µg ( Tx k +1 + µy k )) y k +1 = ◮ Nonconvex f : convergence to local minimum ◮ x -minimization step: differentiable problem Dhingra, Khong, Jovanović, arXiv:1610.04514 15 / 35

Proximal augmented Lagrangian MM cartoon L µ ( x, z ; y 0 ) , L µ ( x ; y 0 ) 16 / 35

Proximal augmented Lagrangian MM cartoon L µ ( x, z ; y 1 ) , L µ ( x ; y 1 ) 16 / 35

Proximal augmented Lagrangian MM cartoon L µ ( x, z ; y ⋆ ) , L µ ( x ; y ⋆ ) 16 / 35

Edge addition in directed consensus networks x 1 x 4 x 2 x 5 x 6 x 7 x 3 z are edges, columns of T are basis for space of balanced graphs Identify edges minimize f 2 ( x ) + γ � Tx � 1 x ( γ ) = x Design edge weights x ⋆ ( γ ) = minimize f 2 ( x ) x subject to sp( Tx ) ∈ sp( Tx ( γ )) 17 / 35

Edge addition in directed consensus networks percent performance loss number of added edges 18 / 35

Comparison with ADMM Comp. time ( s ) Outer iter. ( k ) m m Outer iterations per outer iteration - guaranteed convergence to local minimum - computational savings from reduced outer iterations Dhingra, Khong, Jovanović, arXiv:1610.04514 19 / 35

Outline I Proximal augmented Lagrangian - centralized approach – method of multipliers II Primal-dual method - distributable - convergence for convex problems - linear convergence for strongly convex problems 20 / 35

Primal-descent dual-ascent Arrow-Hurwicz-Uzawa type gradient flow � ˙ � −∇ x L � � x = y ˙ ∇ y L ◮ Existing methods use subgradients or projection ◮ Convenient for distributed implementation Arrow, Hurwicz, Uzawa, ‘59 Nedic & Ozdaglar, TAC ‘09 Wang & Elia, CDC ‘11 Feijer & Paganini, AUT ‘10 Cherukuri, Gharesifard, Cortés, SCL ‘15 21 / 35

First-order primal-dual method � ˙ � −∇ x L µ ( x ; y ) � � x = y ˙ ∇ y L µ ( x ; y ) ◮ Continuous rhs – even for non-differentiable g ( Tx ) - algorithmic implementation via forward Euler discretization ◮ Convex f – asymptotic convergence - Lyapunov function & LaSalle’s invariance principle ◮ Strongly cvx, Lip. cts gradient – linear convergence - Integral Quadratic Constraints - extends to discrete-time Dhingra, Khong, Jovanović, arXiv:1610.04514 22 / 35

Method of multipliers cartoon II L µ ( x ; y ) , min x L µ ( x ; y ) 23 / 35

Method of multipliers cartoon II x 1 = argmin L µ ( x ; y 0 ) , min x L µ ( x ; y ) x 23 / 35

Method of multipliers cartoon II y 1 = y 0 + 1 µ ∇ y L µ ( x 1 ; y 0 ) , min x L µ ( x ; y ) 23 / 35

Method of multipliers cartoon II x 2 = argmin L µ ( x ; y 1 ) , min x L µ ( x ; y ) x 23 / 35

Method of multipliers cartoon II y ⋆ = y 1 + 1 µ ∇ y L µ ( x 2 ; y 1 ) , min x L µ ( x ; y ) 23 / 35

Method of multipliers cartoon II x ⋆ = argmin L µ ( x ; y ⋆ ) , min x L µ ( x ; y ) x 23 / 35

Primal-dual cartoon ( x 1 , y 1 ) = ( x 0 , y 0 ) − α ( ∇ x L µ ( x 0 ; y 0 ) , −∇ y L µ ( x 0 ; y 0 )) , min x L µ ( x ; y ) 24 / 35

Primal-dual cartoon ( x 2 , y 2 ) = ( x 1 , y 1 ) − α ( ∇ x L µ ( x 1 ; y 1 ) , −∇ y L µ ( x 1 ; y 1 )) , min x L µ ( x ; y ) 24 / 35

Primal-dual cartoon ( x ⋆ , y ⋆ ) = ( x 2 , y 2 ) − α ( ∇ x L µ ( x 2 ; y 2 ) , −∇ y L µ ( x 2 ; y 2 )) , min x L µ ( x ; y ) 24 / 35

Distributed updates � ˙ � −∇ f ( x ) − T T ∇ M µg ( Tx + µy ) � � x = y ˙ µ ∇ M µg ( Tx + µy ) − µy ◮ Recall ∇ M µg ( v ) = 1 µ ( v − prox µg ( v )) ◮ Distributed implementation if g separable and - ∇ f : R n → R n is a sparse mapping - T T T is sparse 25 / 35

Distributed updates � ˙ � −∇ f ( x ) − T T ∇ M µg ( Tx + µy ) � � x = y ˙ µ ∇ M µg ( Tx + µy ) − µy ◮ Recall ∇ M µg ( v ) = 1 µ ( v − prox µg ( v )) ◮ Distributed implementation if g separable and - ∇ f : R n → R n is a sparse mapping - T T T is sparse ◮ Each node x i - communicates according to ∇ f and T T T - stores y i according to T T 25 / 35

Overlapping group LASSO example � 1 2 � Ax − b � 2 minimize 2 + � ( Tx ) i � 2 Gradient mapping: ∇ f ( x ) = A T ( Ax − b ) - communicate states x i according to ∇ f and T T T - store y i corresponding to red edges   ⋆ x 1 ⋆     ⋆ ⋆   ⋆ x 2 x 4 � �� A   ⋆ ⋆ ⋆ ⋆   x 3 ⋆ ⋆ � �� T 26 / 35

Reformulation of distributed optimization � minimize f i ( x i ) � minimize f i ( x ) ≡ x 1 ,x 2 ,... x subject to Tx = 0 ◮ T T is Laplacian or incidence matrix of connected network � ≡ minimize f i ( x i ) + I 0 ( Tx ) x 1 ,x 2 ,... � 0 , z = 0 Indicator function is I 0 ( z ) := ∞ , z � = 0 27 / 35

Distributed nonsmooth composite optimization via the proximal - PowerPoint PPT Presentation

Distributed nonsmooth composite optimization via the proximal augmented Lagrangian Neil K. Dhingra neilkdh.com joint work with Sei Zhen Khong Mihailo Jovanovi LCCC Focus Period on Large-Scale and Distributed Optimization June 9, 2017 1 /

A Secant Method for Nonsmooth Optimization Asef Nazari CSIRO Melbourne CARMA Workshop on

Stochastic Perturbations of Proximal-Gradient methods for nonsmooth convex optimization: the

Computational Optimization Advance Topics NonSmooth Optimization Reference: Nonlinear

Investigation of Crouzeixs Conjecture via Nonsmooth Optimization Michael L. Overton Courant

Investigation of Crouzeixs Conjecture via Nonsmooth Optimization Michael L. Overton Courant

Extended Mathematical Programming Michael C. Ferris University of Wisconsin, Madison Nonsmooth

Performance of time-marching techniques dedicated to nonsmooth systems A nonlinear modal analysis

Numerical time integration schemes for nonsmooth multibody systems in the event-driven framework

The nonsmooth contact dynamics method for the simulation of granular matter flows and fracture in

Nonsmooth trust region methods on Riemannian manifolds S. Hosseini Institut f ur Numerische

COMPOSITE OF PLAGE AREAS OVER COMPOSITE OF PLAGE AREAS OVER COMPOSITE OF PLAGE AREAS OVER

Solving composite optimization problems, with applications to phase retrieval John Duchi (based

A nonsmooth Chow-Rashevskis theorem Ermal Feleqi University of Vlora, Albania Optimization,

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

The Chain Rule Given a composite function: The Chain Rule Given a composite function: h ( x ) =

Plan Composite Likelihood Methods What are composite likelihoods? David Firth Where are

Projection onto Minkowski Sums with Application to Constrained Learning Joong-Ho (Johann) Won 1

Complex Case Phenomena in the Grammar Matrix Scott Drellishak University of Washington July 28,

Convex Optimization: Modeling and Algorithms Lieven Vandenberghe Electrical Engineering

Projective Splitting Methods for Decomposing Convex Optimization Problems Jonat han Eckstein

Stochastic Optimization Techniques for Big Data Machine Learning Tong Zhang Rutgers University

Plug & Manage Heterogeneous Sensing Devices Levent Grgen, Johan Nystrm-Persson, Amin

Inner and Outer Approximating Flowpipes for Delay Differential Equations Eric Goubault 1 Sylvie

Stochastic Optimization for Learning over Networks Guanghui (George) Lan School of Industrial

Distributed nonsmooth composite optimization via the proximal - PowerPoint PPT Presentation

Distributed nonsmooth composite optimization via the proximal augmented Lagrangian Neil K. Dhingra neilkdh.com joint work with Sei Zhen Khong Mihailo Jovanovi LCCC Focus Period on Large-Scale and Distributed Optimization June 9, 2017 1 /

A Secant Method for Nonsmooth Optimization Asef Nazari CSIRO Melbourne CARMA Workshop on

Stochastic Perturbations of Proximal-Gradient methods for nonsmooth convex optimization: the

Computational Optimization Advance Topics NonSmooth Optimization Reference: Nonlinear

Investigation of Crouzeixs Conjecture via Nonsmooth Optimization Michael L. Overton Courant

Investigation of Crouzeixs Conjecture via Nonsmooth Optimization Michael L. Overton Courant

Extended Mathematical Programming Michael C. Ferris University of Wisconsin, Madison Nonsmooth

Performance of time-marching techniques dedicated to nonsmooth systems A nonlinear modal analysis

Numerical time integration schemes for nonsmooth multibody systems in the event-driven framework

The nonsmooth contact dynamics method for the simulation of granular matter flows and fracture in

Nonsmooth trust region methods on Riemannian manifolds S. Hosseini Institut f ur Numerische

COMPOSITE OF PLAGE AREAS OVER COMPOSITE OF PLAGE AREAS OVER COMPOSITE OF PLAGE AREAS OVER

Solving composite optimization problems, with applications to phase retrieval John Duchi (based

A nonsmooth Chow-Rashevskis theorem Ermal Feleqi University of Vlora, Albania Optimization,

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

The Chain Rule Given a composite function: The Chain Rule Given a composite function: h ( x ) =

Plan Composite Likelihood Methods What are composite likelihoods? David Firth Where are

Projection onto Minkowski Sums with Application to Constrained Learning Joong-Ho (Johann) Won 1

Complex Case Phenomena in the Grammar Matrix Scott Drellishak University of Washington July 28,

Convex Optimization: Modeling and Algorithms Lieven Vandenberghe Electrical Engineering

Projective Splitting Methods for Decomposing Convex Optimization Problems Jonat han Eckstein

Stochastic Optimization Techniques for Big Data Machine Learning Tong Zhang Rutgers University

Plug &amp; Manage Heterogeneous Sensing Devices Levent Grgen*, Johan Nystrm-Persson*, Amin

Inner and Outer Approximating Flowpipes for Delay Differential Equations Eric Goubault 1 Sylvie

Stochastic Optimization for Learning over Networks Guanghui (George) Lan School of Industrial

Plug & Manage Heterogeneous Sensing Devices Levent Grgen, Johan Nystrm-Persson, Amin