Splitting Envelopes Accelerated Second-order Proximal Methods Panos Patrinos (joint work with Lorenzo Stella, Alberto Bemporad) September 8, 2014
Outline � forward-backward envelope (FBE) � forward-backward Newton method (FBN) � dual FBE and Augmented Lagrangian � alternating minimization Newton method (AMNM) � Douglas Rachford envelope (DRE) � accelerated Douglas Rachford splitting (ADRS) based on 1. P. Patrinos and A. Bemporad. Proximal Newton methods for convex composite optimization . In Proc. 52nd IEEE Conference on Decision and Control (CDC), pages 2358-2363, Florence, Italy, 2013. 2. P. Patrinos, L. Stella, and A. Bemporad, Forward-backward truncated Newton methods for convex composite optimization. submitted, arXiv:1402.6655, 2014. 3. P. Patrinos, L. Stella, and A. Bemporad. Douglas-Rachford splitting: complexity estimates and accelerated variants . In Proc. 53rd IEEE Conference on Decision and Control (CDC), Los Angeles, CA, arXiv:1407.6723, 2014. 4. L. Stella, P. Patrinos, and A. Bemporad, Alternating minimization Newton method for separable convex optimization , 2014 (submitted). fixed point implementation for MPC � A. Guiggiani, P. Patrinos, and A. Bemporad. Fixed-point implementation of a proximal Newton method for embedded model predictive control . In 19th IFAC, South Africa, 2014. 2 / 40
Convex composite optimization minimize F ( x ) = f ( x ) + g ( x ) R n → I � f : I R convex, twice continuously differentiable with R n �∇ f ( x ) − ∇ f ( y ) � ≤ L f � x − y � , for all x, y ∈ I R n → I � g : I R convex, nonsmooth with inexpensive proximal mapping � 2 γ � z − x � 2 � g ( z ) + 1 prox γg ( x ) = arg min R n z ∈ I � many problem classes : QPs, cone programs, sparse least-squares, rank minimization, total variation minimization,. . . � applications : control, system identification, signal processing, image analysis, machine learning,. . . 3 / 40
Proximal mappings � 2 γ � z − x � 2 � g ( z ) + 1 prox γg ( x ) = arg min γ > 0 R n z ∈ I � resolvent of maximal monotone operator ∂g prox γg ( x ) = ( I + γ∂g ) − 1 ( x ) � single-valued and (firmly) nonexpansive � explicitly computable for many functions (see Parikh, Boyd ’14, Combettes, Pesquet ’10) � reduces to projection when g is indicator of convex set prox γδ C ( x ) = Π C ( x ) � z = prox γg ( x ) is implicit a subgradient step ( 0 ∈ ∂g ( z )+ γ − 1 ( z − x ) ) z = x − γv v ∈ ∂g ( z ) 4 / 40
Proximal Minimization Algorithm R n → I minimize g ( x ) , g : I R closed proper convex given x 0 ∈ I R n , repeat x k +1 = prox γg ( x k ) γ > 0 � fixed point iteration for optimality conditions ⇒ x ⋆ ∈ ( I + γ∂g )( x ⋆ ) ⇐ ⇒ x ⋆ = prox γg ( x ⋆ ) 0 ∈ ∂g ( x ⋆ ) ⇐ � special case of proximal point algorithm (Martinet ’70, Rockafellar ’76) � converges under very general conditions � mostly conceptual algorithm 5 / 40
Moreau envelope R n → I Moreau envelope of closed proper convex g : I R � 2 γ � z − x � 2 � g γ ( x ) = inf g ( z ) + 1 , γ > 0 R n z ∈ I � g γ is real-valued, convex, differentiable with 1 /γ -Lipschitz gradient ∇ g γ ( x ) = (1 /γ )( x − prox γg ( x )) � minimizing nonsmooth g is equivalent to minimizing smooth g γ � proximal minimization algorithm = gradient method for g γ x k +1 = x k − γ ∇ g γ ( x k ) � can use any method of unconstrained smooth minimization for g γ 6 / 40
Forward-Backward Splitting (FBS) minimize F ( x ) = f ( x ) + g ( x ) R n is optimal if and only if � optimality condition: x ⋆ ∈ I x ⋆ = prox γg ( x ⋆ − γ ∇ f ( x ⋆ )) , γ > 0 � forward-backward splitting (aka proximal gradient ) x k +1 = prox γg ( x k − γ ∇ f ( x k )) , γ ∈ (0 , 2 /L f ) � FBS is a fixed point iteration � g = 0 : gradient method, g = δ C : gradient projection, f = 0 : prox min � accelerated versions (Nesterov) 7 / 40
Forward-Backward Envelope x − prox γg ( x − γ ∇ f ( x )) = 0 � use prox γg ( y ) = y − γ ∇ g γ ( y ) for y = x − γ ∇ f ( x ) γ ∇ f ( x ) + γ ∇ g γ ( x − γ ∇ f ( x )) = 0 � multiply with γ − 1 ( I − γ ∇ 2 f ( x )) (positive definite for γ ∈ (0 , 1 /L f ) ) � gradient of the F orward B ackward E nvelope (FBE) F FB ( x ) = f ( x ) − γ 2 �∇ f ( x ) � 2 2 + g γ ( x − γ ∇ f ( x )) γ � alternative expression for FBE F FB + g ( z ) + 1 2 γ � z − x � 2 } ( x ) = inf R n { f ( x ) + �∇ f ( x ) , z − x � γ z ∈ I � �� � linearize f around x 8 / 40
Properties of FBE � stationary points of F FB = minimizers of F γ � reformulates original nonsmooth problem into a smooth one F FB minimize ( x ) equivalent to minimize F ( x ) γ R n R n x ∈ I x ∈ I � F FB is real-valued, continuously differentiable γ ( x ) = γ − 1 ( I − γ ∇ 2 f ( x ))( x − prox γg ( x − γ ∇ f ( x )) ∇ F FB γ � FBS is a variable metric gradient method for FBE x k +1 = x k − γD − 1 k ∇ F FB ( x k ) γ 9 / 40
Forward-Backward Newton Method (FBN) Input : x 0 ∈ I R n , γ ∈ (0 , 1 /L f ) , σ ∈ (0 , 1 / 2) for k = 0 , 1 , 2 , . . . do Newton direction Choose H k ∈ ˆ ( x k ) . Compute d k by solving (approximately) ∂ 2 F FB γ H k d = −∇ F FB ( x k ) γ Line search Compute stepsize by backtracking ( x k + τ k d k ) ≤ F FB F FB ( x k ) + στ k �∇ F FB ( x k ) , d k � γ γ γ Update: x k +1 = x k + τ k d k end 10 / 40
Linear Newton approximation FBE is C 1 but not C 2 Hd = −∇ F FB ( x ) where γ ∇ F FB ( x ) = γ − 1 ( I − γ ∇ 2 f ( x ))( x − prox γg ( x − γ ∇ f ( x )) γ and ˆ ∂ 2 F γ ( x ) is an approximate generalized Hessian H = γ − 1 ( I − γ ∇ 2 f ( x ))( I − P ( I − γ ∇ 2 f ( x ))) ∈ ˆ ∂ 2 F γ ( x ) where P ∈ ∂ C (prox γg ) ( x − γ ∇ f ( x )) � �� � Clarke’s generalized Jacobian � preserves all favorable properties of the Hessian for C 2 functions � “Gauss-Newton” generalized Hessian: we omit 3rd order terms 11 / 40
Generalized Jacobians of proximal mappings � ∂ C prox γg ( x ) is the following set of matrices (Clarke, 1983) � limits of (ordinary) Jacobians for every sequence that converges � conv to x , consisting of points where prox γg is differentiable ◮ prox γg ( x ) simple to compute = ⇒ P ∈ ∂ C (prox γg )( x ) for free ◮ g (block) separable = ⇒ P ∈ ∂ C (prox γg )( x ) (block) diagonal example– ℓ 1 norm more examples in Patrinos, Stella, Bemporad (2014) x i + γ, if x i ≤ − γ, � g ( x ) = � x � 1 prox γf ( x ) i = 0 , if − γ ≤ x i ≤ γ x i − γ, if x i ≥ γ � P ∈ ∂ C (prox γg )( x ) are diagonal matrices with 1 , if i ∈ { i | | x i | > γ } , P ii = ∈ [0 , 1] , if i ∈ { i | | x i | = γ } , 0 , if i ∈ { i | | x i | < γ } . 12 / 40
Convergence of FBN � every limit point of { x k } converges to arg min F ( x ) R n x ∈ I � all H ∈ ˆ ∂ 2 F γ ( x ⋆ ) nonsingular = ⇒ Q-quadratic asymptotic rate extension: FBN II � apply FB step after a Newton step � same asymptotic rate + global complexity estimates ◮ non-strongly convex f : sublinear rate for F ( x k ) − F ( x ⋆ ) � F ( x k ) − F ( x ⋆ ) � ◮ strongly convex f : linear rate for � x k − x ⋆ � 2 13 / 40
FBN–CG large problems conjugate gradient (CG) on regularized Newton system until � ( H k + δ k I ) d k + ∇ F FB ( x k ) � ≤ η k �∇ F FB ( x k ) � γ γ � �� � residual with η k = O ( �∇ F FB ( x k ) � ) , δ k = O ( �∇ F FB ( x k ) � ) γ γ properties � no need to form ∇ 2 f ( x ) and H k – only matvec products � same convergence properties 14 / 40
Box-constrained convex programs minimize f ( x ) subject to ℓ ≤ x ≤ u Newton direction solves 1 2 � d, ∇ 2 f ( x k ) d � + �∇ f ( x k ) , d � minimize d i = ℓ i − x k d i = u i − x k subject to i , i ∈ β 1 , i , i ∈ β 2 where β 1 = { i | x k i − γ ∇ i f ( x k ) ≤ ℓ i } estimate of x ⋆ i = ℓ i β 2 = { i | x k i − γ ∇ i f ( x k ) ≥ u i } estimate of x ⋆ i = u i Newton system becomes Q δδ d δ = − ( ∇ δ f ( x k ) + ∇ δβ f ( x k ) d β ) , ( β = β 1 ∪ β 2 , δ \ β = [ n ]) 15 / 40
Example 1 minimize 2 � x, Qx � + � q, x � , n = 1000 subject to ℓ ≤ x ≤ u cond( Q ) = 10 4 cond( Q ) = 10 8 2 2 10 10 0 0 10 10 PNM FBN − 2 − 2 10 F ( x ν ) − F � 10 F ( x ν ) − F � PGNM FBN II FGM AFBS − 4 − 4 10 10 − 6 − 6 10 10 − 8 − 8 10 10 FBN PNM − 10 − 10 10 10 FBN II PGNM FGM AFBS − 12 − 12 10 10 10 20 30 40 50 0 . 2 0 . 4 0 . 6 0 . 8 1 1 . 2 1 . 4 time [sec] time [sec] GUROBI: 4.87 sec, CPLEX: 3.73 sec GUROBI: 5.96 sec, CPLEX: 4.83 sec FBN : much less sensitive to bad conditioning 16 / 40
Sparse least-squares 1 2 � Ax − b � 2 minimize 2 + λ � x � 1 � Newton system becomes d β = − x β A ⊤ · δ A · δ d δ = − [ A ⊤ · δ ( A · δ x δ − b ) + λ sign( x δ − γ ∇ δ f ( x ))] � δ is an estimate of the nonzero components of x ⋆ δ = { i | | x i − γ ∇ i f ( x ) | > λγ } � close to solution δ small 17 / 40
Recommend
More recommend