Semi-smooth Newton Type Methods for Composite Convex Programs Zaiwen Wen Beijing International Center For Mathematical Research Peking University wenzw@pku.edu.cn 1/62
Outline composite convex programs 1 Semi-smoothness of proximal mapping 2 semi-smooth Newton methods based on the primal 3 Approach Numerical Results Semi-smooth Newton method based on the dual (SDPNAL) 4 2/62
Composite convex program Consider the following composite convex program min f ( x ) + h ( x ) , x ∈ R n where f and h are convex, f is differentiable but h may not Many applications: Sparse and low rank optimization: h ( x ) = � x � 1 or � X � ∗ and many other forms. Regularized risk minimization: f ( x ) = � i f i ( x ) is a loss function of some misfit and h is a regularization term. Constrained program: h is an indicator function of a convex set. 3/62
A General Recipe Goal: study approaches to bridge the gap between first-order and second-order type methods for composite convex programs. key observations: Many popular first-order methods can be equivalent to some fixed-point iterations: x k + 1 = T ( x k ) ; Advantages: easy to implement; converge fast to a solution with moderate accuracy. Disadvantages: slow tail convergence. The original problem is equivalent to the system F ( x ) := ( I − T )( x ) = 0 . Newton-type method since F ( x ) is semi-smooth in many cases Computational costs can be controlled reasonably well 4/62
An SDP From Electronic Structure Calculation system: BeO 10 2 10 0 10 0 10 -2 10 -2 err err 10 -4 10 -4 10 -6 10 -6 10 -8 10 -8 0 1000 2000 3000 4000 5000 6000 7000 2000 2010 2020 2030 2040 2050 2060 2070 iter iter (a) ADMM, CPU: 2003s (b) Semi-smooth Newton, CPU: 635s 5/62
Operator splitting and fixed-point algorithm Examples: forward-backward splitting(FBS). Douglas-Rachford splitting(DRS). Peaceman-Rachford splitting(PRS). alternating direction method of multipliers(ADMM). Advantages: easy to implement; converge fast to a solution with moderate accuracy. Disadvantages: slow tail convergence. 6/62
Forward-backward splitting (FBS) Consider min f ( x ) + h ( x ) x ∈ R n the proximal mapping of f is defined by u ∈ R n { f ( u ) + 1 2 t � u − x � 2 prox tf ( x ) := argmin 2 } . Proximal gradient method or the FBS is the iteration x k + 1 = prox tf ( x k − t ∇ h ( x k )) , k = 0 , 1 , · · · , Equivalent to a fixed-point iteration x k + 1 = T FBS ( x k ) . where T FBS := prox tf ◦ ( I − t ∇ h ) . 7/62
Douglas-Rachford splitting (DRS) DRS is the following update: x k + 1 = prox th ( z k ) , y k + 1 = prox tf ( 2 x k + 1 − z k ) , z k + 1 = z k + y k + 1 − x k + 1 . Equivalent to a fixed-point iteration z k + 1 = T DRS ( z k ) , where T DRS := I + prox tf ◦ ( 2prox th − I ) − prox th . 8/62
Alternating direction method of multipliers (ADMM) Consider a linear constrained program min f 1 ( x 1 ) + f 2 ( x 2 ) x 1 ∈ R n 1 , x 2 ∈ R n 2 s.t. A 1 x 1 + A 2 x 2 = b , The dual problem is min d 1 ( w ) + d 2 ( w ) , w ∈ R m where d 1 ( w ) := f ∗ d 2 ( w ) := f ∗ 1 ( A T 2 ( A T 2 w ) − b T w . 1 w ) , The ADMM to the primal is equivalent to the DRS to the dual 9/62
Outline composite convex programs 1 Semi-smoothness of proximal mapping 2 semi-smooth Newton methods based on the primal 3 Approach Numerical Results Semi-smooth Newton method based on the dual (SDPNAL) 4 10/62
Semi-smooth Newton-type method Solving the system F ( z ) = 0 , where F ( z ) = T ( z ) − z and T ( z ) is a fixed-point mapping. Fixed-point algorithms suffer from slow tail convergence and may not be suitable for high accuracy applications. F ( z ) fails to be differentiable in many interesting applications. but F ( z ) is (strongly) semi-smooth and monotone. semi-smooth Newton type method 11/62
Semi-smoothness F : O → R m be locally Lipschitz continuous. The B-subdifferential of F at x is defined by � � k →∞ F ′ ( x k ) | x k ∈ D F , x k → x ∂ B F ( x ) := lim . The set ∂ F ( x ) = co ( ∂ B F ( x )) is called Clarke’s generalized Jacobian We say that F is semismooth at x ∈ O if F is directionally differentiable at x ; for any d ∈ O and J ∈ ∂ F ( x + d ) , � F ( x + d ) − F ( x ) − J ( d ) � = o ( � d � ) as d → 0 . F is said to be strongly semi-smooth at x ∈ O if F is semi-smooth and for any d ∈ O and J ∈ ∂ F ( x + d ) , � F ( x + d ) − F ( x ) − J ( d ) � = O ( � d � 2 ) as d → 0 . 12/62
Semi-smoothness (Strongly) semi-smoothness is closed under scalar multiplication, summation and composition. A vector-valued function is (strongly) semi-smooth if and only if each of its component functions is (strongly) semi-smooth. Examples: semi-smooth the smooth functions all convex functions (thus norm) the piecewise differentiable functions strongly semi-smooth Differentiable functions with Lipschitz gradients For every p ∈ [ 1 , ∞ ] , the norm � · � p Piecewise affine functions 13/62
Semi-smoothness of proximal mappings Many commonly seen proximal mappings are semi-smooth Examples: The proximal mapping of ℓ 1 -norm � x � 1 (or ℓ ∞ -norm � x � ∞ ) is strongly semi-smooth. The projection 1 over a polyhedral set is piecewise linear and hence strongly semi-smooth. The projections over symmetric cones are proved to be strongly semi-smooth. In many applications, the proximal mapping is shown to be piecewise C 1 and hence semi-smooth. 1 The proximal mapping of an indicator function onto a closed set is the metric projection over this set. 14/62
Some concepts on monotonicity A mapping F : R n → R n is said to be monotone, if for all x , y ∈ R n . � x − y , F ( x ) − F ( y ) � ≥ 0 , A mapping F : R n → R n is called strongly monotone with modulus c > 0 if � x − y , F ( x ) − F ( y ) � ≥ c � x − y � 2 for all x , y ∈ R n . 2 , It is said that F is cocoercive with modulus β > 0 if � x − y , F ( x ) − F ( y ) � ≥ β � F ( x ) − F ( y ) � 2 for all x , y ∈ R n . 2 , 15/62
Monotone mapping monotone properties of F FBS = I − T FBS and F DRS = I − T DRS : (i) Suppose that ∇ h is cocoercive with β > 0 , then F FBS is monotone if 0 < t ≤ 2 β . (ii) Suppose that ∇ h is strongly monotone with c > 0 and Lipschitz with L > 0 , then F FBS is strongly monotone if 0 < t < 2 c / L 2 . (iii) Suppose that h ∈ C 2 , H ( x ) := ∇ 2 h ( x ) is positive semidefinite for any x ∈ R n and ¯ λ = max x λ max ( H ( x )) < ∞ . Then, F FBS is monotone if 0 < t ≤ 2 / ¯ λ . (iv) The fixed-point mapping F DRS is monotone. (v) For a monotone and Lipschitz continuous mapping F : R n → R n and any x ∈ R n , each element of ∂ B F ( x ) is positive semidefinite. 16/62
Outline composite convex programs 1 Semi-smoothness of proximal mapping 2 semi-smooth Newton methods based on the primal 3 Approach Numerical Results Semi-smooth Newton method based on the dual (SDPNAL) 4 17/62
Semi-smooth Newton system J k ∈ ∂ B F ( z k ) : positively semidefinite. regularized Newton’s method ( J k + µ k I ) d = − F k , where F k = F ( z k ) , µ k = λ k � F k � and λ k > 0 is a regularization parameter. solve the linear system inexactly. r k := ( J k + µ k I ) d k + F k . seek to step d k by solving the system approximately such that � r k � ≤ τ min { 1 , λ k � F k � · � d k �} , where 0 < τ < 1 is some positive constant. 18/62
Semi-smooth Newton method Select 0 < v < 1 , 0 < η 1 ≤ η 2 < 1 and 1 < γ 1 ≤ γ 2 . λ > 0 A trial point u k = z k + d k Define a ratio � F ( u k ) , d k � ρ k = − . � d k � 2 F Update the point � u k , if � F ( u k ) � F ≤ ν max( 1 , k − ζ + 1 ) ≤ j ≤ k � F ( z j ) � F , [Newton] max z k + 1 = z k , otherwise . [failed] Update the regularization prameter ( λ, λ k ) , if ρ k ≥ η 2 , λ k + 1 ∈ [ λ k , γ 1 λ k ] , if η 1 ≤ ρ k < η 2 , ( γ 1 λ k , γ 2 λ k ] , otherwise, . 19/62
Ensuring global convergence I If the residual F is not reduced sufficiently or certain other conditions are not met, switching to first order methods. Note that F itself is a first order methods construct another point from the Newton step? X. Xiao, Y. Li, Z. Wen, L, Zhang, A Regularized Semi-Smooth Newton Method with Projection Steps for Composite Convex Programs, Journal of Scientfic Computing, 2018, Vol 76, No. 1, pp 364-389 Y. Li, Z. Wen, C. Yang, Y. Yuan, A Semi-smooth Newton Method For semidefinite programs and its applications in electronic structure calculations, SIAM Journal on Scientific Computing, Vol 40, No. 6, 2018, A4131A4157 20/62
Ensuring global convergence II: projection step d k = 0 , then x k is the optimal solution. A trial point u k = z k + d k . d k is small enough, F ( u k ) , z k − u k � � � F ( u k ) , d k � = − > 0 . By monotonicity of F , for any optimal solution z ∗ F ( u k ) , z ∗ − u k � � ≤ 0 . Therefore the hyperplane H k := { z ∈ R n | F ( u k ) , z − u k � � = 0 } strictly separates z k from the solution set Z ∗ . 21/62
Ensuring global convergence II: projection step Define a ratio � F ( u k ) , d k � ρ k = − . � d k � 2 If ρ k is big enough, F ( u k ) , z k − u k � � z k + 1 = z k − F ( u k ) , � F ( u k ) � 2 which is the projection onto the hyperplane H k . If ρ k is too small, z k + 1 = z k and increase the parameter. 22/62
Recommend
More recommend