Linear Convergence of Randomized Primal-Dual Coordinate Method for Large-scale Linear Constrained Convex Programming Daoli Zhu and Lei Zhao Shanghai Jiao Tong University ICML 2020 July 16, 2020
PRO PRE CCR LCA NA CCL Outline 1 Research Problem 2 Preliminaries 3 Convergence and Convergence Rate Analysis of RPDC 4 Linear Convergence of RPDC under Global Strong Metric Subregularity 5 Numerical Analysis 6 Conclusions 2 / 28
PRO PRE CCR LCA NA CCL 1. Research Problem Linear Constrained Convex Programming (LCCP): (P): min F ( u ) = G ( u ) + J ( u ) s . t Au − b = 0 (1.1) u ∈ U Assumption 1 (H 1 ) J is a convex, lower semi-continuous function (not necessarily differentiable) such that dom J ∩ U � = ∅ . (H 2 ) G is convex and differentiable, and its derivative is Lipschitz with constant B G . (H 3 ) There exists at least one saddle point for the Lagrangian of (P). Decomposition for partial structured problem: N � Space decomposition of U : U = U 1 × U 2 · · · × U N , U i ⊂ R n i , n i = n . i = 1 N � J i ( u i ) and A = ( A 1 , A 2 , · · · , A N ) ∈ R m × n is an appropriate partition of A , where J ( u ) = i = 1 A i is an m × n i matrix. 3 / 28
PRO PRE CCR LCA NA CCL 1.1 Motivation Support vector machine (SVM) problem: 1 2 u ⊤ Qu − 1 ⊤ (SVM) min n u u ∈ [ 0 , c ] n y ⊤ u = 0 s.t. Q ∈ R n × n is symmetric and positive-definite. c > 0, y ∈ {− 1 , 1 } n . Machine learning portfolio (MLP) problem: 1 2 u ⊤ Σ u + λ � u � 1 (MLP) min u ∈ R n µ ⊤ u = ρ s.t. 1 ⊤ n u = 1 Σ ∈ R n × n is the estimated covariance matrix of asset returns. µ ∈ R n is the expectation of asset returns. ρ is a predefined prospective growth rate. 4 / 28
PRO PRE CCR LCA NA CCL 1.1 Motivation In the big data era, the datasets used for computation are very big and are often distributed in different locations. It is often impractical to assume that optimization algorithms can traverse an entire dataset once in each iteration, because doing so is either time consuming or unreliable. Coordinate-type methods can make progress by using distributed information and thus, provide much flexibility for their implementation in the distributed environments. Therefore, we adopt randomized coordinate methods for the constrained optimization problem with emphasis on the convergence and rate of convergence properties. 5 / 28
PRO PRE CCR LCA NA CCL 1.2 Related works: augmented Lagrangian decomposition method The augmented Lagrangian of (P) is L γ ( u , p ) = F ( u ) + � p , Au − b � + γ 2 � Au − b � 2 . Augmented Lagrangian method (ALM) (Hestenes, 1969; Powell, 1969) � u k + 1 = arg min u ∈ U L γ ( u , p k ); does not preserve separability p k + 1 = p k + γ ( Au k + 1 − b ) . Augmented Lagrangian decomposition method (I) Alternating Direction Method of Multipliers (ADMM) (Fortin & Glowinski, 1983) u k + 1 L γ ( u 1 , u k 2 , u k 3 , ..., u k N − 1 , u k N , p k ); = arg min 1 u 1 ∈ U 1 u k + 1 L γ ( u k + 1 , u 2 , u k 3 , ..., u k N − 1 , u k N , p k ); = arg min 2 1 u 1 ∈ U 1 . Gauss-Seidel method for ALM . . u k + 1 L γ ( u k + 1 , u k + 1 , u k + 1 , ..., u k + 1 N − 1 , u N , p k ); = arg min N 1 2 3 u 1 ∈ U 1 p k + 1 = p k + γ ( Au k + 1 − b ) . 6 / 28
PRO PRE CCR LCA NA CCL 1.2 Related works: augmented Lagrangian decomposition method Augmented Lagrangian decomposition method (II) Auxiliary Problem Principle of Augmented Lagrangian (APP-AL) (Cohen & Zhu, 1983) u k + 1 = arg min u ∈ U �∇ G ( u k ) , u � + J ( u ) + � p k + γ ( Au k − b ) , Au � linearize the smooth term in primal problem of ALM + 1 and add a regularization term ǫ D ( u , u k ); p k + 1 = p k + γ ( Au k + 1 − b ) . where D ( u , v ) = K ( u ) − K ( v ) − �∇ K ( v ) , u − v � is a Bregman like function. Randomized Primal-Dual Coordinate method (RPDC) (This paper) Choose i ( k ) from { 1 , ..., N } with equal probability; u k + 1 = arg min u ∈ U �∇ i ( k ) G ( u k ) , u i ( k ) � + J i ( k ) ( u i ( k ) ) randomly updates one block + � p k + γ ( Au k − b ) , A i ( k ) u i ( k ) � of variables in primal subproblem + 1 ǫ D ( u , u k ); of APP-AL p k + 1 = p k + ρ ( Au k + 1 − b ) . 7 / 28
PRO PRE CCR LCA NA CCL 1.2 Related works: comparison between RPDC and Randomized Coordinate Descent algorithm (RCD) by Necoara & Patrascu, 2014 Randomized Primal-Dual Coordinate method (RPDC) (This paper) Choose i ( k ) from { 1 , ..., N } with equal probability; u k + 1 = arg min u ∈ U �∇ i ( k ) G ( u k ) , u i ( k ) � + J i ( k ) ( u i ( k ) ) + � p k + γ ( Au k − b ) , A i ( k ) u i ( k ) � + 1 ǫ D ( u , u k ); p k + 1 = p k + ρ ( Au k + 1 − b ) . Necoara & Patrascu, 2014 consider problem (P) with A ∈ R 1 × n , b = 0, and U = R n : a ⊤ u = 0 . (P’): min G ( u ) + J ( u ) , s . t u ∈ R n where a = ( a 1 , ..., a n ) ⊤ ∈ R n . And the randomized coordinate descent algorithm (RCD) by Necoara & Patrascu, 2014 for (P’) is Choose i ( k ) and j ( k ) from { 1 , ..., n } with equal probability; u k + 1 = arg min a i ( k ) u i ( k ) + a j ( k ) u j ( k ) = 0 �∇ i ( k ) G ( u k ) , u i ( k ) � + �∇ j ( k ) G ( u k ) , u j ( k ) � + J i ( k ) ( u i ( k ) ) + J j ( k ) ( u j ( k ) ) + 1 2 ǫ � u − u k � 2 . The RPDC method can deal with more complex problem than RCD. 8 / 28
PRO PRE CCR LCA NA CCL 1.2 Related works: similar schemes Paper Problem Algorithm Theoretical Results Xu similar to F is strongly convex: O ( 1 / t 2 ) rate . & Zhang, (P) RPDC 2018 Gao, Xu similar to & Zhang, (P) F is convex: O ( 1 / t ) rate . RPDC 2019 F is convex: (i) Almost surely convergence ; (ii) O ( 1 / t ) rate ; This paper (P) RPDC Global strong metric subregularity: (iii) Linear convergence . 9 / 28
PRO PRE CCR LCA NA CCL 1.3 Contribution We propose the randomized primal-dual coordinate (RPDC) method based on the first-order primal-dual method Cohen & Zhu, 1984; Zhao & Zhu, 2019. (i) We show that the sequence generated by RPDC converges to an optimal solution with probability 1. (ii) We show RPDC has expected O ( 1 / t ) rate for general LCCP . (iii) We establish the expected linear convergence of RPDC under global strong metric subregularity. (iv) We show that SVM and MLP problems satisfy global strong metric subregularity under some reasonable conditions. 10 / 28
PRO PRE CCR LCA NA CCL 2. Preliminaries Lagrangian of (P): L ( u , p ) = F ( u ) + � p , Au − b � , Saddle point inequality: ∀ u ∈ U , p ∈ R m : L ( u ∗ , p ) ≤ L ( u ∗ , p ∗ ) ≤ L ( u , p ∗ ) . (2.2) Karush-Kuhn-Tucker (KKT) system of (P): Let w = ( u , p ) and U ∗ × P ∗ be the set of saddle points. ∀ w ∈ U ∗ × P ∗ , � � � � ∇ G ( u ) + ∂ J ( u ) + A ⊤ p + N U ( u ) ∂ u L ( u , p ) + N U ( u ) 0 ∈ H ( w ) = = , −∇ p L ( u , p ) b − Au with N U ( u ) = { ξ : � ξ, ζ − u � ≤ 0 , ∀ ζ ∈ U } is the normal cone at u to U . 11 / 28
PRO PRE CCR LCA NA CCL 3. Convergence and Convergence Rate Analysis of RPDC: RPDC Algorithm Algorithm 1: Randomized Primal-Dual Coordinate method (RPDC) for k = 1 to t Choose i ( k ) from { 1 , . . . , N } with equal probability; u k + 1 = arg min u ∈ U �∇ i ( k ) G ( u k ) , u i ( k ) � + J i ( k ) ( u i ( k ) ) + � q k , A i ( k ) u i ( k ) � + 1 ǫ D ( u , u k ) ; p k + 1 = p k + ρ ( Au k + 1 − b ) . end for where q k = p k + γ ( Au k + b ) and D ( u , v ) = K ( u ) − K ( v ) − �∇ K ( v ) , u − v � is a Bregman like function with K is strongly convex and gradient Lipschitz. Assumption 2 (i) K is strongly convex with parameter β and gradient Lipschitz continuous with parameter B. 2 γ (ii) The parameters ǫ and ρ satisfy: 0 < ǫ < β/ [ B G + γλ max ( A ⊤ A )] and 0 < ρ < 2 N − 1 . 12 / 28
PRO PRE CCR LCA NA CCL 3. Convergence and Convergence Rate Analysis of RPDC: Preparation Filtration: def F k = { i ( 0 ) , i ( 1 ) , . . . , i ( k ) } , F k ⊂ F k + 1 . The conditional expectation with respect to F k : E F k + 1 = E ( ·|F k ) . The conditional expectation in the i ( k ) term for given i ( 0 ) , i ( 1 ) , . . . , i ( k − 1 ) : E i ( k ) . Reference point: APP-AL: T u ( w k ) = arg min u ∈ U �∇ G ( u k ) , u � + J ( u ) + � q k , Au � w k = T ( w k ) = � � T u ( w k ) , T p ( w k ) ( u k , p k ) + 1 ǫ D ( u , u k ); � � T p ( w k ) = p k + γ AT u ( w k ) − b . E i ( k ) u k + 1 = 1 N T u ( w k ) + ( 1 − 1 N ) u k 13 / 28
Recommend
More recommend