Selective Linearization Method for Statistical Learning Problems Yu Du yu.du@ucdenver.edu Joint work: Andrzej Ruszczynski DIMACS Workshop on ADMM and Proximal Splitting Methods in Optimization June 2018
Agenda Introduction to multi-block convex non-smooth optimization 1 Motivating examples Problem formulation Review of related existing methods 2 Proximal point and operator splitting methods Bundle method Alternating linearization method(ALIN) Selective linearization (SLIN) for multi-block convex optimization 3 SLIN method for multi-block convex optimization Global convergence Convergence rate Numerical illustration 4 Three-block fused lasso Overlapping group lasso Regularized support vector machine problem Conclusions and ongoing work 5 Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems
Motivating examples for multi-block structured regularization N � x ∈ R n F ( x ) = f ( x ) + min h i ( B i x ) i = 1 Figure: [Demiralp et al., 2013] Figure: [Zhou et al., 2015] Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems
Motivating examples for multi-block structured regularization K 1 � � 2 � � � � S ∈S || P Ω ( S ) − P Ω ( A ) || 2 min � � 2 + � � � b − Aw d j � w T j min F + γ || RS || 1 + τ || S || ∗ � 2 2 K λ w j = 1 Figure: [Zhou et al., 2015] Figure: [Demiralp et al., 2013] Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems
Problem formulation for multi-block convex optimization Problem formulation x ∈ ❘ n F ( x ) = f 1 ( x ) + f 2 ( x ) + . . . + f N ( x ) min f 1 , f 2 , ..., f N : ❘ n → ❘ are convex functions. We introduced the selective linearization (SLIN) algorithm for multi-block non-smooth convex optimization. Global convergence is guaranteed; Almost O ( 1 / k ) convergence rate with only 1 out of N functions being strongly convex, where k is the iteration number.(Du, Lin, and Ruszczynski, 2017). Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems
Review of Related Existing Methods Selective linearization algorithm comes from the idea of proximal point algorithm (Rockafellar, 1976) and operator splitting methods (Douglas and Rachford, 1956), (Lions and Mercier, 1979), and later by (Eckstein and Bertsekas, 1992) and (Bauschke and Combettes, 2011), bundle methods(Kiwiel, 1985) and (Ruszczynski, 2006) and Alternating linearzation method (Kiwiel, Rosa, and Ruszczynski, 1999). Proximal point method x ∈ ❘ n F ( x ) min where F : ❘ n → ❘ is a convex function. For solving the above problem, � � 2 � � � x − x k � F ( x ) + ρ construct a proximal step prox F ( x k ) = argmin x and � � 2 x k + 1 = prox F ( x k ) , k = 1 , 2 , . . . . It is known to converge to the minimum of F ( · ) (Rockafellar, 1976). Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems
Bundle method The main idea of the bundle method (Kiwiel, 1985) and (Ruszczynski, 2006) is to replace problem min x ∈ ❘ n F ( x ) with a sequence of approximate problems of the form (cutting plane approximation): F k ( x ) + ρ � � x − x k � � 2 x ∈ ❘ n ˜ min � � 2 Here ˜ F k ( · ) is a piecewise linear convex lower approximation of the function F ( · ) . � � ˜ F k ( x ) = max F ( z j ) + � g j , x − z j � j ∈ J k with some earlier generated solutions z j and subgradients g j ∈ ∂ F ( z j ) , j ∈ J k , where J k ⊆ { 1 , . . . , k } . The solution of proximal step is subject to a sufficient improvement test, which decides the proximal center will change to current solution or not. Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems
Bundle method(cont) � � ˜ F k ( x ) = max j ∈ J k F ( z j ) + � g j , x − z j � Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems
Bundle method(cont) Bundle method with multiple cuts Step 1: Initialization: Set k = 1, J 1 = { 1 } , z 1 = x 1 , and select g 1 ∈ ∂ F ( z 1 ) . Choose parameter β ∈ ( 0 , 1 ) , and a stopping precision ε > 0. � 2 } . � � x − x k � Step 2: z k + 1 ← argmin { ˜ F k ( x ) + ρ � � 2 Step 3: if ( F ( x k ) − ˜ F k ( z k + 1 ) ≤ ε ) stop, otherwise continue � � Step 4: Update Test: if ( F ( z k + 1 ) ≤ F ( x k ) − β F k ( z k + 1 ) F ( x k ) − ˜ ), then set x k + 1 = z k + 1 (descent step); otherwise set x k + 1 = x k (null step). Step 5:Select a set J k + 1 so that � � j ∈ J k : F ( z j ) + � g j , z k + 1 − z j � = ˜ F k ( z k + 1 ) J k ∪ { k + 1 } ⊇ J k + 1 ⊇ { k + 1 } ∪ . Increase k by 1 and go to Step 1. Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems
Bundle method(cont) Bundle method with cut aggregation Step 1: Initialization: Set k = 1, J 1 = { 1 } , z 1 = x 1 , and select g 1 ∈ ∂ F ( z 1 ) . Choose parameter β ∈ ( 0 , 1 ) , and a stopping precision ε > 0. � 2 } . � � x − x k � Step 2: z k + 1 ← argmin { ˜ F k ( x ) + ρ � � 2 � ¯ � ˜ F k ( x ) = max F k ( x ) , F ( z k ) + � g k , x − z k � Step 3: if ( F ( x k ) − ˜ F k ( z k + 1 ) ≤ ε ) stop, otherwise continue � � ), then set x k + 1 = z k + 1 Step 4: if ( F ( z k + 1 ) ≤ F ( x k ) − β F ( x k ) − ˜ F k ( z k + 1 ) ( descent step ); otherwise set x k + 1 = x k ( null step ). � � Step 5: Define ¯ F k + 1 ( x ) = θ k ¯ F k ( x ) + ( 1 − θ k ) F ( z k ) + � g k , x − z k � where θ k ∈ [ 0 , 1 ] is such that the gradient of ¯ F k + 1 ( · ) is equal to the subgradient of F k ( · ) at z k + 1 that satisfies the optimality conditions for the problem. ˜ Increase k by 1 and go to Step 1. Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems
Bundle method(cont) Convergence of the bundle method (in both versions) for convex functions is well-known (Kiwiel, 1985) and (Ruszczynski, 2006). Theorem Suppose Argmin F � ∅ and ε = 0. Then a point x ∗ ∈ Argmin F exists, such that: k →∞ x k = lim k →∞ z k = x ∗ . lim Convergence Rate (Du and Ruszczynski, 2017) proved that the bundle method for nonsmooth ln ( 1 ǫ ) optimization achieves solution accuracy ǫ in at most O ( ) iterations, if ǫ the function is strongly convex. The result is true for the versions of the method with multiple cuts and with cut aggregation. Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems
Operator splitting for two block convex optimization Operator Splitting x ∈ ❘ n f ( x ) + h ( x ) min The solution ˆ x satisfies 0 ∈ ∂ f (ˆ x ) , where we can consider two x ) + ∂ h (ˆ subdifferentials as two maximal monotone operators M 1 and M 2 on the space ❘ n : 0 ∈ ( M 1 + M 2 )(ˆ x ) . Standard ADMM x ∈ ❘ n f ( x ) + h ( y ) min s.t. Mx − y = 0 ADMM for solving the above takes the following form, for some scalar parameter c > 0: x k + 1 ∈ argmin x ∈ ❘ n { f ( x ) + g ( y k ) + � λ k , Mx − y k � + c 2 || Mx − y k || 2 } y k + 1 ∈ argmin y ∈ ❘ m { f ( x k + 1 ) + g ( y ) + � λ k , Mx k + 1 − y � + c 2 || Mx k + 1 − y || 2 } λ k + 1 = λ k + c ( Mx k + 1 − y k + 1 ) . Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems
Alternating Linearization method for two block convex optimization The Alternating Linearization Method (ALIN) (Kiwiel, Rosa, and Ruszczynski, 1999) adapted ideas of the operator splitting methods and bundle methods and proved the globlal convergence. ALIN (Lin, Pham and Ruszczy´ nski, 2014) is sucessfully applied to solve two block structured regularization statistical learning problems. Algorithm Alternating Linearization (ALIN) x ∈ ❘ n f ( x ) + h ( x ) min 1: repeat x h ← argmin { ˜ f ( x ) + h ( x ) + 1 x || 2 2: ˜ D } . 2 || x − ˆ 3: g h ← − g f − D (˜ x h − ˆ x ) 4: if (Update Test for ˜ x h ) then ˆ x h end if x ← ˜ h ( x ) + 1 x f ← argmin { f ( x ) + ˜ x || 2 5: ˜ D } . 2 || x − ˆ 6: g f ← − g h − D (˜ x f − ˆ x ) 7: if (Update Test for ˜ x f ) then ˆ x f end if x ← ˜ 8: until (Stopping Test) ˜ f ( x ) is the linear approximation of f ( x ) . D is a positive definite diagonal matrix. Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems
Selective Linearization (SLIN) algorithm Objective: min F ( x ) = � N i = 1 f i ( x ) f 1 , f 2 , . . . , f N : ❘ n → ❘ are convex functions. They can be nonsmooth. Every iteration We choose an index j according to a selection rule (the largest gap between the function value and its linear approximation) and solve the f j -sub-problem: i ( x ) + 1 � ˜ 2 || x − x k || 2 f k min f j ( x ) + D x i � j Each ˜ f k i ( x ) is a first-order linearization of f i ( x ) . x k is a proximal center. It will be updated over the iterations. After solving f j -sub-problem, f j will be linearized using its subgradient at current solution z k j preparing for the next sub-problem: ˜ f k i ( x ) = f i ( z k i ) + � g k i , x − z k i � . We denote the function approximating: ˜ F k ( x ) = f j ( x ) + � i � j ˜ f k i ( x ) . Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems
Recommend
More recommend