Selective Linearization Method for Statistical Learning Problems Yu - PowerPoint PPT Presentation

Selective Linearization Method for Statistical Learning Problems Yu Du yu.du@ucdenver.edu Joint work: Andrzej Ruszczynski DIMACS Workshop on ADMM and Proximal Splitting Methods in Optimization June 2018

Agenda Introduction to multi-block convex non-smooth optimization 1 Motivating examples Problem formulation Review of related existing methods 2 Proximal point and operator splitting methods Bundle method Alternating linearization method(ALIN) Selective linearization (SLIN) for multi-block convex optimization 3 SLIN method for multi-block convex optimization Global convergence Convergence rate Numerical illustration 4 Three-block fused lasso Overlapping group lasso Regularized support vector machine problem Conclusions and ongoing work 5 Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

Motivating examples for multi-block structured regularization N � x ∈ R n F ( x ) = f ( x ) + min h i ( B i x ) i = 1 Figure: [Demiralp et al., 2013] Figure: [Zhou et al., 2015] Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

Motivating examples for multi-block structured regularization K 1 � � 2 � � � � S ∈S || P Ω ( S ) − P Ω ( A ) || 2 min � � 2 + � � � b − Aw d j � w T j min F + γ || RS || 1 + τ || S || ∗ � 2 2 K λ w j = 1 Figure: [Zhou et al., 2015] Figure: [Demiralp et al., 2013] Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

Problem formulation for multi-block convex optimization Problem formulation x ∈ ❘ n F ( x ) = f 1 ( x ) + f 2 ( x ) + . . . + f N ( x ) min f 1 , f 2 , ..., f N : ❘ n → ❘ are convex functions. We introduced the selective linearization (SLIN) algorithm for multi-block non-smooth convex optimization. Global convergence is guaranteed; Almost O ( 1 / k ) convergence rate with only 1 out of N functions being strongly convex, where k is the iteration number.(Du, Lin, and Ruszczynski, 2017). Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

Review of Related Existing Methods Selective linearization algorithm comes from the idea of proximal point algorithm (Rockafellar, 1976) and operator splitting methods (Douglas and Rachford, 1956), (Lions and Mercier, 1979), and later by (Eckstein and Bertsekas, 1992) and (Bauschke and Combettes, 2011), bundle methods(Kiwiel, 1985) and (Ruszczynski, 2006) and Alternating linearzation method (Kiwiel, Rosa, and Ruszczynski, 1999). Proximal point method x ∈ ❘ n F ( x ) min where F : ❘ n → ❘ is a convex function. For solving the above problem, � � 2 � � � x − x k � F ( x ) + ρ construct a proximal step prox F ( x k ) = argmin x and � � 2 x k + 1 = prox F ( x k ) , k = 1 , 2 , . . . . It is known to converge to the minimum of F ( · ) (Rockafellar, 1976). Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

Bundle method The main idea of the bundle method (Kiwiel, 1985) and (Ruszczynski, 2006) is to replace problem min x ∈ ❘ n F ( x ) with a sequence of approximate problems of the form (cutting plane approximation): F k ( x ) + ρ � � x − x k � � 2 x ∈ ❘ n ˜ min � � 2 Here ˜ F k ( · ) is a piecewise linear convex lower approximation of the function F ( · ) . � � ˜ F k ( x ) = max F ( z j ) + � g j , x − z j � j ∈ J k with some earlier generated solutions z j and subgradients g j ∈ ∂ F ( z j ) , j ∈ J k , where J k ⊆ { 1 , . . . , k } . The solution of proximal step is subject to a sufficient improvement test, which decides the proximal center will change to current solution or not. Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

Bundle method(cont) � � ˜ F k ( x ) = max j ∈ J k F ( z j ) + � g j , x − z j � Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

Bundle method(cont) Bundle method with multiple cuts Step 1: Initialization: Set k = 1, J 1 = { 1 } , z 1 = x 1 , and select g 1 ∈ ∂ F ( z 1 ) . Choose parameter β ∈ ( 0 , 1 ) , and a stopping precision ε > 0. � 2 } . � � x − x k � Step 2: z k + 1 ← argmin { ˜ F k ( x ) + ρ � � 2 Step 3: if ( F ( x k ) − ˜ F k ( z k + 1 ) ≤ ε ) stop, otherwise continue � � Step 4: Update Test: if ( F ( z k + 1 ) ≤ F ( x k ) − β F k ( z k + 1 ) F ( x k ) − ˜ ), then set x k + 1 = z k + 1 (descent step); otherwise set x k + 1 = x k (null step). Step 5:Select a set J k + 1 so that � � j ∈ J k : F ( z j ) + � g j , z k + 1 − z j � = ˜ F k ( z k + 1 ) J k ∪ { k + 1 } ⊇ J k + 1 ⊇ { k + 1 } ∪ . Increase k by 1 and go to Step 1. Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

Bundle method(cont) Bundle method with cut aggregation Step 1: Initialization: Set k = 1, J 1 = { 1 } , z 1 = x 1 , and select g 1 ∈ ∂ F ( z 1 ) . Choose parameter β ∈ ( 0 , 1 ) , and a stopping precision ε > 0. � 2 } . � � x − x k � Step 2: z k + 1 ← argmin { ˜ F k ( x ) + ρ � � 2 � ¯ � ˜ F k ( x ) = max F k ( x ) , F ( z k ) + � g k , x − z k � Step 3: if ( F ( x k ) − ˜ F k ( z k + 1 ) ≤ ε ) stop, otherwise continue � � ), then set x k + 1 = z k + 1 Step 4: if ( F ( z k + 1 ) ≤ F ( x k ) − β F ( x k ) − ˜ F k ( z k + 1 ) ( descent step ); otherwise set x k + 1 = x k ( null step ). � � Step 5: Define ¯ F k + 1 ( x ) = θ k ¯ F k ( x ) + ( 1 − θ k ) F ( z k ) + � g k , x − z k � where θ k ∈ [ 0 , 1 ] is such that the gradient of ¯ F k + 1 ( · ) is equal to the subgradient of F k ( · ) at z k + 1 that satisfies the optimality conditions for the problem. ˜ Increase k by 1 and go to Step 1. Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

Bundle method(cont) Convergence of the bundle method (in both versions) for convex functions is well-known (Kiwiel, 1985) and (Ruszczynski, 2006). Theorem Suppose Argmin F � ∅ and ε = 0. Then a point x ∗ ∈ Argmin F exists, such that: k →∞ x k = lim k →∞ z k = x ∗ . lim Convergence Rate (Du and Ruszczynski, 2017) proved that the bundle method for nonsmooth ln ( 1 ǫ ) optimization achieves solution accuracy ǫ in at most O ( ) iterations, if ǫ the function is strongly convex. The result is true for the versions of the method with multiple cuts and with cut aggregation. Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

Operator splitting for two block convex optimization Operator Splitting x ∈ ❘ n f ( x ) + h ( x ) min The solution ˆ x satisfies 0 ∈ ∂ f (ˆ x ) , where we can consider two x ) + ∂ h (ˆ subdifferentials as two maximal monotone operators M 1 and M 2 on the space ❘ n : 0 ∈ ( M 1 + M 2 )(ˆ x ) . Standard ADMM x ∈ ❘ n f ( x ) + h ( y ) min s.t. Mx − y = 0 ADMM for solving the above takes the following form, for some scalar parameter c > 0: x k + 1 ∈ argmin x ∈ ❘ n { f ( x ) + g ( y k ) + � λ k , Mx − y k � + c 2 || Mx − y k || 2 } y k + 1 ∈ argmin y ∈ ❘ m { f ( x k + 1 ) + g ( y ) + � λ k , Mx k + 1 − y � + c 2 || Mx k + 1 − y || 2 } λ k + 1 = λ k + c ( Mx k + 1 − y k + 1 ) . Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

Alternating Linearization method for two block convex optimization The Alternating Linearization Method (ALIN) (Kiwiel, Rosa, and Ruszczynski, 1999) adapted ideas of the operator splitting methods and bundle methods and proved the globlal convergence. ALIN (Lin, Pham and Ruszczy´ nski, 2014) is sucessfully applied to solve two block structured regularization statistical learning problems. Algorithm Alternating Linearization (ALIN) x ∈ ❘ n f ( x ) + h ( x ) min 1: repeat x h ← argmin { ˜ f ( x ) + h ( x ) + 1 x || 2 2: ˜ D } . 2 || x − ˆ 3: g h ← − g f − D (˜ x h − ˆ x ) 4: if (Update Test for ˜ x h ) then ˆ x h end if x ← ˜ h ( x ) + 1 x f ← argmin { f ( x ) + ˜ x || 2 5: ˜ D } . 2 || x − ˆ 6: g f ← − g h − D (˜ x f − ˆ x ) 7: if (Update Test for ˜ x f ) then ˆ x f end if x ← ˜ 8: until (Stopping Test) ˜ f ( x ) is the linear approximation of f ( x ) . D is a positive definite diagonal matrix. Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

Selective Linearization (SLIN) algorithm Objective: min F ( x ) = � N i = 1 f i ( x ) f 1 , f 2 , . . . , f N : ❘ n → ❘ are convex functions. They can be nonsmooth. Every iteration We choose an index j according to a selection rule (the largest gap between the function value and its linear approximation) and solve the f j -sub-problem: i ( x ) + 1 � ˜ 2 || x − x k || 2 f k min f j ( x ) + D x i � j Each ˜ f k i ( x ) is a first-order linearization of f i ( x ) . x k is a proximal center. It will be updated over the iterations. After solving f j -sub-problem, f j will be linearized using its subgradient at current solution z k j preparing for the next sub-problem: ˜ f k i ( x ) = f i ( z k i ) + � g k i , x − z k i � . We denote the function approximating: ˜ F k ( x ) = f j ( x ) + � i � j ˜ f k i ( x ) . Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

Selective Linearization Method for Statistical Learning Problems Yu - PowerPoint PPT Presentation

Selective Linearization Method for Statistical Learning Problems Yu Du yu.du@ucdenver.edu Joint work: Andrzej Ruszczynski DIMACS Workshop on ADMM and Proximal Splitting Methods in Optimization June 2018 Agenda Introduction to multi-block

Linearization Error Polina Zheglova DNOISE Seminar, October 16, 2013 1/7 Linearization Error

Math 211 Math 211 Lecture #36 The Use of the Linearization November 19, 2003 2 Linearization

Mixed Oxides in Selective Mixed Oxides in Selective Mixed Oxides in Selective Mixed Oxides in

Texas Instruments & RFAB TI Information Selective Disclosure TI Information Selective

Cimzia Selective rebrand Concept A Cimzia Selective rebrand Logo Main / Colour Grayscale

Selective Prediction Binary classifications Rong Zhou November 8, 2017 Table of contents 1.

Overview 1. Linearization 2. Examples of linearization 3. Example with Mathematica 4.

Linearization of nonlinear control systems: state-space, feedback, orbital, and dynamic Witold

Linearization and system equilibrium Daniele Carnevale Dipartimento di Ing. Civile ed Ing.

The Scientific Method The Scientific Method The Scientific Method involves 6 steps: Problem

Selective Laser Trabeculoplasty Selective Laser Trabeculoplasty SLT SLT Jorge

Selective W eb Archiving at the Germ an National Library 1 | 8 | Selective Web Archiving

Selective Early Request Termination Selective Early Request Termination for Busy Internet

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

Foundations of AI Why learning works 1 6 . Statistical Machine Learning Bayesian Learning and

NLG, Wrap up Surface realizer Linearization SimpleNLG Lexicon Scott Farrar Design ideas

Parallel Outlier Ensembles Yue Zhao, Zain Nasrullah Maciej K. Hryniewicki Zheng Li Department

Genetic variation for Wood Basic Density, Knot index and Their Genetic variation for Wood Basic

Natural Computing Lecture 2: Genetic Algorithms J. Michael Herrmann michael.herrman@ed.ac.uk

Study of Deeper Learning: Opportunities and Outcomes Education Writers Association November 19,

Non-classical Heuristics for Classical Planning Erez Karpas Advisors: Carmel Domshlak Shaul

John R. Gallagher Assistant Professor of English and Writing Studies johng@illinois.edu

#IDEFINEME Who Are You? Defining Resiliency Understanding Social Emotional Learning Discover

1" Taking a Look in the Mirror: Restorative Practices Starts With Us Mary Jo Hebling Beth

Selective Linearization Method for Statistical Learning Problems Yu - PowerPoint PPT Presentation

Selective Linearization Method for Statistical Learning Problems Yu Du yu.du@ucdenver.edu Joint work: Andrzej Ruszczynski DIMACS Workshop on ADMM and Proximal Splitting Methods in Optimization June 2018 Agenda Introduction to multi-block

Linearization Error Polina Zheglova DNOISE Seminar, October 16, 2013 1/7 Linearization Error

Math 211 Math 211 Lecture #36 The Use of the Linearization November 19, 2003 2 Linearization

Mixed Oxides in Selective Mixed Oxides in Selective Mixed Oxides in Selective Mixed Oxides in

Texas Instruments &amp; RFAB TI Information Selective Disclosure TI Information Selective

Cimzia Selective rebrand Concept A Cimzia Selective rebrand Logo Main / Colour Grayscale

Selective Prediction Binary classifications Rong Zhou November 8, 2017 Table of contents 1.

Overview 1. Linearization 2. Examples of linearization 3. Example with Mathematica 4.

Linearization of nonlinear control systems: state-space, feedback, orbital, and dynamic Witold

Linearization and system equilibrium Daniele Carnevale Dipartimento di Ing. Civile ed Ing.

The Scientific Method The Scientific Method The Scientific Method involves 6 steps: Problem

Selective Laser Trabeculoplasty Selective Laser Trabeculoplasty SLT SLT Jorge

Selective W eb Archiving at the Germ an National Library 1 | 8 | Selective Web Archiving

Selective Early Request Termination Selective Early Request Termination for Busy Internet

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

Foundations of AI Why learning works 1 6 . Statistical Machine Learning Bayesian Learning and

NLG, Wrap up Surface realizer Linearization SimpleNLG Lexicon Scott Farrar Design ideas

Parallel Outlier Ensembles Yue Zhao, Zain Nasrullah Maciej K. Hryniewicki Zheng Li Department

Genetic variation for Wood Basic Density, Knot index and Their Genetic variation for Wood Basic

Natural Computing Lecture 2: Genetic Algorithms J. Michael Herrmann michael.herrman@ed.ac.uk

Study of Deeper Learning: Opportunities and Outcomes Education Writers Association November 19,

Non-classical Heuristics for Classical Planning Erez Karpas Advisors: Carmel Domshlak Shaul

John R. Gallagher Assistant Professor of English and Writing Studies johng@illinois.edu

#IDEFINEME Who Are You? Defining Resiliency Understanding Social Emotional Learning Discover

1&quot; Taking a Look in the Mirror: Restorative Practices Starts With Us Mary Jo Hebling Beth

Texas Instruments & RFAB TI Information Selective Disclosure TI Information Selective

1" Taking a Look in the Mirror: Restorative Practices Starts With Us Mary Jo Hebling Beth