Network Flow Algorithms for Structured Sparsity Julien Mairal 1 - PowerPoint PPT Presentation

Network Flow Algorithms for Structured Sparsity Julien Mairal 1 Rodolphe Jenatton 2 Guillaume Obozinski 2 Francis Bach 2 1 UC Berkeley 2 INRIA - SIERRA Project-Team Bellevue, ICML Workshop, July 2011 Julien Mairal, UC Berkeley Network Flow Algorithms for Structured Sparsity 1/50

What this work is about Sparse and structured linear models. Optimization for group Lasso with overlapping groups. Links between sparse regularization and network flow optimization. Related publications: [1] J. Mairal, R. Jenatton, G. Obozinski and F. Bach. Network Flow Algorithms for Structured Sparsity. NIPS, 2010. [2] R. Jenatton, J. Mairal, G. Obozinski and F. Bach. Proximal Methods for Hierarchical Sparse Coding. JMLR, to appear. [3] R. Jenatton, J. Mairal, G. Obozinski and F. Bach. Proximal Methods for Sparse Hierarchical Dictionary Learning. ICML, 2010. Julien Mairal, UC Berkeley Network Flow Algorithms for Structured Sparsity 2/50

Part I: Introduction to Structured Sparsity Julien Mairal, UC Berkeley Network Flow Algorithms for Structured Sparsity 3/50

Sparse Linear Model: Machine Learning Point of View i =1 be a training set, where the vectors x i are in R p and are Let ( y i , x i ) n called features. The scalars y i are in {− 1 , +1 } for binary classification problems. R for regression problems. We assume there is a relation y ≈ w ⊤ x , and solve n 1 � ℓ ( y i , w ⊤ x i ) min + λ Ω( w ) . n w ∈ R p � �� i =1 regularization � �� empirical risk Julien Mairal, UC Berkeley Network Flow Algorithms for Structured Sparsity 4/50

Sparse Linear Models: Machine Learning Point of View A few examples: n 1 � ( y i − w ⊤ x i ) 2 + λ � w � 2 Ridge regression: min 2 . 2 n w ∈ R p i =1 n 1 � max(0 , 1 − y i w ⊤ x i ) + λ � w � 2 Linear SVM: min 2 . n w ∈ R p i =1 n � 1 + e − y i w ⊤ x i � 1 � + λ � w � 2 Logistic regression: min log 2 . n w ∈ R p i =1 The squared ℓ 2 -norm induces “ smoothness ” in w . When one knows in advance that w should be sparse, one should use a sparsity-inducing regularization such as the ℓ 1 -norm. [Chen et al., 1999, Tibshirani, 1996] How can one add a-priori knowledge in the regularization? Julien Mairal, UC Berkeley Network Flow Algorithms for Structured Sparsity 5/50

Sparse Linear Models: Signal Processing Point of View Let y in R n be a signal. Let X = [ x 1 , . . . , x p ] ∈ R n × p be a set of normalized “basis vectors”. We call it dictionary . X is “adapted” to y if it can represent it with a few basis vectors—that is, there exists a sparse vector w in R p such that x ≈ Xw . We call w the sparse code .   w 1     w 2    x 1 x 2 x p    y ≈ · · · .    .  .   w p � �� y ∈ R n X ∈ R n × p � �� w ∈ R p , sparse Julien Mairal, UC Berkeley Network Flow Algorithms for Structured Sparsity 6/50

Sparse Linear Models: the Lasso/ Basis Pursuit Signal processing: X is a dictionary in R n × p , 1 2 � y − Xw � 2 min 2 + λ � w � 1 . w ∈ R p Machine Learning: n 1 1 � ( y i − x i ⊤ w ) 2 + λ � w � 1 = min 2 n � y − X ⊤ w � 2 2 + λ � w � 1 , min 2 n w ∈ R p w ∈ R p i =1 with X △ = [ x 1 , . . . , x n ], and y △ = [ y 1 , . . . , y n ] ⊤ . Useful tool in signal processing, machine learning, statistics, neuroscience,. . . as long as one wishes to select features. Julien Mairal, UC Berkeley Network Flow Algorithms for Structured Sparsity 7/50

Group Sparsity-Inducing Norms data fitting term �� min f ( w ) + λ Ω( w ) w ∈ R p � �� sparsity-inducing norm The most popular choice for Ω: The ℓ 1 norm, � w � 1 = � p j =1 | w j | . However, the ℓ 1 norm encodes poor information, just cardinality ! Another popular choice for Ω: The ℓ 1 - ℓ q norm [Turlach et al., 2005], with q = 2 or q = ∞ � � w g � q with G a partition of { 1 , . . . , p } . g ∈G The ℓ 1 - ℓ q norm sets to zero groups of non-overlapping variables (as opposed to single variables for the ℓ 1 norm). Julien Mairal, UC Berkeley Network Flow Algorithms for Structured Sparsity 8/50

Structured Sparsity with Overlapping Groups Warning: Under the name “structured sparsity” appear in fact significantly different formulations! 1 non-convex zero-tree wavelets [Shapiro, 1993] sparsity patterns are in a predefined collection: [Baraniuk et al., 2010] select a union of groups: [Huang et al., 2009] structure via Markov Random Fields: [Cehver et al., 2008] 2 convex tree-structure: [Zhao et al., 2009] non-zero patterns are a union of groups: [Jacob et al., 2009] zero patterns are a union of groups: [Jenatton et al., 2009] other norms: [Micchelli et al., 2010] Julien Mairal, UC Berkeley Network Flow Algorithms for Structured Sparsity 9/50

Sparsity-Inducing Norms � Ω( w ) = � w g � q g ∈G What happens when the groups overlap? [Jenatton et al., 2009] Inside the groups, the ℓ 2 -norm (or ℓ ∞ ) does not promote sparsity. Variables belonging to the same groups are encouraged to be set to zero together. Julien Mairal, UC Berkeley Network Flow Algorithms for Structured Sparsity 10/50

Examples of set of groups G [Jenatton et al., 2009] Selection of contiguous patterns on a sequence, p = 6. G is the set of blue groups. Any union of blue groups set to zero leads to the selection of a contiguous pattern. Julien Mairal, UC Berkeley Network Flow Algorithms for Structured Sparsity 11/50

Hierarchical Norms [Zhao et al., 2009] A node can be active only if its ancestors are active . The selected patterns are rooted subtrees. Julien Mairal, UC Berkeley Network Flow Algorithms for Structured Sparsity 12/50

Part II: How do we optimize these cost functions? Julien Mairal, UC Berkeley Network Flow Algorithms for Structured Sparsity 13/50

Different strategies � w ∈ R p f ( w ) + λ min � w g � q g ∈G generic methods: QP, CP, subgradient descent. Augmented Lagrangian, ADMM [Mairal et al., 2011, Qi and Goldfarb, 2011] Nesterov smoothing technique [Chen et al., 2010] hierarchical case: proximal methods [Jenatton et al., 2010a] for q = ∞ : proximal gradient methods with network flow optimization. [Mairal et al., 2010] also proximal gradient methods with inexact proximal operator [Jenatton et al., 2010a, Liu and Ye, 2010] for q =2, reweighted- ℓ 2 [Jenatton et al., 2010b, Micchelli et al., 2010] Julien Mairal, UC Berkeley Network Flow Algorithms for Structured Sparsity 14/50

First-order/proximal methods w ∈ R p f ( w ) + λ Ω( w ) min f is strictly convex and differentiable with a Lipshitz gradient. Generalizes the idea of gradient descent + L w k +1 ← arg min w ∈ R p f ( w k )+ ∇ f ( w k ) ⊤ ( w − w k ) 2 � w − w k � 2 + λ Ω( w ) 2 � �� linear approximation quadratic term 1 2 � w − ( w k − 1 2 + λ L ∇ f ( w k )) � 2 ← arg min L Ω( w ) w ∈ R p When λ = 0, w k +1 ← w k − 1 L ∇ f ( w k ), this is equivalent to a classical gradient descent step. Julien Mairal, UC Berkeley Network Flow Algorithms for Structured Sparsity 15/50

First-order/proximal methods They require solving efficiently the proximal operator 1 2 � u − w � 2 min 2 + λ Ω( w ) w ∈ R p For the ℓ 1 -norm, this amounts to a soft-thresholding: i = sign( u i )( u i − λ ) + . w ⋆ There exists accelerated versions based on Nesterov optimal first-order method (gradient method with “extrapolation”) [Beck and Teboulle, 2009, Nesterov, 2007, 1983] suited for large-scale experiments. Julien Mairal, UC Berkeley Network Flow Algorithms for Structured Sparsity 16/50

Tree-structured groups Proposition [Jenatton, Mairal, Obozinski, and Bach, 2010a] If G is a tree-structured set of groups, i.e., ∀ g , h ∈ G , g ∩ h = ∅ or g ⊂ h or h ⊂ g For q = 2 or q = ∞ , we define Prox g and Prox Ω as 1 Prox g : u → arg min 2 � u − w � + λ � w g � q , w ∈ R p 1 � Prox Ω : u → arg min 2 � u − w � + λ � w g � q , w ∈ R p g ∈G If the groups are sorted from the leaves to the root, then Prox Ω = Prox g m ◦ . . . ◦ Prox g 1 . → Tree-structured regularization : Efficient linear time algorithm. Julien Mairal, UC Berkeley Network Flow Algorithms for Structured Sparsity 17/50

General Overlapping Groups for q = ∞ Dual formulation [Jenatton, Mairal, Obozinski, and Bach, 2010a] The solutions w ⋆ and ξ ⋆ of the following optimization problems 1 � min 2 � u − w � + λ � w g � ∞ , ( Primal ) w ∈ R p g ∈G 1 � ξ g � 2 2 s.t. ∀ g ∈ G , � ξ g � 1 ≤ λ and ξ g 2 � u − ∈ g , min j = 0 if j / ξ ∈ R p ×|G| g ∈G ( Dual ) satisfy � w ⋆ = u − ξ ⋆ g . ( Primal-dual relation ) g ∈G The dual formulation has more variables, but no overlapping constraints . Julien Mairal, UC Berkeley Network Flow Algorithms for Structured Sparsity 18/50

General Overlapping Groups for q = ∞ [Mairal, Jenatton, Obozinski, and Bach, 2010] First Step: Flip the signs of u The dual is equivalent to a quadratic min-cost flow problem . 1 � � ξ g j ≤ λ and ξ g ξ g � 2 min 2 � u − 2 s.t. ∀ g ∈ G , j = 0 if j / ∈ g , ξ ∈ R p ×|G| g ∈G j ∈ g + Julien Mairal, UC Berkeley Network Flow Algorithms for Structured Sparsity 19/50

Network Flow Algorithms for Structured Sparsity Julien Mairal 1 - PowerPoint PPT Presentation

Network Flow Algorithms for Structured Sparsity Julien Mairal 1 Rodolphe Jenatton 2 Guillaume Obozinski 2 Francis Bach 2 1 UC Berkeley 2 INRIA - SIERRA Project-Team Bellevue, ICML Workshop, July 2011 Julien Mairal, UC Berkeley Network Flow

A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE

Sparsity, Randomness and Compressed Sensing Petros Boufounos Mitsubishi Electric Research Labs

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Structured sparsity and convex optimization Francis Bach INRIA - Ecole Normale Sup erieure,

RegML2017@SIMULA Oslo Class 7 Structured sparsity Lorenzo Rosasco UNIGE-MIT-IIT May 4, 2017

Network Flow CS31005: Algorithms-II Autumn 2020 IIT Kharagpur Network Flow Models the flow

Chapter 12 Network Flow CS 573: Algorithms, Fall 2013 October 3, 2013 12.1 Network Flow

RegML 2016 Class 6 Structured sparsity Lorenzo Rosasco UNIGE-MIT-IIT June 30, 2016 Exploiting

Introduction to Sparsity in Modeling and Learning Introduction to Sparsity in Modeling and

Sparsity and image processing Aurlie Boisbunon INRIA-SAM, AYIN March 26, 2014 Why sparsity?

Network Flow 5 Network Flow terminology Network flow is similar to finding how much water we

Scaling Log-Structured KV-Stores featuring Monkey and Dostoevsky SIGMOD17 / SIGMOD18 Niv Dayan

Machine Learning Fall 2017 Structured Prediction (structured perceptron, HMM, structured SVM)

Nearly Linear-Time Algorithms for Structured Sparsity Chinmay

Review Network flow definitions CSE 421 Flow examples Augmenting Paths Algorithms

Mat 3770 Conservation Max Flow Network Flows flow Cancellation Cut Ford- Fulkerson

Hundtertwasser Friedensreich "Painting is to dream," said Hundertwasser. "When the

Communicating Purified Water for El Paso: A Sustainable Water Supply Regional Water Resources

emissions control. An analysis of challenges to reach 2050 . CRREM PROJECT Paloma Taltavull de La

ADVANCE GSE Program Workshop June 1, 2015 Baltimore, Maryland Expanding on our NSF ADVANCE

Outline A bit of History Facts and Figures Our Campus September 18th, 2017 Seville, a

Tivoli Road, Dn Laoghaire, Co. Dublin Telephone: 01-2803504 E-Mail: office@dunlaoghairens.ie

BOARD MEETING MAY 8, 2017 Answers to Questions about Davis Arts Centers Deficit and Long Term

Proposed Zoning Ordinance Changes City of St. Marys, GA May 11, 2017 Connie B. Cooper, FAICP

Network Flow Algorithms for Structured Sparsity Julien Mairal 1 - PowerPoint PPT Presentation

Network Flow Algorithms for Structured Sparsity Julien Mairal 1 Rodolphe Jenatton 2 Guillaume Obozinski 2 Francis Bach 2 1 UC Berkeley 2 INRIA - SIERRA Project-Team Bellevue, ICML Workshop, July 2011 Julien Mairal, UC Berkeley Network Flow

A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE

Sparsity, Randomness and Compressed Sensing Petros Boufounos Mitsubishi Electric Research Labs

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Structured sparsity and convex optimization Francis Bach INRIA - Ecole Normale Sup erieure,

RegML2017@SIMULA Oslo Class 7 Structured sparsity Lorenzo Rosasco UNIGE-MIT-IIT May 4, 2017

Network Flow CS31005: Algorithms-II Autumn 2020 IIT Kharagpur Network Flow Models the flow

Chapter 12 Network Flow CS 573: Algorithms, Fall 2013 October 3, 2013 12.1 Network Flow

RegML 2016 Class 6 Structured sparsity Lorenzo Rosasco UNIGE-MIT-IIT June 30, 2016 Exploiting

Introduction to Sparsity in Modeling and Learning Introduction to Sparsity in Modeling and

Sparsity and image processing Aurlie Boisbunon INRIA-SAM, AYIN March 26, 2014 Why sparsity?

Network Flow 5 Network Flow terminology Network flow is similar to finding how much water we

Scaling Log-Structured KV-Stores featuring Monkey and Dostoevsky SIGMOD17 / SIGMOD18 Niv Dayan

Machine Learning Fall 2017 Structured Prediction (structured perceptron, HMM, structured SVM)

Nearly Linear-Time Algorithms for Structured Sparsity Chinmay

Review Network flow definitions CSE 421 Flow examples Augmenting Paths Algorithms

Mat 3770 Conservation Max Flow Network Flows flow Cancellation Cut Ford- Fulkerson

Hundtertwasser Friedensreich &quot;Painting is to dream,&quot; said Hundertwasser. &quot;When the

Communicating Purified Water for El Paso: A Sustainable Water Supply Regional Water Resources

emissions control. An analysis of challenges to reach 2050 . CRREM PROJECT Paloma Taltavull de La

ADVANCE GSE Program Workshop June 1, 2015 Baltimore, Maryland Expanding on our NSF ADVANCE

Outline A bit of History Facts and Figures Our Campus September 18th, 2017 Seville, a

Tivoli Road, Dn Laoghaire, Co. Dublin Telephone: 01-2803504 E-Mail: office@dunlaoghairens.ie

BOARD MEETING MAY 8, 2017 Answers to Questions about Davis Arts Centers Deficit and Long Term

Proposed Zoning Ordinance Changes City of St. Marys, GA May 11, 2017 Connie B. Cooper, FAICP

Hundtertwasser Friedensreich "Painting is to dream," said Hundertwasser. "When the