Structured sparsity and convex optimization Francis Bach INRIA - - PowerPoint PPT Presentation

Structured sparsity and convex optimization Francis Bach INRIA - Ecole Normale Sup´ erieure, Paris, France É C O L E N O R M A L E S U P É R I E U R E Joint work with R. Jenatton, J. Mairal, G. Obozinski December 2015

Structured sparsity and convex optimization Outline • Structured sparsity • Hierarchical dictionary learning – Known topology but unknown location/projection – Tree: Efficient linear-time computations • Non-linear variable selection – Known topology and location – Directed acyclic graph: semi-efficient active-set algorithm

Sparsity in machine learning and statistics • Assumption : y = w ⊤ x + ε , with w ∈ R p sparse – Proxy for interpretability – Allow high-dimensional inference: log p = O ( n ) • Sparsity and convexity ( ℓ 1 -norm regularization): w ∈ R p L ( w ) + � w � 1 min w w 2 2 w w 1 1

Sparsity in supervised machine learning • Observed data ( y i , x i ) ∈ R p × R , i = 1 , . . . , n – Response vector y = ( y 1 , . . . , y n ) ⊤ ∈ R n – Design matrix X = ( x 1 , . . . , x n ) ⊤ ∈ R n × p • Regularized empirical risk minimization: n 1 � ℓ ( y i , w ⊤ x i ) + λ Ω( w ) = min w ∈ R p L ( y , Xw ) + λ Ω( w ) min n w ∈ R p i =1 • Norm Ω to promote sparsity – Main example: ℓ 1 -norm – square loss ⇒ basis pursuit in signal processing (Chen et al., 2001), Lasso in statistics/machine learning (Tibshirani, 1996)

Sparsity in unsupervised machine learning and signal processing : Dictionary learning y ∈ R n , design matrix X ∈ R n × p • Responses – Lasso: w ∈ R p L ( y , Xw ) + λ Ω( w ) min

Sparsity in unsupervised machine learning and signal processing : Dictionary learning • Single signal x ∈ R p , given dictionary D ∈ R p × k – Basis pursuit: min α ∈ R k L ( x , D α ) + λ Ω( α )

Sparsity in unsupervised machine learning and signal processing : Dictionary learning • Single signal x ∈ R p , given dictionary D ∈ R p × k – Basis pursuit: min α ∈ R k L ( x , D α ) + λ Ω( α ) • Multiple signals x i ∈ R p , i = 1 , . . . , n , given dictionary D ∈ R p × k n � � � min L ( x i , D α i ) + λ Ω( α i ) α 1 ,..., α n ∈ R k i =1

Sparsity in unsupervised machine learning and signal processing : Dictionary learning • Single signal x ∈ R p , given dictionary D ∈ R p × k – Basis pursuit: min α ∈ R k L ( x , D α ) + λ Ω( α ) • Multiple signals x i ∈ R p , i = 1 , . . . , n , given dictionary D ∈ R p × k n � � � min L ( x i , D α i ) + λ Ω( α i ) α 1 ,..., α n ∈ R k i =1 • Dictionary learning : D = ( d 1 , . . . , d k ) such that ∀ j, � d j � 2 � 1 n � � � min min L ( x i , D α i ) + λ Ω( α i ) D α 1 ,..., α n ∈ R k i =1 • Olshausen and Field (1997); Elad and Aharon (2006)

Why structured sparsity? • Interpretability – Structured dictionary elements (Jenatton et al., 2009b) – Dictionary elements “organized” in a tree or a grid (Kavukcuoglu et al., 2009; Jenatton et al., 2011b; Mairal et al., 2010)

Why structured sparsity? • Interpretability – Structured dictionary elements (Jenatton et al., 2009b) – Dictionary elements “organized” in a tree or a grid (Kavukcuoglu et al., 2009; Jenatton et al., 2011b; Mairal et al., 2010) • Stability and identifiability – Optimization problem min α ∈ R p L ( x, Dα ) + λ � α � 1 is unstable – Codes α often used in later processing (Mairal et al., 2009b) • Prediction or estimation performance – When prior knowledge matches data (Haupt and Nowak, 2006; Baraniuk et al., 2008; Jenatton et al., 2009a; Huang et al., 2009)

Why structured sparsity? • Interpretability – Structured dictionary elements (Jenatton et al., 2009b) – Dictionary elements “organized” in a tree or a grid (Kavukcuoglu et al., 2009; Jenatton et al., 2011b; Mairal et al., 2010) • Stability and identifiability – Optimization problem min α ∈ R p L ( x, Dα ) + λ � α � 1 is unstable – Codes α often used in later processing (Mairal et al., 2009b) • Prediction or estimation performance – When prior knowledge matches data (Haupt and Nowak, 2006; Baraniuk et al., 2008; Jenatton et al., 2009a; Huang et al., 2009) • Multi-resolution analysis

Classical approaches to structured sparsity (pre-2011) • Many application domains – Computer vision (Cevher et al., 2008; Kavukcuoglu et al., 2009; Mairal et al., 2009a) – Neuro-imaging (Gramfort and Kowalski, 2009; Jenatton et al., 2011a) – Bio-informatics (Rapaport et al., 2008; Kim and Xing, 2010) – Audio processing (Lef` evre et al., 2011) • Non-convex approaches – Haupt and Nowak (2006); Baraniuk et al. (2008); Huang et al. (2009) - Convex approaches - Design of sparsity-inducing norms

Classical approaches to structured sparsity (pre-2011) • Many application domains – Computer vision (Cevher et al., 2008; Kavukcuoglu et al., 2009; Mairal et al., 2009a) – Neuro-imaging (Gramfort and Kowalski, 2009; Jenatton et al., 2011a) – Bio-informatics (Rapaport et al., 2008; Kim and Xing, 2010) – Audio processing (Lef` evre et al., 2011) • Non-convex approaches – Haupt and Nowak (2006); Baraniuk et al. (2008); Huang et al. (2009) • Convex approaches – Design of sparsity-inducing norms

Unit-norm balls Geometric interpretation

Hierarchical dictionary learning (Jenatton, Mairal, Obozinski, and Bach, 2011b) • Structure on codes α (not on dictionary D ) • Hierarchical penalization: Ω( α ) = � G ∈ G � α G � 2 where groups G in G are equal to set of descendants of some nodes in a tree • Variable selected after its ancestors (Zhao et al., 2009; Bach, 2008b)

Hierarchical dictionary learning Efficient optimization n � � x i − D α i � 2 min 2 + λ Ω( α i ) s.t. ∀ j, � d j � 2 ≤ 1 . A ∈ R k × n i =1 D ∈ R p × k • Minimization with respect to α i : regularized least-squares – Many algorithms dedicated to the ℓ 1 -norm Ω( α ) = � α � 1 • Proximal methods : first-order methods with optimal convergence rate (Nesterov, 2007; Beck and Teboulle, 2009) – Requires solving many times min α ∈ R p 1 2 � y − α � 2 2 + λ Ω( α ) • Tree-structured regularization : Efficient linear time algorithm based on primal-dual decomposition (Jenatton et al., 2011b)

Decomposability of the proximity operator • Sum of simple norms : Ω( α ) = � G ∈ G � α G � 2 – Each proximity operator is simple (soft-thresholding of ℓ 2 -norm) • In general, the proximity operator of the sum is not the composition of proximity operators

Decomposability of the proximity operator • Sum of simple norms : Ω( α ) = � G ∈ G � α G � 2 – Each proximity operator is simple (soft-thresholding of ℓ 2 -norm) • In general, the proximity operator of the sum is not the composition of proximity operators • In this particular it is! – Which direction?

Decomposability of the proximity operator • Sum of simple norms : Ω( α ) = � G ∈ G � α G � 2 – Each proximity operator is simple (soft-thresholding of ℓ 2 -norm) • In general, the proximity operator of the sum is not the composition of proximity operators • In this particular it is! – From leaves to the root

Application to image denoising - Dictionary tree

Hierarchical dictionary learning Modelling of text corpora • Each document is modelled through word counts • Low-rank matrix factorization of word-document matrix • Probabilistic topic models (Blei et al., 2003) – Similar structures based on non parametric Bayesian methods (Blei et al., 2004) – Can we achieve similar performance with simple matrix factorization formulation?

Modelling of text corpora - Dictionary tree

Application to neuro-imaging (supervised) Structured sparsity for fMRI (Jenatton et al., 2011a) • “Brain reading”: prediction of (seen) object size • Multi-scale activity levels through hierarchical penalization

Non-linear variable selection • Given x = ( x 1 , . . . , x q ) ∈ R q , find function f ( x 1 , . . . , x q ) which depends only on a few variables • Sparse generalized additive models (Ravikumar et al., 2008; Bach, 2008a): – restricted to f ( x 1 , . . . , x q ) = f 1 ( x 1 ) + · · · + f q ( x q ) • Cosso (Lin and Zhang, 2006): � – restricted to f ( x 1 , . . . , x q ) = f J ( x J ) J ⊂{ 1 ,...,q } , | J | � 2

Structured sparsity and convex optimization Francis Bach INRIA - - PowerPoint PPT Presentation

Structured sparsity and convex optimization Francis Bach INRIA - Ecole Normale Sup erieure, Paris, France C O L E N O R M A L E S U P R I E U R E Joint work with R. Jenatton, J. Mairal, G. Obozinski December 2015 Structured sparsity

Convex Hell 362 dnc CS 16: Convex Hull Whoops, I mean... Convex Hull Whats a Convex Hull?

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Optimization Problems Instructor:

CS675: Convex and Combinatorial Optimization Spring 2018 Convex Optimization Problems

constrained convex optimization virgil pavlu 1 convex set a set X in a vector space is convex if

Convex Optimization 4. Convex Optimization Problems Prof. Ying Cui Department of Electrical

CS675: Convex and Combinatorial Optimization Spring 2018 Convex Sets Instructor: Shaddin Dughmi

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Functions Instructor: Shaddin

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Sets Instructor: Shaddin Dughmi

CS675: Convex and Combinatorial Optimization Fall 2014 Convex Functions Instructor: Shaddin

Convex hull 1 - 1 Convex hull 1 - 2 Convex hull 1 - 3 Convex hull Definition, extremal

A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE

CS133 Computational Geometry Convex Hull 1 Convex Hull Given a set of n points, find the

Structured sparsity through convex optimization Francis Bach INRIA - Ecole Normale Sup

Some Recent Advances in Non-convex Optimization Purushottam Kar IIT KANPUR Outline of the Talk

16. Review of convex optimization Convex sets and functions Convex programming models

A Primer in Convex Optimization Moritz Diehl partly based on material by Colin Jones, Stephen

Learning graphical models of the brain Ga el Varoquaux functional MRI (fMRI) t Recordings of

Functional Brain Networks Constructed fr from fM fMRI data Shubham Tripathi and Vijay Keswani

Compressed Dictionary Learning for Detecting Activations in fMRI using Double Sparsity

Connectomics and Graph Theory Jon Clayden <j.clayden@ucl.ac.uk> DIBS Teaching Seminar, 20

Modelling the dependence between two diagnostic tests via copula functions Jorge Alberto Achcar 1

Screening for Colorectal Cancer: The impact of tailored decision support delivered via the

National Native Network Tobacco Control and American Indian Cancer Policy Tobacco Control and

feedback: 389 hypotheses and counting Jamie C. Brehaut, PhD Clinical Epidemiology Program

Structured sparsity and convex optimization Francis Bach INRIA - - PowerPoint PPT Presentation

Structured sparsity and convex optimization Francis Bach INRIA - Ecole Normale Sup erieure, Paris, France C O L E N O R M A L E S U P R I E U R E Joint work with R. Jenatton, J. Mairal, G. Obozinski December 2015 Structured sparsity

Convex Hell 362 dnc CS 16: Convex Hull Whoops, I mean... Convex Hull Whats a Convex Hull?

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Optimization Problems Instructor:

CS675: Convex and Combinatorial Optimization Spring 2018 Convex Optimization Problems

constrained convex optimization virgil pavlu 1 convex set a set X in a vector space is convex if

Convex Optimization 4. Convex Optimization Problems Prof. Ying Cui Department of Electrical

CS675: Convex and Combinatorial Optimization Spring 2018 Convex Sets Instructor: Shaddin Dughmi

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Functions Instructor: Shaddin

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Sets Instructor: Shaddin Dughmi

CS675: Convex and Combinatorial Optimization Fall 2014 Convex Functions Instructor: Shaddin

Convex hull 1 - 1 Convex hull 1 - 2 Convex hull 1 - 3 Convex hull Definition, extremal

A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE

CS133 Computational Geometry Convex Hull 1 Convex Hull Given a set of n points, find the

Structured sparsity through convex optimization Francis Bach INRIA - Ecole Normale Sup

Some Recent Advances in Non-convex Optimization Purushottam Kar IIT KANPUR Outline of the Talk

16. Review of convex optimization Convex sets and functions Convex programming models

A Primer in Convex Optimization Moritz Diehl partly based on material by Colin Jones, Stephen

Learning graphical models of the brain Ga el Varoquaux functional MRI (fMRI) t Recordings of

Functional Brain Networks Constructed fr from fM fMRI data Shubham Tripathi and Vijay Keswani

Compressed Dictionary Learning for Detecting Activations in fMRI using Double Sparsity

Connectomics and Graph Theory Jon Clayden &lt;j.clayden@ucl.ac.uk&gt; DIBS Teaching Seminar, 20

Modelling the dependence between two diagnostic tests via copula functions Jorge Alberto Achcar 1

Screening for Colorectal Cancer: The impact of tailored decision support delivered via the

National Native Network Tobacco Control and American Indian Cancer Policy Tobacco Control and

feedback: 389 hypotheses and counting Jamie C. Brehaut, PhD Clinical Epidemiology Program

Connectomics and Graph Theory Jon Clayden <j.clayden@ucl.ac.uk> DIBS Teaching Seminar, 20