structured sparsity and convex optimization
play

Structured sparsity and convex optimization Francis Bach INRIA - - PowerPoint PPT Presentation

Structured sparsity and convex optimization Francis Bach INRIA - Ecole Normale Sup erieure, Paris, France C O L E N O R M A L E S U P R I E U R E Joint work with R. Jenatton, J. Mairal, G. Obozinski December 2015 Structured sparsity


  1. Structured sparsity and convex optimization Francis Bach INRIA - Ecole Normale Sup´ erieure, Paris, France É C O L E N O R M A L E S U P É R I E U R E Joint work with R. Jenatton, J. Mairal, G. Obozinski December 2015

  2. Structured sparsity and convex optimization Francis Bach INRIA - Ecole Normale Sup´ erieure, Paris, France É C O L E N O R M A L E S U P É R I E U R E Joint work with R. Jenatton, J. Mairal, G. Obozinski December 2015

  3. Structured sparsity and convex optimization Outline • Structured sparsity • Hierarchical dictionary learning – Known topology but unknown location/projection – Tree: Efficient linear-time computations • Non-linear variable selection – Known topology and location – Directed acyclic graph: semi-efficient active-set algorithm

  4. Sparsity in machine learning and statistics • Assumption : y = w ⊤ x + ε , with w ∈ R p sparse – Proxy for interpretability – Allow high-dimensional inference: log p = O ( n ) • Sparsity and convexity ( ℓ 1 -norm regularization): w ∈ R p L ( w ) + � w � 1 min w w 2 2 w w 1 1

  5. Sparsity in supervised machine learning • Observed data ( y i , x i ) ∈ R p × R , i = 1 , . . . , n – Response vector y = ( y 1 , . . . , y n ) ⊤ ∈ R n – Design matrix X = ( x 1 , . . . , x n ) ⊤ ∈ R n × p • Regularized empirical risk minimization: n 1 � ℓ ( y i , w ⊤ x i ) + λ Ω( w ) = min w ∈ R p L ( y , Xw ) + λ Ω( w ) min n w ∈ R p i =1 • Norm Ω to promote sparsity – Main example: ℓ 1 -norm – square loss ⇒ basis pursuit in signal processing (Chen et al., 2001), Lasso in statistics/machine learning (Tibshirani, 1996)

  6. Sparsity in unsupervised machine learning and signal processing : Dictionary learning y ∈ R n , design matrix X ∈ R n × p • Responses – Lasso: w ∈ R p L ( y , Xw ) + λ Ω( w ) min

  7. Sparsity in unsupervised machine learning and signal processing : Dictionary learning • Single signal x ∈ R p , given dictionary D ∈ R p × k – Basis pursuit: min α ∈ R k L ( x , D α ) + λ Ω( α )

  8. Sparsity in unsupervised machine learning and signal processing : Dictionary learning • Single signal x ∈ R p , given dictionary D ∈ R p × k – Basis pursuit: min α ∈ R k L ( x , D α ) + λ Ω( α ) • Multiple signals x i ∈ R p , i = 1 , . . . , n , given dictionary D ∈ R p × k n � � � min L ( x i , D α i ) + λ Ω( α i ) α 1 ,..., α n ∈ R k i =1

  9. Sparsity in unsupervised machine learning and signal processing : Dictionary learning • Single signal x ∈ R p , given dictionary D ∈ R p × k – Basis pursuit: min α ∈ R k L ( x , D α ) + λ Ω( α ) • Multiple signals x i ∈ R p , i = 1 , . . . , n , given dictionary D ∈ R p × k n � � � min L ( x i , D α i ) + λ Ω( α i ) α 1 ,..., α n ∈ R k i =1 • Dictionary learning : D = ( d 1 , . . . , d k ) such that ∀ j, � d j � 2 � 1 n � � � min min L ( x i , D α i ) + λ Ω( α i ) D α 1 ,..., α n ∈ R k i =1 • Olshausen and Field (1997); Elad and Aharon (2006)

  10. Why structured sparsity? • Interpretability – Structured dictionary elements (Jenatton et al., 2009b) – Dictionary elements “organized” in a tree or a grid (Kavukcuoglu et al., 2009; Jenatton et al., 2011b; Mairal et al., 2010)

  11. Why structured sparsity? • Interpretability – Structured dictionary elements (Jenatton et al., 2009b) – Dictionary elements “organized” in a tree or a grid (Kavukcuoglu et al., 2009; Jenatton et al., 2011b; Mairal et al., 2010) • Stability and identifiability – Optimization problem min α ∈ R p L ( x, Dα ) + λ � α � 1 is unstable – Codes α often used in later processing (Mairal et al., 2009b) • Prediction or estimation performance – When prior knowledge matches data (Haupt and Nowak, 2006; Baraniuk et al., 2008; Jenatton et al., 2009a; Huang et al., 2009)

  12. Why structured sparsity? • Interpretability – Structured dictionary elements (Jenatton et al., 2009b) – Dictionary elements “organized” in a tree or a grid (Kavukcuoglu et al., 2009; Jenatton et al., 2011b; Mairal et al., 2010) • Stability and identifiability – Optimization problem min α ∈ R p L ( x, Dα ) + λ � α � 1 is unstable – Codes α often used in later processing (Mairal et al., 2009b) • Prediction or estimation performance – When prior knowledge matches data (Haupt and Nowak, 2006; Baraniuk et al., 2008; Jenatton et al., 2009a; Huang et al., 2009) • Multi-resolution analysis

  13. Classical approaches to structured sparsity (pre-2011) • Many application domains – Computer vision (Cevher et al., 2008; Kavukcuoglu et al., 2009; Mairal et al., 2009a) – Neuro-imaging (Gramfort and Kowalski, 2009; Jenatton et al., 2011a) – Bio-informatics (Rapaport et al., 2008; Kim and Xing, 2010) – Audio processing (Lef` evre et al., 2011) • Non-convex approaches – Haupt and Nowak (2006); Baraniuk et al. (2008); Huang et al. (2009) - Convex approaches - Design of sparsity-inducing norms

  14. Classical approaches to structured sparsity (pre-2011) • Many application domains – Computer vision (Cevher et al., 2008; Kavukcuoglu et al., 2009; Mairal et al., 2009a) – Neuro-imaging (Gramfort and Kowalski, 2009; Jenatton et al., 2011a) – Bio-informatics (Rapaport et al., 2008; Kim and Xing, 2010) – Audio processing (Lef` evre et al., 2011) • Non-convex approaches – Haupt and Nowak (2006); Baraniuk et al. (2008); Huang et al. (2009) • Convex approaches – Design of sparsity-inducing norms

  15. Unit-norm balls Geometric interpretation

  16. Hierarchical dictionary learning (Jenatton, Mairal, Obozinski, and Bach, 2011b) • Structure on codes α (not on dictionary D ) • Hierarchical penalization: Ω( α ) = � G ∈ G � α G � 2 where groups G in G are equal to set of descendants of some nodes in a tree • Variable selected after its ancestors (Zhao et al., 2009; Bach, 2008b)

  17. Hierarchical dictionary learning Efficient optimization n � � x i − D α i � 2 min 2 + λ Ω( α i ) s.t. ∀ j, � d j � 2 ≤ 1 . A ∈ R k × n i =1 D ∈ R p × k • Minimization with respect to α i : regularized least-squares – Many algorithms dedicated to the ℓ 1 -norm Ω( α ) = � α � 1 • Proximal methods : first-order methods with optimal convergence rate (Nesterov, 2007; Beck and Teboulle, 2009) – Requires solving many times min α ∈ R p 1 2 � y − α � 2 2 + λ Ω( α ) • Tree-structured regularization : Efficient linear time algorithm based on primal-dual decomposition (Jenatton et al., 2011b)

  18. Decomposability of the proximity operator • Sum of simple norms : Ω( α ) = � G ∈ G � α G � 2 – Each proximity operator is simple (soft-thresholding of ℓ 2 -norm) • In general, the proximity operator of the sum is not the composition of proximity operators

  19. Decomposability of the proximity operator • Sum of simple norms : Ω( α ) = � G ∈ G � α G � 2 – Each proximity operator is simple (soft-thresholding of ℓ 2 -norm) • In general, the proximity operator of the sum is not the composition of proximity operators • In this particular it is! – Which direction?

  20. Decomposability of the proximity operator • Sum of simple norms : Ω( α ) = � G ∈ G � α G � 2 – Each proximity operator is simple (soft-thresholding of ℓ 2 -norm) • In general, the proximity operator of the sum is not the composition of proximity operators • In this particular it is! – From leaves to the root

  21. Application to image denoising - Dictionary tree

  22. Hierarchical dictionary learning Modelling of text corpora • Each document is modelled through word counts • Low-rank matrix factorization of word-document matrix • Probabilistic topic models (Blei et al., 2003) – Similar structures based on non parametric Bayesian methods (Blei et al., 2004) – Can we achieve similar performance with simple matrix factorization formulation?

  23. Modelling of text corpora - Dictionary tree

  24. Application to neuro-imaging (supervised) Structured sparsity for fMRI (Jenatton et al., 2011a) • “Brain reading”: prediction of (seen) object size • Multi-scale activity levels through hierarchical penalization

  25. Application to neuro-imaging (supervised) Structured sparsity for fMRI (Jenatton et al., 2011a) • “Brain reading”: prediction of (seen) object size • Multi-scale activity levels through hierarchical penalization

  26. Application to neuro-imaging (supervised) Structured sparsity for fMRI (Jenatton et al., 2011a) • “Brain reading”: prediction of (seen) object size • Multi-scale activity levels through hierarchical penalization

  27. Non-linear variable selection • Given x = ( x 1 , . . . , x q ) ∈ R q , find function f ( x 1 , . . . , x q ) which depends only on a few variables • Sparse generalized additive models (Ravikumar et al., 2008; Bach, 2008a): – restricted to f ( x 1 , . . . , x q ) = f 1 ( x 1 ) + · · · + f q ( x q ) • Cosso (Lin and Zhang, 2006): � – restricted to f ( x 1 , . . . , x q ) = f J ( x J ) J ⊂{ 1 ,...,q } , | J | � 2

Recommend


More recommend