structured sparsity through convex optimization
play

Structured sparsity through convex optimization Francis Bach INRIA - PowerPoint PPT Presentation

Structured sparsity through convex optimization Francis Bach INRIA - Ecole Normale Sup erieure, Paris, France Joint work with R. Jenatton, J. Mairal, G. Obozinski Journ ees INRIA - Apprentissage - December 2011 Outline SIERRA


  1. Structured sparsity through convex optimization Francis Bach INRIA - Ecole Normale Sup´ erieure, Paris, France Joint work with R. Jenatton, J. Mairal, G. Obozinski Journ´ ees INRIA - Apprentissage - December 2011

  2. Outline • SIERRA project-team • Introduction: Sparse methods for machine learning – Need for structured sparsity: Going beyond the ℓ 1 -norm • Classical approaches to structured sparsity – Linear combinations of ℓ q -norms • Structured sparsity through submodular functions – Relaxation of the penalization of supports – Unified algorithms and analysis

  3. SIERRA - created January 1 st , 2011 Composition of the INRIA/ENS/CNRS team • 3 Researchers (Sylvain Arlot, Francis Bach, Guillaume Obozinski) • 4 Post-docs (Simon Lacoste-Julien, Nicolas Le Roux, Ronny Luss, Mark Schmidt) • 9 PhD students (Louise Benoit, Florent Couzinie-Devy, Edouard Grave, Toby Hocking, Armand Joulin, Augustin Lef` evre, Anil Nelakanti, Fabian Pedregosa, Matthieu Solnon)

  4. Machine learning Computer science and applied mathematics • Modelisation, prediction and control from training examples • Theory – Analysis of statistical performance • Algorithms – Numerical efficiency and stability • Applications – Computer vision, bioinformatics, neuro-imaging, text, audio

  5. Scientific objectives - SIERRA tenet - Machine learning does not exist in the void - Specific domain knowledge must be exploited

  6. Scientific objectives - SIERRA tenet - Machine learning does not exist in the void - Specific domain knowledge must be exploited • Scientific challenges – Fully automated data processing – Incorporating structure – Large-scale learning

  7. Scientific objectives - SIERRA tenet - Machine learning does not exist in the void - Specific domain knowledge must be exploited • Scientific challenges – Fully automated data processing – Incorporating structure – Large-scale learning • Scientific objectives – Supervised learning – Parsimony – Optimization – Unsupervised learning

  8. Scientific objectives - SIERRA tenet - Machine learning does not exist in the void - Specific domain knowledge must be exploited • Scientific challenges – Fully automated data processing – Incorporating structure – Large-scale learning • Scientific objectives • Interdisciplinary collaborations – Supervised learning – Computer vision – Parsimony – Bioinformatics – Optimization – Neuro-imaging – Unsupervised learning – Text, audio, natural language

  9. Supervised learning • Data ( x i , y i ) ∈ X × Y , i = 1 , . . . , n • Goal : predict y ∈ Y from x ∈ X , i.e., find f : X → Y • Empirical risk minimization n 1 λ � 2 � f � 2 ℓ ( y i , f ( x i )) + n i =1 Data-fitting + Regularization • SIERRA Scientific objectives : – Studying generalization error (S. Arlot, M. Solnon, F. Bach) – Improving calibration (S. Arlot, M. Solnon, F. Bach) – Two main types of norms: ℓ 2 vs. ℓ 1 (G. Obozinski, F. Bach)

  10. Sparsity in supervised machine learning • Observed data ( x i , y i ) ∈ R p × R , i = 1 , . . . , n – Response vector y = ( y 1 , . . . , y n ) ⊤ ∈ R n – Design matrix X = ( x 1 , . . . , x n ) ⊤ ∈ R n × p • Regularized empirical risk minimization: n 1 � ℓ ( y i , w ⊤ x i ) + λ Ω( w ) = min w ∈ R p L ( y, Xw ) + λ Ω( w ) min n w ∈ R p i =1 • Norm Ω to promote sparsity – square loss + ℓ 1 -norm ⇒ basis pursuit in signal processing (Chen et al., 2001), Lasso in statistics/machine learning (Tibshirani, 1996) – Proxy for interpretability – Allow high-dimensional inference: log p = O ( n )

  11. Sparsity in unsupervised machine learning • Multiple responses/signals y = ( y 1 , . . . , y k ) ∈ R n × k k � � � L ( y j , Xw j ) + λ Ω( w j ) min min X =( x 1 ,...,x p ) w 1 ,...,w k ∈ R p j =1

  12. Sparsity in unsupervised machine learning • Multiple responses/signals y = ( y 1 , . . . , y k ) ∈ R n × k k � � � L ( y j , Xw j ) + λ Ω( w j ) min min X =( x 1 ,...,x p ) w 1 ,...,w k ∈ R p j =1 • Only responses are observed ⇒ Dictionary learning – Learn X = ( x 1 , . . . , x p ) ∈ R n × p such that ∀ j, � x j � 2 � 1 k � � � L ( y j , Xw j ) + λ Ω( w j ) min min X =( x 1 ,...,x p ) w 1 ,...,w k ∈ R p j =1 – Olshausen and Field (1997); Elad and Aharon (2006); Mairal et al. (2009a) • sparse PCA : replace � x j � 2 � 1 by Θ( x j ) � 1

  13. Sparsity in signal processing • Multiple responses/signals x = ( x 1 , . . . , x k ) ∈ R n × k k � � � L ( x j , Dα j ) + λ Ω( α j ) min min D =( d 1 ,...,d p ) α 1 ,...,α k ∈ R p j =1 • Only responses are observed ⇒ Dictionary learning – Learn D = ( d 1 , . . . , d p ) ∈ R n × p such that ∀ j, � d j � 2 � 1 k � � � L ( x j , Dα j ) + λ Ω( α j ) min min D =( d 1 ,...,d p ) α 1 ,...,α k ∈ R p j =1 – Olshausen and Field (1997); Elad and Aharon (2006); Mairal et al. (2009a) • sparse PCA : replace � d j � 2 � 1 by Θ( d j ) � 1

  14. Why structured sparsity? • Interpretability – Structured dictionary elements (Jenatton et al., 2009b) – Dictionary elements “organized” in a tree or a grid (Kavukcuoglu et al., 2009; Jenatton et al., 2010; Mairal et al., 2010)

  15. Structured sparse PCA (Jenatton et al., 2009b) raw data sparse PCA • Unstructed sparse PCA ⇒ many zeros do not lead to better interpretability

  16. Structured sparse PCA (Jenatton et al., 2009b) raw data sparse PCA • Unstructed sparse PCA ⇒ many zeros do not lead to better interpretability

  17. Structured sparse PCA (Jenatton et al., 2009b) raw data Structured sparse PCA • Enforce selection of convex nonzero patterns ⇒ robustness to occlusion in face identification

  18. Structured sparse PCA (Jenatton et al., 2009b) raw data Structured sparse PCA • Enforce selection of convex nonzero patterns ⇒ robustness to occlusion in face identification

  19. Why structured sparsity? • Interpretability – Structured dictionary elements (Jenatton et al., 2009b) – Dictionary elements “organized” in a tree or a grid (Kavukcuoglu et al., 2009; Jenatton et al., 2010; Mairal et al., 2010)

  20. Modelling of text corpora (Jenatton et al., 2010)

  21. Why structured sparsity? • Interpretability – Structured dictionary elements (Jenatton et al., 2009b) – Dictionary elements “organized” in a tree or a grid (Kavukcuoglu et al., 2009; Jenatton et al., 2010; Mairal et al., 2010)

  22. Why structured sparsity? • Interpretability – Structured dictionary elements (Jenatton et al., 2009b) – Dictionary elements “organized” in a tree or a grid (Kavukcuoglu et al., 2009; Jenatton et al., 2010; Mairal et al., 2010) • Stability and identifiability – Optimization problem min w ∈ R p L ( y, Xw ) + λ � w � 1 is unstable – “Codes” w j often used in later processing (Mairal et al., 2009c) • Prediction or estimation performance – When prior knowledge matches data (Haupt and Nowak, 2006; Baraniuk et al., 2008; Jenatton et al., 2009a; Huang et al., 2009) • Numerical efficiency – Non-linear variable selection with 2 p subsets (Bach, 2008)

  23. Classical approaches to structured sparsity • Many application domains – Computer vision (Cevher et al., 2008; Mairal et al., 2009b) – Neuro-imaging (Gramfort and Kowalski, 2009; Jenatton et al., 2011) – Bio-informatics (Rapaport et al., 2008; Kim and Xing, 2010) • Non-convex approaches – Haupt and Nowak (2006); Baraniuk et al. (2008); Huang et al. (2009) • Convex approaches – Design of sparsity-inducing norms

  24. Outline • SIERRA project-team • Introduction: Sparse methods for machine learning – Need for structured sparsity: Going beyond the ℓ 1 -norm • Classical approaches to structured sparsity – Linear combinations of ℓ q -norms • Structured sparsity through submodular functions – Relaxation of the penalization of supports – Unified algorithms and analysis

  25. Sparsity-inducing norms • Popular choice for Ω – The ℓ 1 - ℓ 2 norm, G � 1 / 2 � � � � w 2 1 � w G � 2 = j G ∈ H G ∈ H j ∈ G G 2 – with H a partition of { 1 , . . . , p } – The ℓ 1 - ℓ 2 norm sets to zero groups of non-overlapping G 3 variables (as opposed to single variables for the ℓ 1 -norm) – For the square loss, group Lasso (Yuan and Lin, 2006)

  26. Unit norm balls Geometric interpretation � w 2 1 + w 2 � w � 2 � w � 1 2 + | w 3 |

  27. Sparsity-inducing norms • Popular choice for Ω – The ℓ 1 - ℓ 2 norm, G � 1 / 2 � � � � w 2 1 � w G � 2 = j G ∈ H G ∈ H j ∈ G G 2 – with H a partition of { 1 , . . . , p } – The ℓ 1 - ℓ 2 norm sets to zero groups of non-overlapping G 3 variables (as opposed to single variables for the ℓ 1 -norm) – For the square loss, group Lasso (Yuan and Lin, 2006) • However, the ℓ 1 - ℓ 2 norm encodes fixed/static prior information , requires to know in advance how to group the variables • What happens if the set of groups H is not a partition anymore?

  28. Structured sparsity with overlapping groups (Jenatton, Audibert, and Bach, 2009a) G 1 • When penalizing by the ℓ 1 - ℓ 2 norm, G2 � 1 / 2 � � � � w 2 2 � w G � 2 = j G ∈ H G ∈ H j ∈ G – The ℓ 1 norm induces sparsity at the group level: G ∗ Some w G ’s are set to zero 3 – Inside the groups, the ℓ 2 norm does not promote sparsity

Recommend


More recommend