machine learning and convex optimization with submodular
play

Machine learning and convex optimization with submodular functions - PowerPoint PPT Presentation

Machine learning and convex optimization with submodular functions Francis Bach Sierra project-team, INRIA - Ecole Normale Sup erieure Workshop on combinatorial optimization - Cargese, 2013 Submodular functions - References References


  1. Submodular and base polyhedra - Properties • Submodular polyhedron: P ( F ) = { s ∈ R p , ∀ A ⊂ V, s ( A ) � F ( A ) } • Base polyhedron: B ( F ) = P ( F ) ∩ { s ( V ) = F ( V ) } • Many facets (up to 2 p ), many extreme points (up to p ! )

  2. Submodular and base polyhedra - Properties • Submodular polyhedron: P ( F ) = { s ∈ R p , ∀ A ⊂ V, s ( A ) � F ( A ) } • Base polyhedron: B ( F ) = P ( F ) ∩ { s ( V ) = F ( V ) } • Many facets (up to 2 p ), many extreme points (up to p ! ) • Fundamental property (Edmonds, 1970): If F is submodular, maximizing linear functions may be done by a “greedy algorithm” – Let w ∈ R p + such that w j 1 � · · · � w j p – Let s j k = F ( { j 1 , . . . , j k } ) − F ( { j 1 , . . . , j k − 1 } ) for k ∈ { 1 , . . . , p } s ∈ P ( F ) w ⊤ s = max s ∈ B ( F ) w ⊤ s – Then f ( w ) = max – Both problems attained at s defined above • Simple proof by convex duality

  3. Submodular functions Links with convexity • Theorem (Lov´ asz, 1982): If F is submodular, then A ⊂ V F ( A ) = min w ∈{ 0 , 1 } p f ( w ) = min w ∈ [0 , 1] p f ( w ) min • Consequence: Submodular function minimization may be done in polynomial time (through ellipsoid algorithm) • Representation of f ( w ) as a support function (Edmonds, 1970): s ∈ B ( F ) s ⊤ w f ( w ) = max – Maximizer s may be found efficiently through the greedy algorithm

  4. Outline 1. Submodular functions – Review and examples of submodular functions – Links with convexity through Lov´ asz extension 2. Submodular minimization – Non-smooth convex optimization – Parallel algorithm for special case 3. Structured sparsity-inducing norms – Relaxation of the penalization of supports by submodular functions – Extensions (symmetric, ℓ q -relaxation)

  5. Submodular function minimization Dual problem • Let F : 2 V → R be a submodular function (such that F ( ∅ ) = 0 ) • Convex duality (Edmonds, 1970): A ⊂ V F ( A ) min = w ∈ [0 , 1] p f ( w ) min s ∈ B ( F ) w ⊤ s = w ∈ [0 , 1] p max min w ∈ [0 , 1] p w ⊤ s = max = max min s ∈ B ( F ) s − ( V ) s ∈ B ( F )

  6. Exact submodular function minimization Combinatorial algorithms • Algorithms based on min A ⊂ V F ( A ) = max s ∈ B ( F ) s − ( V ) • Output the subset A and a base s ∈ B ( F ) as a certificate of optimality • Best algorithms have polynomial complexity (Schrijver, 2000; Iwata et al., 2001; Orlin, 2009) (typically O ( p 6 ) or more) • Update a sequence of convex combination of vertices of B ( F ) obtained from the greedy algorithm using a specific order: – Based only on function evaluations • Recent algorithms using efficient reformulations in terms of generalized graph cuts (Jegelka et al., 2011)

  7. Approximate submodular function minimization • For most machine learning applications, no need to obtain exact minimum – For convex optimization, see, e.g., Bottou and Bousquet (2008) A ⊂ V F ( A ) = min w ∈{ 0 , 1 } p f ( w ) = min w ∈ [0 , 1] p f ( w ) min

  8. Approximate submodular function minimization • For most machine learning applications, no need to obtain exact minimum – For convex optimization, see, e.g., Bottou and Bousquet (2008) A ⊂ V F ( A ) = min w ∈{ 0 , 1 } p f ( w ) = min w ∈ [0 , 1] p f ( w ) min • Important properties of f for convex optimization – Polyhedral function – Representation as maximum of linear functions s ∈ B ( F ) w ⊤ s f ( w ) = max • Stability vs. speed vs. generality vs. ease of implementation

  9. Projected subgradient descent (Shor et al., 1985) s ∈ B ( F ) s ⊤ w through the greedy algorithm • Subgradient of f ( w ) = max • Using projected subgradient descent to minimize f on [0 , 1] p w t − 1 − C – Iteration: w t = Π [0 , 1] p � � t s t where s t ∈ ∂f ( w t − 1 ) √ √ p – Convergence rate: f ( w t ) − min w ∈ [0 , 1] p f ( w ) � t with primal/dual √ guarantees (Nesterov, 2003) • Fast iterations but slow convergence – need O ( p/ε 2 ) iterations to reach precision ε – need O ( p 2 /ε 2 ) function evaluations to reach precision ε

  10. Ellipsoid method (Nemirovski and Yudin, 1983) • Build a sequence of minimum volume ellipsoids that enclose the set of solutions E 1 E 0 E 1 E 2 • Cost of a single iteration: p function evaluations and O ( p 3 ) operations log 1 • Number of iterations: 2 p 2 � � max A ⊂ V F ( A ) − min A ⊂ V F ( A ) ε . – O ( p 5 ) operations and O ( p 3 ) function evaluations • Slow in practice (the bound is “tight”)

  11. Analytic center cutting planes (Goffin and Vial, 1993) • Center of gravity method – improves the convergence rate of ellipsoid method – cannot be computed easily • Analytic center of a polytope defined by a ⊤ i w � b i , i ∈ I � log( b i − a ⊤ w ∈ R p − min i w ) i ∈ I • Analytic center cutting planes (ACCPM) – Each iteration has complexity O ( p 2 | I | + | I | 3 ) using Newton’s method – No linear convergence rate – Good performance in practice

  12. Simplex method for submodular minimization • Mentioned by Girlich and Pisaruk (1997); McCormick (2005) • Formulation as linear program : s ∈ B ( F ) ⇔ s = S ⊤ η , S ∈ R d × p p � min { ( S ⊤ η ) i , 0 } s ∈ B ( F ) s − ( V ) = max max η � 0 , η ⊤ 1 d =1 i =1 η � 0 , α � 0 , β � 0 − β ⊤ 1 p such that S ⊤ η − α + β = 0 , η ⊤ 1 d = 1 . = max • Column generation for simplex methods : only access the rows of S by maximizing linear functions – no complexity bound, may get global optimum if enough iterations

  13. Separable optimization on base polyhedron • Optimization of convex functions of the form Ψ( w ) + f ( w ) with asz extension of F , and Ψ( w ) = � f Lov´ k ∈ V ψ k ( w k ) • Structured sparsity – Total variation denoising - isotonic regression – Regularized risk minimization penalized by the Lov´ asz extension

  14. Total variation denoising (Chambolle, 2005) � � • F ( A ) = d ( k, j ) ⇒ f ( w ) = d ( k, j )( w k − w j ) + k,j ∈ V k ∈ A,j ∈ V \ A • d symmetric ⇒ f = total variation

  15. Isotonic regression • Given real numbers x i , i = 1 , . . . , p p – Find y ∈ R p that minimizes 1 ( x i − y i ) 2 such that ∀ i, y i � y i +1 � 2 j =1 y x • For a directed chain, f ( y ) = 0 if and only if ∀ i, y i � y i +1 j =1 ( x i − y i ) 2 + λf ( y ) for λ large � p • Minimize 1 2

  16. Separable optimization on base polyhedron • Optimization of convex functions of the form Ψ( w ) + f ( w ) with asz extension of F , and Ψ( w ) = � f Lov´ k ∈ V ψ k ( w k ) • Structured sparsity – Total variation denoising - isotonic regression – Regularized risk minimization penalized by the Lov´ asz extension

  17. Separable optimization on base polyhedron • Optimization of convex functions of the form Ψ( w ) + f ( w ) with asz extension of F , and Ψ( w ) = � f Lov´ k ∈ V ψ k ( w k ) • Structured sparsity – Total variation denoising - isotonic regression – Regularized risk minimization penalized by the Lov´ asz extension • Proximal methods (see second part) – Minimize Ψ( w ) + f ( w ) for smooth Ψ as soon as the following “proximal” problem may be obtained efficiently p 1 1 2( w k − z k ) 2 + f ( w ) � 2 � w − z � 2 min 2 + f ( w ) = min w ∈ R p w ∈ R p k =1 • Submodular function minimization

  18. Separable optimization on base polyhedron Convex duality • Let ψ k : R → R , k ∈ { 1 , . . . , p } be p functions. Assume – Each ψ k is strictly convex – sup α ∈ R ψ ′ j ( α ) = + ∞ and inf α ∈ R ψ ′ j ( α ) = −∞ – Denote ψ ∗ 1 , . . . , ψ ∗ p their Fenchel-conjugates (then with full domain)

  19. Separable optimization on base polyhedron Convex duality • Let ψ k : R → R , k ∈ { 1 , . . . , p } be p functions. Assume – Each ψ k is strictly convex – sup α ∈ R ψ ′ j ( α ) = + ∞ and inf α ∈ R ψ ′ j ( α ) = −∞ – Denote ψ ∗ 1 , . . . , ψ ∗ p their Fenchel-conjugates (then with full domain) p p � � s ∈ B ( F ) w ⊤ s + w ∈ R p f ( w ) + min ψ i ( w j ) = w ∈ R p max min ψ j ( w j ) j =1 j =1 p � w ∈ R p w ⊤ s + = s ∈ B ( F ) min max ψ j ( w j ) j =1 p � ψ ∗ s ∈ B ( F ) − j ( − s j ) = max j =1

  20. Separable optimization on base polyhedron Equivalence with submodular function minimization • For α ∈ R , let A α ⊂ V be a minimizer of A �→ F ( A ) + � j ∈ A ψ ′ j ( α ) • Let w ∗ be the unique minimizer of w �→ f ( w ) + � p j =1 ψ j ( w j ) • Proposition (Chambolle and Darbon, 2009): – Given A α for all α ∈ R , then ∀ j, w ∗ j = sup( { α ∈ R , j ∈ A α } ) – Given w ∗ , then A �→ F ( A ) + � j ∈ A ψ ′ j ( α ) has minimal minimizer { w ∗ > α } and maximal minimizer { w ∗ � α } • Separable optimization equivalent to a sequence of submodular function minimizations – NB: extension of known results from parametric max-flow

  21. Equivalence with submodular function minimization Proof sketch (Bach, 2011b) p p � � ψ ∗ • Duality gap for min s ∈ B ( F ) − j ( − s j ) w ∈ R p f ( w ) + ψ i ( w j ) = max j =1 j =1 p p � � ψ ∗ ψ i ( w j ) − j ( − s j ) f ( w ) + j =1 j =1 p � � � f ( w ) − w ⊤ s + ψ j ( w j ) + ψ ∗ = j ( − s j ) + w j s j j =1 � + ∞ � � ( F + ψ ′ ( α ))( { w � α } ) − ( s + ψ ′ ( α )) − ( V ) = dα −∞ • Duality gap for convex problems = sums of duality gaps for combinatorial problems

  22. Separable optimization on base polyhedron Quadratic case • Let F be a submodular function and w ∈ R p the unique minimizer of w �→ f ( w ) + 1 2 � w � 2 2 . Then: (a) s = − w is the point in B ( F ) with minimum ℓ 2 -norm (b) For all λ ∈ R , the maximal minimizer of A �→ F ( A ) + λ | A | is { w � − λ } and the minimal minimizer of F is { w > − λ } • Consequences – Threshold at 0 the minimum norm point in B ( F ) to minimize F (Fujishige and Isotani, 2011) – Minimizing submodular functions with cardinality constraints (Nagano et al., 2011)

  23. From convex to combinatorial optimization and vice-versa... � • Solving min ψ k ( w k ) + f ( w ) to solve min A ⊂ V F ( A ) w ∈ R p k ∈ V – Thresholding solutions w at zero if ∀ k ∈ V, ψ ′ k (0) = 0 – For quadratic functions ψ k ( w k ) = 1 2 w 2 k , equivalent to projecting 0 on B ( F ) (Fujishige, 2005)

  24. From convex to combinatorial optimization and vice-versa... � • Solving min ψ k ( w k ) + f ( w ) to solve min A ⊂ V F ( A ) w ∈ R p k ∈ V – Thresholding solutions w at zero if ∀ k ∈ V, ψ ′ k (0) = 0 – For quadratic functions ψ k ( w k ) = 1 2 w 2 k , equivalent to projecting 0 on B ( F ) (Fujishige, 2005) � • Solving min A ⊂ V F ( A ) − t ( A ) to solve min ψ k ( w k ) + f ( w ) w ∈ R p k ∈ V – General decomposition strategy (Groenevelt, 1991) – Efficient only when submodular minimization is efficient

  25. � A ⊂ V F ( A ) − t ( A ) to solve min Solving min ψ k ( w k )+ f ( w ) w ∈ R p k ∈ V • General recursive divide-and-conquer algorithm (Groenevelt, 1991) • NB: Dual version of Fujishige (2005) 1. Compute minimizer t ∈ R p of � j ∈ V ψ ∗ j ( − t j ) s.t. t ( V ) = F ( V ) 2. Compute minimizer A of F ( A ) − t ( A ) 3. If A = V , then t is optimal. Exit. j ∈ A ψ ∗ 4. Compute a minimizer s A of � j ( − s j ) over s ∈ B ( F A ) where F A : 2 A → R is the restriction of F to A , i.e., F A ( B ) = F ( A ) j ∈ V \ A ψ ∗ j ( − s j ) over s ∈ B ( F A ) 5. Compute a minimizer s V \ A of � where F A ( B ) = F ( A ∪ B ) − F ( A ) , for B ⊂ V \ A 6. Concatenate s A and s V \ A . Exit.

  26. � Solving min ψ k ( w k ) + f ( w ) to solve min A ⊂ V F ( A ) w ∈ R p k ∈ V • Dual problem: max s ∈ B ( F ) − � p j =1 ψ ∗ j ( − s j ) • Constrained optimization when linear functions can be maximized – Frank-Wolfe algorithms • Two main types for convex functions

  27. Approximate quadratic optimization on B ( F ) 1 s ∈ B ( F ) − 1 2 � w � 2 2 � s � 2 • Goal : min 2 + f ( w ) = max 2 w ∈ R p • Can only maximize linear functions on B ( F ) • Two types of “Frank-wolfe” algorithms • 1. Active set algorithm ( ⇔ min-norm-point) – Sequence of maximizations of linear functions over B ( F ) + overheads (affine projections) – Finite convergence, but no complexity bounds

  28. Minimum-norm-point algorithm (Wolfe, 1976) 2 2 2 1 1 1 (a) (b) (c) 0 0 0 3 3 3 5 5 5 4 4 4 2 2 2 1 1 1 (d) (e) (f) 0 0 0 3 3 3 5 5 5 4 4 4

  29. Approximate quadratic optimization on B ( F ) 1 s ∈ B ( F ) − 1 2 � w � 2 2 � s � 2 • Goal : min 2 + f ( w ) = max 2 w ∈ R p • Can only maximize linear functions on B ( F ) • Two types of “Frank-wolfe” algorithms • 1. Active set algorithm ( ⇔ min-norm-point) – Sequence of maximizations of linear functions over B ( F ) + overheads (affine projections) – Finite convergence, but no complexity bounds • 2. Conditional gradient – Sequence of maximizations of linear functions over B ( F ) – Approximate optimality bound

  30. Conditional gradient with line search 2 2 2 1 1 1 (a) (b) (c) 0 0 0 3 3 3 5 5 5 4 4 4 2 2 2 1 1 1 (d) (e) (f) 0 0 0 3 3 3 5 5 5 4 4 4 2 2 2 1 1 1 (g) (h) (i) 0 0 0 3 3 3 5 5 5 4 4 4

  31. Approximate quadratic optimization on B ( F ) • Proposition : t steps of conditional gradient (with line search) outputs s t ∈ B ( F ) and w t = − s t , such that 2 � 2 D 2 f ( w t ) + 1 2 − OPT � f ( w t ) + 1 2 + 1 2 � w t � 2 2 � w t � 2 2 � s t � 2 t

  32. Approximate quadratic optimization on B ( F ) • Proposition : t steps of conditional gradient (with line search) outputs s t ∈ B ( F ) and w t = − s t , such that 2 � 2 D 2 f ( w t ) + 1 2 − OPT � f ( w t ) + 1 2 + 1 2 � w t � 2 2 � w t � 2 2 � s t � 2 t • Improved primal candidate through isotonic regression – f ( w ) is linear on any set of w with fixed ordering – May be optimized using isotonic regression (“pool-adjacent- violator”) in O ( n ) (see, e.g., Best and Chakravarti, 1990) – Given w t = − s t , keep the ordering and reoptimize

  33. Approximate quadratic optimization on B ( F ) • Proposition : t steps of conditional gradient (with line search) outputs s t ∈ B ( F ) and w t = − s t , such that 2 � 2 D 2 f ( w t ) + 1 2 − OPT � f ( w t ) + 1 2 + 1 2 � w t � 2 2 � w t � 2 2 � s t � 2 t • Improved primal candidate through isotonic regression – f ( w ) is linear on any set of w with fixed ordering – May be optimized using isotonic regression (“pool-adjacent- violator”) in O ( n ) (see, e.g. Best and Chakravarti, 1990) – Given w t = − s t , keep the ordering and reoptimize • Better bound for submodular function minimization?

  34. From quadratic optimization on B ( F ) to submodular function minimization • Proposition : If w is ε -optimal for min w ∈ R p 1 2 � w � 2 2 + f ( w ) , then at � √ εp � least a levet set A of w is -optimal for submodular function 2 minimization √ εp • If ε = 2 D 2 = Dp 1 / 2 √ t , ⇒ no provable gains , but: 2 2 t – Bound on the iterates A t (with additional assumptions) – Possible thresolding for acceleration

  35. From quadratic optimization on B ( F ) to submodular function minimization • Proposition : If w is ε -optimal for min w ∈ R p 1 2 � w � 2 2 + f ( w ) , then at � √ εp � least a levet set A of w is -optimal for submodular function 2 minimization √ εp • If ε = 2 D 2 = Dp 1 / 2 √ ⇒ no provable gains , but: t , 2 2 t – Bound on the iterates A t (with additional assumptions) – Possible thresolding for acceleration • Lower complexity bound for SFM – Conjecture : no algorithm that is based only on a sequence of greedy algorithms obtained from linear combinations of bases can improve on the subgradient bound (after p/ 2 iterations).

  36. Simulations on standard benchmark “DIMACS Genrmf-wide”, p = 430 • Submodular function minimization – (Left) dual suboptimality – (Right) primal suboptimality MNP 4 4 CG−LS log 10 (min(F)−s − (V)) log 10 (F(A)−min(F)) CG−1/t 3 3 SD−1/t 1/2 SD−Polyak 2 2 Ellipsoid Simplex 1 1 ACCPM ACCPM−simp. 0 0 −1 −1 0 500 1000 1500 0 500 1000 1500 iterations iterations

  37. Simulations on standard benchmark “DIMACS Genrmf-long”, p = 575 • Submodular function minimization – (Left) dual suboptimality – (Right) primal suboptimality MNP 4 4 CG−LS log 10 (min(F)−s − (V)) log 10 (F(A)−min(F)) CG−1/t 3 3 SD−1/t 1/2 SD−Polyak 2 2 Ellipsoid Simplex 1 1 ACCPM ACCPM−simp. 0 0 −1 −1 0 500 1000 0 500 1000 iterations iterations

  38. Simulations on standard benchmark • Separable quadratic optimization – (Left) dual suboptimality – (Right) primal suboptimality (in dashed, before the pool-adjacent-violator correction) 8 8 MNP log 10 ( ||w|| 2 /2+f(w)−OPT) CG−LS log 10 (OPT+ ||s|| 2 /2) CG−1/t 6 6 4 4 2 2 0 0 0 500 1000 1500 0 500 1000 1500 iterations iterations

  39. Outline 1. Submodular functions – Review and examples of submodular functions – Links with convexity through Lov´ asz extension 2. Submodular minimization – Non-smooth convex optimization – Parallel algorithm for special case 3. Structured sparsity-inducing norms – Relaxation of the penalization of supports by submodular functions – Extensions (symmetric, ℓ q -relaxation)

  40. From submodular minimization to proximal problems • Summary : several optimization problems – Discrete problem: min A ⊂ V F ( A ) = w ∈{ 0 , 1 } p f ( w ) min – Continuous problem: w ∈ [0 , 1] p f ( w ) min 1 2 � w � 2 – Proximal problem (P): min 2 + f ( w ) w ∈ R p • Solving (P) is equivalent to minimizing F ( A ) + λ | A | for all λ A ⊆ V F ( A ) + λ | A | = { k, w k � − λ } – arg min • Much simpler problem but no gains in terms of (provable) complexity – See Bach (2011a)

  41. Decomposable functions • F may often be decomposed as the sum of r “simple” functions: r � F ( A ) = F j ( A ) j =1 – Each F j may be minimized efficiently – Example: 2D grid = vertical chains + horizontal chains • Komodakis et al. (2011); Kolmogorov (2012); Stobbe and Krause (2010); Savchynskyy et al. (2011) – Dual decomposition approach but slow non-smooth problem

  42. Decomposable functions and proximal problems (Jegelka, Bach, and Sra, 2013) • Dual problem w ∈ R p f 1 ( w ) + f 2 ( w ) + 1 2 � w � 2 min 2 2 w + 1 s 1 ∈ B ( F 1 ) s ⊤ s 2 ∈ B ( F 2 ) s ⊤ 2 � w � 2 = min max 1 w + max 2 w ∈ R p s 1 ∈ B ( F 1 ) , s 2 ∈ B ( F 2 ) − 1 2 � s 1 + s 2 � 2 = max • Finding the closest point between two polytopes – Several alternatives: Block coordinate ascent, Douglas Rachford splitting (Bauschke et al., 2004) – (a) no parameters, (b) parallelizable

  43. Experiments • Graph cuts on a 500 × 500 image discrete gaps − smooth problems− 4 discrete gaps − non−smooth problems − 4 5 5 grad−accel dual−sgd−P BCD dual−sgd−F 4 4 DR dual−smooth BCD−para primal−smooth log 10 (duality gap) log 10 (duality gap) primal−sgd DR−para 3 3 2 2 1 1 0 0 −1 −1 200 400 600 800 1000 20 40 60 80 100 iteration iteration • Matlab/C implementation 10 times slower than C-code for graph cut – Easy to code and parallelizable

  44. Parallelization • Multiple cores 40 iterations of DR 6 5 speedup factor 4 3 2 1 0 0 2 4 6 8 # cores

  45. Outline 1. Submodular functions – Review and examples of submodular functions – Links with convexity through Lov´ asz extension 2. Submodular minimization – Non-smooth convex optimization – Parallel algorithm for special case 3. Structured sparsity-inducing norms – Relaxation of the penalization of supports by submodular functions – Extensions (symmetric, ℓ q -relaxation)

  46. Structured sparsity through submodular functions References and Links • References on submodular functions – Submodular Functions and Optimization (Fujishige, 2005) – Tutorial paper based on convex optimization (Bach, 2011b) www.di.ens.fr/~fbach/submodular_fot.pdf • Structured sparsity through convex optimization – Algorithms (Bach, Jenatton, Mairal, and Obozinski, 2011) www.di.ens.fr/~fbach/bach_jenatton_mairal_obozinski_FOT.pdf – Theory/applications (Bach, Jenatton, Mairal, and Obozinski, 2012) www.di.ens.fr/~fbach/stat_science_structured_sparsity.pdf – Matlab/R/Python codes: http://www.di.ens.fr/willow/SPAMS/ • Slides : www.di.ens.fr/~fbach/fbach_cargese_2013.pdf

  47. Sparsity in supervised machine learning • Observed data ( x i , y i ) ∈ R p × R , i = 1 , . . . , n – Response vector y = ( y 1 , . . . , y n ) ⊤ ∈ R n – Design matrix X = ( x 1 , . . . , x n ) ⊤ ∈ R n × p • Regularized empirical risk minimization: n 1 � ℓ ( y i , w ⊤ x i ) + λ Ω( w ) = min w ∈ R p L ( y, Xw ) + λ Ω( w ) min n w ∈ R p i =1 • Norm Ω to promote sparsity – square loss + ℓ 1 -norm ⇒ basis pursuit in signal processing (Chen et al., 2001), Lasso in statistics/machine learning (Tibshirani, 1996) – Proxy for interpretability – Allow high-dimensional inference: log p = O ( n )

  48. Sparsity in unsupervised machine learning • Multiple responses/signals y = ( y 1 , . . . , y k ) ∈ R n × k k � � � L ( y j , Xw j ) + λ Ω( w j ) min min X =( x 1 ,...,x p ) w 1 ,...,w k ∈ R p j =1

  49. Sparsity in unsupervised machine learning • Multiple responses/signals y = ( y 1 , . . . , y k ) ∈ R n × k k � � � L ( y j , Xw j ) + λ Ω( w j ) min min X =( x 1 ,...,x p ) w 1 ,...,w k ∈ R p j =1 • Only responses are observed ⇒ Dictionary learning – Learn X = ( x 1 , . . . , x p ) ∈ R n × p such that ∀ j, � x j � 2 � 1 k � � � L ( y j , Xw j ) + λ Ω( w j ) min min X =( x 1 ,...,x p ) w 1 ,...,w k ∈ R p j =1 – Olshausen and Field (1997); Elad and Aharon (2006); Mairal et al. (2009a) • sparse PCA : replace � x j � 2 � 1 by Θ( x j ) � 1

  50. Sparsity in signal processing • Multiple responses/signals x = ( x 1 , . . . , x k ) ∈ R n × k k � � � L ( x j , Dα j ) + λ Ω( α j ) min min D =( d 1 ,...,d p ) α 1 ,...,α k ∈ R p j =1 • Only responses are observed ⇒ Dictionary learning – Learn D = ( d 1 , . . . , d p ) ∈ R n × p such that ∀ j, � d j � 2 � 1 k � � � L ( x j , Dα j ) + λ Ω( α j ) min min D =( d 1 ,...,d p ) α 1 ,...,α k ∈ R p j =1 – Olshausen and Field (1997); Elad and Aharon (2006); Mairal et al. (2009a) • sparse PCA : replace � d j � 2 � 1 by Θ( d j ) � 1

  51. Why structured sparsity? • Interpretability – Structured dictionary elements (Jenatton et al., 2009b) – Dictionary elements “organized” in a tree or a grid (Kavukcuoglu et al., 2009; Jenatton et al., 2010; Mairal et al., 2010)

  52. Structured sparse PCA (Jenatton et al., 2009b) raw data sparse PCA • Unstructed sparse PCA ⇒ many zeros do not lead to better interpretability

  53. Structured sparse PCA (Jenatton et al., 2009b) raw data sparse PCA • Unstructed sparse PCA ⇒ many zeros do not lead to better interpretability

  54. Structured sparse PCA (Jenatton et al., 2009b) raw data Structured sparse PCA • Enforce selection of convex nonzero patterns ⇒ robustness to occlusion in face identification

  55. Structured sparse PCA (Jenatton et al., 2009b) raw data Structured sparse PCA • Enforce selection of convex nonzero patterns ⇒ robustness to occlusion in face identification

  56. Why structured sparsity? • Interpretability – Structured dictionary elements (Jenatton et al., 2009b) – Dictionary elements “organized” in a tree or a grid (Kavukcuoglu et al., 2009; Jenatton et al., 2010; Mairal et al., 2010)

  57. Modelling of text corpora (Jenatton et al., 2010)

  58. Why structured sparsity? • Interpretability – Structured dictionary elements (Jenatton et al., 2009b) – Dictionary elements “organized” in a tree or a grid (Kavukcuoglu et al., 2009; Jenatton et al., 2010; Mairal et al., 2010)

  59. Why structured sparsity? • Interpretability – Structured dictionary elements (Jenatton et al., 2009b) – Dictionary elements “organized” in a tree or a grid (Kavukcuoglu et al., 2009; Jenatton et al., 2010; Mairal et al., 2010) • Stability and identifiability • Prediction or estimation performance – When prior knowledge matches data (Haupt and Nowak, 2006; Baraniuk et al., 2008; Jenatton et al., 2009a; Huang et al., 2009) • Numerical efficiency – Non-linear variable selection with 2 p subsets (Bach, 2008)

  60. Classical approaches to structured sparsity • Many application domains – Computer vision (Cevher et al., 2008; Mairal et al., 2009b) – Neuro-imaging (Gramfort and Kowalski, 2009; Jenatton et al., 2011) – Bio-informatics (Rapaport et al., 2008; Kim and Xing, 2010) • Non-convex approaches – Haupt and Nowak (2006); Baraniuk et al. (2008); Huang et al. (2009) • Convex approaches – Design of sparsity-inducing norms

  61. Why ℓ 1 -norms lead to sparsity? 1 2 x 2 − xy + λ | x | • Example 1 : quadratic problem in 1D, i.e., min x ∈ R • Piecewise quadratic function with a kink at zero – Derivative at 0+ : g + = λ − y and 0 − : g − = − λ − y – x = 0 is the solution iff g + � 0 and g − � 0 (i.e., | y | � λ ) – x � 0 is the solution iff g + � 0 (i.e., y � λ ) ⇒ x ∗ = y − λ – x � 0 is the solution iff g − � 0 (i.e., y � − λ ) ⇒ x ∗ = y + λ • Solution x ∗ = sign( y )( | y | − λ ) + = soft thresholding

  62. Why ℓ 1 -norms lead to sparsity? 1 2 x 2 − xy + λ | x | • Example 1 : quadratic problem in 1D, i.e., min x ∈ R • Piecewise quadratic function with a kink at zero • Solution x ∗ = sign( y )( | y | − λ ) + = soft thresholding x x*(y) −λ y λ

  63. Why ℓ 1 -norms lead to sparsity? • Example 2 : minimize quadratic function Q ( w ) subject to � w � 1 � T . – coupled soft thresholding • Geometric interpretation – NB : penalizing is “equivalent” to constraining w w 2 2 w w 1 1 • Non-smooth optimization!

  64. Gaussian hare ( ℓ 2 ) vs. Laplacian tortoise ( ℓ 1 ) • Smooth vs. non-smooth optimization • See Bach, Jenatton, Mairal, and Obozinski (2011)

  65. Sparsity-inducing norms • Popular choice for Ω – The ℓ 1 - ℓ 2 norm, G � 1 / 2 � � � � w 2 1 � w G � 2 = j G ∈ H G ∈ H j ∈ G G 2 – with H a partition of { 1 , . . . , p } – The ℓ 1 - ℓ 2 norm sets to zero groups of non-overlapping G 3 variables (as opposed to single variables for the ℓ 1 -norm) – For the square loss, group Lasso (Yuan and Lin, 2006)

  66. Unit norm balls Geometric interpretation � w 2 1 + w 2 � w � 2 � w � 1 2 + | w 3 |

  67. Sparsity-inducing norms • Popular choice for Ω – The ℓ 1 - ℓ 2 norm, G � 1 / 2 � � � � w 2 1 � w G � 2 = j G ∈ H G ∈ H j ∈ G G 2 – with H a partition of { 1 , . . . , p } – The ℓ 1 - ℓ 2 norm sets to zero groups of non-overlapping G 3 variables (as opposed to single variables for the ℓ 1 -norm) – For the square loss, group Lasso (Yuan and Lin, 2006) • What if the set of groups H is not a partition anymore? • Is there any systematic way?

  68. ℓ 1 -norm = convex envelope of cardinality of support • Let w ∈ R p . Let V = { 1 , . . . , p } and Supp( w ) = { j ∈ V, w j � = 0 } • Cardinality of support : � w � 0 = Card(Supp( w )) • Convex envelope = largest convex lower bound (see, e.g., Boyd and Vandenberghe, 2004) ||w|| 0 ||w|| 1 −1 1 • ℓ 1 -norm = convex envelope of ℓ 0 -quasi-norm on the ℓ ∞ -ball [ − 1 , 1] p

  69. Convex envelopes of general functions of the support (Bach, 2010) • Let F : 2 V → R be a set-function – Assume F is non-decreasing (i.e., A ⊂ B ⇒ F ( A ) � F ( B ) ) – Explicit prior knowledge on supports (Haupt and Nowak, 2006; Baraniuk et al., 2008; Huang et al., 2009) • Define Θ( w ) = F (Supp( w )) : How to get its convex envelope? 1. Possible if F is also submodular 2. Allows unified theory and algorithm 3. Provides new regularizers

  70. Submodular functions and structured sparsity • Let F : 2 V → R be a non-decreasing submodular set-function • Proposition : the convex envelope of Θ : w �→ F (Supp( w )) on the ℓ ∞ -ball is Ω : w �→ f ( | w | ) where f is the Lov´ asz extension of F

Recommend


More recommend