regml 2016 class 6 structured sparsity
play

RegML 2016 Class 6 Structured sparsity Lorenzo Rosasco - PowerPoint PPT Presentation

RegML 2016 Class 6 Structured sparsity Lorenzo Rosasco UNIGE-MIT-IIT June 30, 2016 Exploiting structure Building blocks of a function can be more structure than single variables L.Rosasco, RegML 2016 Sparsity Variables divided in


  1. RegML 2016 Class 6 Structured sparsity Lorenzo Rosasco UNIGE-MIT-IIT June 30, 2016

  2. Exploiting structure Building blocks of a function can be more structure than single variables L.Rosasco, RegML 2016

  3. Sparsity Variables divided in non-overlapping groups L.Rosasco, RegML 2016

  4. Group sparsity ◮ f ( x ) = � d j =1 w j x j ◮ w = ( w 1 , . . . , . . . , . . . , w d ) � �� � � �� � w (1) w ( G ) ◮ each group G g has size |G g | , so w ( g ) ∈ R |G g | L.Rosasco, RegML 2016

  5. Group sparsity regularization Regularization exploiting structure � � |G g | � G � G � � � ( w ( g )) 2 R group ( w ) = � w ( g ) � = j g =1 g =1 j =1 L.Rosasco, RegML 2016

  6. Group sparsity regularization Regularization exploiting structure � � |G g | � G � G � � � ( w ( g )) 2 R group ( w ) = � w ( g ) � = j g =1 g =1 j =1 Compare to |G g | G G � � � � w ( g ) � 2 = ( w ( g )) 2 j g =1 g =1 j =1 L.Rosasco, RegML 2016

  7. Group sparsity regularization Regularization exploiting structure � � |G g | � G � G � � � ( w ( g )) 2 R group ( w ) = � w ( g ) � = j g =1 g =1 j =1 Compare to |G g | G G � � � � w ( g ) � 2 = ( w ( g )) 2 j g =1 g =1 j =1 or |G g | G G � � � � w ( g ) � 2 = | ( w ( g )) j | g =1 g =1 j =1 L.Rosasco, RegML 2016

  8. ℓ 1 − ℓ 2 norm We take the ℓ 2 norm of all the groups ( � w (1) � , . . . , � w ( G ) � ) and then the ℓ 1 norm of the above vector � G � w ( g ) � g =1 L.Rosasco, RegML 2016

  9. Groups lasso G � 1 y � 2 + λ n � ˆ min Xw − ˆ � w ( g ) � w g =1 ◮ reduces to the Lasso if groups have cardinality one L.Rosasco, RegML 2016

  10. Computations G � 1 y � 2 + λ n � ˆ min Xw − ˆ � w ( g ) � w g =1 � �� � non differentiable Convex, non-smooth, but composite structure � � w t − γ 2 X ⊤ ( ˆ ˆ w t +1 = Prox γλR group Xw t − ˆ y ) n L.Rosasco, RegML 2016

  11. Block thresholding It can be shown that Prox λR group ( w ) = (Prox λ �·� ( w (1)) , . . . , Prox λ �·� ( w ( G )) � w ( g ) j − λ w ( g ) j � w ( g ) � > λ (Prox λ �·� ( w ( g ))) j = � w ( g ) � 0 � w ( g ) � ≤ λ ◮ Entire groups of coefficients set to zero! ◮ Reduces to softhresholding if groups have cardinality one L.Rosasco, RegML 2016

  12. Other norms ℓ 1 − ℓ p norms   1 p |G g | � G � G � ( w ( g )) p   R ( w ) = � w ( g ) � p = j g =1 g =1 j =1 L.Rosasco, RegML 2016

  13. Overlapping groups Variables divided in possibly overlapping groups L.Rosasco, RegML 2016

  14. Regularization with overlapping groups Group Lasso G � R GL ( w ) = � w ( g ) � g =1 L.Rosasco, RegML 2016

  15. Regularization with overlapping groups Group Lasso G � R GL ( w ) = � w ( g ) � g =1 → The selected variables are union of group complements L.Rosasco, RegML 2016

  16. Regularization with overlapping groups w ( g ) ∈ R d be equal to w ( g ) on group G g and zero otherwise Let ¯ L.Rosasco, RegML 2016

  17. Regularization with overlapping groups w ( g ) ∈ R d be equal to w ( g ) on group G g and zero otherwise Let ¯ Group Lasso with overlap � G � G � � R GLO ( w ) = inf � w ( g ) � | w (1) , . . . , w ( g ) s . t . w = w ( g ) ¯ g =1 g =1 L.Rosasco, RegML 2016

  18. Regularization with overlapping groups w ( g ) ∈ R d be equal to w ( g ) on group G g and zero otherwise Let ¯ Group Lasso with overlap � G � G � � R GLO ( w ) = inf � w ( g ) � | w (1) , . . . , w ( g ) s . t . w = w ( g ) ¯ g =1 g =1 ◮ Multiple ways to write w = � G g =1 ¯ w ( g ) L.Rosasco, RegML 2016

  19. Regularization with overlapping groups w ( g ) ∈ R d be equal to w ( g ) on group G g and zero otherwise Let ¯ Group Lasso with overlap � G � G � � R GLO ( w ) = inf � w ( g ) � | w (1) , . . . , w ( g ) s . t . w = w ( g ) ¯ g =1 g =1 ◮ Multiple ways to write w = � G g =1 ¯ w ( g ) ◮ Selected variables are groups! L.Rosasco, RegML 2016

  20. An equivalence It holds G � 1 1 y � 2 + λR GLO ( w ) ⇔ min y � 2 + λ n � ˆ n � ˜ min Xw − ˆ X ˜ w − ˆ � w ( g ) � w w ˜ g =1 ◮ ˜ X is the matrix obtained by replicating columns/variables ◮ ˜ w = ( w (1) , . . . , w ( G )) , vector with (nonoverlapping!) groups L.Rosasco, RegML 2016

  21. An equivalence (cont.) Indeed � G 1 y � 2 + λ n � ˆ min Xw − ˆ inf � w ( g ) � = w w (1) ,...,w ( g ) g =1 s.t. � G g =1 ¯ w ( g )= w G � 1 y � 2 + λ n � ˆ inf Xw − ˆ � w ( g ) � = w (1) ,...,w ( g ) g =1 s.t. � G g =1 ¯ w ( g )= w � G � G 1 y � 2 + λ n � ˆ inf X ( w ( g )) − ˆ ¯ � w ( g ) � = w (1) ,...,w ( g ) g =1 g =1 G G � � 1 y � 2 + λ ˆ inf n � X |G g w ( g ) − ˆ � w ( g ) � = w (1) ,...,w ( g ) g =1 g =1 G � 1 y � 2 + λ n � ˜ min X ˜ w − ˆ � w ( g ) � w ˜ g =1 L.Rosasco, RegML 2016

  22. Computations ◮ Can use block thresholding with replicated variables = ⇒ potentially wasteful ◮ The proximal operator for R GLO can be computed efficiently but not in closed form L.Rosasco, RegML 2016

  23. More structure Structured overlapping groups ◮ trees ◮ DAG ◮ . . . Structure can be exploited in computations. . . L.Rosasco, RegML 2016

  24. Beyond linear models Consider a dictionary made by union of distinct dictionaries G G � � Φ g ( x ) ⊤ w ( g ) , f ( x ) = f g ( x ) � �� � = g =1 g =1 where each dictionary defines a feature map Φ g ( x ) = ( φ g 1 ( x ) , . . . , φ g p g ( x )) Easy extension with usual change of variable... L.Rosasco, RegML 2016

  25. Representer theorems Let G G G � � � x ( g ) ⊤ ¯ f ( x ) = x ⊤ ( w ( g )) = ¯ ¯ w ( g ) = f g ( x ) , g =1 g =1 g =1 L.Rosasco, RegML 2016

  26. Representer theorems Let G G G � � � x ( g ) ⊤ ¯ f ( x ) = x ⊤ ( w ( g )) = ¯ ¯ w ( g ) = f g ( x ) , g =1 g =1 g =1 Idea Show that n � w ( g ) = ¯ x ( g ) i c ( g ) i , ¯ i =1 i.e. n n � � x ( g ) ⊤ ¯ x ( g ) ⊤ x ( g ) i f g ( x ) = ¯ x ( g ) i c ( g ) i = c ( g ) i � �� � i =1 i =1 Φ g ( x ) ⊤ Φ g ( x i )= K g ( x,x i ) L.Rosasco, RegML 2016

  27. Representer theorems Let G G G � � � x ( g ) ⊤ ¯ f ( x ) = x ⊤ ( w ( g )) = ¯ ¯ w ( g ) = f g ( x ) , g =1 g =1 g =1 Idea Show that n � w ( g ) = ¯ x ( g ) i c ( g ) i , ¯ i =1 i.e. n n � � x ( g ) ⊤ ¯ x ( g ) ⊤ x ( g ) i f g ( x ) = ¯ x ( g ) i c ( g ) i = c ( g ) i � �� � i =1 i =1 Φ g ( x ) ⊤ Φ g ( x i )= K g ( x,x i ) Note that in this case � f g � 2 = � w ( g ) � 2 = c ( g ) ⊤ ˆ X ( g ) ˆ X ( g ) ⊤ c ( g ) � �� � ˆ K ( g ) L.Rosasco, RegML 2016

  28. Coefficients update � � c t − γ ( ˆ c t +1 = Prox γλR group Kc t − ˆ y )) where ˆ K = ( ˆ K (1) , . . . , ˆ K ( G )) , and c t = ( c t (1) , . . . , c t ( G )) Block Thresholding It can be shown that  c ( g ) j − λ c ( g ) j �  � f g � > λ    c ( g ) ⊤ ˆ K ( g ) c ( g ) (Prox λ �·� ( c ( g ))) j = � �� �   � fg �   0 � f g � ≤ λ L.Rosasco, RegML 2016

  29. Non-parametric sparsity G � f ( x ) = f g ( x ) g =1 n n � � x ( g ) ⊤ x ( g ) i ( c ( g )) i �→ f g ( x ) = f g ( x ) = K g ( x, x i )( c ( g )) i i =1 i =1 ( K 1 , . . . , K G ) family of kernels G G � � � w ( g ) � = ⇒ � f g � K g g =1 g =1 L.Rosasco, RegML 2016

  30. ℓ 1 MKL � G 1 y � 2 + λ n � ˆ inf Xw − ˆ � w ( g ) � = w (1) ,...,w ( g ) g =1 s.t. � G g =1 ¯ w ( g )= w ⇓ n G � � 1 ( y i − f ( x i )) 2 + λ min � f g � K g n f 1 , . . . , f g i =1 g =1 s . t . � G g =1 f g = f L.Rosasco, RegML 2016

  31. ℓ 2 MKL G G � � � w ( g ) � 2 = � f g � 2 ⇒ K g g =1 g =1 Corresponds to using the kernel G � K ( x, x ′ ) = K g ( x, x ′ ) g =1 L.Rosasco, RegML 2016

  32. ℓ 1 or ℓ 2 MKL ◮ ℓ 2 *much* faster ◮ ℓ 1 could be useful is only few kernels are relevant L.Rosasco, RegML 2016

  33. Why MKL? ◮ Data fusion– different features ◮ Model selection, e.g. gaussian kernels with different widths ◮ Richer model– many kernels! L.Rosasco, RegML 2016

  34. MKL & kernel learning It can be shown that n G � � 1 ( y i − f ( x i )) 2 + λ min � f g � K g n f 1 , . . . , f g i =1 g =1 s . t . � G g =1 f g = f � n � 1 ( y i − f ( x i )) 2 + λ � f � 2 K ∈K min min K n f ∈H K i =1 where K = { K | K = � α g ≥ 0 } g K g α g , L.Rosasco, RegML 2016

  35. Sparsity beyond vectors Recall multi-variable regression x i ∈ R d , y i ∈ R T ( x i , y i ) i =1 n , f ( x ) = x ⊤ W ���� d × T W � ˆ XW − ˆ Y � 2 F + λ Tr ( WAW ⊤ ) min L.Rosasco, RegML 2016

  36. Sparse regularization ◮ We have seen � d � T Tr ( WW ⊤ ) = ( W t,j ) 2 j =1 t =1 ◮ We could consider now � d � T | W t,j | j =1 t =1 ◮ . . . L.Rosasco, RegML 2016

  37. Spectral Norms/ p -Schatten norms ◮ We have seen min { d,T } � Tr ( WW ⊤ ) = σ 2 i t =1 ◮ We could consider now min { d,T } � R ( W ) = � W � ∗ = σ i , nuclear norm t =1 ◮ or min { d,T } � ( σ i ) p ) 1 /p , R ( W ) = ( p-Schatten norm t =1 L.Rosasco, RegML 2016

Recommend


More recommend