RegML 2016 Class 6 Structured sparsity Lorenzo Rosasco UNIGE-MIT-IIT June 30, 2016
Exploiting structure Building blocks of a function can be more structure than single variables L.Rosasco, RegML 2016
Sparsity Variables divided in non-overlapping groups L.Rosasco, RegML 2016
Group sparsity ◮ f ( x ) = � d j =1 w j x j ◮ w = ( w 1 , . . . , . . . , . . . , w d ) � �� � � �� � w (1) w ( G ) ◮ each group G g has size |G g | , so w ( g ) ∈ R |G g | L.Rosasco, RegML 2016
Group sparsity regularization Regularization exploiting structure � � |G g | � G � G � � � ( w ( g )) 2 R group ( w ) = � w ( g ) � = j g =1 g =1 j =1 L.Rosasco, RegML 2016
Group sparsity regularization Regularization exploiting structure � � |G g | � G � G � � � ( w ( g )) 2 R group ( w ) = � w ( g ) � = j g =1 g =1 j =1 Compare to |G g | G G � � � � w ( g ) � 2 = ( w ( g )) 2 j g =1 g =1 j =1 L.Rosasco, RegML 2016
Group sparsity regularization Regularization exploiting structure � � |G g | � G � G � � � ( w ( g )) 2 R group ( w ) = � w ( g ) � = j g =1 g =1 j =1 Compare to |G g | G G � � � � w ( g ) � 2 = ( w ( g )) 2 j g =1 g =1 j =1 or |G g | G G � � � � w ( g ) � 2 = | ( w ( g )) j | g =1 g =1 j =1 L.Rosasco, RegML 2016
ℓ 1 − ℓ 2 norm We take the ℓ 2 norm of all the groups ( � w (1) � , . . . , � w ( G ) � ) and then the ℓ 1 norm of the above vector � G � w ( g ) � g =1 L.Rosasco, RegML 2016
Groups lasso G � 1 y � 2 + λ n � ˆ min Xw − ˆ � w ( g ) � w g =1 ◮ reduces to the Lasso if groups have cardinality one L.Rosasco, RegML 2016
Computations G � 1 y � 2 + λ n � ˆ min Xw − ˆ � w ( g ) � w g =1 � �� � non differentiable Convex, non-smooth, but composite structure � � w t − γ 2 X ⊤ ( ˆ ˆ w t +1 = Prox γλR group Xw t − ˆ y ) n L.Rosasco, RegML 2016
Block thresholding It can be shown that Prox λR group ( w ) = (Prox λ �·� ( w (1)) , . . . , Prox λ �·� ( w ( G )) � w ( g ) j − λ w ( g ) j � w ( g ) � > λ (Prox λ �·� ( w ( g ))) j = � w ( g ) � 0 � w ( g ) � ≤ λ ◮ Entire groups of coefficients set to zero! ◮ Reduces to softhresholding if groups have cardinality one L.Rosasco, RegML 2016
Other norms ℓ 1 − ℓ p norms 1 p |G g | � G � G � ( w ( g )) p R ( w ) = � w ( g ) � p = j g =1 g =1 j =1 L.Rosasco, RegML 2016
Overlapping groups Variables divided in possibly overlapping groups L.Rosasco, RegML 2016
Regularization with overlapping groups Group Lasso G � R GL ( w ) = � w ( g ) � g =1 L.Rosasco, RegML 2016
Regularization with overlapping groups Group Lasso G � R GL ( w ) = � w ( g ) � g =1 → The selected variables are union of group complements L.Rosasco, RegML 2016
Regularization with overlapping groups w ( g ) ∈ R d be equal to w ( g ) on group G g and zero otherwise Let ¯ L.Rosasco, RegML 2016
Regularization with overlapping groups w ( g ) ∈ R d be equal to w ( g ) on group G g and zero otherwise Let ¯ Group Lasso with overlap � G � G � � R GLO ( w ) = inf � w ( g ) � | w (1) , . . . , w ( g ) s . t . w = w ( g ) ¯ g =1 g =1 L.Rosasco, RegML 2016
Regularization with overlapping groups w ( g ) ∈ R d be equal to w ( g ) on group G g and zero otherwise Let ¯ Group Lasso with overlap � G � G � � R GLO ( w ) = inf � w ( g ) � | w (1) , . . . , w ( g ) s . t . w = w ( g ) ¯ g =1 g =1 ◮ Multiple ways to write w = � G g =1 ¯ w ( g ) L.Rosasco, RegML 2016
Regularization with overlapping groups w ( g ) ∈ R d be equal to w ( g ) on group G g and zero otherwise Let ¯ Group Lasso with overlap � G � G � � R GLO ( w ) = inf � w ( g ) � | w (1) , . . . , w ( g ) s . t . w = w ( g ) ¯ g =1 g =1 ◮ Multiple ways to write w = � G g =1 ¯ w ( g ) ◮ Selected variables are groups! L.Rosasco, RegML 2016
An equivalence It holds G � 1 1 y � 2 + λR GLO ( w ) ⇔ min y � 2 + λ n � ˆ n � ˜ min Xw − ˆ X ˜ w − ˆ � w ( g ) � w w ˜ g =1 ◮ ˜ X is the matrix obtained by replicating columns/variables ◮ ˜ w = ( w (1) , . . . , w ( G )) , vector with (nonoverlapping!) groups L.Rosasco, RegML 2016
An equivalence (cont.) Indeed � G 1 y � 2 + λ n � ˆ min Xw − ˆ inf � w ( g ) � = w w (1) ,...,w ( g ) g =1 s.t. � G g =1 ¯ w ( g )= w G � 1 y � 2 + λ n � ˆ inf Xw − ˆ � w ( g ) � = w (1) ,...,w ( g ) g =1 s.t. � G g =1 ¯ w ( g )= w � G � G 1 y � 2 + λ n � ˆ inf X ( w ( g )) − ˆ ¯ � w ( g ) � = w (1) ,...,w ( g ) g =1 g =1 G G � � 1 y � 2 + λ ˆ inf n � X |G g w ( g ) − ˆ � w ( g ) � = w (1) ,...,w ( g ) g =1 g =1 G � 1 y � 2 + λ n � ˜ min X ˜ w − ˆ � w ( g ) � w ˜ g =1 L.Rosasco, RegML 2016
Computations ◮ Can use block thresholding with replicated variables = ⇒ potentially wasteful ◮ The proximal operator for R GLO can be computed efficiently but not in closed form L.Rosasco, RegML 2016
More structure Structured overlapping groups ◮ trees ◮ DAG ◮ . . . Structure can be exploited in computations. . . L.Rosasco, RegML 2016
Beyond linear models Consider a dictionary made by union of distinct dictionaries G G � � Φ g ( x ) ⊤ w ( g ) , f ( x ) = f g ( x ) � �� � = g =1 g =1 where each dictionary defines a feature map Φ g ( x ) = ( φ g 1 ( x ) , . . . , φ g p g ( x )) Easy extension with usual change of variable... L.Rosasco, RegML 2016
Representer theorems Let G G G � � � x ( g ) ⊤ ¯ f ( x ) = x ⊤ ( w ( g )) = ¯ ¯ w ( g ) = f g ( x ) , g =1 g =1 g =1 L.Rosasco, RegML 2016
Representer theorems Let G G G � � � x ( g ) ⊤ ¯ f ( x ) = x ⊤ ( w ( g )) = ¯ ¯ w ( g ) = f g ( x ) , g =1 g =1 g =1 Idea Show that n � w ( g ) = ¯ x ( g ) i c ( g ) i , ¯ i =1 i.e. n n � � x ( g ) ⊤ ¯ x ( g ) ⊤ x ( g ) i f g ( x ) = ¯ x ( g ) i c ( g ) i = c ( g ) i � �� � i =1 i =1 Φ g ( x ) ⊤ Φ g ( x i )= K g ( x,x i ) L.Rosasco, RegML 2016
Representer theorems Let G G G � � � x ( g ) ⊤ ¯ f ( x ) = x ⊤ ( w ( g )) = ¯ ¯ w ( g ) = f g ( x ) , g =1 g =1 g =1 Idea Show that n � w ( g ) = ¯ x ( g ) i c ( g ) i , ¯ i =1 i.e. n n � � x ( g ) ⊤ ¯ x ( g ) ⊤ x ( g ) i f g ( x ) = ¯ x ( g ) i c ( g ) i = c ( g ) i � �� � i =1 i =1 Φ g ( x ) ⊤ Φ g ( x i )= K g ( x,x i ) Note that in this case � f g � 2 = � w ( g ) � 2 = c ( g ) ⊤ ˆ X ( g ) ˆ X ( g ) ⊤ c ( g ) � �� � ˆ K ( g ) L.Rosasco, RegML 2016
Coefficients update � � c t − γ ( ˆ c t +1 = Prox γλR group Kc t − ˆ y )) where ˆ K = ( ˆ K (1) , . . . , ˆ K ( G )) , and c t = ( c t (1) , . . . , c t ( G )) Block Thresholding It can be shown that c ( g ) j − λ c ( g ) j � � f g � > λ c ( g ) ⊤ ˆ K ( g ) c ( g ) (Prox λ �·� ( c ( g ))) j = � �� � � fg � 0 � f g � ≤ λ L.Rosasco, RegML 2016
Non-parametric sparsity G � f ( x ) = f g ( x ) g =1 n n � � x ( g ) ⊤ x ( g ) i ( c ( g )) i �→ f g ( x ) = f g ( x ) = K g ( x, x i )( c ( g )) i i =1 i =1 ( K 1 , . . . , K G ) family of kernels G G � � � w ( g ) � = ⇒ � f g � K g g =1 g =1 L.Rosasco, RegML 2016
ℓ 1 MKL � G 1 y � 2 + λ n � ˆ inf Xw − ˆ � w ( g ) � = w (1) ,...,w ( g ) g =1 s.t. � G g =1 ¯ w ( g )= w ⇓ n G � � 1 ( y i − f ( x i )) 2 + λ min � f g � K g n f 1 , . . . , f g i =1 g =1 s . t . � G g =1 f g = f L.Rosasco, RegML 2016
ℓ 2 MKL G G � � � w ( g ) � 2 = � f g � 2 ⇒ K g g =1 g =1 Corresponds to using the kernel G � K ( x, x ′ ) = K g ( x, x ′ ) g =1 L.Rosasco, RegML 2016
ℓ 1 or ℓ 2 MKL ◮ ℓ 2 *much* faster ◮ ℓ 1 could be useful is only few kernels are relevant L.Rosasco, RegML 2016
Why MKL? ◮ Data fusion– different features ◮ Model selection, e.g. gaussian kernels with different widths ◮ Richer model– many kernels! L.Rosasco, RegML 2016
MKL & kernel learning It can be shown that n G � � 1 ( y i − f ( x i )) 2 + λ min � f g � K g n f 1 , . . . , f g i =1 g =1 s . t . � G g =1 f g = f � n � 1 ( y i − f ( x i )) 2 + λ � f � 2 K ∈K min min K n f ∈H K i =1 where K = { K | K = � α g ≥ 0 } g K g α g , L.Rosasco, RegML 2016
Sparsity beyond vectors Recall multi-variable regression x i ∈ R d , y i ∈ R T ( x i , y i ) i =1 n , f ( x ) = x ⊤ W ���� d × T W � ˆ XW − ˆ Y � 2 F + λ Tr ( WAW ⊤ ) min L.Rosasco, RegML 2016
Sparse regularization ◮ We have seen � d � T Tr ( WW ⊤ ) = ( W t,j ) 2 j =1 t =1 ◮ We could consider now � d � T | W t,j | j =1 t =1 ◮ . . . L.Rosasco, RegML 2016
Spectral Norms/ p -Schatten norms ◮ We have seen min { d,T } � Tr ( WW ⊤ ) = σ 2 i t =1 ◮ We could consider now min { d,T } � R ( W ) = � W � ∗ = σ i , nuclear norm t =1 ◮ or min { d,T } � ( σ i ) p ) 1 /p , R ( W ) = ( p-Schatten norm t =1 L.Rosasco, RegML 2016
Recommend
More recommend