RegML 2016 Class 4 Regularization for multi-task learning Lorenzo Rosasco UNIGE-MIT-IIT June 28, 2016
Supervised learning so far ◮ Regression f : X → Y ⊆ R ◮ Classification f : X → Y = {− 1 , 1 } What next? ◮ Vector-valued f : X → Y ⊆ R T ◮ Multiclass f : X → Y = { 1 , 2 , . . . , T } ◮ ... L.Rosasco, RegML 2016 2
Multitask learning Given S 1 = ( x 1 i , y 1 i ) n 1 i =1 , . . . , S T = ( x T i , y T i ) n T i =1 find f 1 : X 1 → Y 1 , . . . , f T : X T → Y T L.Rosasco, RegML 2016 3
Multitask learning Given S 1 = ( x 1 i , y 1 i ) n 1 i =1 , . . . , S T = ( x T i , y T i ) n T i =1 find f 1 : X 1 → Y 1 , . . . , f T : X T → Y T ◮ vector valued regression, S n = ( x i , y i ) n y i ∈ R T x i ∈ X, i =1 , MTL with equal inputs! Output coordinates are “tasks” ◮ multiclass S n = ( x i , y i ) n i =1 , x i ∈ X, y i ∈ { 1 , . . . , T } L.Rosasco, RegML 2016 4
Why MTL? Task 1 Y X Task 2 X L.Rosasco, RegML 2016 5
Why MTL? 60 60 40 40 20 20 0 0 0 5 10 15 20 25 0 5 10 15 20 25 60 60 40 40 20 20 0 0 0 5 10 15 20 25 0 5 10 15 20 25 Real data! L.Rosasco, RegML 2016 6
Why MTL? Related problems: ◮ conjoint analysis ◮ transfer learning ◮ collaborative filtering ◮ co-kriging Examples of applications: ◮ geophysics ◮ music recommendation (Dinuzzo 08) ◮ pharmacological data (Pillonetto at el. 08) ◮ binding data (Jacob et al. 08) ◮ movies recommendation (Abernethy et al. 08) ◮ HIV Therapy Screening (Bickel et al. 08) L.Rosasco, RegML 2016 7
Why MTL? VVR, e.g. vector fields estimation L.Rosasco, RegML 2016 8
Why MTL? Component 1 Y X Component 2 X L.Rosasco, RegML 2016 9
Penalized regularization for MTL err( w 1 , . . . , w T ) + pen( w 1 , . . . , w T ) We start with linear models f 1 ( x ) = w ⊤ 1 x, . . . , f T ( x ) = w ⊤ T x L.Rosasco, RegML 2016 10
Empirical error � T � n i 1 � ( y i j − w ⊤ i x i j ) 2 E ( w 1 , . . . , w T ) = n i i =1 j =1 ◮ could consider other losses ◮ could try to “couple” errors L.Rosasco, RegML 2016 11
Least squares error We focus on vector valued regression (VVR) S n = ( x i , y i ) n y i ∈ R T x i ∈ X, i =1 , L.Rosasco, RegML 2016 12
Least squares error We focus on vector valued regression (VVR) S n = ( x i , y i ) n y i ∈ R T x i ∈ X, i =1 , � T � n 1 t x i ) 2 = 1 ˆ − � ( y t i − w ⊤ � 2 n � X W Y ���� ���� ���� F n t =1 i =1 n × d d × T n × T � F = Tr( W ⊤ W ) , y t � W � 2 W = ( w 1 , . . . , w T ) , Y it = ˆ i i = 1 . . . n t = 1 . . . T L.Rosasco, RegML 2016 13
MTL by regularization pen( w 1 . . . w T ) ◮ Coupling task solutions by regularization ◮ Borrowing strength ◮ Exploit structure L.Rosasco, RegML 2016 14
Regularizations for MTL T � � w t � 2 pen ( w 1 , . . . , w T ) = t =1 L.Rosasco, RegML 2016 15
Regularizations for MTL T � � w t � 2 pen ( w 1 , . . . , w T ) = t =1 Single tasks regularization! T n T � � � 1 t x i ) 2 + λ � w t � 2 = ( y t i − w ⊤ min n w 1 ,...,w T t =1 i =1 t =1 T n � � 1 t x i ) 2 + λ � w t � 2 ) ( y t i − w ⊤ (min n w t t =1 i =1 L.Rosasco, RegML 2016 16
Regularizations for MTL ◮ Isotropic coupling � � 2 � � T T T � � � � w j − 1 � � � w j � 2 (1 − α ) + α � w i � T � j =1 j =1 i =1 L.Rosasco, RegML 2016 17
Regularizations for MTL ◮ Isotropic coupling � � 2 � � T T T � � � � w j − 1 � � � w j � 2 (1 − α ) + α � w i � T � j =1 j =1 i =1 ◮ Graph coupling - Let M ∈ R T × T an adjacency matrix, with M ts ≥ 0 T T T � � � M ts � w t − w s � 2 + γ � w t � 2 t =1 s =1 t =1 special case: output divided in clusters L.Rosasco, RegML 2016 18
A general form of regularization All the regularizers so far are of the form � T � T A ts w ⊤ t w s t =1 s =1 for a suitable positive definite matrix A L.Rosasco, RegML 2016 19
MTL regularization revisited ◮ Single tasks � T j =1 � w j � 2 = ⇒ A = I L.Rosasco, RegML 2016 20
MTL regularization revisited ◮ Single tasks � T j =1 � w j � 2 = ⇒ A = I ◮ Isotropic coupling � � 2 � � T T T � � � � � w j − 1 � � � w j � 2 (1 − α ) + α w j � � T � � j =1 j =1 j =1 A = I − α = ⇒ T 1 L.Rosasco, RegML 2016 21
MTL regularization revisited ◮ Single tasks � T j =1 � w j � 2 = ⇒ A = I ◮ Isotropic coupling � � 2 � � T T T � � � � � w j − 1 � � � w j � 2 (1 − α ) + α w j � � T � � j =1 j =1 j =1 A = I − α = ⇒ T 1 ◮ Graph coupling T T T � � � M ts � w t − w s � 2 + γ � w t � 2 t =1 s =1 t =1 = ⇒ A = L + γI, where L graph Laplacian of M � � L = D − M, D = diag ( M 1 ,j , . . . , M T,j , ) j j L.Rosasco, RegML 2016 22
A general form of regularization A ∈ R T × T Let W = ( w 1 , . . . , w T ) , Note that T T � � A ts w ⊤ t w s = Tr( WAW ⊤ ) t =1 s =1 L.Rosasco, RegML 2016 23
A general form of regularization A ∈ R T × T Let W = ( w 1 , . . . , w T ) , Note that T T � � A ts w ⊤ t w s = Tr( WAW ⊤ ) t =1 s =1 Indeed � d � d � T Tr( WAW ⊤ ) = ⊤ AW i = W i A ts W it W is i =1 i =1 t,s =1 T d T � � � A ts w ⊤ = A ts W is W ir = t w s t,s =1 i =1 t,s =1 L.Rosasco, RegML 2016 24
Computations 1 n � � XW − � Y � 2 F + λ Tr( WAW ⊤ ) L.Rosasco, RegML 2016 25
Computations 1 n � � XW − � Y � 2 F + λ Tr( WAW ⊤ ) Consider the SVD A = U Σ U ⊤ , Σ = diag ( σ 1 , . . . , σ T ) L.Rosasco, RegML 2016 26
Computations 1 n � � XW − � Y � 2 F + λ Tr( WAW ⊤ ) Consider the SVD A = U Σ U ⊤ , Σ = diag ( σ 1 , . . . , σ T ) let ˜ Y = � ˜ W = WU, Y U then we can rewrite the above problem as 1 n � � X ˜ W − ˜ F + λ Tr( ˜ W Σ ˜ Y � 2 W ⊤ ) L.Rosasco, RegML 2016 27
Computations (cont.) Fially, rewrite 1 n � � X ˜ W − ˜ F + λ Tr( ˜ W Σ ˜ Y � 2 W ⊤ ) as T n � � ( 1 t x i ) 2 + λσ t � ˜ y t w ⊤ w t � 2 ) (˜ i − ˜ n t =1 i =1 Finally W = ˜ WU ⊤ Compare to single task regularization L.Rosasco, RegML 2016 28
Computations (cont.) E λ ( W ) = 1 n � � XW − � Y � 2 F + λ Tr( WAW ⊤ ) Alternatively ∇E λ ( W ) = 2 X ⊤ ( � � XW − � Y ) + 2 λWA n W t +1 = W t − γ ∇E λ ( W t ) Trivially extends to other loss functions. L.Rosasco, RegML 2016 29
Beyond Linearity f t ( x ) = w ⊤ t Φ( x ) , Φ( x ) = ( φ 1 ( x ) , . . . , φ p ( x )) E λ ( W ) = 1 Y � 2 + λ Tr( WAW ⊤ ) , n � � Φ W − � with � Φ matrix with rows Φ( x 1 ) , . . . , Φ( x n ) L.Rosasco, RegML 2016 30
Nonparametrics and kernels n � f t ( x ) = K ( x, x i ) C it i =1 with � 2 � KC ℓ − � � C ℓ +1 = C ℓ − γ Y + 2 λC ℓ A n ◮ C ℓ ∈ R n × T ◮ � K ∈ R n × n , � K ij = K ( x i , x j ) Y ∈ R n × T , � ◮ � Y ij = y j i L.Rosasco, RegML 2016 31
Spectral filtering for MTL Beyond penalization 1 Y � 2 + λ Tr( WAW ⊤ ) , n � � XW − � min W other forms of regularizations can be considered ◮ projection ◮ early stopping L.Rosasco, RegML 2016 32
Multiclass and MTL Y = { 1 , . . . , T } L.Rosasco, RegML 2016 33
From Multiclass to MTL Encoding For j = 1 , . . . , T j �→ e j canonical vector of R T the problem reduces to vector valued regression Decoding For f ( x ) ∈ R T e ⊤ f ( x ) �→ argmax t f ( x ) = argmax f t ( x ) t =1 ,...t t =1 ,...t L.Rosasco, RegML 2016 34
Single MTL and OVA Write 1 Y � 2 + λ Tr( WW ⊤ ) , n � � XW − � min W as � T � n t 1 i ) 2 + λ � w t � 2 ( w ⊤ t x t i − y t min n w t t =1 i =1 This is known as one versus all (OVA) L.Rosasco, RegML 2016 35
Beyond OVA Consider 1 Y � 2 + λ Tr( WAW ⊤ ) , n � � XW − � min W that is T T n � � � ( 1 t x i ) 2 + λσ t � ˜ y t w ⊤ w t � 2 ) min (˜ i − ˜ n w t ˜ t =1 t =1 i =1 Class relatedness encoded in A L.Rosasco, RegML 2016 36
Back to MTL T n t � � 1 ( y t j − w ⊤ i x t j ) 2 n t t =1 j =1 ⇓ T � � ( ˆ � 2 − Y ) ⊙ M n = X W F , n t ���� ���� ���� ���� t =1 n × d d × T n × T n × T ◮ ⊙ Hadamard product ◮ M mask ◮ Y having one non-zero value for each row L.Rosasco, RegML 2016 37
Computations W � ( ˆ XW − Y ) ⊙ M � 2 F + λ Tr( WAW ⊤ ) min ◮ can be rewritten using tensor calculus ◮ computation for vector valued regression easily extended ◮ sparsity of M can be exploited L.Rosasco, RegML 2016 38
From MTL to matrix completion Special case Take d = n and X = I � ( ˆ XW − Y ) ◦ M � 2 F ⇓ � T � n y ij ) 2 M ij ( w ij − ¯ t =1 i =1 L.Rosasco, RegML 2016 39
Summary so far A regularization framework for ◮ VVR ◮ Multiclass ◮ MTL ◮ Matrix completion if the structure of the “tasks” is known. What if it is not? L.Rosasco, RegML 2016 40
Recommend
More recommend