regml 2016 class 4 regularization for multi task learning
play

RegML 2016 Class 4 Regularization for multi-task learning Lorenzo - PowerPoint PPT Presentation

RegML 2016 Class 4 Regularization for multi-task learning Lorenzo Rosasco UNIGE-MIT-IIT June 28, 2016 Supervised learning so far Regression f : X Y R Classification f : X Y = { 1 , 1 } What next? Vector-valued f : X


  1. RegML 2016 Class 4 Regularization for multi-task learning Lorenzo Rosasco UNIGE-MIT-IIT June 28, 2016

  2. Supervised learning so far ◮ Regression f : X → Y ⊆ R ◮ Classification f : X → Y = {− 1 , 1 } What next? ◮ Vector-valued f : X → Y ⊆ R T ◮ Multiclass f : X → Y = { 1 , 2 , . . . , T } ◮ ... L.Rosasco, RegML 2016 2

  3. Multitask learning Given S 1 = ( x 1 i , y 1 i ) n 1 i =1 , . . . , S T = ( x T i , y T i ) n T i =1 find f 1 : X 1 → Y 1 , . . . , f T : X T → Y T L.Rosasco, RegML 2016 3

  4. Multitask learning Given S 1 = ( x 1 i , y 1 i ) n 1 i =1 , . . . , S T = ( x T i , y T i ) n T i =1 find f 1 : X 1 → Y 1 , . . . , f T : X T → Y T ◮ vector valued regression, S n = ( x i , y i ) n y i ∈ R T x i ∈ X, i =1 , MTL with equal inputs! Output coordinates are “tasks” ◮ multiclass S n = ( x i , y i ) n i =1 , x i ∈ X, y i ∈ { 1 , . . . , T } L.Rosasco, RegML 2016 4

  5. Why MTL? Task 1 Y X Task 2 X L.Rosasco, RegML 2016 5

  6. Why MTL? 60 60 40 40 20 20 0 0 0 5 10 15 20 25 0 5 10 15 20 25 60 60 40 40 20 20 0 0 0 5 10 15 20 25 0 5 10 15 20 25 Real data! L.Rosasco, RegML 2016 6

  7. Why MTL? Related problems: ◮ conjoint analysis ◮ transfer learning ◮ collaborative filtering ◮ co-kriging Examples of applications: ◮ geophysics ◮ music recommendation (Dinuzzo 08) ◮ pharmacological data (Pillonetto at el. 08) ◮ binding data (Jacob et al. 08) ◮ movies recommendation (Abernethy et al. 08) ◮ HIV Therapy Screening (Bickel et al. 08) L.Rosasco, RegML 2016 7

  8. Why MTL? VVR, e.g. vector fields estimation L.Rosasco, RegML 2016 8

  9. Why MTL? Component 1 Y X Component 2 X L.Rosasco, RegML 2016 9

  10. Penalized regularization for MTL err( w 1 , . . . , w T ) + pen( w 1 , . . . , w T ) We start with linear models f 1 ( x ) = w ⊤ 1 x, . . . , f T ( x ) = w ⊤ T x L.Rosasco, RegML 2016 10

  11. Empirical error � T � n i 1 � ( y i j − w ⊤ i x i j ) 2 E ( w 1 , . . . , w T ) = n i i =1 j =1 ◮ could consider other losses ◮ could try to “couple” errors L.Rosasco, RegML 2016 11

  12. Least squares error We focus on vector valued regression (VVR) S n = ( x i , y i ) n y i ∈ R T x i ∈ X, i =1 , L.Rosasco, RegML 2016 12

  13. Least squares error We focus on vector valued regression (VVR) S n = ( x i , y i ) n y i ∈ R T x i ∈ X, i =1 , � T � n 1 t x i ) 2 = 1 ˆ − � ( y t i − w ⊤ � 2 n � X W Y ���� ���� ���� F n t =1 i =1 n × d d × T n × T � F = Tr( W ⊤ W ) , y t � W � 2 W = ( w 1 , . . . , w T ) , Y it = ˆ i i = 1 . . . n t = 1 . . . T L.Rosasco, RegML 2016 13

  14. MTL by regularization pen( w 1 . . . w T ) ◮ Coupling task solutions by regularization ◮ Borrowing strength ◮ Exploit structure L.Rosasco, RegML 2016 14

  15. Regularizations for MTL T � � w t � 2 pen ( w 1 , . . . , w T ) = t =1 L.Rosasco, RegML 2016 15

  16. Regularizations for MTL T � � w t � 2 pen ( w 1 , . . . , w T ) = t =1 Single tasks regularization! T n T � � � 1 t x i ) 2 + λ � w t � 2 = ( y t i − w ⊤ min n w 1 ,...,w T t =1 i =1 t =1 T n � � 1 t x i ) 2 + λ � w t � 2 ) ( y t i − w ⊤ (min n w t t =1 i =1 L.Rosasco, RegML 2016 16

  17. Regularizations for MTL ◮ Isotropic coupling � � 2 � � T T T � � � � w j − 1 � � � w j � 2 (1 − α ) + α � w i � T � j =1 j =1 i =1 L.Rosasco, RegML 2016 17

  18. Regularizations for MTL ◮ Isotropic coupling � � 2 � � T T T � � � � w j − 1 � � � w j � 2 (1 − α ) + α � w i � T � j =1 j =1 i =1 ◮ Graph coupling - Let M ∈ R T × T an adjacency matrix, with M ts ≥ 0 T T T � � � M ts � w t − w s � 2 + γ � w t � 2 t =1 s =1 t =1 special case: output divided in clusters L.Rosasco, RegML 2016 18

  19. A general form of regularization All the regularizers so far are of the form � T � T A ts w ⊤ t w s t =1 s =1 for a suitable positive definite matrix A L.Rosasco, RegML 2016 19

  20. MTL regularization revisited ◮ Single tasks � T j =1 � w j � 2 = ⇒ A = I L.Rosasco, RegML 2016 20

  21. MTL regularization revisited ◮ Single tasks � T j =1 � w j � 2 = ⇒ A = I ◮ Isotropic coupling � � 2 � � T T T � � � � � w j − 1 � � � w j � 2 (1 − α ) + α w j � � T � � j =1 j =1 j =1 A = I − α = ⇒ T 1 L.Rosasco, RegML 2016 21

  22. MTL regularization revisited ◮ Single tasks � T j =1 � w j � 2 = ⇒ A = I ◮ Isotropic coupling � � 2 � � T T T � � � � � w j − 1 � � � w j � 2 (1 − α ) + α w j � � T � � j =1 j =1 j =1 A = I − α = ⇒ T 1 ◮ Graph coupling T T T � � � M ts � w t − w s � 2 + γ � w t � 2 t =1 s =1 t =1 = ⇒ A = L + γI, where L graph Laplacian of M � � L = D − M, D = diag ( M 1 ,j , . . . , M T,j , ) j j L.Rosasco, RegML 2016 22

  23. A general form of regularization A ∈ R T × T Let W = ( w 1 , . . . , w T ) , Note that T T � � A ts w ⊤ t w s = Tr( WAW ⊤ ) t =1 s =1 L.Rosasco, RegML 2016 23

  24. A general form of regularization A ∈ R T × T Let W = ( w 1 , . . . , w T ) , Note that T T � � A ts w ⊤ t w s = Tr( WAW ⊤ ) t =1 s =1 Indeed � d � d � T Tr( WAW ⊤ ) = ⊤ AW i = W i A ts W it W is i =1 i =1 t,s =1 T d T � � � A ts w ⊤ = A ts W is W ir = t w s t,s =1 i =1 t,s =1 L.Rosasco, RegML 2016 24

  25. Computations 1 n � � XW − � Y � 2 F + λ Tr( WAW ⊤ ) L.Rosasco, RegML 2016 25

  26. Computations 1 n � � XW − � Y � 2 F + λ Tr( WAW ⊤ ) Consider the SVD A = U Σ U ⊤ , Σ = diag ( σ 1 , . . . , σ T ) L.Rosasco, RegML 2016 26

  27. Computations 1 n � � XW − � Y � 2 F + λ Tr( WAW ⊤ ) Consider the SVD A = U Σ U ⊤ , Σ = diag ( σ 1 , . . . , σ T ) let ˜ Y = � ˜ W = WU, Y U then we can rewrite the above problem as 1 n � � X ˜ W − ˜ F + λ Tr( ˜ W Σ ˜ Y � 2 W ⊤ ) L.Rosasco, RegML 2016 27

  28. Computations (cont.) Fially, rewrite 1 n � � X ˜ W − ˜ F + λ Tr( ˜ W Σ ˜ Y � 2 W ⊤ ) as T n � � ( 1 t x i ) 2 + λσ t � ˜ y t w ⊤ w t � 2 ) (˜ i − ˜ n t =1 i =1 Finally W = ˜ WU ⊤ Compare to single task regularization L.Rosasco, RegML 2016 28

  29. Computations (cont.) E λ ( W ) = 1 n � � XW − � Y � 2 F + λ Tr( WAW ⊤ ) Alternatively ∇E λ ( W ) = 2 X ⊤ ( � � XW − � Y ) + 2 λWA n W t +1 = W t − γ ∇E λ ( W t ) Trivially extends to other loss functions. L.Rosasco, RegML 2016 29

  30. Beyond Linearity f t ( x ) = w ⊤ t Φ( x ) , Φ( x ) = ( φ 1 ( x ) , . . . , φ p ( x )) E λ ( W ) = 1 Y � 2 + λ Tr( WAW ⊤ ) , n � � Φ W − � with � Φ matrix with rows Φ( x 1 ) , . . . , Φ( x n ) L.Rosasco, RegML 2016 30

  31. Nonparametrics and kernels n � f t ( x ) = K ( x, x i ) C it i =1 with � 2 � KC ℓ − � � C ℓ +1 = C ℓ − γ Y + 2 λC ℓ A n ◮ C ℓ ∈ R n × T ◮ � K ∈ R n × n , � K ij = K ( x i , x j ) Y ∈ R n × T , � ◮ � Y ij = y j i L.Rosasco, RegML 2016 31

  32. Spectral filtering for MTL Beyond penalization 1 Y � 2 + λ Tr( WAW ⊤ ) , n � � XW − � min W other forms of regularizations can be considered ◮ projection ◮ early stopping L.Rosasco, RegML 2016 32

  33. Multiclass and MTL Y = { 1 , . . . , T } L.Rosasco, RegML 2016 33

  34. From Multiclass to MTL Encoding For j = 1 , . . . , T j �→ e j canonical vector of R T the problem reduces to vector valued regression Decoding For f ( x ) ∈ R T e ⊤ f ( x ) �→ argmax t f ( x ) = argmax f t ( x ) t =1 ,...t t =1 ,...t L.Rosasco, RegML 2016 34

  35. Single MTL and OVA Write 1 Y � 2 + λ Tr( WW ⊤ ) , n � � XW − � min W as � T � n t 1 i ) 2 + λ � w t � 2 ( w ⊤ t x t i − y t min n w t t =1 i =1 This is known as one versus all (OVA) L.Rosasco, RegML 2016 35

  36. Beyond OVA Consider 1 Y � 2 + λ Tr( WAW ⊤ ) , n � � XW − � min W that is T T n � � � ( 1 t x i ) 2 + λσ t � ˜ y t w ⊤ w t � 2 ) min (˜ i − ˜ n w t ˜ t =1 t =1 i =1 Class relatedness encoded in A L.Rosasco, RegML 2016 36

  37. Back to MTL T n t � � 1 ( y t j − w ⊤ i x t j ) 2 n t t =1 j =1 ⇓ T � � ( ˆ � 2 − Y ) ⊙ M n = X W F , n t ���� ���� ���� ���� t =1 n × d d × T n × T n × T ◮ ⊙ Hadamard product ◮ M mask ◮ Y having one non-zero value for each row L.Rosasco, RegML 2016 37

  38. Computations W � ( ˆ XW − Y ) ⊙ M � 2 F + λ Tr( WAW ⊤ ) min ◮ can be rewritten using tensor calculus ◮ computation for vector valued regression easily extended ◮ sparsity of M can be exploited L.Rosasco, RegML 2016 38

  39. From MTL to matrix completion Special case Take d = n and X = I � ( ˆ XW − Y ) ◦ M � 2 F ⇓ � T � n y ij ) 2 M ij ( w ij − ¯ t =1 i =1 L.Rosasco, RegML 2016 39

  40. Summary so far A regularization framework for ◮ VVR ◮ Multiclass ◮ MTL ◮ Matrix completion if the structure of the “tasks” is known. What if it is not? L.Rosasco, RegML 2016 40

Recommend


More recommend