RegML 2016 Class 4 Regularization for multi-task learning Lorenzo - PowerPoint PPT Presentation

RegML 2016 Class 4 Regularization for multi-task learning Lorenzo Rosasco UNIGE-MIT-IIT June 28, 2016

Supervised learning so far ◮ Regression f : X → Y ⊆ R ◮ Classification f : X → Y = {− 1 , 1 } What next? ◮ Vector-valued f : X → Y ⊆ R T ◮ Multiclass f : X → Y = { 1 , 2 , . . . , T } ◮ ... L.Rosasco, RegML 2016 2

Multitask learning Given S 1 = ( x 1 i , y 1 i ) n 1 i =1 , . . . , S T = ( x T i , y T i ) n T i =1 find f 1 : X 1 → Y 1 , . . . , f T : X T → Y T L.Rosasco, RegML 2016 3

Multitask learning Given S 1 = ( x 1 i , y 1 i ) n 1 i =1 , . . . , S T = ( x T i , y T i ) n T i =1 find f 1 : X 1 → Y 1 , . . . , f T : X T → Y T ◮ vector valued regression, S n = ( x i , y i ) n y i ∈ R T x i ∈ X, i =1 , MTL with equal inputs! Output coordinates are “tasks” ◮ multiclass S n = ( x i , y i ) n i =1 , x i ∈ X, y i ∈ { 1 , . . . , T } L.Rosasco, RegML 2016 4

Why MTL? Task 1 Y X Task 2 X L.Rosasco, RegML 2016 5

Why MTL? 60 60 40 40 20 20 0 0 0 5 10 15 20 25 0 5 10 15 20 25 60 60 40 40 20 20 0 0 0 5 10 15 20 25 0 5 10 15 20 25 Real data! L.Rosasco, RegML 2016 6

Why MTL? Related problems: ◮ conjoint analysis ◮ transfer learning ◮ collaborative filtering ◮ co-kriging Examples of applications: ◮ geophysics ◮ music recommendation (Dinuzzo 08) ◮ pharmacological data (Pillonetto at el. 08) ◮ binding data (Jacob et al. 08) ◮ movies recommendation (Abernethy et al. 08) ◮ HIV Therapy Screening (Bickel et al. 08) L.Rosasco, RegML 2016 7

Why MTL? VVR, e.g. vector fields estimation L.Rosasco, RegML 2016 8

Why MTL? Component 1 Y X Component 2 X L.Rosasco, RegML 2016 9

Penalized regularization for MTL err( w 1 , . . . , w T ) + pen( w 1 , . . . , w T ) We start with linear models f 1 ( x ) = w ⊤ 1 x, . . . , f T ( x ) = w ⊤ T x L.Rosasco, RegML 2016 10

Empirical error � T � n i 1 � ( y i j − w ⊤ i x i j ) 2 E ( w 1 , . . . , w T ) = n i i =1 j =1 ◮ could consider other losses ◮ could try to “couple” errors L.Rosasco, RegML 2016 11

Least squares error We focus on vector valued regression (VVR) S n = ( x i , y i ) n y i ∈ R T x i ∈ X, i =1 , L.Rosasco, RegML 2016 12

Least squares error We focus on vector valued regression (VVR) S n = ( x i , y i ) n y i ∈ R T x i ∈ X, i =1 , � T � n 1 t x i ) 2 = 1 ˆ − � ( y t i − w ⊤ � 2 n � X W Y �� F n t =1 i =1 n × d d × T n × T � F = Tr( W ⊤ W ) , y t � W � 2 W = ( w 1 , . . . , w T ) , Y it = ˆ i i = 1 . . . n t = 1 . . . T L.Rosasco, RegML 2016 13

MTL by regularization pen( w 1 . . . w T ) ◮ Coupling task solutions by regularization ◮ Borrowing strength ◮ Exploit structure L.Rosasco, RegML 2016 14

Regularizations for MTL T � � w t � 2 pen ( w 1 , . . . , w T ) = t =1 L.Rosasco, RegML 2016 15

Regularizations for MTL T � � w t � 2 pen ( w 1 , . . . , w T ) = t =1 Single tasks regularization! T n T � � � 1 t x i ) 2 + λ � w t � 2 = ( y t i − w ⊤ min n w 1 ,...,w T t =1 i =1 t =1 T n � � 1 t x i ) 2 + λ � w t � 2 ) ( y t i − w ⊤ (min n w t t =1 i =1 L.Rosasco, RegML 2016 16

Regularizations for MTL ◮ Isotropic coupling � � 2 � � T T T � � � � w j − 1 � � � w j � 2 (1 − α ) + α � w i � T � j =1 j =1 i =1 L.Rosasco, RegML 2016 17

Regularizations for MTL ◮ Isotropic coupling � � 2 � � T T T � � � � w j − 1 � � � w j � 2 (1 − α ) + α � w i � T � j =1 j =1 i =1 ◮ Graph coupling - Let M ∈ R T × T an adjacency matrix, with M ts ≥ 0 T T T � � � M ts � w t − w s � 2 + γ � w t � 2 t =1 s =1 t =1 special case: output divided in clusters L.Rosasco, RegML 2016 18

A general form of regularization All the regularizers so far are of the form � T � T A ts w ⊤ t w s t =1 s =1 for a suitable positive definite matrix A L.Rosasco, RegML 2016 19

MTL regularization revisited ◮ Single tasks � T j =1 � w j � 2 = ⇒ A = I L.Rosasco, RegML 2016 20

MTL regularization revisited ◮ Single tasks � T j =1 � w j � 2 = ⇒ A = I ◮ Isotropic coupling � � 2 � � T T T � � � � � w j − 1 � � � w j � 2 (1 − α ) + α w j � � T � � j =1 j =1 j =1 A = I − α = ⇒ T 1 L.Rosasco, RegML 2016 21

MTL regularization revisited ◮ Single tasks � T j =1 � w j � 2 = ⇒ A = I ◮ Isotropic coupling � � 2 � � T T T � � � � � w j − 1 � � � w j � 2 (1 − α ) + α w j � � T � � j =1 j =1 j =1 A = I − α = ⇒ T 1 ◮ Graph coupling T T T � � � M ts � w t − w s � 2 + γ � w t � 2 t =1 s =1 t =1 = ⇒ A = L + γI, where L graph Laplacian of M � � L = D − M, D = diag ( M 1 ,j , . . . , M T,j , ) j j L.Rosasco, RegML 2016 22

A general form of regularization A ∈ R T × T Let W = ( w 1 , . . . , w T ) , Note that T T � � A ts w ⊤ t w s = Tr( WAW ⊤ ) t =1 s =1 L.Rosasco, RegML 2016 23

A general form of regularization A ∈ R T × T Let W = ( w 1 , . . . , w T ) , Note that T T � � A ts w ⊤ t w s = Tr( WAW ⊤ ) t =1 s =1 Indeed � d � d � T Tr( WAW ⊤ ) = ⊤ AW i = W i A ts W it W is i =1 i =1 t,s =1 T d T � � � A ts w ⊤ = A ts W is W ir = t w s t,s =1 i =1 t,s =1 L.Rosasco, RegML 2016 24

Computations 1 n � � XW − � Y � 2 F + λ Tr( WAW ⊤ ) L.Rosasco, RegML 2016 25

Computations 1 n � � XW − � Y � 2 F + λ Tr( WAW ⊤ ) Consider the SVD A = U Σ U ⊤ , Σ = diag ( σ 1 , . . . , σ T ) L.Rosasco, RegML 2016 26

Computations 1 n � � XW − � Y � 2 F + λ Tr( WAW ⊤ ) Consider the SVD A = U Σ U ⊤ , Σ = diag ( σ 1 , . . . , σ T ) let ˜ Y = � ˜ W = WU, Y U then we can rewrite the above problem as 1 n � � X ˜ W − ˜ F + λ Tr( ˜ W Σ ˜ Y � 2 W ⊤ ) L.Rosasco, RegML 2016 27

Computations (cont.) Fially, rewrite 1 n � � X ˜ W − ˜ F + λ Tr( ˜ W Σ ˜ Y � 2 W ⊤ ) as T n � � ( 1 t x i ) 2 + λσ t � ˜ y t w ⊤ w t � 2 ) (˜ i − ˜ n t =1 i =1 Finally W = ˜ WU ⊤ Compare to single task regularization L.Rosasco, RegML 2016 28

Computations (cont.) E λ ( W ) = 1 n � � XW − � Y � 2 F + λ Tr( WAW ⊤ ) Alternatively ∇E λ ( W ) = 2 X ⊤ ( � � XW − � Y ) + 2 λWA n W t +1 = W t − γ ∇E λ ( W t ) Trivially extends to other loss functions. L.Rosasco, RegML 2016 29

Beyond Linearity f t ( x ) = w ⊤ t Φ( x ) , Φ( x ) = ( φ 1 ( x ) , . . . , φ p ( x )) E λ ( W ) = 1 Y � 2 + λ Tr( WAW ⊤ ) , n � � Φ W − � with � Φ matrix with rows Φ( x 1 ) , . . . , Φ( x n ) L.Rosasco, RegML 2016 30

Nonparametrics and kernels n � f t ( x ) = K ( x, x i ) C it i =1 with � 2 � KC ℓ − � � C ℓ +1 = C ℓ − γ Y + 2 λC ℓ A n ◮ C ℓ ∈ R n × T ◮ � K ∈ R n × n , � K ij = K ( x i , x j ) Y ∈ R n × T , � ◮ � Y ij = y j i L.Rosasco, RegML 2016 31

Spectral filtering for MTL Beyond penalization 1 Y � 2 + λ Tr( WAW ⊤ ) , n � � XW − � min W other forms of regularizations can be considered ◮ projection ◮ early stopping L.Rosasco, RegML 2016 32

Multiclass and MTL Y = { 1 , . . . , T } L.Rosasco, RegML 2016 33

From Multiclass to MTL Encoding For j = 1 , . . . , T j �→ e j canonical vector of R T the problem reduces to vector valued regression Decoding For f ( x ) ∈ R T e ⊤ f ( x ) �→ argmax t f ( x ) = argmax f t ( x ) t =1 ,...t t =1 ,...t L.Rosasco, RegML 2016 34

Single MTL and OVA Write 1 Y � 2 + λ Tr( WW ⊤ ) , n � � XW − � min W as � T � n t 1 i ) 2 + λ � w t � 2 ( w ⊤ t x t i − y t min n w t t =1 i =1 This is known as one versus all (OVA) L.Rosasco, RegML 2016 35

Beyond OVA Consider 1 Y � 2 + λ Tr( WAW ⊤ ) , n � � XW − � min W that is T T n � � � ( 1 t x i ) 2 + λσ t � ˜ y t w ⊤ w t � 2 ) min (˜ i − ˜ n w t ˜ t =1 t =1 i =1 Class relatedness encoded in A L.Rosasco, RegML 2016 36

Back to MTL T n t � � 1 ( y t j − w ⊤ i x t j ) 2 n t t =1 j =1 ⇓ T � � ( ˆ � 2 − Y ) ⊙ M n = X W F , n t �� t =1 n × d d × T n × T n × T ◮ ⊙ Hadamard product ◮ M mask ◮ Y having one non-zero value for each row L.Rosasco, RegML 2016 37

Computations W � ( ˆ XW − Y ) ⊙ M � 2 F + λ Tr( WAW ⊤ ) min ◮ can be rewritten using tensor calculus ◮ computation for vector valued regression easily extended ◮ sparsity of M can be exploited L.Rosasco, RegML 2016 38

From MTL to matrix completion Special case Take d = n and X = I � ( ˆ XW − Y ) ◦ M � 2 F ⇓ � T � n y ij ) 2 M ij ( w ij − ¯ t =1 i =1 L.Rosasco, RegML 2016 39

Summary so far A regularization framework for ◮ VVR ◮ Multiclass ◮ MTL ◮ Matrix completion if the structure of the “tasks” is known. What if it is not? L.Rosasco, RegML 2016 40

RegML 2016 Class 4 Regularization for multi-task learning Lorenzo - PowerPoint PPT Presentation

RegML 2016 Class 4 Regularization for multi-task learning Lorenzo Rosasco UNIGE-MIT-IIT June 28, 2016 Supervised learning so far Regression f : X Y R Classification f : X Y = { 1 , 1 } What next? Vector-valued f : X

RegML 2020 Class 4 Regularization for multi-task learning Lorenzo Rosasco UNIGE-MIT-IIT

RegML 2016 Class 7 Dictionary learning Lorenzo Rosasco UNIGE-MIT-IIT June 30, 2016 Data

RegML 2016 Class 6 Structured sparsity Lorenzo Rosasco UNIGE-MIT-IIT June 30, 2016 Exploiting

RegML 2020 Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT Learning

RegML 2020 Class 7 Dictionary learning Lorenzo Rosasco UNIGE-MIT-IIT Data representation A

RegML 2020 Class 3 Early Stopping and Spectral Regularization Lorenzo Rosasco UNIGE-MIT-IIT

Multi-Task Learning and Matrix Regularization Andreas Argyriou TTI Chicago Outline

Regularization for Multi-Output Learning Lorenzo Rosasco 9.520 L. Rosasco Regularization for

Regularization for Multi-Output Learning Lorenzo Rosasco 9.520 Class 11 March 9, 2011 L.

Applications to high dimensional problems Francesca Odone and Lorenzo Rosasco RegML 2013

Regularization via Spectral Filtering Lorenzo Rosasco MIT, 9.520 Class 7 L. Rosasco

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970

Regularization Regularization is a general approach to add a complexity parameter to a

H OW C AN W E D ESIGN AN A LGORITHM ? A possible way to do this is penalized empirical risk

Regularization Overview Regularization Overview Problems & Multicollinearity We will

RegML 2020 Class 1 Statistical Learning Theory Lorenzo Rosasco UNIGE-MIT-IIT All starts with

Recitation 1: Multitasking Kai Mast Threads vs. Processes Threads Processes How to start?

Multitask Learning with Low-Level Auxiliary Tasks 1 Traditional automatic speech recognition

Multi-Task Learning: Models, Optimization and Applications Linli Xu University of Science and

Multi-Task Active Learning Yi Zhang Outline Active Learning Multi-Task Active Learning

Real Time Operating Systems from Fundamentals of Real Time Systems Mukul Shirvaikar &

IN5550 Neural Methods in Natural Language Processing Ensembles, transfer and multi-task

amforth : multitasking Erich W alde Forth Gesellschaft e.V. 2012 Task Control Block state

Self-reflective Multi-task Gaussian Process Kohei Hayashi 1 , Takashi Takenouchi 1 , Ryota Tomioka

RegML 2016 Class 4 Regularization for multi-task learning Lorenzo - PowerPoint PPT Presentation

RegML 2016 Class 4 Regularization for multi-task learning Lorenzo Rosasco UNIGE-MIT-IIT June 28, 2016 Supervised learning so far Regression f : X Y R Classification f : X Y = { 1 , 1 } What next? Vector-valued f : X

RegML 2020 Class 4 Regularization for multi-task learning Lorenzo Rosasco UNIGE-MIT-IIT

RegML 2016 Class 7 Dictionary learning Lorenzo Rosasco UNIGE-MIT-IIT June 30, 2016 Data

RegML 2016 Class 6 Structured sparsity Lorenzo Rosasco UNIGE-MIT-IIT June 30, 2016 Exploiting

RegML 2020 Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT Learning

RegML 2020 Class 7 Dictionary learning Lorenzo Rosasco UNIGE-MIT-IIT Data representation A

RegML 2020 Class 3 Early Stopping and Spectral Regularization Lorenzo Rosasco UNIGE-MIT-IIT

Multi-Task Learning and Matrix Regularization Andreas Argyriou TTI Chicago Outline

Regularization for Multi-Output Learning Lorenzo Rosasco 9.520 L. Rosasco Regularization for

Regularization for Multi-Output Learning Lorenzo Rosasco 9.520 Class 11 March 9, 2011 L.

Applications to high dimensional problems Francesca Odone and Lorenzo Rosasco RegML 2013

Regularization via Spectral Filtering Lorenzo Rosasco MIT, 9.520 Class 7 L. Rosasco

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970

Regularization Regularization is a general approach to add a complexity parameter to a

H OW C AN W E D ESIGN AN A LGORITHM ? A possible way to do this is penalized empirical risk

Regularization Overview Regularization Overview Problems &amp; Multicollinearity We will

RegML 2020 Class 1 Statistical Learning Theory Lorenzo Rosasco UNIGE-MIT-IIT All starts with

Recitation 1: Multitasking Kai Mast Threads vs. Processes Threads Processes How to start?

Multitask Learning with Low-Level Auxiliary Tasks 1 Traditional automatic speech recognition

Multi-Task Learning: Models, Optimization and Applications Linli Xu University of Science and

Multi-Task Active Learning Yi Zhang Outline Active Learning Multi-Task Active Learning

Real Time Operating Systems from Fundamentals of Real Time Systems Mukul Shirvaikar &amp;

IN5550 Neural Methods in Natural Language Processing Ensembles, transfer and multi-task

amforth : multitasking Erich W alde Forth Gesellschaft e.V. 2012 Task Control Block state

Self-reflective Multi-task Gaussian Process Kohei Hayashi 1 , Takashi Takenouchi 1 , Ryota Tomioka

Regularization Overview Regularization Overview Problems & Multicollinearity We will

Real Time Operating Systems from Fundamentals of Real Time Systems Mukul Shirvaikar &