Multi-Task Learning and Matrix Regularization Andreas Argyriou TTI Chicago
Outline • Multi-task learning and related problems • Multi-task feature learning (trace norm, Schatten L p norms, non-convex regularizers) • Representer theorems; “kernelization” 1
Multi-Task Learning • Tasks t = 1 , . . . , n • m examples per task are given: ( x t 1 , y t 1 ) , . . . , ( x tm , y tm ) ∈ X × Y (simplification: sample sizes need not be equal; subsumes case of common input data) • Predict using functions f t : X → Y , t = 1 , . . . , n • When the tasks are related, learning the tasks jointly should perform better than learning each task independently • Especially important when few data points are available per task (small m ); in such cases, independent learning is not successful 2
Transfer • Want good generalization on the n given tasks but also on new tasks ( transfer learning ) • Given a few examples from a new task t ′ , { ( x t ′ 1 , y t ′ 1 ) , . . . , ( x t ′ ℓ , y t ′ ℓ ) } , want to learn f t ′ • Do this by “transferring” the common task structure / features learned from the n tasks • Transfer is an important feature of human intelligence 3
Multi-Task Applications • Marketing databases, collaborative filtering, recommendation systems (e.g. Netflix); task = product preferences for each person 4
Matrix Completion • Matrix completion minimize rank( W ) R d × n W ∈ I s.t. w ij = y ij , ∀ ( i, j ) ∈ E • Special case of multi-task learning (input vectors are elements of the canonical basis) • Each column of the matrix corresponds to the regression vector for a task; emphasis is on recovery of the matrix; in learning we are also interested in generalization 5
Related Problems • Domain adaptation / transfer • Multi-view learning • Multi-label learning • Multi-task learning is a broad problem ; no single method can solve everything; 6
Learning Multiple Tasks with a Common Kernel • Learn a common kernel K ( x, x ′ ) = � x, Dx ′ � from a convex set of kernels: n m � � + γ tr( W ⊤ D − 1 W ) inf E ( � w t , x ti � , y ti ) ( MT L ) R d w 1 ,...,w n ∈ I t =1 i =1 D ≻ 0 , tr( D ) ≤ 1 ↑ n � � w t , D − 1 w t � t =1 w 1 . . . w n where W = 7
Learning Multiple Tasks with a Common Kernel • Jointly convex problem in ( W, D ) • The choice of constraint tr( D ) ≤ 1 is important; intuitively, penalizes the number of common features (eigenvectors of D ) • Once we have learned ˆ D , we can transfer it to learning of a new task t ′ m � E ( � w, x t ′ i � , y t ′ i ) + γ � w, ˆ D − 1 w � min R d w ∈ I i =1 8
Alternating Minimization Algorithm • Alternating minimization over W and D Initialization: given initial D , e.g. D = I d d while convergence condition is not true do for t = 1 , . . . , n learn w t independently by minimizing m E ( � w, x ti � , y ti ) + γ � w, D − 1 w � � i =1 end for 1 ( W W ⊤ ) 2 set D = 1 tr( W W ⊤ ) 2 end while 9
Alternating Minimization (contd.) 29 6 η = 0.05 Alternating η = 0.03 5 η = 0.05 28 η = 0.01 Alternating 4 27 3 objective seconds 26 function 2 25 1 24 0 0 20 40 60 80 100 50 100 150 200 #iterations #tasks (green = alternating) (blue = alternating) • Compare computational cost with a gradient descent on W only ( η := learning rate) 10
Alternating Minimization (contd.) • Small number of iterations (typically fewer than 50 in experiments) • Alternative algorithms: singular value thresholding [ Cai et al. 2008 ], Bregman-type gradient descent [ Ma et al. 2009 ] etc. • Non-SVD alternatives like [ Rennie & Srebro 2005 , Maurer 2007 ] or SOCP methods [ Srebro et al. 2005, Liu and Vandenberghe 2008 ] 11
Trace Norm Regularization Problem ( MT L ) is equivalent to n m � � E ( � w t , x ti � , y ti ) + γ � W � 2 min ( T R ) tr R d × n W ∈ I t =1 i =1 The trace norm (or nuclear norm) � W � tr is the sum of the singular values of W W = U Σ V ⊤ � � W � tr = σ i ( W ) i 12
Trace Norm vs. Rank • Problem ( T R ) is a convex relaxation of the problem n m � � min E ( � w t , x ti � , y ti ) + γ rank( W ) R d × n W ∈ I t =1 i =1 • NP-hard problem • Rank and trace norm correspond to L 0 , L 1 on the vector of singular values • Hence one (qualified) interpretation: we want the task parameter vectors w t to lie on a low dimensional subspace 13
Machine Learning Interpretations • Learning a common linear kernel for all tasks (discussed already) • Maximum likelihood (learning a Gaussian covariance with fixed trace) • Matrix factorization � W � tr = 1 F ⊤ G = W ( � F � 2 F r + � G � 2 min F r ) 2 • MAP in a graphical model (as above) • Gaussian process interpretation 14
“Rotation invariant” Group Lasso • Problem ( MT L ) is equivalent to n m � � E ( � a t , U ⊤ x ti � , y ti ) + γ � A � 2 min 2 , 1 R d × n A ∈ I t =1 i =1 R d × d , U ⊤ U = I U ∈ I � d n a 2 � � where � A � 2 , 1 := it i =1 t =1 2 4 6 8 10 12 14 10 20 30 40 50 60 70 80 90 100 15
Experiment (Computer Survey) • Consumers’ ratings of products [Lenk et al. 1996] • 180 persons (tasks) • 8 PC models (training examples) • 13 binary input variables (RAM, CPU, price etc.) + bias term • Integer output in { 0 , . . . , 10 } (likelihood of purchase) • The square loss was used 16
Experiment (Computer Survey) 0.25 Method RMSE 0.2 Alternating Alg. 1.93 0.15 Hierarchical Bayes 0.1 1.90 u 1 [Lenk et al.] 0.05 Independent 3.88 0 Aggregate 2.35 −0.05 Group Lasso 2.01 −0.1 TE RAM SC CPU HD CD CA CO AV WA SW GU PR • The most important feature (eigenvector of D ) weighs technical characteristics (RAM, CPU, CD-ROM) vs. price 17
Generalizations: Spectral Regularization • Generalize ( MT L ): n m � � E ( � w t , x ti � , y ti ) + γ � W � 2 inf p R d × n W ∈ I t =1 i =1 where � W � p is the Schatten L p norm of the singular values of W • L 1 − L 2 trade-off • Can be generalized to a family of spectral functions • A similar alternating algorithm can be used 18
Generalizations: Learning Groups of Tasks • Assume heterogeneous environment, i.e. K low dimensional subspaces • Learn a partition of tasks in K groups � m n � K � � E ( � w t , x ti � , y ti ) + γ � w t , D − 1 inf min min k w t � D 1 ,...,D K ≻ 0 k =1 R d w t ∈ I t =1 i =1 tr( D k ) ≤ 1 • The representation learned is ( ˆ D 1 , . . . , ˆ D K ) ; we can transfer this representation to easily learn a new task • Non-convex problem; we use stochastic gradient descent 19
Nonlinear Kernels • An important note: all methods presented satisfy a multi-task representer theorem (a type of necessary optimality condition) • This fact enables “kernelization”, i.e. we may use a given kernel (e.g. polynomial, RBF) via its Gram matrix • We now expand on this observation 20
Representer Theorems • Consider any learning problem of the form m � min E ( � w, x i � , y i ) + Ω( w ) R d w ∈ I i =1 • This problem can be “kernelized” if Ω satisfies the “classical” rep. thm. m � w = ˆ c i x i i =1 (a necessary but not sufficient optimality condition) 21
Representer Theorems (contd.) Theorem. The “classical” rep. thm. for single-task learning , holds if and only if there exists a nondecreasing function h : I R + → I R such that R d Ω( w ) = h ( � w, w � ) ∀ w ∈ I (under differentiability assumptions) • Sufficiency of the condition was known [ Kimeldorf & Wahba, 1970, olkopf et al., 2001 etc.] Sch¨ 22
Representer Theorems (contd.) • Sketch of the proof: equivalent condition is Ω( w + p ) ≥ Ω( w ) for all w, p such that � w, p � = 0 . w w+p 0 23
Multi-Task Representer Theorems • Our multi-task formulations satisfy a multi-task representer theorem n m c ( t ) � � w t = ˆ si x si ∀ t ∈ { 1 , . . . , n } ( R . T . ) s =1 i =1 • All tasks are involved in this expression (unlike the single-task representer theorem ⇔ Frobenius norm regularization) • Generally, consider any matrix optimization problem of the form n m � � min E ( � w t , x ti � , y ti ) + Ω( W ) R d w 1 ,...,w n ∈ I t =1 i =1 24
Multi-Task Representer Theorems (contd.) • Definitions: S n + = the positive semidefinite cone The function h : S n + → I R is matrix nondecreasing, if ∀ A, B ∈ S n h ( A ) ≤ h ( B ) s.t. A � B + Theorem. Rep. thm. ( R . T . ) holds if and only if there exists a matrix nondecreasing function h : S n + → I R such that R d × n Ω( W ) = h ( W ⊤ W ) ∀ W ∈ I (under differentiability assumptions) 25
Implications • The theorem tells us when a matrix learning problem can be “kernelized” • In single-task learning, the choice of h does not matter essentially • However, in multi-task learning, the choice of h is important (since � is a partial ordering) • Many valid regularizers: Schatten L p norms � · � p , rank , orthogonally invariant norms, norms of type W �→ � WM � p etc. 26
Recommend
More recommend