Self-reflective Multi-task Gaussian Process Kohei Hayashi 1 , Takashi Takenouchi 1 , Ryota Tomioka 2 , Hisashi Kashima 2 1 Graduate School of Information Science Nara Institute of Science and Technology 2 Department of Mathematical Informatics The University of Tokyo July 2nd, 2011 1 / 22
Multi-task learning: problem definition • tasks and data points are correlated Goal: predict from and 2 / 22
Multi-task learning: problem definition • tasks and data points are correlated Goal: predict from and 3 / 22
Gaussian process for multi-task learning Idea: capture the correlations by measuring similarities between the responses . Multi-task GP [Bonilla+ AISTAT’07] [Yu+ NIPS’07] . separately measures task/data point similarity by using additional information . 4 / 22
Challenges 1 Good similarity measurement . . • additional information may not be enough to capture the correlations ⇒ inaccurate prediction 2 Computational complexity . . • inverse of Gram matrix: not practical for large-scale datasets 5 / 22
Our contributions Propose a new framework for multi-task learning • Self-measuring similarities • measures similarities by observed responses themselves • Efficient, exact learning algorithm • ∼ 10 min for 1000 × 1000 matrix Apply to a recommender system • Outperform existing methods 6 / 22
Model 7 / 22
Simple linear model Consider a linear Gaussian model x ik = w ⊤ ξ ik + ε ik , ( i, k ) ∈ I • w ∈ R K : weight parameter • ξ ik ∈ R K : latent feature vector of x ik • ε ik ∼ N (0 , σ 2 ) : observation noise • I : indices set of observed elements 8 / 22
Bilinear assumption Assume that ξ ik is decomposed into ψ i and φ k : w ⊤ ξ ik = w ⊤ ( φ k ⊗ ψ i ) = ψ ⊤ i W φ k • ψ i ∈ R K 1 : i -th row-dependent feature • φ k ∈ R K 2 : k -th column-dependent feature • W ∈ R K 1 × K 2 : weight parameter ( vec W = w ) • K = K 1 K 2 9 / 22
Now ψ i and φ k are given by feature functions: ψ i = ψ ( x i : ) , φ k = φ ( x : k ) • x i : ∈ R D 1 : i -th row vector of X • x : k ∈ R D 2 : k -th col. vector of X 10 / 22
Kernel representation x pred w ⊤ ( φ ( x : k ) ⊗ ψ ( x i : )) = ˆ (primal) ik ˆ ∑ = β ik k ( { x i : , x : k } , { x j : , x : l } ) (dual) ( j,l ) ∈I . Self-measuring kernel (similarity) . k ( { x i : , x : k } , { x j : , x : l } ) = k ψ ( x i : , x j : ) k φ ( x : k , x : l ) = � ψ ( x i : ) , ψ ( x j : ) �� φ ( x : k ) , φ ( x : l ) � . sim( x ij , x kl ) =sim( x i : , x j : ) × sim( x : k , x : l ) 11 / 22
Latent variables for missing values x pred ˆ ∑ = β ik k ( { x i : , x : k } , { x j : , x : l } ) ik ( j,l ) ∈I How to compute k ( · , · ) with missing values? • introduce latent variables Z ) ⊤ ) ⊤ ( ( 1 , z i 2 , 3 , 4 , z i 5 x i : = ⇒ 1 , , 3 , 4 , . EM-like iterative estimation . • initialize Z 0 by data mean ik = x pred ik ( Z t − 1 ) for t = 1 , 2 , . . . • estimate z t • early stopping with a validation set . 12 / 22
Use of additional information We can exploit additional information S = ( s 1 , . . . , s D 1 ) and T = ( t 1 , . . . , t D 2 ) by combining them with self-measuring similarity. e.g. k ψ ( · , · ) = k ( x i : , x j : ) k ( s i , s j ) , k φ ( · , · ) = k ( x : k , x : l ) k ( t k , t l ) 13 / 22
Optimization 14 / 22
Strategy L 2 norm regularized least square solution: ˆ β = K − 1 x I • K = Ω ⊗ Σ + σ 2 I : Gram matrix • x I ∈ R M : observed elements of X • M = |I| : # observations . ıve approach: compute K − 1 Na¨ . • O ( M 3 ) time and O ( M 2 ) space • too expensive . 15 / 22
Strategy L 2 norm regularized least square solution: ˆ β = K − 1 x I • K = Ω ⊗ Σ + σ 2 I : Gram matrix • x I ∈ R M : observed elements of X • M = |I| : # observations . ıve approach: compute K − 1 Na¨ . • O ( M 3 ) time and O ( M 2 ) space • too expensive . Solve x I = K β by conjugate gradient with vec-trick 3 • O ( M 2 ) time and O ( M ) space 16 / 22
Experiment (updated) 17 / 22
Dataset Dataset: Movielens 100k data • 1682 movies × 943 users • x ik ∈ { 1 , . . . , 5 } • a rating of i -th movie by k -th user • # observations: 100 , 000 • 86 , 040 for training • 4 , 530 for validating (early stopping) • 9 , 430 for testing • additional information • user-specific feature: age, gender, ... • movie-specific feature: genre, release date, ... 18 / 22
Settings • RBF kernel: k ( x , x ′ ) = exp( − λ || x − x ′ || 2 ) • hyper-parameters { σ 2 , λ } : 3 -fold CV k ψ k φ Multi-task GP k ( s i , s j ) k ( t k , t l ) Self-measuring k ( x i : , x j : ) k ( x : k , x : l ) Product k ( x i : , x j : ) k ( s i , s j ) k ( x : k , x : l ) k ( t k , t l ) 19 / 22
Results Method RMSE time Matrix Factorization 0 . 9345 1 m 38 s Multi-task GP 1 . 0517 7 m 01 s Self-measuring 0 . 9308 16 m 22 s Product 18 m 25 s 0 . 9256 • The best score in http://mlcomp.org/datasets/341 20 / 22
Conclusion 1 Proposed a kernel-based method for multi-task . . learning problems • self-measuring similarity • efficient algorithm using CG method 2 Applied to a recommender system . . • outperformed existing methods in the Movielens 100k dataset 21 / 22
Conclusion 1 Proposed a kernel-based method for multi-task . . learning problems • self-measuring similarity • efficient algorithm using CG method 2 Applied to a recommender system . . • outperformed existing methods in the Movielens 100k dataset Questions? 22 / 22
Recommend
More recommend