Multi-Task Learning: Models, Optimization and Applications Linli Xu University of Science and Technology of China University of Science and Technology of China
Outline • Introduction to multi-task learning (MTL): problem and models • Multi-task learning with task-feature co-clusters • Low-rank optimization in multi-task learning • Multi-task learning applied to trajectory regression 2016/11/5 2
Multiple Tasks Examination Scores Prediction 1 (Argyriou et. al. ’08 ) School 1 - Alverno High School Student Birth Previous … School … Exam id year score ranking score 72981 1985 95 … 83% … ? … school-dependent student-dependent School 138 - Jefferson Intermediate School Exam Student Birth Previous … School … score id year score ranking ? 31256 1986 87 … 72% … student-dependent school-dependent School 139 - Rosemead High School Exam Student Birth Previous … School … score id year score ranking ? 12381 1986 83 … 77% … school-dependent student-dependent 1 The Inner London Education Authority (ILEA) 2016/11/5 5
Learning Multiple Tasks Learning each task independently School 1 - Alverno High School Exam Student Birth Previous School … 1st Score id year score ranking task ? 72981 1985 95 83% … Excellent … School 138 - Jefferson Intermediate School Student Birth Previous School … Exam 138th id year score ranking Score task 31256 1986 87 72% … ? Excellent School 139 - Rosemead High School Exam Student Birth Previous School … 139th id year score ranking Score task ? 12381 1986 83 77% … Excellent 2016/11/5 6
Learning Multiple Tasks Learning multiple tasks simultaneously School 1 - Alverno High School Exam Student Birth Previous School … 1st Score id year score ranking task ? 72981 1985 95 83% … … School 138 - Jefferson Intermediate School Exam Student Birth Previous School … 138th Score id year score ranking task ? 31256 1986 87 72% … School 139 - Rosemead High School Student Birth Previous School … Exam 139th id year score ranking Score task 12381 1986 83 77% … ? Learn tasks simultaneously …… Model the task relationships 2016/11/5 7
Multi-Task Learning Single Task Learning • Different from single task Task 1 Training Data Model Training learning Task 2 Training Data Training Model … … Task m Training Data Model Training Multi-Task Learning • Training multiple tasks Task 1 Training Data Model simultaneously to exploit task relationships Task 2 Training Data Model Training … … Task m Training Data Model 2016/11/5 8
Exploiting Task Relationships Key challenge in multi-task learning: Exploiting (statistical) relationships between the tasks so as to improve individual and/or overall predictive accuracy (in comparison to training individual models)! 2016/11/5 10
How Tasks Are Related? • All tasks are related – Models of all tasks are close to each other; – Models of all tasks share a common set of features; – Models share the same low rank subspace • Structure in tasks – clusters / graphs / trees • Learning with outlier tasks 2016/11/5 11
Regularization-based Multi-Task Learning Task m Dimension d Task m Task m Sample n 2 Sample n 2 Sample n m Sample n m Dimension d ... ... Learning Sample n 1 Sample n 1 Feature Matrices X i Target Vectors Y i Model Matrix W We focus on linear models: 𝑍 𝑗 ~𝑌 𝑗 𝒙 𝑗 𝑌 𝑗 ∈ ℝ 𝑜 𝑗 ×𝑒 , 𝑍 𝑗 ∈ ℝ 𝑜 𝑗 ×1 , 𝑋 = [𝒙 1 , 𝒙 2 , … , 𝒙 𝑛 ] ∈ ℝ 𝑒×𝑛 Generic framework 𝑀𝑝𝑡𝑡 𝑋, 𝑌 𝑗 , 𝑍 𝑗 + 𝜇 𝑆𝑓(𝑋) min 𝑋 𝑗 Impose various types of relations on tasks with 𝑆𝑓 𝑋 2016/11/5 12
How Tasks Are Related? • All tasks are related – Models of all tasks are close to each other; – Models of all tasks share a common set of features; – Models share the same low rank subspace • Structure in tasks – clusters / graphs / trees • Learning with outlier tasks 2016/11/5 13
MTL Methods: Mean-Regularized MTL Evgeniou & Pontil, 2004 KDD Assumption: model parameters of all tasks are close to each other. – Advantage: simple, intuitive, easy to implement – Disadvantage: too simple Regularization – Penalizes the deviation of each task from the mean 2 𝑛 𝑛 1 𝑋 𝑗 − min 𝑋 𝑀𝑝𝑡𝑡(𝑋) + 𝜇 𝑋 𝑡 𝑛 𝑗=1 𝑡=1 2 2016/11/5 14
MTL Methods: Joint Feature Learning Evgeniou et al. 2006 NIPS, Obozinski et. al. 2009 Stat Comput, Liu et. al. 2010 Technical Report Assumption: models of all tasks share a common set of features – Using group sparsity: ℓ 1,𝑟 -norm regularization Task m Task 1 Task 2 …… Feature 1 Regularization Feature 2 𝑒 𝑋 1,𝑟 = 𝑗=1 𝒙 𝑗 – Feature 3 𝑟 Feature 4 – When 𝑟 > 1 we have group sparsity Feature 5 Feature 6 min 𝑋 𝑀𝑝𝑡𝑡(𝑋) + 𝜇 𝑋 1,𝑟 Feature 7 …… Feature d 2016/11/5 15
MTL Methods: Low-Rank MTL Ji et. al. 2009 ICML Assumption: in high dimensional feature space, the linear models share the same low-rank subspace Regularization - Rank minimization formulation min 𝑋 𝑀𝑝𝑡𝑡(𝑋) + 𝜇 ∙ rank(𝑋) – Rank minimization is NP-Hard for general loss functions • Convex relaxation: nuclear norm minimization min 𝑋 𝑀𝑝𝑡𝑡(𝑋) + 𝜇 𝑋 ∗ ( 𝑋 ∗ : sum of singular values of 𝑋 ) 2016/11/5 16
How Tasks Are Related? • All tasks are related – Models of all tasks are close to each other; – Models of all tasks share a common set of features; – Models share the same low rank subspace • Structure in tasks – clusters / graphs / trees • Learning with outlier tasks 2016/11/5 17
MTL Methods: Clustered MTL Zhou et. al. 2011 NIPS Assumption: cluster structure in tasks - the models of tasks from the same group are closer to each other than those from a different group Regularization - capture clustered structures 𝑀𝑝𝑡𝑡 W + 𝛽 tr 𝑋 𝑈 𝑋 − tr 𝐺 𝑈 𝑋 𝑈 𝑋𝐺 + 𝛾 tr 𝑋 𝑈 𝑋 min 𝑋,𝐺:𝐺 𝑈 𝐺=𝐽 𝑙 Improves capture cluster structures generalization performance 2016/11/5 18
Regularization-based MTL: Decomposition Framework • In practice, it is too restrictive to constrain all tasks to share a single shared structure. • Assumption: the model is the sum of two components 𝑋 = 𝑄 + 𝑅 – A shared low dimensional subspace and a task specific component (Ando and Zhang, 2005, JMLR) – A group sparse component and a task specific sparse component (Jalali et.al., 2010, NIPS) – A low rank structure among relevant tasks + outlier tasks (Gong et.al., 2011, KDD) 2016/11/5 19
How Tasks Are Related? • All tasks are related – Models of all tasks are close to each other; – Models of all tasks share a common set of features; – Models share the same low rank subspace • Structure in tasks – clusters / graphs / trees • Learning with outlier tasks 2016/11/5 20
MTL Methods: Robust MTL Chen et. al. 2011 KDD Assumption: models share the same low-rank subspace + outlier tasks outlier tasks 𝑋 = 𝑄 + 𝑅 𝑅 Regularization Features 𝑄 ∗ : nuclear norm – 𝑛 𝑅 2,1 = 𝑘=1 𝒓 :,𝒌 2 – min 𝑋 𝑀𝑝𝑡𝑡(𝑋) + 𝛽 𝑄 ∗ + 𝛾 𝑅 2,1 low rank column-sparse 2016/11/5 21
Summary So Far… • All multi-task learning formulations discussed above can fit into the 𝑋 = 𝑄 + 𝑅 schema. – Component 𝑄 : shared structure – Component 𝑅 : information not captured by the shared structure 2016/11/5 22
Outline • Introduction to multi-task learning (MTL): problem and models • Multi-task learning with task-feature co-clusters • Low-rank optimization in multi-task learning • Multi-task learning applied to trajectory regression 2016/11/5 23
Recap: How Tasks Are Related? • All tasks are related – Models of all tasks are close to each other; – Models of all tasks share a common set of features; – Models share the same low rank subspace • Structure in tasks – clusters / graphs / trees Task-level • Learning with outlier tasks 2016/11/5 24
How Tasks are Related • Existing methods consider the structure at a general task-level • Restrictive assumption in practice: – In document classification: different tasks may be relevant to different sets of words – In a recommender system: two users with similar tastes on one feature subset may have totally different preference on another subset 2016/11/5 25
CoCMTL: MTL with Task-Feature Co-Clusters [Xu. et al, AAAI15] • Motivation: feature-level groups feature task clustering on the bipartite graph • Impose task-feature co-clustering structure with 𝑆𝑓(𝑋) 2016/11/5 26
CoCMTL: Model • Decomposition model: 𝑋 = 𝑄 + 𝑅 min 𝑋 𝑀𝑝𝑡𝑡(𝑋) + 𝜇 1 Ω 1 𝑄 + 𝜇 2 Ω 2 (𝑅) 2016/11/5 27
CoCMTL: Model • Decomposition model: 𝑋 = 𝑄 + 𝑅 min 𝑋 𝑀𝑝𝑡𝑡(𝑋) + 𝜇 1 Ω 1 𝑄 + 𝜇 2 Ω 2 (𝑅) non-convex min 𝑒,𝑛 𝜏 𝑗 2 (𝑅) Ω 2 𝑅 = 𝑗=𝑙+1 min 𝑒,𝑛 2 (𝑅) 𝑋 𝑀𝑝𝑡𝑡(𝑋) + 𝜇 1 tr(𝑄𝑀𝑄 𝑈 ) + 𝜇 2 min 𝜏 𝑗 𝑗=𝑙+1 2016/11/5 28
Recommend
More recommend