ACL 2020 Multi-source Meta Transfer for Low Resource MCQA Ming Yan 1 , Hao Zhang 1,2 , Di Jin 3 , Joey Tianyi Zhou 1 1 IHPC, A*STAR, Singapore 2 CSCE, NTU, Singapore 3 CSAIL, MIT, USA 1
Background Low resource MCQA with date size under 100K Corpus from different domains SEARCHQA 140 Snippets NEWSQA 120 Newswire SWAG 113.5 Scenario Text Extractive/ HOTPOTQA 113 Wikipedia Abstractive SQUAD 108 Wikipedia RACE 97.6 Exam Multi-hop SEMEVAL 13.9 Narrative Text DREAM 6.1 Dialogue MCQA MCTEST 2.6 Story 0 35 70 105 140 Data Size (K) 2
How does meta learning work? Β§ Low resource setting Transfer learning, multi-task learning Β§ Domains discrepancy Fine-tuning on the target domain πΎ βΆ πππ‘π’ ππ£πππ’πππ ππππ’ π₯ ! from ππππππππ πππππ Transfer Learning ππ£ππππ π’ π’ππ‘ππ‘: π¦ " ~π πΉπππ£ππ π§ π’ππ‘ππ‘: π¦ ! ~π modππ " =: π πππ(πππππ ! ) π§ " = πππππ " (π₯ " , π¦ " ) FF L FF L Multi-task Learning π₯ " =: π₯ " + π½ ππΎ " BP L BP L ππ₯ " fast adaption π§ ! = πππππ " (π₯ " , π¦ ! ) FF m π₯ ! =: π₯ ! + π½ ππΎ ! BP m FF: Feedforward ππ₯ ! BP: Backpropagation meta-learning Source 1 Source 2 Source 3 Target FF L BP L FF m BP m 3 [Finn et al. 17] Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks, ICML 2017
How does meta learning work? ππππ’ π₯ ! from preπ’π πππππ πππππ πΉπππ£ππ π§ π: π¦ ! ~π ππ£ππππ π’ π: π¦ " ~π πΎ βΆ πππ‘π’ ππ£πππ’πππ meta-learning learning/adaption π₯ ! source domains β ! ! πΎ $ Exam β ! ! πΎ # fast adaption Dialogue 4 choices β ! ! πΎ " 3 choices π₯ ! ' π₯ ! % π₯ ! & Story 4 choices target domain meta-learning π₯ ! % Narrative Text π₯ ! ( same domain 2 choice π₯ ! & fast adaption π₯ ! ' Learn a model that can generalize over the task distribution. [Finn et al. 17] Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks, ICML 2017 4
Multi-source Meta Transfer Meta Learning Multi-source Meta Transfer Meta model MMT model Dialogue 3 choices Exam Target 2 4 choices 1 Story 4 choices Scenario Text 3 4 choice Task in source Task in source 1 Task in source 2 Task in source 3 Β§ Learn knowledge from multiple sources Β§ Reduce discrepancy between sources and target. 5
Multi-source Meta Transfer Supervised MMT MMT model 4 1 Target 2 3 Representation space Target 4 1 2 3 Input space Task in source 1 Task in source 3 MMT representation MTL Task in source 2 Task in source 4 Source representation MML Learn knowledge from multiple sources. Multi-source Meta Learning (MML) Learn a representation near to the target. Multi-source Transfer Learning Finetune meta-model to the target source. (MTL) 6
How MMT samples the task? Algorithm 1 : The procedure of MMT Input: Task distribution over source π ) π , data distribution over M Q A target π * π , backbone model π π , learning rates in MMT π½, πΎ, π 0.1 0.2 β¦ 0.4 Output: Optimized parameters π Initial the value of π 0.1 0.2 β¦ 0.4 0.2 0.1 β¦ 0.4 0.2 0.1 β¦ 0.4 While not done do for all source π do 0.2 0.1 β¦ 0.4 0.1 0.2 β¦ 0.4 0.5 0.1 β¦ 0.4 # ~π # π Sample batch of tasks π " # do for all π " 0.5 0.1 β¦ 0.4 0.5 0.1 β¦ 0.4 0.5 0.1 β¦ 0.4 0.20.5 β¦ 0.8 , π π Evaluate πΌ $ π % + with respect to k examples 0.1 0.2 β¦ 0.4 source 1 0.2 0.3 β¦ 0.6 Compute gradient for fast adaption: 0.1 0.1 β¦ 0.9 π & =: π β π½πΌ $ π % + , π π 0.7 0.4 β¦ 0.3 0.2 0.1 β¦ 0.4 0.2 0.3 β¦ 0.6 end 0.1 0.1 β¦ 0.9 0.2 0.5 β¦ 0.3 Meta model update: 0.1 0.1 β¦ 0.9 0.3 0.4 β¦ 0.7 0.2 0.3 β¦ 0.6 , π πβ² π =: π β πΎβ $ β % + , ~( , (%) π % + 0.6 0.4 β¦ 0.5 + ~π + π 0.2 0.5 β¦ 0.8 0.7 0.4 β¦ 0.3 Get batch of data π " 0.3 0.4 β¦ 0.7 + do for all π " 0.4 0.7 β¦ 0.2 0.2 0.3 β¦ 0.6 0.2 0.5 β¦ 0.8 0.4 0.7 β¦ 0.3 Evaluate β $ π % + - π π with respect to k examples source 2 0.6 0.4 β¦ 0.5 Gradient for target fine-tuning: 0.8 0.5 β¦ 0.3 π =: π β πΎβ $ π % + - π π 0.3 0.4 β¦ 0.7 0.3 0.4 β¦ 0.7 end end 0.4 0.7 β¦ 0.2 0.8 0.5 β¦ 0.3 end + ~π + π Get all batches of data π " Meta Tasks 0.4 0.7 β¦ 0.2 + do for all π " 0.6 0.4 β¦ 0.5 π # β₯ π $ Evaluate with respect to batch size; target Gradient for meta transfer learning: source 3 βπ· % β π # π =: π β πΎβ $ π % + - π π 7 end
Multi-source Meta Transfer Algorithm 1 : The procedure of MMT Input: Task distribution over source π ) π , data distribution over target π * π , backbone model π π , learning rates in MMT π½, πΎ, π MMT is agnostic Output: Optimized parameters π d to backbone models Initial the value of π While not done do for all source π do # ~π # π Support task and Query task sampled Sample batch of tasks π " # do for all π " from the same distribution , π π Evaluate πΌ $ π % + with respect to k examples Compute gradient for fast adaption: Updated the learner ( π & ) on support task π & =: π β π½πΌ $ π % + , π π end Meta model update: Updated the meta model ( π ) on query task d π =: π β πΎβ $ β % + , ~( , (%) π % + , π πβ² + ~π + π Get batch of data π " Updated the meta model ( π ) on target data + do for all π " Evaluate β $ π % + - π π with respect to k examples Gradient for target fine-tuning: S2 Target Target π =: π β πΎβ $ π % + - π π MML end MML end S1 S3 end + ~π + π Get all batches of data π " + do for all π " Target S4 Target MTL Evaluate with respect to batch size; d Gradient for meta transfer learning: π =: π β πΎβ $ π % + - π π Transfer meta model to the target MTL 8 end
Results Performance of Supervised MMT 9 MCTEST Performance of Unsupervised MMT MMT Ablation Study
How to select sources? T-SNE Visualization of BERT Feature 100 random samples Targets Sources Test on SemEval 2018 Transferability Matrix 10
Takeaways v MMT extends to meta learning to multi-source on MCQA task v MMT provided an algorithm both for supervised and unsupervised meta training v MMT give a guideline to source selection 11
Recommend
More recommend