multi source meta transfer for low resource mcqa
play

Multi-source Meta Transfer for Low Resource MCQA Ming Yan 1 , Hao - PowerPoint PPT Presentation

ACL 2020 Multi-source Meta Transfer for Low Resource MCQA Ming Yan 1 , Hao Zhang 1,2 , Di Jin 3 , Joey Tianyi Zhou 1 1 IHPC, A*STAR, Singapore 2 CSCE, NTU, Singapore 3 CSAIL, MIT, USA 1 Background Low resource MCQA with date size under 100K


  1. ACL 2020 Multi-source Meta Transfer for Low Resource MCQA Ming Yan 1 , Hao Zhang 1,2 , Di Jin 3 , Joey Tianyi Zhou 1 1 IHPC, A*STAR, Singapore 2 CSCE, NTU, Singapore 3 CSAIL, MIT, USA 1

  2. Background Low resource MCQA with date size under 100K Corpus from different domains SEARCHQA 140 Snippets NEWSQA 120 Newswire SWAG 113.5 Scenario Text Extractive/ HOTPOTQA 113 Wikipedia Abstractive SQUAD 108 Wikipedia RACE 97.6 Exam Multi-hop SEMEVAL 13.9 Narrative Text DREAM 6.1 Dialogue MCQA MCTEST 2.6 Story 0 35 70 105 140 Data Size (K) 2

  3. How does meta learning work? Β§ Low resource setting Transfer learning, multi-task learning Β§ Domains discrepancy Fine-tuning on the target domain 𝐾 ∢ 𝑑𝑝𝑑𝑒 π‘”π‘£π‘œπ‘‘π‘’π‘—π‘π‘œ π‘—π‘œπ‘—π‘’ π‘₯ ! from π‘π‘π‘‘π‘™π‘π‘π‘œπ‘“ π‘›π‘π‘’π‘“π‘š Transfer Learning π‘‡π‘£π‘žπ‘žπ‘π‘ π‘’ 𝑒𝑏𝑑𝑙𝑑: 𝑦 " ~π‘Œ πΉπ‘œπ‘Ÿπ‘£π‘—π‘ π‘§ 𝑒𝑏𝑑𝑙𝑑: 𝑦 ! ~π‘Œ modπ‘“π‘š " =: 𝒅𝒑𝒒𝒛(π‘›π‘π‘’π‘“π‘š ! ) 𝑧 " = π‘›π‘π‘’π‘“π‘š " (π‘₯ " , 𝑦 " ) FF L FF L Multi-task Learning π‘₯ " =: π‘₯ " + 𝛽 πœ–πΎ " BP L BP L πœ–π‘₯ " fast adaption 𝑧 ! = π‘›π‘π‘’π‘“π‘š " (π‘₯ " , 𝑦 ! ) FF m π‘₯ ! =: π‘₯ ! + 𝛽 πœ–πΎ ! BP m FF: Feedforward πœ–π‘₯ ! BP: Backpropagation meta-learning Source 1 Source 2 Source 3 Target FF L BP L FF m BP m 3 [Finn et al. 17] Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks, ICML 2017

  4. How does meta learning work? π‘—π‘œπ‘—π‘’ π‘₯ ! from preπ‘’π‘ π‘π‘—π‘œπ‘“π‘’ π‘›π‘π‘’π‘“π‘š πΉπ‘œπ‘Ÿπ‘£π‘—π‘ π‘§ π‘ˆ: 𝑦 ! ~π‘Œ π‘‡π‘£π‘žπ‘žπ‘π‘ π‘’ π‘ˆ: 𝑦 " ~π‘Œ 𝐾 ∢ 𝑑𝑝𝑑𝑒 π‘”π‘£π‘œπ‘‘π‘’π‘—π‘π‘œ meta-learning learning/adaption π‘₯ ! source domains βˆ‡ ! ! 𝐾 $ Exam βˆ‡ ! ! 𝐾 # fast adaption Dialogue 4 choices βˆ‡ ! ! 𝐾 " 3 choices π‘₯ ! ' π‘₯ ! % π‘₯ ! & Story 4 choices target domain meta-learning π‘₯ ! % Narrative Text π‘₯ ! ( same domain 2 choice π‘₯ ! & fast adaption π‘₯ ! ' Learn a model that can generalize over the task distribution. [Finn et al. 17] Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks, ICML 2017 4

  5. Multi-source Meta Transfer Meta Learning Multi-source Meta Transfer Meta model MMT model Dialogue 3 choices Exam Target 2 4 choices 1 Story 4 choices Scenario Text 3 4 choice Task in source Task in source 1 Task in source 2 Task in source 3 Β§ Learn knowledge from multiple sources Β§ Reduce discrepancy between sources and target. 5

  6. Multi-source Meta Transfer Supervised MMT MMT model 4 1 Target 2 3 Representation space Target 4 1 2 3 Input space Task in source 1 Task in source 3 MMT representation MTL Task in source 2 Task in source 4 Source representation MML Learn knowledge from multiple sources. Multi-source Meta Learning (MML) Learn a representation near to the target. Multi-source Transfer Learning Finetune meta-model to the target source. (MTL) 6

  7. How MMT samples the task? Algorithm 1 : The procedure of MMT Input: Task distribution over source π‘ž ) 𝜐 , data distribution over M Q A target 𝑄 * 𝜐 , backbone model 𝑔 πœ„ , learning rates in MMT 𝛽, 𝛾, πœ‡ 0.1 0.2 … 0.4 Output: Optimized parameters πœ„ Initial the value of πœ„ 0.1 0.2 … 0.4 0.2 0.1 … 0.4 0.2 0.1 … 0.4 While not done do for all source 𝑇 do 0.2 0.1 … 0.4 0.1 0.2 … 0.4 0.5 0.1 … 0.4 # ~π‘ž # 𝜐 Sample batch of tasks 𝜐 " # do for all 𝜐 " 0.5 0.1 … 0.4 0.5 0.1 … 0.4 0.5 0.1 … 0.4 0.20.5 … 0.8 , 𝑔 πœ„ Evaluate 𝛼 $ 𝑀 % + with respect to k examples 0.1 0.2 … 0.4 source 1 0.2 0.3 … 0.6 Compute gradient for fast adaption: 0.1 0.1 … 0.9 πœ„ & =: πœ„ βˆ’ 𝛽𝛼 $ 𝑀 % + , 𝑔 πœ„ 0.7 0.4 … 0.3 0.2 0.1 … 0.4 0.2 0.3 … 0.6 end 0.1 0.1 … 0.9 0.2 0.5 … 0.3 Meta model update: 0.1 0.1 … 0.9 0.3 0.4 … 0.7 0.2 0.3 … 0.6 , 𝑔 πœ„β€² πœ„ =: πœ„ βˆ’ π›Ύβˆ‡ $ βˆ‘ % + , ~( , (%) 𝑀 % + 0.6 0.4 … 0.5 + ~π‘ž + 𝜐 0.2 0.5 … 0.8 0.7 0.4 … 0.3 Get batch of data 𝜐 " 0.3 0.4 … 0.7 + do for all 𝜐 " 0.4 0.7 … 0.2 0.2 0.3 … 0.6 0.2 0.5 … 0.8 0.4 0.7 … 0.3 Evaluate βˆ‡ $ 𝑀 % + - 𝑔 πœ„ with respect to k examples source 2 0.6 0.4 … 0.5 Gradient for target fine-tuning: 0.8 0.5 … 0.3 πœ„ =: πœ„ βˆ’ π›Ύβˆ‡ $ 𝑀 % + - 𝑔 πœ„ 0.3 0.4 … 0.7 0.3 0.4 … 0.7 end end 0.4 0.7 … 0.2 0.8 0.5 … 0.3 end + ~π‘ž + 𝜐 Get all batches of data 𝜐 " Meta Tasks 0.4 0.7 … 0.2 + do for all 𝜐 " 0.6 0.4 … 0.5 𝑂 # β‰₯ 𝑂 $ Evaluate with respect to batch size; target Gradient for meta transfer learning: source 3 βˆ€π· % ∈ 𝜐 # πœ„ =: πœ„ βˆ’ π›Ύβˆ‡ $ 𝑀 % + - 𝑔 πœ„ 7 end

  8. Multi-source Meta Transfer Algorithm 1 : The procedure of MMT Input: Task distribution over source π‘ž ) 𝜐 , data distribution over target 𝑄 * 𝜐 , backbone model 𝑔 πœ„ , learning rates in MMT 𝛽, 𝛾, πœ‡ MMT is agnostic Output: Optimized parameters πœ„ d to backbone models Initial the value of πœ„ While not done do for all source 𝑇 do # ~π‘ž # 𝜐 Support task and Query task sampled Sample batch of tasks 𝜐 " # do for all 𝜐 " from the same distribution , 𝑔 πœ„ Evaluate 𝛼 $ 𝑀 % + with respect to k examples Compute gradient for fast adaption: Updated the learner ( πœ„ & ) on support task πœ„ & =: πœ„ βˆ’ 𝛽𝛼 $ 𝑀 % + , 𝑔 πœ„ end Meta model update: Updated the meta model ( πœ„ ) on query task d πœ„ =: πœ„ βˆ’ π›Ύβˆ‡ $ βˆ‘ % + , ~( , (%) 𝑀 % + , 𝑔 πœ„β€² + ~π‘ž + 𝜐 Get batch of data 𝜐 " Updated the meta model ( πœ„ ) on target data + do for all 𝜐 " Evaluate βˆ‡ $ 𝑀 % + - 𝑔 πœ„ with respect to k examples Gradient for target fine-tuning: S2 Target Target πœ„ =: πœ„ βˆ’ π›Ύβˆ‡ $ 𝑀 % + - 𝑔 πœ„ MML end MML end S1 S3 end + ~π‘ž + 𝜐 Get all batches of data 𝜐 " + do for all 𝜐 " Target S4 Target MTL Evaluate with respect to batch size; d Gradient for meta transfer learning: πœ„ =: πœ„ βˆ’ π›Ύβˆ‡ $ 𝑀 % + - 𝑔 πœ„ Transfer meta model to the target MTL 8 end

  9. Results Performance of Supervised MMT 9 MCTEST Performance of Unsupervised MMT MMT Ablation Study

  10. How to select sources? T-SNE Visualization of BERT Feature 100 random samples Targets Sources Test on SemEval 2018 Transferability Matrix 10

  11. Takeaways v MMT extends to meta learning to multi-source on MCQA task v MMT provided an algorithm both for supervised and unsupervised meta training v MMT give a guideline to source selection 11

Recommend


More recommend