Multi-source Meta Transfer for Low Resource MCQA Ming Yan 1 , Hao - PowerPoint PPT Presentation

ACL 2020 Multi-source Meta Transfer for Low Resource MCQA Ming Yan 1 , Hao Zhang 1,2 , Di Jin 3 , Joey Tianyi Zhou 1 1 IHPC, A*STAR, Singapore 2 CSCE, NTU, Singapore 3 CSAIL, MIT, USA 1

Background Low resource MCQA with date size under 100K Corpus from different domains SEARCHQA 140 Snippets NEWSQA 120 Newswire SWAG 113.5 Scenario Text Extractive/ HOTPOTQA 113 Wikipedia Abstractive SQUAD 108 Wikipedia RACE 97.6 Exam Multi-hop SEMEVAL 13.9 Narrative Text DREAM 6.1 Dialogue MCQA MCTEST 2.6 Story 0 35 70 105 140 Data Size (K) 2

How does meta learning work? § Low resource setting Transfer learning, multi-task learning § Domains discrepancy Fine-tuning on the target domain 𝐾 ∶ 𝑑𝑝𝑡𝑢 𝑔𝑣𝑜𝑑𝑢𝑗𝑝𝑜 𝑗𝑜𝑗𝑢 𝑥 ! from 𝑐𝑏𝑑𝑙𝑐𝑝𝑜𝑓 𝑛𝑝𝑒𝑓𝑚 Transfer Learning 𝑇𝑣𝑞𝑞𝑝𝑠𝑢 𝑢𝑏𝑡𝑙𝑡: 𝑦 " ~𝑌 𝐹𝑜𝑟𝑣𝑗𝑠𝑧 𝑢𝑏𝑡𝑙𝑡: 𝑦 ! ~𝑌 mod𝑓𝑚 " =: 𝒅𝒑𝒒𝒛(𝑛𝑝𝑒𝑓𝑚 ! ) 𝑧 " = 𝑛𝑝𝑒𝑓𝑚 " (𝑥 " , 𝑦 " ) FF L FF L Multi-task Learning 𝑥 " =: 𝑥 " + 𝛽 𝜖𝐾 " BP L BP L 𝜖𝑥 " fast adaption 𝑧 ! = 𝑛𝑝𝑒𝑓𝑚 " (𝑥 " , 𝑦 ! ) FF m 𝑥 ! =: 𝑥 ! + 𝛽 𝜖𝐾 ! BP m FF: Feedforward 𝜖𝑥 ! BP: Backpropagation meta-learning Source 1 Source 2 Source 3 Target FF L BP L FF m BP m 3 [Finn et al. 17] Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks, ICML 2017

How does meta learning work? 𝑗𝑜𝑗𝑢 𝑥 ! from pre𝑢𝑠𝑏𝑗𝑜𝑓𝑒 𝑛𝑝𝑒𝑓𝑚 𝐹𝑜𝑟𝑣𝑗𝑠𝑧 𝑈: 𝑦 ! ~𝑌 𝑇𝑣𝑞𝑞𝑝𝑠𝑢 𝑈: 𝑦 " ~𝑌 𝐾 ∶ 𝑑𝑝𝑡𝑢 𝑔𝑣𝑜𝑑𝑢𝑗𝑝𝑜 meta-learning learning/adaption 𝑥 ! source domains ∇ ! ! 𝐾 $ Exam ∇ ! ! 𝐾 # fast adaption Dialogue 4 choices ∇ ! ! 𝐾 " 3 choices 𝑥 ! ' 𝑥 ! % 𝑥 ! & Story 4 choices target domain meta-learning 𝑥 ! % Narrative Text 𝑥 ! ( same domain 2 choice 𝑥 ! & fast adaption 𝑥 ! ' Learn a model that can generalize over the task distribution. [Finn et al. 17] Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks, ICML 2017 4

Multi-source Meta Transfer Meta Learning Multi-source Meta Transfer Meta model MMT model Dialogue 3 choices Exam Target 2 4 choices 1 Story 4 choices Scenario Text 3 4 choice Task in source Task in source 1 Task in source 2 Task in source 3 § Learn knowledge from multiple sources § Reduce discrepancy between sources and target. 5

Multi-source Meta Transfer Supervised MMT MMT model 4 1 Target 2 3 Representation space Target 4 1 2 3 Input space Task in source 1 Task in source 3 MMT representation MTL Task in source 2 Task in source 4 Source representation MML Learn knowledge from multiple sources. Multi-source Meta Learning (MML) Learn a representation near to the target. Multi-source Transfer Learning Finetune meta-model to the target source. (MTL) 6

How MMT samples the task? Algorithm 1 : The procedure of MMT Input: Task distribution over source 𝑞 ) 𝜐 , data distribution over M Q A target 𝑄 * 𝜐 , backbone model 𝑔 𝜄 , learning rates in MMT 𝛽, 𝛾, 𝜇 0.1 0.2 … 0.4 Output: Optimized parameters 𝜄 Initial the value of 𝜄 0.1 0.2 … 0.4 0.2 0.1 … 0.4 0.2 0.1 … 0.4 While not done do for all source 𝑇 do 0.2 0.1 … 0.4 0.1 0.2 … 0.4 0.5 0.1 … 0.4 # ~𝑞 # 𝜐 Sample batch of tasks 𝜐 " # do for all 𝜐 " 0.5 0.1 … 0.4 0.5 0.1 … 0.4 0.5 0.1 … 0.4 0.20.5 … 0.8 , 𝑔 𝜄 Evaluate 𝛼 $ 𝑀 % + with respect to k examples 0.1 0.2 … 0.4 source 1 0.2 0.3 … 0.6 Compute gradient for fast adaption: 0.1 0.1 … 0.9 𝜄 & =: 𝜄 − 𝛽𝛼 $ 𝑀 % + , 𝑔 𝜄 0.7 0.4 … 0.3 0.2 0.1 … 0.4 0.2 0.3 … 0.6 end 0.1 0.1 … 0.9 0.2 0.5 … 0.3 Meta model update: 0.1 0.1 … 0.9 0.3 0.4 … 0.7 0.2 0.3 … 0.6 , 𝑔 𝜄′ 𝜄 =: 𝜄 − 𝛾∇ $ ∑ % + , ~( , (%) 𝑀 % + 0.6 0.4 … 0.5 + ~𝑞 + 𝜐 0.2 0.5 … 0.8 0.7 0.4 … 0.3 Get batch of data 𝜐 " 0.3 0.4 … 0.7 + do for all 𝜐 " 0.4 0.7 … 0.2 0.2 0.3 … 0.6 0.2 0.5 … 0.8 0.4 0.7 … 0.3 Evaluate ∇ $ 𝑀 % + - 𝑔 𝜄 with respect to k examples source 2 0.6 0.4 … 0.5 Gradient for target fine-tuning: 0.8 0.5 … 0.3 𝜄 =: 𝜄 − 𝛾∇ $ 𝑀 % + - 𝑔 𝜄 0.3 0.4 … 0.7 0.3 0.4 … 0.7 end end 0.4 0.7 … 0.2 0.8 0.5 … 0.3 end + ~𝑞 + 𝜐 Get all batches of data 𝜐 " Meta Tasks 0.4 0.7 … 0.2 + do for all 𝜐 " 0.6 0.4 … 0.5 𝑂 # ≥ 𝑂 $ Evaluate with respect to batch size; target Gradient for meta transfer learning: source 3 ∀𝐷 % ∈ 𝜐 # 𝜄 =: 𝜄 − 𝛾∇ $ 𝑀 % + - 𝑔 𝜄 7 end

Multi-source Meta Transfer Algorithm 1 : The procedure of MMT Input: Task distribution over source 𝑞 ) 𝜐 , data distribution over target 𝑄 * 𝜐 , backbone model 𝑔 𝜄 , learning rates in MMT 𝛽, 𝛾, 𝜇 MMT is agnostic Output: Optimized parameters 𝜄 d to backbone models Initial the value of 𝜄 While not done do for all source 𝑇 do # ~𝑞 # 𝜐 Support task and Query task sampled Sample batch of tasks 𝜐 " # do for all 𝜐 " from the same distribution , 𝑔 𝜄 Evaluate 𝛼 $ 𝑀 % + with respect to k examples Compute gradient for fast adaption: Updated the learner ( 𝜄 & ) on support task 𝜄 & =: 𝜄 − 𝛽𝛼 $ 𝑀 % + , 𝑔 𝜄 end Meta model update: Updated the meta model ( 𝜄 ) on query task d 𝜄 =: 𝜄 − 𝛾∇ $ ∑ % + , ~( , (%) 𝑀 % + , 𝑔 𝜄′ + ~𝑞 + 𝜐 Get batch of data 𝜐 " Updated the meta model ( 𝜄 ) on target data + do for all 𝜐 " Evaluate ∇ $ 𝑀 % + - 𝑔 𝜄 with respect to k examples Gradient for target fine-tuning: S2 Target Target 𝜄 =: 𝜄 − 𝛾∇ $ 𝑀 % + - 𝑔 𝜄 MML end MML end S1 S3 end + ~𝑞 + 𝜐 Get all batches of data 𝜐 " + do for all 𝜐 " Target S4 Target MTL Evaluate with respect to batch size; d Gradient for meta transfer learning: 𝜄 =: 𝜄 − 𝛾∇ $ 𝑀 % + - 𝑔 𝜄 Transfer meta model to the target MTL 8 end

Results Performance of Supervised MMT 9 MCTEST Performance of Unsupervised MMT MMT Ablation Study

How to select sources? T-SNE Visualization of BERT Feature 100 random samples Targets Sources Test on SemEval 2018 Transferability Matrix 10

Takeaways v MMT extends to meta learning to multi-source on MCQA task v MMT provided an algorithm both for supervised and unsupervised meta training v MMT give a guideline to source selection 11

Multi-source Meta Transfer for Low Resource MCQA Ming Yan 1 , Hao - PowerPoint PPT Presentation

ACL 2020 Multi-source Meta Transfer for Low Resource MCQA Ming Yan 1 , Hao Zhang 1,2 , Di Jin 3 , Joey Tianyi Zhou 1 1 IHPC, A*STAR, Singapore 2 CSCE, NTU, Singapore 3 CSAIL, MIT, USA 1 Background Low resource MCQA with date size under 100K

Meta- Meta -Programming with Programming with Modelica Modelica for Meta- for Meta

META-SHARE META SHARE the Open Resource Exchange Facility Stelios Piperidis ILSP-Athena RC,

META Seal of Recognition and META Prize Award Ceremony Georg Rehm (DFKI) on behalf of the

and Retrieval Source: H. Jegou Source: H. Jegou Source: H. Jegou Source: H. Jegou Source: H.

Radiative Transfer Radiative Transfer Radiative transfer is a branch of atmospheric physics. We

Multi-Source Adjustment of Multi-Layer Annotation: the Bits of Wisdom Approach Kilian Evang 20

Meta-transfer Learning for Few-shot Learning Yaoyao Liu Tianjin University and NUS School of

Meta-Learning for Low Resource NMT Introduction Historically Statistical Translation

Intelligent Tutoring Systems: A Meta-Analysis Meta-Analysis Wenting Ma March, 2011

Company profile Capabilities Customers & References META-LRA Kft. 8400 Ajka,

Individual Participant Data (IPD) Reviews and Meta analyses Lesley Stewart Director, CRD Larysa

Lecture 31/Chapter 25 More about Meta-Analysis Benefits and Pitfalls An Application:

Bayesian Model-Agnostic Meta-Learning Taesup Kim* (presenter), Jaesik Yoon* Ousmane Dia,

Simultaneous meta and data manipulation in Blaise Marien Lina Statistics netherlands Statistics

CS 671 Automated Reasoning Meta Reasoning Object Level versus Meta Level Object level:

Transfer United: Partnerships to Foster Transfer Student Success Tuesday, November 5 th

Algorithm Analysis Rada Mihalcea http://www.cs.unt.edu/~rada/CSCE3110 Queues Reading: Chap. 3

CRT Electron Lifetime and Purity Measurement Updates Richie Diurba diurb001@umn.edu 1/20

CSCE 496/896 Lecture 8: Good Research Talk How to Give a Good Research Talk Stephen Scott

Feat u re crossing FE ATU R E E N G IN E E R IN G IN R Jose Hernande z Data Scientist , Uni v

Principles of Counting Debdeep Mukhopadhyay IIT Madras Part-I The Sum Rule Two tasks T 1

The Local Density Approximation in Density Functional Theory Robert Seiringer IST Austria Based

Support Vector Machine (Part 2) OUTLINE Multi-class classification Nonlinear mapping

Thai Oil Public Company Limited Presentation to Investors 2006 Merrill Lynch 10 th Global Emerging

Multi-source Meta Transfer for Low Resource MCQA Ming Yan 1 , Hao - PowerPoint PPT Presentation

ACL 2020 Multi-source Meta Transfer for Low Resource MCQA Ming Yan 1 , Hao Zhang 1,2 , Di Jin 3 , Joey Tianyi Zhou 1 1 IHPC, A*STAR, Singapore 2 CSCE, NTU, Singapore 3 CSAIL, MIT, USA 1 Background Low resource MCQA with date size under 100K

Meta- Meta -Programming with Programming with Modelica Modelica for Meta- for Meta

META-SHARE META SHARE the Open Resource Exchange Facility Stelios Piperidis ILSP-Athena RC,

META Seal of Recognition and META Prize Award Ceremony Georg Rehm (DFKI) on behalf of the

and Retrieval Source: H. Jegou Source: H. Jegou Source: H. Jegou Source: H. Jegou Source: H.

Radiative Transfer Radiative Transfer Radiative transfer is a branch of atmospheric physics. We

Multi-Source Adjustment of Multi-Layer Annotation: the Bits of Wisdom Approach Kilian Evang 20

Meta-transfer Learning for Few-shot Learning Yaoyao Liu Tianjin University and NUS School of

Meta-Learning for Low Resource NMT Introduction Historically Statistical Translation

Intelligent Tutoring Systems: A Meta-Analysis Meta-Analysis Wenting Ma March, 2011

Company profile Capabilities Customers &amp; References META-LRA Kft. 8400 Ajka,

Individual Participant Data (IPD) Reviews and Meta analyses Lesley Stewart Director, CRD Larysa

Lecture 31/Chapter 25 More about Meta-Analysis Benefits and Pitfalls An Application:

Bayesian Model-Agnostic Meta-Learning Taesup Kim* (presenter), Jaesik Yoon* Ousmane Dia,

Simultaneous meta and data manipulation in Blaise Marien Lina Statistics netherlands Statistics

CS 671 Automated Reasoning Meta Reasoning Object Level versus Meta Level Object level:

Transfer United: Partnerships to Foster Transfer Student Success Tuesday, November 5 th

Algorithm Analysis Rada Mihalcea http://www.cs.unt.edu/~rada/CSCE3110 Queues Reading: Chap. 3

CRT Electron Lifetime and Purity Measurement Updates Richie Diurba diurb001@umn.edu 1/20

CSCE 496/896 Lecture 8: Good Research Talk How to Give a Good Research Talk Stephen Scott

Feat u re crossing FE ATU R E E N G IN E E R IN G IN R Jose Hernande z Data Scientist , Uni v

Principles of Counting Debdeep Mukhopadhyay IIT Madras Part-I The Sum Rule Two tasks T 1

The Local Density Approximation in Density Functional Theory Robert Seiringer IST Austria Based

Support Vector Machine (Part 2) OUTLINE Multi-class classification Nonlinear mapping

Thai Oil Public Company Limited Presentation to Investors 2006 Merrill Lynch 10 th Global Emerging

Company profile Capabilities Customers & References META-LRA Kft. 8400 Ajka,