intermediate task transfer CS685 Fall 2020 Advanced Natural Language Processing Mohit Iyyer College of Information and Computer Sciences University of Massachusetts Amherst many slides from Tu Vu
Stu ff from last time • Too many readings! • The mythical HW1 • Extra credit!
What is a task? - a description - a (sample) dataset
Tasks can help each other! • classification : supplementing language model (LM)- style pretraining with further training on intermediate tasks leads to improvements and reduced variance (Phang et al., 2019; arXiv) • sequence labeling : pretraining on a closely related task yields better performance than LM pretraining when the pretraining dataset is fixed (Liu et al., 2019; NAACL) • machine comprehension : pretraining on multiple related datasets leads to robust generalization and transfer (Talmor and Berant, 2019; ACL)
• Discover the space of language tasks - properties of individual tasks - task similarities and beneficial relations among tasks • Practical application - reduce the need for supervision among related tasks - multi-task learning : solve many tasks in one system - transfer learning : select source tasks for a given task
A real-world scenario Returns a structure Task among tasks bank Task Company’s Cloud Service description Sample data end user’s Submits a related tasks task new task efficient supervision end user policies
There are tons of NLP tasks! • ~ 100 tasks/datasets from various classes of problems Single Sentence Machine Sequence Unsupervised Probing Sentence Pair Comprehension Labeling Learning Tasks Classification Classification CoLA MRPC SQuAD CCG LM SentLen SST-2 STS-B NewsQA POS autoencoding WC 20 Newsgroups QQP SearchQA Chunk next sentence TreeDepth TREC-6 MNLI TriviaQA NER real/fake TopConst IMDB QNLI HotpotQA ST discourse relations BShift Yelp-2 RTE CQ GED … Tense Yelp-full WNLI CWQ PS SubjNum AG BoolQ ComQA EF ObjNum DBPedia CB WikiHOP Parent SOMO Sogou News WiC DROP Conj CoordInv … … … … …
Taskonomy for vision tasks • Zamir et al. (2018); CVPR: A library of 26 tasks covering common themes in computer vision (2D, 3D, semantics, etc.)
A research question • What criteria can be used to predict which combinations of source/ intermediate and target tasks should work well?
Create task embeddings • fixed-length dense vector representations of tasks • The vector space can tell us how closely related two tasks are (i.e., via cosine distance)
STS-B MRPC WNLI RTE QNLI SNLI MNLI CoLA QQP SST-2
Previous work on exploring the relations between NLP tasks • Bingel and Søgaard (2017); • Talmor and Berant (2019); EACL: 10 main sequence ACL: 10 main reading comprehension tasks labeling tasks, 90 task pairs for multi-task learning
A simple approach task embedding • use the task description base network only (i.e., a paragraph describing the task) • Limitation: requires a clear description for each task in the library … Tok 1 Tok 2 Tok N task description
Gradient-based methods • use a single base network task-specific • add a task-specific layer for a classifier layer given task base network • pass the entire dataset forward through the network only once • during backpropagation: either use training labels or sample from the model’s predictive distribution to compute gradients w.r.t. the model’s parameters ( weights ) or … Tok 1 Tok 2 Tok N outputs ( activations ) input text
What is the base network? • a pre-trained model, e.g., BERT, XLNet, RoBERTa
How to get gradient information? • use training labels - original gradients - use the empirical Fisher • sample from the model’s predictive distribution - original gradients - use the theoretical Fisher
Various gradient types Pooled Output P Pooler Layer dense Layer Output L 1 L 2 L N … LayerNorm dense Feed Forward dense Encoder Multi-head Layer Attention MH 1 MH 2 MH N … Output N x LayerNorm Multi-head dense Attention queries keys values activations weights LayerNorm Embedding Layer word segment position embedding embedding embedding
1. given a target task of interest, 2. identify the most compute a task embedding from similar source task BERT’s layer-wise gradients embedding from a precomputed library MNLI DROP QNLI SST2 SQuAD CCG WikiHop POS-PTB Target task WikiHop 4. fine-tune the 3. fine-tune BERT on resulting model selected source task on target task
1. given a target task of interest, 2. identify the most compute a task embedding from similar source task BERT’s layer-wise gradients embedding from a precomputed library MNLI DROP QNLI SST2 SQuAD CCG WikiHop POS-PTB Target task WikiHop 4. fine-tune the 3. fine-tune BERT on resulting model selected source task on target task
1. given a target task of interest, 2. identify the most compute a task embedding from similar source task BERT’s layer-wise gradients embedding from a precomputed library MNLI DROP QNLI SST2 SQuAD CCG WikiHop POS-PTB Target task WikiHop 4. fine-tune the 3. fine-tune BERT on resulting model selected source task on target task
1. given a target task of interest, 2. identify the most compute a task embedding from similar source task BERT’s layer-wise gradients embedding from a precomputed library MNLI DROP QNLI SST2 SQuAD CCG WikiHop POS-PTB Target task WikiHop 4. fine-tune the 3. fine-tune BERT on resulting model selected source task on target task
ComQA STS-B CQ DuoRC-s BoolQ MRPC DuoRC-p WikiHop Chunk QNLI DROP SQuAD-2.0 SQuAD-1.1 NewsQA Conj QQP Parent HotpotQA MNLI WNLI CCG GGParent SNLI GParent RTE GED NER SciTail ST POS-PTB SST-2 POS-EWT CoLA
L IMITED → L IMITED e e HotpotQA HotpotQA HotpotQA HotpotQA HotpotQA POS-PTB POS-PTB POS-PTB POS-PTB c SQuAD-1 NewsQA NewsQA NewsQA GGParent GGParent c NewsQA NewsQA DuoRC-p WikiHop HotpotQA HotpotQA HotpotQA HotpotQA HotpotQA HotpotQA HotpotQA HotpotQA HotpotQA GParent POS-PTB Parent r ComQA ComQA ComQA CCG SQuAD-1 NER POS-PTB POS-PTB POS-PTB POS-PTB i SQuAD-2 SQuAD-1 GGParent Chunk STS-B GParent u o Chunk SciTail GParent GParent GParent SNLI MRPC o STS-B QNLI SNLI QNLI h CQ CQ MNLI SNLI s SNLI SNLI ST c ST t b s m e B E k s a T SQuAD-2 is no longer the QA tasks best source task for any are good 80 QA targets in this regime sources for Target task performance CR targets 60 40 20 CR tasks baseline (no transfer) QA tasks task chosen by TaskEmb k SL tasks 0 s a t t e g WNLI CoLA RTE MRPC MNLI STS-B QQP SNLI QNLI SST-2 SciTail DROP CQ DuoRC-p WikiHop ComQA DuoRC-s NewsQA BoolQ HotpotQA SQuAD-2 SQuAD-1 GED Conj GGParent GParent NER Parent CCG ST POS-EWT POS-PTB Chunk r a T
L IMITED → L IMITED e e HotpotQA HotpotQA HotpotQA HotpotQA POS-PTB c SQuAD-1 NewsQA NewsQA c NewsQA NewsQA WikiHop HotpotQA HotpotQA HotpotQA HotpotQA HotpotQA HotpotQA HotpotQA ComQA ComQA ComQA r CCG SQuAD-1 i SQuAD-2 u SciTail STS-B o Chunk SQuAD- SNLI MRPC o STS-B QNLI SNLI QNLI h CQ CQ MNLI SNLI SNLI SNLI s c t b s m e B E k s a T SQuAD-2 is no longer the QA tasks best source task for any are good 80 QA targets in this regime sources for Target task performance CR targets 60 40 20 k 0 s a t t e WNLI CoLA RTE MRPC MNLI STS-B QQP SNLI QNLI SST-2 SciTail DROP CQ DuoRC-p WikiHop ComQA DuoRC-s NewsQA BoolQ g r a Hotpot T
L IMITED → L IMITED HotpotQA HotpotQA HotpotQA HotpotQA HotpotQA POS-PTB POS-PTB POS-PTB POS-PTB NewsQA NewsQA NewsQA GGParent GGParent NewsQA DuoRC-p HotpotQA HotpotQA HotpotQA HotpotQA HotpotQA HotpotQA HotpotQA HotpotQA HotpotQA GParent Parent A POS-PTB SQuAD-1 POS-PTB POS-PTB POS-PTB POS-PTB CCG SQuAD-1 NER GGParent Chunk GParent GParent GParent GParent QNLI ST ST SQuAD-2 is no longer the best source task for any QA targets in this regime CR tasks baseline (no transfer) QA tasks task chosen by TaskEmb SL tasks iTail DROP CQ DuoRC-p WikiHop ComQA DuoRC-s NewsQA BoolQ HotpotQA SQuAD-2 SQuAD-1 GED Conj GGParent GParent NER Parent CCG ST POS-EWT POS-PTB Chunk
HotpotQA POS-PTB POS-PTB POS-PTB GGParent GGParent DuoRC-p otQA GParent Parent POS-PTB POS-PTB POS-PTB POS-PTB POS-PTB GGParent Chunk GParent GParent GParent GParent ST ST CR tasks baseline (no transfer) QA tasks task chosen by TaskEmb SL tasks SQuAD-1 GED Conj GGParent GParent NER Parent CCG ST POS-EWT POS-PTB Chunk
Recommend
More recommend