even more language modeling
play

(Even More) Language Modeling: Multi-Task Learning, and Building - PowerPoint PPT Presentation

(Even More) Language Modeling: Multi-Task Learning, and Building Blocks of Transformers CMSC 473/673 Frank Ferraro Outline Multi-Task Learning The Attention Mechanism Transformer Language Models as General Language Encoders Remember


  1. (Even More) Language Modeling: Multi-Task Learning, and Building Blocks of Transformers CMSC 473/673 Frank Ferraro

  2. Outline Multi-Task Learning The Attention Mechanism Transformer Language Models as General Language Encoders

  3. Remember Multi-class Classification from Deck 5 Given input 𝑦 , predict discrete label 𝑧 If 𝑧 ∈ {0,1} (or 𝑧 ∈ If 𝑧 ∈ {0,1, … , 𝐿 − 1} (for Single {True, False} ), then a finite K), then a multi-class output binary classification task classification task If multiple 𝑧 𝑚 are Each 𝑧 𝑚 could be binary or Multi- predicted, then a multi- multi-class output label classification task Given input 𝑦 , predict multiple discrete labels 𝑧 = (𝑧 1 , … , 𝑧 𝑀 ) Multi-label Classification

  4. Multi-Label vs. Multi-Task • These can be considered the same thing but often they’re different • “Task”: a thing of interest to predict

  5. Multi-Label vs. Multi-Task • These can be considered the same thing but often they’re different • “Task”: a thing of interest to predict • Multi-label classification often involves multiple labels for the same task – E.g., sentiment (a tweet could be both “HAPPY” and “EXCITED”)

  6. Multi-Label vs. Multi-Task • These can be considered the same thing but often they’re different • “Task”: a thing of interest to predict • Multi-label classification often involves multiple labels for the same task – E.g., sentiment (a tweet could be both “HAPPY” and “EXCITED”) • Multi- task learning is for different “tasks,” e.g., – Task 1: Category of document (SPORTS, FINANCE, etc.) – Task 2: Sentiment of document – Task 3: Part-of-speech per token – Task 4: Syntactic parsing – …

  7. Multi-Task Learning Single-Task Learning Train a system to “do one thing” (make predictions for one task) y h x

  8. Multi-Task Learning Single-Task Learning If you have multiple (T) Train a system to “do one thing” tasks, then train (make predictions for one task) multiple systems y 1 y y 2 y T h 1 h h 2 h T x x x x

  9. Multi-Task Learning Single-Task Learning If you have multiple (T) Train a system to “do one thing” tasks, then train (make predictions for one task) multiple systems y 1 y y 2 y T different decoders h 1 h h 2 h T different encoders x x x x

  10. Multi-Task Learning Single-Task Learning Multi-Task Learning Train a system to “do one thing” Train a system to “do multiple (make predictions for one task) things” (make predictions for T different tasks) y Key idea/assumption: if the tasks are somehow related, can we leverage h an ability to do task i well into an ability to do task j well? x

  11. Multi-Task Learning Single-Task Learning Multi-Task Learning Train a system to “do one thing” Train a system to “do multiple (make predictions for one task) things” (make predictions for T different tasks) Key idea/assumption: if y the tasks are somehow related, can we leverage an ability to do task i h well into an ability to do task j well? Example: could features/embeddings x useful for language modeling (task i) also be useful for part-of-speech tagging (task j)?

  12. Multi-Task Learning Single-Task Learning Multi-Task Learning Train a system to “do one thing” Train a system to “do multiple (make predictions for one task) things” (make predictions for T different tasks) … y 1 y 2 y y T Key idea/assumption: if the tasks are somehow related, can we h h leverage an ability to do task i well into an ability to do task j well? x x

  13. Multi-Task Learning Single-Task Learning Multi-Task Learning Train a system to “do one thing” Train a system to “do multiple (make predictions for one task) things” (make predictions for T different tasks) … y 1 y 2 y y T h h x x

  14. Multi-Task Learning Single-Task Learning Multi-Task Learning Train a system to “do one thing” Train a system to “do multiple (make predictions for one task) things” (make predictions for T different tasks) … y 1 y 2 y y T h h same encoder learns good, general features/embeddings x x

  15. Multi-Task Learning Single-Task Learning Multi-Task Learning Train a system to “do one thing” Train a system to “do multiple (make predictions for one task) things” (make predictions for T different tasks) … y 1 y 2 y y T different decoders learn how to use those reps. for each task h h same encoder learns good, general features/embeddings x x

  16. General Multi-Task Training Procedure Given: T different corpora 𝐷 1 , … 𝐷 𝑈 for tasks 𝑢 , … , (𝑦 𝑂 𝑢 𝑢 , 𝑧 𝑂 𝑢 𝑢 )} 𝑢 , 𝑧 1 𝐷 𝑢 = { 𝑦 1 Encoder 𝐹 and T different decoders 𝐸 1 , … 𝐸 𝑈 These have weights (parameters) you need to learn

  17. General Multi-Task Training Procedure Given: T different corpora 𝐷 1 , … 𝐷 𝑈 for tasks 𝑢 , … , (𝑦 𝑂 𝑢 𝑢 , 𝑧 𝑂 𝑢 𝑢 )} 𝑢 , 𝑧 1 𝐷 𝑢 = { 𝑦 1 Encoder 𝐹 and T different decoders 𝐸 1 , … 𝐸 𝑈 Until converged or done: 1. Select the next task t 𝑢 from 𝐷 𝑢 𝑢 , 𝑧 𝑗 2. Randomly sample an instance 𝑦 𝑗 𝑢 , 𝑧 𝑗 𝑢 3. Train the encoder 𝐹 and decoder 𝐷 𝑢 on 𝑦 𝑗

  18. WARNING: Multi-task learning did not begin in 2008

  19. Two Well-Known Instances of Multi- Task Learning in NLP Collobert and Weston (2008, ICML) BERT [Devlin et al., 2019 NAACL)

  20. Two Well-Known Instances of Multi- Task Learning in NLP Collobert and Weston (2008, ICML) BERT [Devlin et al., 2019 NAACL) We’ll return to this

  21. Collobert and Weston (2008, ICML) Core task: Semantic Role Labeling Present a unified architecture for doing five other, related NLP tasks • Part-of-Speech Tagging • Chunking • Named Entity Recognition • Language Modeling • Prediction of Semantic Relatedness

  22. Collobert and Weston (2008, ICML) Core task: Semantic Role Labeling Present a unified architecture for doing five other, related NLP tasks • Part-of-Speech Tagging • Chunking • Named Entity Recognition • Language Modeling • Prediction of Semantic Relatedness

  23. Remember Semantic Role Labeling (SRL) from Deck 4 • For each predicate (e.g., verb) 1. find its arguments (e.g., NPs) 2. determine their semantic roles John drove Mary from Austin to Dallas in his Toyota Prius. The hammer broke the window. – agent: Actor of an action – patient: Entity affected by the action – source: Origin of the affected entity – destination: Destination of the affected entity – instrument: Tool used in performing action. – beneficiary: Entity for whom action is performed Slide thanks to Ray Mooney (modified) Slide courtesy Jason Eisner, with mild edits

  24. Remember Uses of Semantic Roles from Deck 4 • Find the answer to a user’s question – “Who” questions usually want Agents – “What” question usually want Patients – “How” and “with what” questions usually want Instruments – “Where” questions frequently want Sources/Destinations. – “For whom” questions usually want Beneficiaries – “To whom” questions usually want Destinations • Generate text – Many languages have specific syntactic constructions that must or should be used for specific semantic roles. • Word sense disambiguation, using selectional restrictions – The bat ate the bug . (what kind of bat? what kind of bug?) • Agents (particularly of “eat”) should be animate – animal bat, not baseball bat • Patients of “eat” should be edible – animal bug, not software bug – John fired the secretary. John fired the rifle. Patients of fire 1 are different than patients of fire 2 Slide thanks to Ray Mooney (modified) Slide courtesy Jason Eisner, with mild edits

  25. Collobert and Weston (2008, ICML) Core task: Semantic Role Labeling Present a unified architecture for doing five other, related NLP tasks • Part-of-Speech Tagging • Chunking • Named Entity Recognition • Language Modeling • Prediction of Semantic Relatedness

  26. Collobert and Weston (2008, ICML) Core task: Semantic Role Labeling Present a unified architecture for doing five other, related NLP tasks • Part-of-Speech Tagging • Chunking • Named Entity Recognition • Language Modeling • Prediction of Semantic Relatedness

  27. Collobert and Weston (2008, ICML) Core task: Semantic Role Labeling Present a unified architecture for doing five other, related NLP tasks • Part-of-Speech Tagging • Chunking • Named Entity Recognition • Language Modeling • Prediction of Semantic Relatedness

  28. Part of Speech Tagging (sequence is probably not right!) Noun Verb Noun Prep Noun Noun y 0 y 1 y 2 y 3 y 4 y 5 h 0 h 1 h 2 h 3 h 4 h 5 x 0 x 1 x 2 x 3 x 4 x 5 British Left Waffles on Falkland Islands

  29. Part-of-speech tagging: assign a part-of-speech tag to every word in a sentence

  30. Syntactic Parsing (One Option) (parse is probably not right!) (parse from the Berkeley parser: https://parser.kitaev.io/)

  31. Part-of-speech tagging: assign a part-of-speech tag to every word in a sentence Syntactic parsing: produce an analysis of a sentence according to some grammatical rules

  32. Part-of-speech tagging: assign a part-of-speech tag to every word in a sentence Chunking: A Syntactic parsing: Shallow produce an analysis of a Syntactic sentence Parsing according to some grammatical rules

Recommend


More recommend