transformers pre trained language models
play

Transformers Pre-trained Language Models LING572 Advanced - PowerPoint PPT Presentation

Transformers Pre-trained Language Models LING572 Advanced Statistical Methods for NLP March 10, 2020 1 Announcements Thanks for being here! Please be active on Zoom chat! Thats the only form of interaction; I wont be able to


  1. Language Modeling ● Recent innovation: use language modeling (a.k.a. next word prediction) ● And variants thereof ● Linguistic knowledge: ● The students were happy because ____ … ● The student was happy because ____ … ● World knowledge: ● The POTUS gave a speech after missiles were fired by _____ ● The Seattle Sounders are so-named because Seattle lies on the Puget _____ 28

  2. Language Modeling is “Unsupervised” ● An example of “unsupervised” or “semi-supervised” learning ● NB: I think that “un-annotated” is a better term. Formally, the learning is supervised. But the labels come directly from the data, not an annotator. ● E.g.: “Today is the first day of 575.” ● (<s>, Today) ● (<s> Today, is) ● (<s> Today is, the) ● (<s> Today is the, first) ● … 29

  3. Data for LM is cheap 30

  4. Data for LM is cheap 30

  5. Data for LM is cheap Text 30

  6. Text is abundant ● News sites (e.g. Google 1B) ● Wikipedia (e.g. WikiText103) ● Reddit ● …. ● General web crawling: ● https://commoncrawl.org/ 31

  7. The Revolution will not be [Annotated] Yann LeCun 32 https://twitter.com/rgblong/status/916062474545319938?lang=en

  8. ULMFiT Universal Language Model Fine-tuning for Text Classification (ACL ’18) 33

  9. ULMFiT 34

  10. ULMFiT 35

  11. Deep Contextualized Word Representations 
 Peters et. al (2018) 36

  12. Deep Contextualized Word Representations 
 Peters et. al (2018) ● NAACL 2018 Best Paper Award 36

  13. Deep Contextualized Word Representations 
 Peters et. al (2018) ● NAACL 2018 Best Paper Award ● E mbeddings from L anguage Mo dels (ELMo) ● [aka the OG NLP Muppet] 36

  14. Deep Contextualized Word Representations 
 Peters et. al (2018) ● Comparison to GloVe: Source Nearest Neighbors playing, game, games, played, players, plays, player, Play, GloVe play football, multiplayer Chico Ruiz made a Kieffer, the only junior in the group, was commended for spectacular play on his ability to hit in the clutch, as well as his all-round Alusik’s grounder… excellent play. biLM Olivia De Havilland …they were actors who had been handed fat roles in a signed to do a successful play , and had talent enough to fill the roles Broadway play for competently, with nice understatement. Garson… 37

  15. Deep Contextualized Word Representations 
 Peters et. al (2018) ● Used in place of other embeddings on multiple tasks: SQuAD = Stanford Question Answering Dataset SNLI = Stanford Natural Language Inference Corpus SST -5 = Stanford Sentiment Treebank 38 figure: Matthew Peters

  16. BERT: Bidirectional Encoder Representations from Transformers Devlin et al NAACL 2019 39

  17. Overview ✓ ● Encoder Representations from Transformers: ● Bidirectional: ………? ● BiLSTM (ELMo): left-to-right and right-to-left ● Self-attention: every token can see every other ● How do you treat the encoder as an LM (as computing P ( w t | w t − 1 , w t − 2 , …, w 1 ) )? ● Don’t: modify the task 40

  18. Masked Language Modeling ● Language modeling: next word prediction ● Masked Language Modeling (a.k.a. cloze task): fill-in-the-blank ● Nancy Pelosi sent the articles of ____ to the Senate. ● Seattle ____ some snow, so UW was delayed due to ____ roads. P ( w t | w t + k , w t +( k − 1) , …, w t +1 , w t − 1 , …, w t − ( m +1) , w t − m ) ● I.e. ● (very similar to CBOW: continuous bag of words from word2vec) ● Auxiliary training task: next sentence prediction. ● Given sentences A and B, binary classification: did B follow A in the corpus or not? 41

  19. Schematically 42

  20. Some details 43

  21. Some details ● BASE model: ● 12 Transformer Blocks ● Hidden vector size: 768 ● Attention heads / layer: 12 ● Total parameters: 110M 43

  22. Some details ● BASE model: ● 12 Transformer Blocks ● Hidden vector size: 768 ● Attention heads / layer: 12 ● Total parameters: 110M ● LARGE model: ● 24 Transformer Blocks ● Hidden vector size: 1024 ● Attention heads / layer: 16 ● Total parameters: 340M 43

  23. Input Representation 44

  24. Input Representation ● [CLS], [SEP]: special tokens 44

  25. Input Representation ● [CLS], [SEP]: special tokens ● Segment: is this a token from sentence A or B? 44

  26. Input Representation ● [CLS], [SEP]: special tokens ● Segment: is this a token from sentence A or B? ● Position embeddings: provide position in sequence (learned, not fixed, in this case) 44

  27. Input Representation 🧑🧑🤕🤕 ● [CLS], [SEP]: special tokens ● Segment: is this a token from sentence A or B? ● Position embeddings: provide position in sequence (learned, not fixed, in this case) 44

  28. WordPiece Embeddings ● Another solution to OOV problem, from NMT context (see Wu et al 2016) ● Main idea: ● Fix vocabulary size |V| in advance [for BERT: 30k] ● Choose |V| wordpieces (subwords) such that total number of wordpieces in the corpus is minimized ● Frequent words aren’t split, but rarer ones are ● NB: this is a small issue when you transfer to / evaluate on pre-existing tagging datasets with their own vocabularies. 45

  29. Training Details ● BooksCorpus (800M words) + Wikipedia (2.5B) ● Masking the input text. 15% of all tokens are chosen. Then: ● 80% of the time: replaced by designated ‘[MASK]’ token ● 10% of the time: replaced by random token ● 10% of the time: unchanged ● Loss is cross-entropy of the prediction at the masked positions. ● Max seq length: 512 tokens (final 10%; 128 for first 90%) ● 1M training steps, batch size 256 = 4 days on 4 or 16 TPUs 46

  30. Initial Results 47

  31. Ablations ● Not a given (depth doesn’t help ELMo); possibly a difference between fine- tuning vs. feature extraction ● Many more variations to explore 48

Recommend


More recommend