contextual token representations
play

Contextual Token Representations ULMfit, OpenAI GPT, ELMo, BERT, - PowerPoint PPT Presentation

Contextual Token Representations ULMfit, OpenAI GPT, ELMo, BERT, XLM Noe Casas Background: Language Modeling T 1 T 2 </s> Data: Monolingual Corpus softmax softmax softmax project. project. project. Task:


  1. Contextual Token Representations ULMfit, OpenAI GPT, ELMo, BERT, XLM Noe Casas

  2. Background: Language Modeling … T 1 T 2 </s> • Data: Monolingual Corpus … softmax softmax softmax … project. project. project. • Task: predict next token given previous tokens (causal): embed 1 embed 2 embed 3 Model P ( T i | T 1 … T i − 1 ) • Usual models: LSTM, Transformer. < s > T T … 1 N

  3. Contextual embeddings: intuition • Same word can have different meaning depending on the context. Example: - Please, type everything in lowercase. - What type of flowers do you like most? • Classic word embeddings offer the same vector representation regardless of the context. • Solution: create word representations that depend on the context.

  4. Articles Model Alias Org. Article Reference Universal Language Model Fine-tuning for Text Classification ULMfit fast.ai Howard and Ruder Deep contextualized word representations ELMo AllenNLP Peters et al. Improving Language Understanding by Generative Pre-Training OpenAI GPT OpenAI Radford et al. BERT: Pre-training of Deep Bidirectional Transformers for BERT Google Language Understanding Devlin et al. Facebook Cross-lingual Language Model Pretraining XLM Lample and Conneau

  5. Overview • Train model in one of multiple tasks that lead to word representations. • Release pre-trained models. • Use pre-trained models, options: A. Fine-tune model on final task. B. Directly encode token representations with model.

  6. Overview (graphical) Phase 1: Phase 2: semi-supervised training downstream task fine-tuning *LM task Downstream task Downstream task LM task head head (projection + softmax) contextual transfer learning representations Language Language Modeling Modeling Architecture Architecture small learning rate or directly freeze monolingual task-specific weights corpus data

  7. Differences Alias Model Token Tasks Language ULMfit LSTM word Causal LM English ELMo LSTM word Bidirectional LM English Causal LM OpenAI GPT Transformer subword English + Classification Masked LM BERT Transformer subword + Next sentence Multilingual prediction Causal LM XLM Transformer subword +Masked LM Multilingual + Translation LM

  8. ULMFiT • Task : causal LM T 1 T 2 </s> … … softmax softmax softmax • Model : 3-layer LSTM … project. project. project. • Tokens : words LSTM LSTM LSTM … LSTM LSTM … LSTM LSTM LSTM LSTM … … < / s > E E 1 N

  9. ELMO • Task : bidirectional LM T 1 T 2 T N … softmax … softmax softmax • Model : 2-layer biLSTM … project. project. project. • Tokens : words … LSTM LSTM LSTM LSTM LSTM … LSTM LSTM LSTM LSTM LSTM LSTM LSTM … … charCNN charCNN charCNN charCNN … C < / s > < s > C 1 N

  10. OpenAI GPT • Task : causal LM Output tokens he will be late </s> • Model : self-attention layers softmax softmax softmax softmax softmax project. project. project. project. project. • Tokens : subwords Self-attention layers Token </s>] he will be late embeddings + + + + + Positional 0 1 2 3 4 embeddings

  11. BERT he will br late [SEP] you should leave now [SEP] Output tokens This output is used softmax softmax softmax softmax softmax softmax softmax softmax softmax softmax for classification tasks project. project. project. project. project. project. project. project. project. project. Self-attention Layers Token [CLS] he [MASK] be late [SEP] you [MASK] leave now [SEP] embeddings + + + + + + + + + + + Positional 0 1 2 3 4 5 6 7 8 9 10 embeddings + + + + + + + + + + + Segment A A A A A A B B B B B embeddings 15% of tokens get masked • Tasks : masked LM + next sentence prediction • Model : self-attention layers • Tokens : subwords

  12. Masked Language [/s] take drink now Modeling (MLM) Transformer XLM Token [/s] [/s] [MASK] a seat [MASK] have a [MASK] [MASK] relax and embeddings + + + + + + + + + + + + Position 0 1 2 3 4 5 6 7 8 9 10 11 embeddings + + + + + + + + + + + + Language en en en en en en en en en en en en embeddings Translation Language curtains were les bleus Modeling (TLM) Transformer Token [/s] [/s] [/s] rideaux [/s] the [MASK] [MASK] blue [MASK] étaient [MASK] embeddings + + + + + + + + + + + + Position 0 1 2 3 4 5 0 1 2 3 4 5 embeddings + + + + + + + + + + + + Language en en en en en en fr fr fr fr fr fr embeddings • Tasks : LM + masked LM + Translation LM Masked LM with • Model : self-attention layers parallel sentences • Tokens : subwords Projection and softmax are omitted *figure from “Cross-lingual Language Model Pretraining”

  13. Downstream Tasks • Natural Language Inference (NLI) or Cross-lingual NLI. • Text classification (e.g. sentiment analysis). • Next sentence prediction. • Supervised and Unsupervised Neural Machine Translation (NMT). • Question Answering (QA). • Named Entity Recognition (NER).

  14. Further reading • “Looking for ELMo's friends: Sentence-Level Pretraining Beyond Language Modeling”, Bowman et al., 2018 • “What do you learn from context? Probing for sentence structure in contextualized word representations”, Tenney et al., 2018. • “Assessing BERT’s Syntactic Abilities”, Goldberg, 2018 • “Learning and Evaluating General Linguistic Intelligence”, Yogatama et al., 2019.

  15. Differences with other representations Note the differences of contextual token representations with: • Non-word representations like in (CoVe): Learned in Translation: Contextualized Word Vectors by McCann et al. 2017 [salesforce]. • Fixed-size sentence representations like in Massively Multilingual Sentence Embeddings for Zero-Shot Cross- Lingual Transfer and Beyond by Artetxe and Schewnk, 2018 [facebook].

  16. Other resources • https://nlp.stanford.edu/seminar/details/jdevlin.pdf • http://jalammar.github.io/illustrated-bert/ • https://medium.com/dissecting-bert/dissecting-bert- part2-335ff2ed9c73 • https://github.com/huggingface/pytorch-pretrained-BERT

  17. Summary Phase 2: Phase 1: downstream task fine-tuning semi-supervised training *LM task Downstream task task-specific data Downstream task LM task head head transfer learning (projection + softmax) monolingual corpus model model Alias Model Token Tasks Language ULMfit LSTM word Causal LM English ELMo LSTM word Bidirectional LM English Causal LM OpenAI GPT Transformer subword English + Classification Masked LM BERT Transformer subword Multilingual + Next sentence prediction Causal LM XLM Transformer subword +Masked LM Multilingual + Translation LM

  18. Bonus slides

  19. Are these really token representations? • They are a linear projection away from he will be late token space. softmax softmax softmax softmax project. project. project. project. • Word-level nearest neighbours in corpus finds same word with same Model usage. he will be late

Recommend


More recommend