Contextual Token Representations ULMfit, OpenAI GPT, ELMo, BERT, XLM Noe Casas
Background: Language Modeling … T 1 T 2 </s> • Data: Monolingual Corpus … softmax softmax softmax … project. project. project. • Task: predict next token given previous tokens (causal): embed 1 embed 2 embed 3 Model P ( T i | T 1 … T i − 1 ) • Usual models: LSTM, Transformer. < s > T T … 1 N
Contextual embeddings: intuition • Same word can have different meaning depending on the context. Example: - Please, type everything in lowercase. - What type of flowers do you like most? • Classic word embeddings offer the same vector representation regardless of the context. • Solution: create word representations that depend on the context.
Articles Model Alias Org. Article Reference Universal Language Model Fine-tuning for Text Classification ULMfit fast.ai Howard and Ruder Deep contextualized word representations ELMo AllenNLP Peters et al. Improving Language Understanding by Generative Pre-Training OpenAI GPT OpenAI Radford et al. BERT: Pre-training of Deep Bidirectional Transformers for BERT Google Language Understanding Devlin et al. Facebook Cross-lingual Language Model Pretraining XLM Lample and Conneau
Overview • Train model in one of multiple tasks that lead to word representations. • Release pre-trained models. • Use pre-trained models, options: A. Fine-tune model on final task. B. Directly encode token representations with model.
Overview (graphical) Phase 1: Phase 2: semi-supervised training downstream task fine-tuning *LM task Downstream task Downstream task LM task head head (projection + softmax) contextual transfer learning representations Language Language Modeling Modeling Architecture Architecture small learning rate or directly freeze monolingual task-specific weights corpus data
Differences Alias Model Token Tasks Language ULMfit LSTM word Causal LM English ELMo LSTM word Bidirectional LM English Causal LM OpenAI GPT Transformer subword English + Classification Masked LM BERT Transformer subword + Next sentence Multilingual prediction Causal LM XLM Transformer subword +Masked LM Multilingual + Translation LM
ULMFiT • Task : causal LM T 1 T 2 </s> … … softmax softmax softmax • Model : 3-layer LSTM … project. project. project. • Tokens : words LSTM LSTM LSTM … LSTM LSTM … LSTM LSTM LSTM LSTM … … < / s > E E 1 N
ELMO • Task : bidirectional LM T 1 T 2 T N … softmax … softmax softmax • Model : 2-layer biLSTM … project. project. project. • Tokens : words … LSTM LSTM LSTM LSTM LSTM … LSTM LSTM LSTM LSTM LSTM LSTM LSTM … … charCNN charCNN charCNN charCNN … C < / s > < s > C 1 N
OpenAI GPT • Task : causal LM Output tokens he will be late </s> • Model : self-attention layers softmax softmax softmax softmax softmax project. project. project. project. project. • Tokens : subwords Self-attention layers Token </s>] he will be late embeddings + + + + + Positional 0 1 2 3 4 embeddings
BERT he will br late [SEP] you should leave now [SEP] Output tokens This output is used softmax softmax softmax softmax softmax softmax softmax softmax softmax softmax for classification tasks project. project. project. project. project. project. project. project. project. project. Self-attention Layers Token [CLS] he [MASK] be late [SEP] you [MASK] leave now [SEP] embeddings + + + + + + + + + + + Positional 0 1 2 3 4 5 6 7 8 9 10 embeddings + + + + + + + + + + + Segment A A A A A A B B B B B embeddings 15% of tokens get masked • Tasks : masked LM + next sentence prediction • Model : self-attention layers • Tokens : subwords
Masked Language [/s] take drink now Modeling (MLM) Transformer XLM Token [/s] [/s] [MASK] a seat [MASK] have a [MASK] [MASK] relax and embeddings + + + + + + + + + + + + Position 0 1 2 3 4 5 6 7 8 9 10 11 embeddings + + + + + + + + + + + + Language en en en en en en en en en en en en embeddings Translation Language curtains were les bleus Modeling (TLM) Transformer Token [/s] [/s] [/s] rideaux [/s] the [MASK] [MASK] blue [MASK] étaient [MASK] embeddings + + + + + + + + + + + + Position 0 1 2 3 4 5 0 1 2 3 4 5 embeddings + + + + + + + + + + + + Language en en en en en en fr fr fr fr fr fr embeddings • Tasks : LM + masked LM + Translation LM Masked LM with • Model : self-attention layers parallel sentences • Tokens : subwords Projection and softmax are omitted *figure from “Cross-lingual Language Model Pretraining”
Downstream Tasks • Natural Language Inference (NLI) or Cross-lingual NLI. • Text classification (e.g. sentiment analysis). • Next sentence prediction. • Supervised and Unsupervised Neural Machine Translation (NMT). • Question Answering (QA). • Named Entity Recognition (NER).
Further reading • “Looking for ELMo's friends: Sentence-Level Pretraining Beyond Language Modeling”, Bowman et al., 2018 • “What do you learn from context? Probing for sentence structure in contextualized word representations”, Tenney et al., 2018. • “Assessing BERT’s Syntactic Abilities”, Goldberg, 2018 • “Learning and Evaluating General Linguistic Intelligence”, Yogatama et al., 2019.
Differences with other representations Note the differences of contextual token representations with: • Non-word representations like in (CoVe): Learned in Translation: Contextualized Word Vectors by McCann et al. 2017 [salesforce]. • Fixed-size sentence representations like in Massively Multilingual Sentence Embeddings for Zero-Shot Cross- Lingual Transfer and Beyond by Artetxe and Schewnk, 2018 [facebook].
Other resources • https://nlp.stanford.edu/seminar/details/jdevlin.pdf • http://jalammar.github.io/illustrated-bert/ • https://medium.com/dissecting-bert/dissecting-bert- part2-335ff2ed9c73 • https://github.com/huggingface/pytorch-pretrained-BERT
Summary Phase 2: Phase 1: downstream task fine-tuning semi-supervised training *LM task Downstream task task-specific data Downstream task LM task head head transfer learning (projection + softmax) monolingual corpus model model Alias Model Token Tasks Language ULMfit LSTM word Causal LM English ELMo LSTM word Bidirectional LM English Causal LM OpenAI GPT Transformer subword English + Classification Masked LM BERT Transformer subword Multilingual + Next sentence prediction Causal LM XLM Transformer subword +Masked LM Multilingual + Translation LM
Bonus slides
Are these really token representations? • They are a linear projection away from he will be late token space. softmax softmax softmax softmax project. project. project. project. • Word-level nearest neighbours in corpus finds same word with same Model usage. he will be late
Recommend
More recommend