Contextual Token Representations ULMfit, OpenAI GPT, ELMo, BERT, - PowerPoint PPT Presentation

Contextual Token Representations ULMfit, OpenAI GPT, ELMo, BERT, XLM Noe Casas

Background: Language Modeling … T 1 T 2 </s> • Data: Monolingual Corpus … softmax softmax softmax … project. project. project. • Task: predict next token given previous tokens (causal): embed 1 embed 2 embed 3 Model P ( T i | T 1 … T i − 1 ) • Usual models: LSTM, Transformer. < s > T T … 1 N

Contextual embeddings: intuition • Same word can have different meaning depending on the context. Example: - Please, type everything in lowercase. - What type of flowers do you like most? • Classic word embeddings offer the same vector representation regardless of the context. • Solution: create word representations that depend on the context.

Articles Model Alias Org. Article Reference Universal Language Model Fine-tuning for Text Classification ULMfit fast.ai Howard and Ruder Deep contextualized word representations ELMo AllenNLP Peters et al. Improving Language Understanding by Generative Pre-Training OpenAI GPT OpenAI Radford et al. BERT: Pre-training of Deep Bidirectional Transformers for BERT Google Language Understanding Devlin et al. Facebook Cross-lingual Language Model Pretraining XLM Lample and Conneau

Overview • Train model in one of multiple tasks that lead to word representations. • Release pre-trained models. • Use pre-trained models, options: A. Fine-tune model on final task. B. Directly encode token representations with model.

Overview (graphical) Phase 1: Phase 2: semi-supervised training downstream task fine-tuning *LM task Downstream task Downstream task LM task head head (projection + softmax) contextual transfer learning representations Language Language Modeling Modeling Architecture Architecture small learning rate or directly freeze monolingual task-specific weights corpus data

Differences Alias Model Token Tasks Language ULMfit LSTM word Causal LM English ELMo LSTM word Bidirectional LM English Causal LM OpenAI GPT Transformer subword English + Classification Masked LM BERT Transformer subword + Next sentence Multilingual prediction Causal LM XLM Transformer subword +Masked LM Multilingual + Translation LM

ULMFiT • Task : causal LM T 1 T 2 </s> … … softmax softmax softmax • Model : 3-layer LSTM … project. project. project. • Tokens : words LSTM LSTM LSTM … LSTM LSTM … LSTM LSTM LSTM LSTM … … < / s > E E 1 N

ELMO • Task : bidirectional LM T 1 T 2 T N … softmax … softmax softmax • Model : 2-layer biLSTM … project. project. project. • Tokens : words … LSTM LSTM LSTM LSTM LSTM … LSTM LSTM LSTM LSTM LSTM LSTM LSTM … … charCNN charCNN charCNN charCNN … C < / s > < s > C 1 N

OpenAI GPT • Task : causal LM Output tokens he will be late </s> • Model : self-attention layers softmax softmax softmax softmax softmax project. project. project. project. project. • Tokens : subwords Self-attention layers Token </s>] he will be late embeddings + + + + + Positional 0 1 2 3 4 embeddings

BERT he will br late [SEP] you should leave now [SEP] Output tokens This output is used softmax softmax softmax softmax softmax softmax softmax softmax softmax softmax for classification tasks project. project. project. project. project. project. project. project. project. project. Self-attention Layers Token [CLS] he [MASK] be late [SEP] you [MASK] leave now [SEP] embeddings + + + + + + + + + + + Positional 0 1 2 3 4 5 6 7 8 9 10 embeddings + + + + + + + + + + + Segment A A A A A A B B B B B embeddings 15% of tokens get masked • Tasks : masked LM + next sentence prediction • Model : self-attention layers • Tokens : subwords

Masked Language [/s] take drink now Modeling (MLM) Transformer XLM Token [/s] [/s] [MASK] a seat [MASK] have a [MASK] [MASK] relax and embeddings + + + + + + + + + + + + Position 0 1 2 3 4 5 6 7 8 9 10 11 embeddings + + + + + + + + + + + + Language en en en en en en en en en en en en embeddings Translation Language curtains were les bleus Modeling (TLM) Transformer Token [/s] [/s] [/s] rideaux [/s] the [MASK] [MASK] blue [MASK] étaient [MASK] embeddings + + + + + + + + + + + + Position 0 1 2 3 4 5 0 1 2 3 4 5 embeddings + + + + + + + + + + + + Language en en en en en en fr fr fr fr fr fr embeddings • Tasks : LM + masked LM + Translation LM Masked LM with • Model : self-attention layers parallel sentences • Tokens : subwords Projection and softmax are omitted *figure from “Cross-lingual Language Model Pretraining”

Downstream Tasks • Natural Language Inference (NLI) or Cross-lingual NLI. • Text classification (e.g. sentiment analysis). • Next sentence prediction. • Supervised and Unsupervised Neural Machine Translation (NMT). • Question Answering (QA). • Named Entity Recognition (NER).

Further reading • “Looking for ELMo's friends: Sentence-Level Pretraining Beyond Language Modeling”, Bowman et al., 2018 • “What do you learn from context? Probing for sentence structure in contextualized word representations”, Tenney et al., 2018. • “Assessing BERT’s Syntactic Abilities”, Goldberg, 2018 • “Learning and Evaluating General Linguistic Intelligence”, Yogatama et al., 2019.

Differences with other representations Note the differences of contextual token representations with: • Non-word representations like in (CoVe): Learned in Translation: Contextualized Word Vectors by McCann et al. 2017 [salesforce]. • Fixed-size sentence representations like in Massively Multilingual Sentence Embeddings for Zero-Shot Cross- Lingual Transfer and Beyond by Artetxe and Schewnk, 2018 [facebook].

Other resources • https://nlp.stanford.edu/seminar/details/jdevlin.pdf • http://jalammar.github.io/illustrated-bert/ • https://medium.com/dissecting-bert/dissecting-bert- part2-335ff2ed9c73 • https://github.com/huggingface/pytorch-pretrained-BERT

Summary Phase 2: Phase 1: downstream task fine-tuning semi-supervised training *LM task Downstream task task-specific data Downstream task LM task head head transfer learning (projection + softmax) monolingual corpus model model Alias Model Token Tasks Language ULMfit LSTM word Causal LM English ELMo LSTM word Bidirectional LM English Causal LM OpenAI GPT Transformer subword English + Classification Masked LM BERT Transformer subword Multilingual + Next sentence prediction Causal LM XLM Transformer subword +Masked LM Multilingual + Translation LM

Bonus slides

Are these really token representations? • They are a linear projection away from he will be late token space. softmax softmax softmax softmax project. project. project. project. • Word-level nearest neighbours in corpus finds same word with same Model usage. he will be late

Contextual Token Representations ULMfit, OpenAI GPT, ELMo, BERT, - PowerPoint PPT Presentation

Contextual Token Representations ULMfit, OpenAI GPT, ELMo, BERT, XLM Noe Casas Background: Language Modeling T 1 T 2 </s> Data: Monolingual Corpus softmax softmax softmax project. project. project. Task:

Linguistic Knowledge and Transferability of Contextual Representations Nelson F. Matt Yonatan

Contextual Word Representations with BERT and Other Pre-trained Language Models Jacob Devlin

Contextual Inquiry Take Aways Overview of Contextual Design Contextual inquiry

HOTNOW HOT Token HOTNOW l TOKEN SALE THE FIRST UTILITY TOKEN WITH REAL INTRINSIC VALUE REINVENT

Contextual Analysis SWEN-444 Contextual analysis Systematic analysis of contextual user work

PIV Token Issuance PIV Token Issuance Ketan Mehta Mehta_Ketan@bah.com October 6, 2004 1

Serving Contextual Communities Serving Contextual Communities The Evangelical Theological

Kidnapping's Cesar Cerrudo Revenge Argeniss Token Token Who am I? Who am I? Argeniss

Token Ring Developed by IBM, adopted by IEEE as 802.5 standard Token rings latter

Analysing Lexical Semantic Change with Contextualised Word Representations Mario Giulianelli,

Contextual Inquiry SWEN-444 Contextual Inquiry is the process of discovering what users cannot

Contextual Inquiry SWEN-445 Contextual Inquiry is the process of discovering what users cannot

Learning and Evaluating Contextual Embedding of Source Code Aditya Kanade 1 2 , Petros Maniatis 2

Experimental Design & Evaluation 4. Contextual Inquiry SunyoungKim,PhD Contextual

GNUK TOKEN AND GNUPG GNUK TOKEN AND GNUPG SCDAEMON SCDAEMON minimizing the attack surface

Linking Context Modelling and Contextual Reasoning Dongpyo Hong, Hedda R. Schmidtke, Woontack Woo

1 Physical Model Contextual Design:Your turn ! Shows how the physical environment affects ! In

Token to Words Expanding identified token to words numbers+type = word list

Co-Founder, Managing Partner, SPiCE VC - Security Token Pioneers @AmiBenDavid 1 st Fully

Lecture 19: Dictionaries Counting Words Creating token from a text file: 1 def file to

Intro to Contextual Inquiry Selected material from The UX Book , Hartson &

Contextual Analysis SWEN-444 Selected material from The UX Book , Hartson & Pyla What is

Security model for hybrid token-based networking models By Rudy Borgstede Contents

Contextual Inquiry Tim Clark (488232) March 21, 2011 Tim Clark (488232) Contextual Inquiry

Contextual Token Representations ULMfit, OpenAI GPT, ELMo, BERT, - PowerPoint PPT Presentation

Contextual Token Representations ULMfit, OpenAI GPT, ELMo, BERT, XLM Noe Casas Background: Language Modeling T 1 T 2 </s> Data: Monolingual Corpus softmax softmax softmax project. project. project. Task:

Linguistic Knowledge and Transferability of Contextual Representations Nelson F. Matt Yonatan

Contextual Word Representations with BERT and Other Pre-trained Language Models Jacob Devlin

Contextual Inquiry Take Aways Overview of Contextual Design Contextual inquiry

HOTNOW HOT Token HOTNOW l TOKEN SALE THE FIRST UTILITY TOKEN WITH REAL INTRINSIC VALUE REINVENT

Contextual Analysis SWEN-444 Contextual analysis Systematic analysis of contextual user work

PIV Token Issuance PIV Token Issuance Ketan Mehta Mehta_Ketan@bah.com October 6, 2004 1

Serving Contextual Communities Serving Contextual Communities The Evangelical Theological

Kidnapping's Cesar Cerrudo Revenge Argeniss Token Token Who am I? Who am I? Argeniss

Token Ring Developed by IBM, adopted by IEEE as 802.5 standard Token rings latter

Analysing Lexical Semantic Change with Contextualised Word Representations Mario Giulianelli,

Contextual Inquiry SWEN-444 Contextual Inquiry is the process of discovering what users cannot

Contextual Inquiry SWEN-445 Contextual Inquiry is the process of discovering what users cannot

Learning and Evaluating Contextual Embedding of Source Code Aditya Kanade 1 2 , Petros Maniatis 2

Experimental Design &amp; Evaluation 4. Contextual Inquiry SunyoungKim,PhD Contextual

GNUK TOKEN AND GNUPG GNUK TOKEN AND GNUPG SCDAEMON SCDAEMON minimizing the attack surface

Linking Context Modelling and Contextual Reasoning Dongpyo Hong, Hedda R. Schmidtke, Woontack Woo

1 Physical Model Contextual Design:Your turn ! Shows how the physical environment affects ! In

Token to Words Expanding identified token to words numbers+type = word list

Co-Founder, Managing Partner, SPiCE VC - Security Token Pioneers @AmiBenDavid 1 st Fully

Lecture 19: Dictionaries Counting Words Creating token from a text file: 1 def file to

Intro to Contextual Inquiry Selected material from The UX Book , Hartson &amp;

Contextual Analysis SWEN-444 Selected material from The UX Book , Hartson &amp; Pyla What is

Security model for hybrid token-based networking models By Rudy Borgstede Contents

Contextual Inquiry Tim Clark (488232) March 21, 2011 Tim Clark (488232) Contextual Inquiry

Experimental Design & Evaluation 4. Contextual Inquiry SunyoungKim,PhD Contextual

Intro to Contextual Inquiry Selected material from The UX Book , Hartson &

Contextual Analysis SWEN-444 Selected material from The UX Book , Hartson & Pyla What is