Word Embeddings CS 6956: Deep Learning for NLP
Overview • Representing meaning • Word embeddings: Early work • Word embeddings via language models • Word2vec and Glove • Evaluating embeddings • Design choices and open questions 1
Overview • Representing meaning • Word embeddings: Early work • Word embeddings via language models • Word2vec and Glove • Evaluating embeddings • Design choices and open questions 2
Word embeddings via language models The goal: To find vector embeddings of words High level approach: 1. Train a model for a surrogate task (in this case language modeling) 2. Word embeddings are a byproduct of this process 3
Neural network language models • A multi-layer neural network [Bengio et al 2003] Context = previous words in sentence – Words → embedding layer → hidden layers → softmax – Cross-entropy loss • Instead of producing probability, just produce a score for the next word (no softmax) [Collobert and Weston, 2008] – Ranking loss – Intuition: Valid word sequences should get a higher score than invalid ones • No need for a multi-layer network, a shallow network is good enough [Mikolov, 2013, word2vec] Context = previous – Simpler model, fewer parameters and next words in – Faster to train sentence 4
This lecture • The word2vec models: CBOW and Skipgram • Connection between word2vec and matrix factorization • GloVe 5
Word2Vec [Mikolov et al ICLR 2013, Mikolov et al NIPS 2013] • Two architectures for learning word embeddings – Skipgram and CBOW • Both have two key differences from the older Bengio/C&W approaches 1. No hidden layers 2. Extra context (both left and right context) Several computational tricks to make things faster • 6
This lecture • The word2vec models: CBOW and Skipgram • Connection between word2vec and matrix factorization • GloVe 7
� � Continuous Bag of Words (CBOW) Given a window of words of a length 2m + 1 Call them: 𝑦 #$ , ⋯ , 𝑦 #' 𝑦 ( 𝑦 ' , ⋯ , 𝑦 $ Define a probabilistic model for predicting the middle word 𝑄(𝑦 ( ∣ 𝑦 #, , ⋯ , 𝑦 #' , 𝑦 ' , ⋯ , 𝑦 , ) Train the model by minimizing loss over the dataset 𝑀 = 1 log 𝑄(𝑦 ( ∣ 𝑦 #, , ⋯ , 𝑦 #' , 𝑦 ' , ⋯ , 𝑦 , ) 8
� � Continuous Bag of Words (CBOW) Given a window of words of a length 2m + 1 Call them: 𝑦 #$ , ⋯ , 𝑦 #' 𝑦 ( 𝑦 ' , ⋯ , 𝑦 $ Define a probabilistic model for predicting the middle word 𝑄(𝑦 ( ∣ 𝑦 #$ , ⋯ , 𝑦 #' , 𝑦 ' , ⋯ , 𝑦 $ ) Train the model by minimizing loss over the dataset 𝑀 = 1 log 𝑄(𝑦 ( ∣ 𝑦 #, , ⋯ , 𝑦 #' , 𝑦 ' , ⋯ , 𝑦 , ) 9
� � Continuous Bag of Words (CBOW) Given a window of words of a length 2m + 1 Call them: 𝑦 #$ , ⋯ , 𝑦 #' 𝑦 ( 𝑦 ' , ⋯ , 𝑦 $ Define a probabilistic model for predicting the middle word 𝑄(𝑦 ( ∣ 𝑦 #$ , ⋯ , 𝑦 #' , 𝑦 ' , ⋯ , 𝑦 $ ) Train the model by minimizing loss over the dataset 𝑀 = − 1 log 𝑄(𝑦 ( ∣ 𝑦 #$ , ⋯ , 𝑦 #' , 𝑦 ' , ⋯ , 𝑦 $ ) 10
� � Continuous Bag of Words (CBOW) Given a window of words of a length 2m + 1 Call them: 𝑦 #$ , ⋯ , 𝑦 #' 𝑦 ( 𝑦 ' , ⋯ , 𝑦 $ Define a probabilistic model for predicting the middle word Need to define 𝑄(𝑦 ( ∣ 𝑦 #$ , ⋯ , 𝑦 #' , 𝑦 ' , ⋯ , 𝑦 $ ) this to complete the model Train the model by minimizing loss over the dataset 𝑀 = − 1 log 𝑄(𝑦 ( ∣ 𝑦 #$ , ⋯ , 𝑦 #' , 𝑦 ' , ⋯ , 𝑦 $ ) 11
The CBOW model • The classification task – Input: context words 𝑦 #$ , ⋯ , 𝑦 #' , 𝑦 ' , ⋯ , 𝑦 $ – Output: the center word 𝑦 ( – These words correspond to one-hot vectors • Eg: cat would be associated with a dimension, its one-hot vector has 1 in that dimension and zero everywhere else • Notation: – n: the embedding dimension (eg 300) – V: The vocabulary of words we want to embed • Define two matrices: 1. 𝒲 : a matrix of size 𝑜×|𝑊| 2. 𝒳 : a matrix of size 𝑊 ×𝑜 12
Input: context 𝑦 #$ , ⋯ , 𝑦 #' , 𝑦 ' , ⋯ , 𝑦 $ Output: the center word 𝑦 ( The CBOW model n: the embedding dimension (eg 300) V: The vocabilary 𝒲 : a matrix of size 𝑜×|𝑊| 𝒳 : a matrix of size 𝑊 ×𝑜 1. Map all the context words into the n dimensional space using 𝒲 We get 2m vectors 𝒲𝑦 #$ , ⋯ , 𝒲𝑦 #' , 𝒲𝑦 ' , ⋯ , 𝒲𝑦 $ – 2. Average these vectors to get a context vector $ A = 1 𝑤 1 𝒲𝑦 C 2𝑛 CD#$,CE( 3. Use this to compute a score vector for the output 𝑡𝑑𝑝𝑠𝑓 = 𝒳𝑤 A 4. Use the score to compute probability via softmax 𝑄 𝑦 ( =⋅ 𝑑𝑝𝑜𝑢𝑓𝑦𝑢 = 𝑡𝑝𝑔𝑢𝑛𝑏𝑦(𝒳𝑤 A) 13
Input: context 𝑦 #$ , ⋯ , 𝑦 #' , 𝑦 ' , ⋯ , 𝑦 $ Output: the center word 𝑦 ( The CBOW model n: the embedding dimension (eg 300) V: The vocabilary 𝒲 : a matrix of size 𝑜×|𝑊| 𝒳 : a matrix of size 𝑊 ×𝑜 1. Map all the context words into the n dimensional space using 𝒲 We get 2m vectors 𝒲𝑦 #$ , ⋯ , 𝒲𝑦 #' , 𝒲𝑦 ' , ⋯ , 𝒲𝑦 $ – 2. Average these vectors to get a context vector $ A = 1 𝑤 1 𝒲𝑦 C 2𝑛 CD#$,CE( 3. Use this to compute a score vector for the output 𝑡𝑑𝑝𝑠𝑓 = 𝒳𝑤 A 4. Use the score to compute probability via softmax 𝑄 𝑦 ( =⋅ 𝑑𝑝𝑜𝑢𝑓𝑦𝑢 = 𝑡𝑝𝑔𝑢𝑛𝑏𝑦(𝒳𝑤 A) 14
Input: context 𝑦 #$ , ⋯ , 𝑦 #' , 𝑦 ' , ⋯ , 𝑦 $ Output: the center word 𝑦 ( The CBOW model n: the embedding dimension (eg 300) V: The vocabilary 𝒲 : a matrix of size 𝑜×|𝑊| 𝒳 : a matrix of size 𝑊 ×𝑜 1. Map all the context words into the n dimensional space using 𝒲 We get 2m vectors 𝒲𝑦 #$ , ⋯ , 𝒲𝑦 #' , 𝒲𝑦 ' , ⋯ , 𝒲𝑦 $ – 2. Average these vectors to get a context vector $ A = 1 𝑤 1 𝒲𝑦 C 2𝑛 CD#$,CE( 3. Use this to compute a score vector for the output 𝑡𝑑𝑝𝑠𝑓 = 𝒳𝑤 A 4. Use the score to compute probability via softmax 𝑄 𝑦 ( =⋅ 𝑑𝑝𝑜𝑢𝑓𝑦𝑢 = 𝑡𝑝𝑔𝑢𝑛𝑏𝑦(𝒳𝑤 A) 15
Input: context 𝑦 #$ , ⋯ , 𝑦 #' , 𝑦 ' , ⋯ , 𝑦 $ Output: the center word 𝑦 ( The CBOW model n: the embedding dimension (eg 300) V: The vocabilary 𝒲 : a matrix of size 𝑜×|𝑊| 𝒳 : a matrix of size 𝑊 ×𝑜 1. Map all the context words into the n dimensional space using 𝒲 We get 2m vectors 𝒲𝑦 #$ , ⋯ , 𝒲𝑦 #' , 𝒲𝑦 ' , ⋯ , 𝒲𝑦 $ – 2. Average these vectors to get a context vector $ A = 1 𝑤 1 𝒲𝑦 C 2𝑛 CD#$,CE( 3. Use this to compute a score vector for the output 𝑡𝑑𝑝𝑠𝑓 = 𝒳𝑤 A 4. Use the score to compute probability via softmax 𝑄 𝑦 ( =⋅ 𝑑𝑝𝑜𝑢𝑓𝑦𝑢 = 𝑡𝑝𝑔𝑢𝑛𝑏𝑦(𝒳𝑤 A) 16
Input: context 𝑦 #$ , ⋯ , 𝑦 #' , 𝑦 ' , ⋯ , 𝑦 $ Output: the center word 𝑦 ( The CBOW model n: the embedding dimension (eg 300) V: The vocabilary 𝒲 : a matrix of size 𝑜×|𝑊| 𝒳 : a matrix of size 𝑊 ×𝑜 1. Map all the context words into the n dimensional space using 𝒲 We get 2m vectors 𝒲𝑦 #$ , ⋯ , 𝒲𝑦 #' , 𝒲𝑦 ' , ⋯ , 𝒲𝑦 $ – 2. Average these vectors to get a context vector $ A = 1 𝑤 1 𝒲𝑦 C 2𝑛 CD#$,CE( 3. Use this to compute a score vector for the output 𝑡𝑑𝑝𝑠𝑓 = 𝒳𝑤 A 4. Use the score to compute probability via softmax 𝑄 𝑦 ( =⋅ 𝑑𝑝𝑜𝑢𝑓𝑦𝑢 = 𝑡𝑝𝑔𝑢𝑛𝑏𝑦(𝒳𝑤 A) 17
Input: context 𝑦 #$ , ⋯ , 𝑦 #' , 𝑦 ' , ⋯ , 𝑦 $ Output: the center word 𝑦 ( The CBOW model n: the embedding dimension (eg 300) V: The vocabilary 𝒲 : a matrix of size 𝑜×|𝑊| 𝒳 : a matrix of size 𝑊 ×𝑜 1. Map all the context words into the n dimensional space using 𝒲 We get 2m vectors 𝒲𝑦 #$ , ⋯ , 𝒲𝑦 #' , 𝒲𝑦 ' , ⋯ , 𝒲𝑦 $ – 2. Average these vectors to get a context vector $ A = 1 𝑤 1 𝒲𝑦 C 2𝑛 CD#$,CE( 3. Use this to compute a score vector for the output 𝑡𝑑𝑝𝑠𝑓 = 𝒳𝑤 A 4. Use the score to compute probability via softmax 𝑄 𝑦 ( =⋅ 𝑑𝑝𝑜𝑢𝑓𝑦𝑢 = 𝑡𝑝𝑔𝑢𝑛𝑏𝑦(𝒳𝑤 A) Exercise : Write this as a computation graph 18
Input: context 𝑦 #$ , ⋯ , 𝑦 #' , 𝑦 ' , ⋯ , 𝑦 $ Output: the center word 𝑦 ( The CBOW model n: the embedding dimension (eg 300) V: The vocabilary 𝒲 : a matrix of size 𝑜×|𝑊| 𝒳 : a matrix of size 𝑊 ×𝑜 1. Map all the context words into the n dimensional space using 𝒲 We get 2m vectors 𝒲𝑦 #$ , ⋯ , 𝒲𝑦 #' , 𝒲𝑦 ' , ⋯ , 𝒲𝑦 $ – 2. Average these vectors to get a context vector $ A = 1 𝑤 1 𝒲𝑦 C 2𝑛 CD#$,CE( 3. Use this to compute a score vector for the output 𝑡𝑑𝑝𝑠𝑓 = 𝒳𝑤 A 4. Use the score to compute probability via softmax 𝑄 𝑦 ( =⋅ 𝑑𝑝𝑜𝑢𝑓𝑦𝑢 = 𝑡𝑝𝑔𝑢𝑛𝑏𝑦(𝒳𝑤 A) Word embeddings: Rows of the matrix corresponding to the output. That is rows of 𝒳 19
Recommend
More recommend