word embeddings
play

Word Embeddings CS 6956: Deep Learning for NLP Overview - PowerPoint PPT Presentation

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word embeddings: Early work Word embeddings via language models Word2vec and Glove Evaluating embeddings Design choices and open questions 1


  1. Word Embeddings CS 6956: Deep Learning for NLP

  2. Overview • Representing meaning • Word embeddings: Early work • Word embeddings via language models • Word2vec and Glove • Evaluating embeddings • Design choices and open questions 1

  3. Overview • Representing meaning • Word embeddings: Early work • Word embeddings via language models • Word2vec and Glove • Evaluating embeddings • Design choices and open questions 2

  4. Word embeddings via language models The goal: To find vector embeddings of words High level approach: 1. Train a model for a surrogate task (in this case language modeling) 2. Word embeddings are a byproduct of this process 3

  5. Neural network language models • A multi-layer neural network [Bengio et al 2003] Context = previous words in sentence – Words → embedding layer → hidden layers → softmax – Cross-entropy loss • Instead of producing probability, just produce a score for the next word (no softmax) [Collobert and Weston, 2008] – Ranking loss – Intuition: Valid word sequences should get a higher score than invalid ones • No need for a multi-layer network, a shallow network is good enough [Mikolov, 2013, word2vec] Context = previous – Simpler model, fewer parameters and next words in – Faster to train sentence 4

  6. This lecture • The word2vec models: CBOW and Skipgram • Connection between word2vec and matrix factorization • GloVe 5

  7. Word2Vec [Mikolov et al ICLR 2013, Mikolov et al NIPS 2013] • Two architectures for learning word embeddings – Skipgram and CBOW • Both have two key differences from the older Bengio/C&W approaches 1. No hidden layers 2. Extra context (both left and right context) Several computational tricks to make things faster • 6

  8. This lecture • The word2vec models: CBOW and Skipgram • Connection between word2vec and matrix factorization • GloVe 7

  9. � � Continuous Bag of Words (CBOW) Given a window of words of a length 2m + 1 Call them: 𝑦 #$ , ⋯ , 𝑦 #' 𝑦 ( 𝑦 ' , ⋯ , 𝑦 $ Define a probabilistic model for predicting the middle word 𝑄(𝑦 ( ∣ 𝑦 #, , ⋯ , 𝑦 #' , 𝑦 ' , ⋯ , 𝑦 , ) Train the model by minimizing loss over the dataset 𝑀 = 1 log 𝑄(𝑦 ( ∣ 𝑦 #, , ⋯ , 𝑦 #' , 𝑦 ' , ⋯ , 𝑦 , ) 8

  10. � � Continuous Bag of Words (CBOW) Given a window of words of a length 2m + 1 Call them: 𝑦 #$ , ⋯ , 𝑦 #' 𝑦 ( 𝑦 ' , ⋯ , 𝑦 $ Define a probabilistic model for predicting the middle word 𝑄(𝑦 ( ∣ 𝑦 #$ , ⋯ , 𝑦 #' , 𝑦 ' , ⋯ , 𝑦 $ ) Train the model by minimizing loss over the dataset 𝑀 = 1 log 𝑄(𝑦 ( ∣ 𝑦 #, , ⋯ , 𝑦 #' , 𝑦 ' , ⋯ , 𝑦 , ) 9

  11. � � Continuous Bag of Words (CBOW) Given a window of words of a length 2m + 1 Call them: 𝑦 #$ , ⋯ , 𝑦 #' 𝑦 ( 𝑦 ' , ⋯ , 𝑦 $ Define a probabilistic model for predicting the middle word 𝑄(𝑦 ( ∣ 𝑦 #$ , ⋯ , 𝑦 #' , 𝑦 ' , ⋯ , 𝑦 $ ) Train the model by minimizing loss over the dataset 𝑀 = − 1 log 𝑄(𝑦 ( ∣ 𝑦 #$ , ⋯ , 𝑦 #' , 𝑦 ' , ⋯ , 𝑦 $ ) 10

  12. � � Continuous Bag of Words (CBOW) Given a window of words of a length 2m + 1 Call them: 𝑦 #$ , ⋯ , 𝑦 #' 𝑦 ( 𝑦 ' , ⋯ , 𝑦 $ Define a probabilistic model for predicting the middle word Need to define 𝑄(𝑦 ( ∣ 𝑦 #$ , ⋯ , 𝑦 #' , 𝑦 ' , ⋯ , 𝑦 $ ) this to complete the model Train the model by minimizing loss over the dataset 𝑀 = − 1 log 𝑄(𝑦 ( ∣ 𝑦 #$ , ⋯ , 𝑦 #' , 𝑦 ' , ⋯ , 𝑦 $ ) 11

  13. The CBOW model • The classification task – Input: context words 𝑦 #$ , ⋯ , 𝑦 #' , 𝑦 ' , ⋯ , 𝑦 $ – Output: the center word 𝑦 ( – These words correspond to one-hot vectors • Eg: cat would be associated with a dimension, its one-hot vector has 1 in that dimension and zero everywhere else • Notation: – n: the embedding dimension (eg 300) – V: The vocabulary of words we want to embed • Define two matrices: 1. 𝒲 : a matrix of size 𝑜×|𝑊| 2. 𝒳 : a matrix of size 𝑊 ×𝑜 12

  14. Input: context 𝑦 #$ , ⋯ , 𝑦 #' , 𝑦 ' , ⋯ , 𝑦 $ Output: the center word 𝑦 ( The CBOW model n: the embedding dimension (eg 300) V: The vocabilary 𝒲 : a matrix of size 𝑜×|𝑊| 𝒳 : a matrix of size 𝑊 ×𝑜 1. Map all the context words into the n dimensional space using 𝒲 We get 2m vectors 𝒲𝑦 #$ , ⋯ , 𝒲𝑦 #' , 𝒲𝑦 ' , ⋯ , 𝒲𝑦 $ – 2. Average these vectors to get a context vector $ A = 1 𝑤 1 𝒲𝑦 C 2𝑛 CD#$,CE( 3. Use this to compute a score vector for the output 𝑡𝑑𝑝𝑠𝑓 = 𝒳𝑤 A 4. Use the score to compute probability via softmax 𝑄 𝑦 ( =⋅ 𝑑𝑝𝑜𝑢𝑓𝑦𝑢 = 𝑡𝑝𝑔𝑢𝑛𝑏𝑦(𝒳𝑤 A) 13

  15. Input: context 𝑦 #$ , ⋯ , 𝑦 #' , 𝑦 ' , ⋯ , 𝑦 $ Output: the center word 𝑦 ( The CBOW model n: the embedding dimension (eg 300) V: The vocabilary 𝒲 : a matrix of size 𝑜×|𝑊| 𝒳 : a matrix of size 𝑊 ×𝑜 1. Map all the context words into the n dimensional space using 𝒲 We get 2m vectors 𝒲𝑦 #$ , ⋯ , 𝒲𝑦 #' , 𝒲𝑦 ' , ⋯ , 𝒲𝑦 $ – 2. Average these vectors to get a context vector $ A = 1 𝑤 1 𝒲𝑦 C 2𝑛 CD#$,CE( 3. Use this to compute a score vector for the output 𝑡𝑑𝑝𝑠𝑓 = 𝒳𝑤 A 4. Use the score to compute probability via softmax 𝑄 𝑦 ( =⋅ 𝑑𝑝𝑜𝑢𝑓𝑦𝑢 = 𝑡𝑝𝑔𝑢𝑛𝑏𝑦(𝒳𝑤 A) 14

  16. Input: context 𝑦 #$ , ⋯ , 𝑦 #' , 𝑦 ' , ⋯ , 𝑦 $ Output: the center word 𝑦 ( The CBOW model n: the embedding dimension (eg 300) V: The vocabilary 𝒲 : a matrix of size 𝑜×|𝑊| 𝒳 : a matrix of size 𝑊 ×𝑜 1. Map all the context words into the n dimensional space using 𝒲 We get 2m vectors 𝒲𝑦 #$ , ⋯ , 𝒲𝑦 #' , 𝒲𝑦 ' , ⋯ , 𝒲𝑦 $ – 2. Average these vectors to get a context vector $ A = 1 𝑤 1 𝒲𝑦 C 2𝑛 CD#$,CE( 3. Use this to compute a score vector for the output 𝑡𝑑𝑝𝑠𝑓 = 𝒳𝑤 A 4. Use the score to compute probability via softmax 𝑄 𝑦 ( =⋅ 𝑑𝑝𝑜𝑢𝑓𝑦𝑢 = 𝑡𝑝𝑔𝑢𝑛𝑏𝑦(𝒳𝑤 A) 15

  17. Input: context 𝑦 #$ , ⋯ , 𝑦 #' , 𝑦 ' , ⋯ , 𝑦 $ Output: the center word 𝑦 ( The CBOW model n: the embedding dimension (eg 300) V: The vocabilary 𝒲 : a matrix of size 𝑜×|𝑊| 𝒳 : a matrix of size 𝑊 ×𝑜 1. Map all the context words into the n dimensional space using 𝒲 We get 2m vectors 𝒲𝑦 #$ , ⋯ , 𝒲𝑦 #' , 𝒲𝑦 ' , ⋯ , 𝒲𝑦 $ – 2. Average these vectors to get a context vector $ A = 1 𝑤 1 𝒲𝑦 C 2𝑛 CD#$,CE( 3. Use this to compute a score vector for the output 𝑡𝑑𝑝𝑠𝑓 = 𝒳𝑤 A 4. Use the score to compute probability via softmax 𝑄 𝑦 ( =⋅ 𝑑𝑝𝑜𝑢𝑓𝑦𝑢 = 𝑡𝑝𝑔𝑢𝑛𝑏𝑦(𝒳𝑤 A) 16

  18. Input: context 𝑦 #$ , ⋯ , 𝑦 #' , 𝑦 ' , ⋯ , 𝑦 $ Output: the center word 𝑦 ( The CBOW model n: the embedding dimension (eg 300) V: The vocabilary 𝒲 : a matrix of size 𝑜×|𝑊| 𝒳 : a matrix of size 𝑊 ×𝑜 1. Map all the context words into the n dimensional space using 𝒲 We get 2m vectors 𝒲𝑦 #$ , ⋯ , 𝒲𝑦 #' , 𝒲𝑦 ' , ⋯ , 𝒲𝑦 $ – 2. Average these vectors to get a context vector $ A = 1 𝑤 1 𝒲𝑦 C 2𝑛 CD#$,CE( 3. Use this to compute a score vector for the output 𝑡𝑑𝑝𝑠𝑓 = 𝒳𝑤 A 4. Use the score to compute probability via softmax 𝑄 𝑦 ( =⋅ 𝑑𝑝𝑜𝑢𝑓𝑦𝑢 = 𝑡𝑝𝑔𝑢𝑛𝑏𝑦(𝒳𝑤 A) 17

  19. Input: context 𝑦 #$ , ⋯ , 𝑦 #' , 𝑦 ' , ⋯ , 𝑦 $ Output: the center word 𝑦 ( The CBOW model n: the embedding dimension (eg 300) V: The vocabilary 𝒲 : a matrix of size 𝑜×|𝑊| 𝒳 : a matrix of size 𝑊 ×𝑜 1. Map all the context words into the n dimensional space using 𝒲 We get 2m vectors 𝒲𝑦 #$ , ⋯ , 𝒲𝑦 #' , 𝒲𝑦 ' , ⋯ , 𝒲𝑦 $ – 2. Average these vectors to get a context vector $ A = 1 𝑤 1 𝒲𝑦 C 2𝑛 CD#$,CE( 3. Use this to compute a score vector for the output 𝑡𝑑𝑝𝑠𝑓 = 𝒳𝑤 A 4. Use the score to compute probability via softmax 𝑄 𝑦 ( =⋅ 𝑑𝑝𝑜𝑢𝑓𝑦𝑢 = 𝑡𝑝𝑔𝑢𝑛𝑏𝑦(𝒳𝑤 A) Exercise : Write this as a computation graph 18

  20. Input: context 𝑦 #$ , ⋯ , 𝑦 #' , 𝑦 ' , ⋯ , 𝑦 $ Output: the center word 𝑦 ( The CBOW model n: the embedding dimension (eg 300) V: The vocabilary 𝒲 : a matrix of size 𝑜×|𝑊| 𝒳 : a matrix of size 𝑊 ×𝑜 1. Map all the context words into the n dimensional space using 𝒲 We get 2m vectors 𝒲𝑦 #$ , ⋯ , 𝒲𝑦 #' , 𝒲𝑦 ' , ⋯ , 𝒲𝑦 $ – 2. Average these vectors to get a context vector $ A = 1 𝑤 1 𝒲𝑦 C 2𝑛 CD#$,CE( 3. Use this to compute a score vector for the output 𝑡𝑑𝑝𝑠𝑓 = 𝒳𝑤 A 4. Use the score to compute probability via softmax 𝑄 𝑦 ( =⋅ 𝑑𝑝𝑜𝑢𝑓𝑦𝑢 = 𝑡𝑝𝑔𝑢𝑛𝑏𝑦(𝒳𝑤 A) Word embeddings: Rows of the matrix corresponding to the output. That is rows of 𝒳 19

Recommend


More recommend