Word Embedding CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2018 Many slides have been adopted from Socher lectures, cs224d, Stanford, 2017.
One-hot coding 2
Distributed similarity based representations } representing a word by means of its neighbors } “You shall know a word by the company it keeps” (J. R. Firth 1957: 11) } One of the most successful ideas of modern statistical NLP 3
Word embedding } Store “most” of the important information in a fixed, small number of dimensions: a dense vector } Usually around 25 – 1000 dimensions } Embeddings: distributional models with dimensionality reduction, based on prediction 4
How to make neighbors represent words? } Answer:With a co-occurrence matrix X } options: full document vs windows } Full word-document co-occurrence matrix } will give general topics (all sports terms will have similar entries) leading to “Latent Semantic Analysis” } Window around each word } captures both syntactic (POS) and semantic information 5
LSA: Dimensionality Reduction based on word- doc matrix Docs words Maintaining only the k largest singular values of X Embedded words 6
Problems with SVD } Its computational cost scales quadratically for n x m matrix: O( mn 2 ) flops (when n<m) } Bad for millions of words or documents } Hard to incorporate new words or documents } Does not consider order of words in the documents 7
Directly learn low-dimensional word vectors } Old idea. Relevant for this lecture: } Learning representations by back-propagating errors. (Rumelhart et al., 1986) } NNLM: A neural probabilistic language model (Bengio et al., 2003) } NLP (almost) from Scratch (Collobert & Weston, 2008) } A recent, even simpler and faster model: word2vec (Mikolov et al. 2013)-> intro now 8
word2vec } Key idea:The word vector can predict surrounding words } word2vec: as originally described (Mikolov et al 2013), a NN model using a two-layer network (i.e., not deep!) to perform dimensionality reduction. } Faster and can easily incorporate a new sentence/document or add a word to the vocabulary } Very computationally efficient, good all-round model (good hyper-parameters already selected). 9
Skip-gram vs. CBOW } Two possible architectures: } given some context words, predict the center (CBOW) } Predict center word from sum of surrounding word vectors } given a center word, predict the contexts (Skip-gram) Skip-gram uses a word to CBOW uses a window of word predict the surrounding to predict the middle word words. Continuous Bag of words Skip-gram (CBOW) 10
Continuous Bag of Word: Example } E.g.“The cat sat on floor” } Window size = 2 the cat sat on floor 11
Continuous Bag of Word: Example Input layer 0 Index of cat in vocabulary 1 0 0 Hidden layer Output layer cat 0 0 0 0 0 0 … 0 0 0 one-hot 0 one-hot sat 0 vector vector 0 0 0 1 … 0 1 0 0 on 0 0 0 … 0 12
Continuous Bag of Word: Example We must learn W and W ’ Input layer 0 1 0 0 Hidden layer Output layer 𝑋 cat 0 "×$ 0 0 0 0 0 … 0 V-dim 0 0 𝑋′ $×" 0 sat 0 0 0 1 0 0 … d-dim 𝑋 V-dim 0 1 "×$ 0 on 0 0 0 … V-dim N will be the size of word vector 0 13
Word embedding matrix } You will get the word-vector by left multiplying a one-hot vector by W 0 0 a ⋮ Aardvark 𝑦 = ( 𝑦 + = 1 ) 1 … 𝑋 = ⋮ zebra 0 0 ℎ = 𝑦 - 𝑋 = 𝑋 +,. = 𝑤 + 𝑙 -th row of the matrix 𝑋 14
Continuous Bag of Word: Example 𝑋 - ×𝑦 78 = 𝑤 78 4.8 4.5 5 … … … 2.1 0 4.5 1 Input layer 0.5 8.4 2.5 … … … 5.6 8.4 0 … … … … … … … × 0 0 = … 1 0 … … … … … … … … 0 0 0.6 6.7 0.8 … … … 3.7 6.7 0 0 Output layer x 0 0 cat 0 … 0 0 0 0 0 … 0 V-dim 0 0 2 = 𝑤 345 + 𝑤 78 0 + 𝑤 sat 2 0 0 0 0 1 … 0 V-dim 1 0 Hidden layer 0 x on 0 d-dim 0 0 … V-dim 0 15
Continuous Bag of Word: Example 𝑋 - ×𝑦 345 = 𝑤 345 0 4.8 4.5 5 1.5 … … … 2.1 1.5 0 Input layer 0.5 8.4 2.5 0.9 … … … 5.6 0.9 0 × 0 1 = … … … … … … … … … 1 0 … … … … … … … … … 0 0 0 0 0.6 6.7 0.8 1.9 … … … 3.7 1.9 Output layer x 0 0 cat 0 … 0 0 0 0 0 … 0 V-dim 0 0 2 = 𝑤 345 + 𝑤 78 𝑤 0 + sat 2 0 0 0 0 1 … 0 V-dim 1 0 Hidden layer 0 x on 0 d-dim 0 0 … V-dim 0 16
Continuous Bag of Word: Example Input layer 0 1 0 0 Hidden layer Output layer 𝑋 cat 0 "×$ 0 0 0 0 0 … 0 V-dim 0 0 ? 𝑧 2 = 𝑡𝑝𝑔𝑢𝑛𝑏𝑦(𝑨) 𝑋 ×𝑤 2 = 𝑨 0 "×$ 0 0 0 1 0 𝑤 2 0 … 𝑋 0 1 "×$ d-dim 0 on 𝑧 2 ;<= 0 0 V-dim 0 … V-dim N will be the size of word vector 0 17
Continuous Bag of Word: Example Input layer 0 We would prefer 𝑧 2 close to 𝑧 2 I45 1 0 0 Hidden layer Output layer 𝑋 cat 0 "×$ 0 0 0 0.01 0 0 0.02 … 0 ? V-dim 𝑋 ×𝑤 2 = 𝑨 0.00 0 0 "×$ 0 0.02 𝑧 2 = 𝑡𝑝𝑔𝑢𝑛𝑏𝑦(𝑨) 0 0.01 0 0 1 0 0.02 𝑤 2 0 … 0.01 𝑋 0 1 "×$ 0.7 d-dim 0 on 𝑧 2 ;<= 0 … 0 V-dim 0.00 0 … 𝑧 2 V-dim N will be the size of word vector 0 18
Continuous Bag of Word: Example 𝑋 - 4.8 4.5 5 1.5 … … … 2.1 Contain word’s vectors Input layer 0.5 8.4 2.5 0.9 … … … 5.6 0 … … … … … … … … 1 … … … … … … … … 0 0.6 6.7 0.8 1.9 … … … 3.7 0 Output layer x 0 cat 0 0 0 𝑋 "×$ 0 0 … 0 V-dim 0 0 ? 𝑋 0 $×" sat 0 0 0 0 1 … 0 𝑋 V-dim 1 "×$ 0 Hidden layer 0 x on 0 d-dim 0 0 … V-dim 0 We can consider either W or W’ as the word’s representation. Or even take the average. 19
Skip-gram } Embeddings that are good at predicting neighboring words are also good at representing similarity 20
Output layer 0 1 0 0 0 cat Input layer 0 ? Hidden layer 𝑋 0 0 $×" 0 0 … 0 0 0 0 x 𝑋 sat 0 "×$ 0 0 0 0 ? 0 𝑋 $×" 0 1 𝑤 2 0 on V-dim 0 d-dim 0 0 … 0 V-dim 21
� Details of Word2Vec } Learn to predict surrounding words in a window of length m of every word. } Objective function: Maximize the log probability of any context word given the current center word: - 𝐾 𝜄 = 1 𝑈 M M log 𝑞 𝑥 5ST |𝑥 5 WXYTYX 5\] TZ[ T: training set size m: context size } Use a large training corpus to maximize it 𝑥 T : vector representation of the jth word 𝜄 : whole parameters of the network m is usually 5~10 22
Skip-gram } 𝑥 7 : context or output (outside) word 𝑓 I37no p b ,p e 𝑄 𝑥 7 𝑥 ^ = ∑ 𝑓 I37no p q ,p e } 𝑥 ^ : center or input word r ? = 𝑤 ^ 𝑡𝑑𝑝𝑠𝑓 𝑥 7 , 𝑥 ^ = ℎ - 𝑋 - 𝑣 7 .,7 - 𝑋 = 𝑋 ℎ = 𝑦 ^ ^,. = 𝑤 ^ ? = 𝑣 7 𝑋 .,7 c d e 𝑓 a b 𝑄 𝑥 7 𝑥 ^ = c d h 𝑓 a g ∑ i Every word has 2 vectors 𝑤 m : when 𝑥 is the center word 𝑣 m : when 𝑥 is the outside word (context word) 23
� Details of Word2Vec } Predict surrounding words in a window of length m of every word: c d e 𝑓 a b 𝑄 𝑥 7 𝑥 ^ = c d e ∑ 𝑓 a q �r - 𝐾 𝜄 = 1 𝑈 M M log 𝑞 𝑥 5ST |𝑥 5 WXYTYX 5\] TZ[ 24
Parameters 25
Review: Iterative optimization of objective function } Objective function: 𝐾(𝜾) t = argmax } Optimization problem: 𝜾 𝐾(𝜾) 𝜾 } Steps: } Start from 𝜾 [ } Repeat } Update 𝜾 5 to 𝜾 5S] in order to increase 𝐾 } 𝑢 ← 𝑢 + 1 } until we hopefully end up at a maximum 26
Review: Gradient ascent t = argmax First-order optimization algorithm to find 𝜾 𝐾(𝜾) } 𝜾 Also known as ” steepest ascent ” } } In each step, takes steps proportional to the negative of the gradient vector of the function at the current point 𝜾 5 : 𝐾(𝜾) increases fastest if one goes from 𝜾 5 in the direction of 𝛼 𝜾 𝐾(𝜾 5 ) } Assumption: 𝐾(𝜾) is defined and differentiable in a neighborhood of a point 𝜾 5 } 27
Review: Gradient ascent } Maximize 𝐾(𝜾) Step size 𝜾 5S] = 𝜾 5 + 𝜃𝛼 𝜾 𝐾(𝜾 5 ) (Learning rate parameter) 𝛼 𝜾 𝐾 𝒙 = [𝜖𝐾 𝜾 , 𝜖𝐾 𝜾 , … , 𝜖𝐾 𝜾 ] 𝜖𝜄 ] 𝜖𝜄 • 𝜖𝜄 $ } If 𝜃 is small enough, then 𝐾 𝜾 5S] ≥ 𝐾 𝜾 5 . } 𝜃 can be allowed to change at every iteration as 𝜃 5 . 28
� � � Gradient c d e 𝑓 a b 𝜖 log 𝑞 𝑥 7 |𝑥 ^ = 𝜖 log 𝜖𝑤 ^ 𝜖𝑤 ^ c d e ∑ 𝑓 a q �r = 𝜖 c d e − log M 𝑓 a q c d e log 𝑓 a b 𝜖𝑤 ^ r 1 c d e 𝑓 a q = 𝑣 7 − M 𝑣 r c d e ∑ 𝑓 a q �r r = 𝑣 7 − M 𝑞 𝑥 r |𝑥 ^ 𝑣 r r 29
Training difficulties } With large vocabularies, it is not scalable! " 𝜖 log 𝑞 𝑥 7 |𝑥 ^ = 𝑣 7 − M 𝑞 𝑥 r |𝑥 ^ 𝑣 r 𝜖𝑤 ^ r\] } Define negative prediction that only samples a few words that do not appear in the context } Similar to focusing on mostly positive correlations 30
Recommend
More recommend