Lecture 7: Word Embeddings Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 6501 Natural Language Processing 1
This lecture v Learning word vectors (Cont.) v Representation learning in NLP 6501 Natural Language Processing 2
Recap: Latent Semantic Analysis v Data representation v Encode single-relational data in a matrix v Co-occurrence (e.g., from a general corpus) v Synonyms (e.g., from a thesaurus) v Factorization v Apply SVD to the matrix to find latent components v Measuring degree of relation v Cosine of latent vectors
Recap: Mapping to Latent Space via SVD π» π ' β π« π πΓπ πΓπ πΓπ πΓπ v SVD generalizes the original data v Uncovers relationships not explicit in the thesaurus v Term vectors projected to π -dim latent space v Word similarity: cosine of two column vectors in π»π $
Low rank approximation v Frobenius norm. C is a πΓπ matrix 9 6 1 1 |π 34 | 5 ||π·|| / = 378 478 v Rank of a matrix. v How many vectors in the matrix are independent to each other 6501 Natural Language Processing 5
Low rank approximation v Low rank approximation problem: min = ||π· β π|| / π‘. π’. π πππ π = π v If I can only use k independent vectors to describe the points in the space, what are the best choices? Essentially, we minimize the β reconstruction loss β under a low rank constraint 6501 Natural Language Processing 6
Low rank approximation v Low rank approximation problem: min = ||π· β π|| / π‘. π’. π πππ π = π v If I can only use k independent vectors to describe the points in the space, what are the best choices? Essentially, we minimize the β reconstruction loss β under a low rank constraint 6501 Natural Language Processing 7
Low rank approximation v Assume rank of π· is r v SVD: π· = πΞ£π ' , Ξ£ = diag(π 8 ,π 5 β¦ π P , 0,0,0, β¦0) π 8 0 0 π non-zeros Ξ£ = 0 β± 0 0 0 0 v Zero-out the r β π trailing values Ξ£β² = diag(π 8 , π 5 β¦π U , 0,0,0,β¦ 0) v π· V = UΞ£ V π ' is the best k-rank approximation: C V = ππ π min = ||π· β π|| / π‘.π’. π πππ π = π 6501 Natural Language Processing 8
Word2Vec v LSA: a compact representation of co- occurrence matrix v Word2Vec:Predict surrounding words (skip-gram) v Similar to using co-occurrence counts Levy&Goldberg (2014), Pennington et al. (2014) v Easy to incorporate new words or sentences 6501 Natural Language Processing 9
Word2Vec v Similar to language model, but predicting next word is not the goal. v Idea: words that are semantically similar often occur near each other in text v Embeddings that are good at predicting neighboring words are also good at representing similarity 6501 Natural Language Processing 10
Skip-gram v.s Continuous bag-of-words v What differences? 6501 Natural Language Processing 11
Skip-gram v.s Continuous bag-of-words 6501 Natural Language Processing 12
Objective of Word2Vec (Skip-gram) v Maximize the log likelihood of context word π₯ \]9 , π₯ \]9^8, β¦, π₯ \]8 , π₯ \^8 , π₯ \^5 , β¦, π₯ \^9 given word π₯ \ v m is usually 5~10 6501 Natural Language Processing 13
Objective of Word2Vec (Skip-gram) v How to model log π(π₯ \^4 |π₯ \ ) ? cde (f ghij β l gh ) π π₯ \^4 π₯ \ = β cde (f gn β l gh ) gn v softmax function Again! v Every word has 2 vectors v π€ p : when π₯ is the center word v π£ p : when π₯ is the outside word (context word) 6501 Natural Language Processing 14
How to update? cde (f ghij β l gh ) π π₯ \^4 π₯ \ = β cde (f gn β l gh ) gn v How to minimize πΎ(π) v Gradient descent! v How to compute the gradient? 6501 Natural Language Processing 15
Recap: Calculus v Gradient: π ' = π¦ 8 π¦ 5 π¦ z , ππ(π) ππ¦ 8 ππ(π) βπ π = ππ¦ 5 ππ(π) ππ¦ z v π π = π β π (or represented as π ' π ) βπ π = π 6501 Natural Language Processing 16
Recap: Calculus v If π§ = π π£ and π£ = π π¦ (i.e,. π§ = π(π π¦ ) Ζβ Ζβ (f) Ζβ‘(β¦) Ζβ Ζf Ζβ¦ = ( Ζβ¦ ) Ζf Ζβ¦ Ζf 1. π§ = π¦ Λ + 6 z (π¦ 5 + 5) 2. y = ln 3. y = exp(x β’ + 3π¦ + 2) 6501 Natural Language Processing 17
Other useful formulation v π§ = exp π¦ dy dx = exp x v y = log x dy dx = 1 x When I say log (in this course), usually I mean ln 6501 Natural Language Processing 18
6501 Natural Language Processing 19
Example v Assume vocabulary set is π. We have one center word π , and one context word π . v What is the conditional probability π π π exp (π£ β’ β π€ β ) π π π = β (π£ p n β π€ β ) exp pV v What is the gradient of the log likelihood w.r.t π€ β ? π log π π π = π£ β’ β πΉ pβΌβ’ π₯ π [π£ p ] ππ€ β 6501 Natural Language Processing 20
Gradient Descent min p πΎ(π₯) Update w: π₯ β π₯ β πβπΎ(π₯) 6501 Natural Language Processing 21
Local minimum v.s. global minimum 6501 Natural Language Processing 22
Stochastic gradient descent v Let πΎ π₯ = 8 6 6 β πΎ 4 (π₯) 478 v Gradient descent update rule: ΕΎ 6 6 β π₯ β π₯ β πΌπΎ 4 π₯ 478 v Stochastic gradient descent: v Approximate 8 6 6 β πΌπΎ 4 π₯ by the gradient at a 478 single example πΌπΎ 3 π₯ (why?) v At each step: Randomly pick an example π π₯ β π₯ β ππΌπΎ 3 π₯ 6501 Natural Language Processing 23
Negative sampling v With a large vocabulary set, stochastic gradient descent is still not enough (why?) π logπ π π = π£ β’ β πΉ pβΌβ’ π₯ π [π£ p ] ππ€ β v Letβs approximate it again! v Only sample a few words that do not appear in the context v Essentially, put more weights on positive samples 6501 Natural Language Processing 24
More about Word2Vec β relation to LSA v LSA factorizes a matrix of co-occurrence counts v (Levy and Goldberg 2014) proves that skip-gram model implicitly factorizes a (shifted) PMI matrix! Β‘(β|β¦) Β‘(β¦,β) v PMI(w,c) = log Β‘(β) = log β’(β¦)Β‘(β) = log # π₯, π β |πΈ| #(π₯)#(π) 6501 Natural Language Processing 25
All problem solved? 6501 Natural Language Processing 26
Continuous Semantic Representations sunny rainy cloudy windy car emotion cab sad wheel joy feeling 6501 Natural Language Processing 27
Semantics Needs More Than Similarity Tomorrow will be rainy. Tomorrow will be sunny. π‘ππππππ ( rainy, sunny ) ? πππ’πππ§π( rainy, sunny ) ? 6501 Natural Language Processing 28
Polarity Inducing LSA [Yih, Zweig, Platt 2012] v Data representation v Encode two opposite relations in a matrix using βpolarityβ v Synonyms & antonyms (e.g., from a thesaurus) v Factorization v Apply SVD to the matrix to find latent components v Measuring degree of relation v Cosine of latent vectors
Encode Synonyms & Antonyms in Matrix v Joyfulness: joy, gladden; sorrow, sadden v Sad: sorrow, sadden; joy, gladden Target word: row- Inducing polarity vector joy gladden sorrow sadden goodwill Group 1: βjoyfulnessβ 1 1 -1 -1 0 Group 2: βsadβ -1 -1 1 1 0 Group 3: βaffectionβ 0 0 0 0 1 Cosine Score: + ππ§ππππ§ππ‘
Encode Synonyms & Antonyms in Matrix v Joyfulness: joy, gladden; sorrow, sadden v Sad: sorrow, sadden; joy, gladden Target word: row- Inducing polarity vector joy gladden sorrow sadden goodwill Group 1: βjoyfulnessβ 1 1 -1 -1 0 Group 2: βsadβ -1 -1 1 1 0 Group 3: βaffectionβ 0 0 0 0 1 Cosine Score: β π΅ππ’πππ§ππ‘
Continuous representations for entities Democratic Party Republic Party ? George W Bush Laura Bush Michelle Obama 6501 Natural Language Processing 32
Continuous representations for entities β’ Useful resources for NLP applications β’ Semantic Parsing & Question Answering β’ Information Extraction 6501 Natural Language Processing 33
Recommend
More recommend