lecture 7 word embeddings
play

Lecture 7: Word Embeddings Kai-Wei Chang CS @ University of - PowerPoint PPT Presentation

Lecture 7: Word Embeddings Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 6501 Natural Language Processing 1 This lecture v Learning word vectors (Cont.) v Representation learning in


  1. Lecture 7: Word Embeddings Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 6501 Natural Language Processing 1

  2. This lecture v Learning word vectors (Cont.) v Representation learning in NLP 6501 Natural Language Processing 2

  3. Recap: Latent Semantic Analysis v Data representation v Encode single-relational data in a matrix v Co-occurrence (e.g., from a general corpus) v Synonyms (e.g., from a thesaurus) v Factorization v Apply SVD to the matrix to find latent components v Measuring degree of relation v Cosine of latent vectors

  4. Recap: Mapping to Latent Space via SVD 𝚻 𝐖 ' β‰ˆ 𝑫 𝐕 𝑙×𝑙 π‘™Γ—π‘œ π‘’Γ—π‘œ 𝑒×𝑙 v SVD generalizes the original data v Uncovers relationships not explicit in the thesaurus v Term vectors projected to 𝑙 -dim latent space v Word similarity: cosine of two column vectors in πš»π– $

  5. Low rank approximation v Frobenius norm. C is a π‘›Γ—π‘œ matrix 9 6 1 1 |𝑑 34 | 5 ||𝐷|| / = 378 478 v Rank of a matrix. v How many vectors in the matrix are independent to each other 6501 Natural Language Processing 5

  6. Low rank approximation v Low rank approximation problem: min = ||𝐷 βˆ’ π‘Œ|| / 𝑑. 𝑒. π‘ π‘π‘œπ‘™ π‘Œ = 𝑙 v If I can only use k independent vectors to describe the points in the space, what are the best choices? Essentially, we minimize the β€œ reconstruction loss ” under a low rank constraint 6501 Natural Language Processing 6

  7. Low rank approximation v Low rank approximation problem: min = ||𝐷 βˆ’ π‘Œ|| / 𝑑. 𝑒. π‘ π‘π‘œπ‘™ π‘Œ = 𝑙 v If I can only use k independent vectors to describe the points in the space, what are the best choices? Essentially, we minimize the β€œ reconstruction loss ” under a low rank constraint 6501 Natural Language Processing 7

  8. Low rank approximation v Assume rank of 𝐷 is r v SVD: 𝐷 = π‘‰Ξ£π‘Š ' , Ξ£ = diag(𝜏 8 ,𝜏 5 … 𝜏 P , 0,0,0, …0) 𝜏 8 0 0 𝑠 non-zeros Ξ£ = 0 β‹± 0 0 0 0 v Zero-out the r βˆ’ 𝑙 trailing values Ξ£β€² = diag(𝜏 8 , 𝜏 5 β€¦πœ U , 0,0,0,… 0) v 𝐷 V = UΞ£ V π‘Š ' is the best k-rank approximation: C V = 𝑏𝑠𝑕 min = ||𝐷 βˆ’ π‘Œ|| / 𝑑.𝑒. π‘ π‘π‘œπ‘™ π‘Œ = 𝑙 6501 Natural Language Processing 8

  9. Word2Vec v LSA: a compact representation of co- occurrence matrix v Word2Vec:Predict surrounding words (skip-gram) v Similar to using co-occurrence counts Levy&Goldberg (2014), Pennington et al. (2014) v Easy to incorporate new words or sentences 6501 Natural Language Processing 9

  10. Word2Vec v Similar to language model, but predicting next word is not the goal. v Idea: words that are semantically similar often occur near each other in text v Embeddings that are good at predicting neighboring words are also good at representing similarity 6501 Natural Language Processing 10

  11. Skip-gram v.s Continuous bag-of-words v What differences? 6501 Natural Language Processing 11

  12. Skip-gram v.s Continuous bag-of-words 6501 Natural Language Processing 12

  13. Objective of Word2Vec (Skip-gram) v Maximize the log likelihood of context word π‘₯ \]9 , π‘₯ \]9^8, …, π‘₯ \]8 , π‘₯ \^8 , π‘₯ \^5 , …, π‘₯ \^9 given word π‘₯ \ v m is usually 5~10 6501 Natural Language Processing 13

  14. Objective of Word2Vec (Skip-gram) v How to model log 𝑄(π‘₯ \^4 |π‘₯ \ ) ? cde (f ghij β‹… l gh ) π‘ž π‘₯ \^4 π‘₯ \ = βˆ‘ cde (f gn β‹… l gh ) gn v softmax function Again! v Every word has 2 vectors v 𝑀 p : when π‘₯ is the center word v 𝑣 p : when π‘₯ is the outside word (context word) 6501 Natural Language Processing 14

  15. How to update? cde (f ghij β‹… l gh ) π‘ž π‘₯ \^4 π‘₯ \ = βˆ‘ cde (f gn β‹… l gh ) gn v How to minimize 𝐾(πœ„) v Gradient descent! v How to compute the gradient? 6501 Natural Language Processing 15

  16. Recap: Calculus v Gradient: π’š ' = 𝑦 8 𝑦 5 𝑦 z , πœ–πœš(π’š) πœ–π‘¦ 8 πœ–πœš(π’š) βˆ‡πœš π’š = πœ–π‘¦ 5 πœ–πœš(π’š) πœ–π‘¦ z v 𝜚 π’š = 𝒃 β‹… π’š (or represented as 𝒃 ' π’š ) βˆ‡πœš π’š = 𝒃 6501 Natural Language Processing 16

  17. Recap: Calculus v If 𝑧 = 𝑔 𝑣 and 𝑣 = 𝑕 𝑦 (i.e,. 𝑧 = 𝑔(𝑕 𝑦 ) Ζ’β€ž Ġ(f) ƒ‑(…) Ζ’β€ž Ζ’f ƒ… = ( ƒ… ) Ζ’f ƒ… Ζ’f 1. 𝑧 = 𝑦 Λ† + 6 z (𝑦 5 + 5) 2. y = ln 3. y = exp(x β€’ + 3𝑦 + 2) 6501 Natural Language Processing 17

  18. Other useful formulation v 𝑧 = exp 𝑦 dy dx = exp x v y = log x dy dx = 1 x When I say log (in this course), usually I mean ln 6501 Natural Language Processing 18

  19. 6501 Natural Language Processing 19

  20. Example v Assume vocabulary set is 𝑋. We have one center word 𝑑 , and one context word 𝑝 . v What is the conditional probability π‘ž 𝑝 𝑑 exp (𝑣 β€’ β‹… 𝑀 – ) π‘ž 𝑝 𝑑 = βˆ‘ (𝑣 p n β‹… 𝑀 – ) exp pV v What is the gradient of the log likelihood w.r.t 𝑀 – ? πœ– log π‘ž 𝑝 𝑑 = 𝑣 β€’ βˆ’ 𝐹 pβˆΌβ„’ π‘₯ 𝑑 [𝑣 p ] πœ–π‘€ – 6501 Natural Language Processing 20

  21. Gradient Descent min p 𝐾(π‘₯) Update w: π‘₯ ← π‘₯ βˆ’ πœƒβˆ‡πΎ(π‘₯) 6501 Natural Language Processing 21

  22. Local minimum v.s. global minimum 6501 Natural Language Processing 22

  23. Stochastic gradient descent v Let 𝐾 π‘₯ = 8 6 6 βˆ‘ 𝐾 4 (π‘₯) 478 v Gradient descent update rule: ΕΎ 6 6 βˆ‘ π‘₯ ← π‘₯ βˆ’ 𝛼𝐾 4 π‘₯ 478 v Stochastic gradient descent: v Approximate 8 6 6 βˆ‘ 𝛼𝐾 4 π‘₯ by the gradient at a 478 single example 𝛼𝐾 3 π‘₯ (why?) v At each step: Randomly pick an example 𝑗 π‘₯ ← π‘₯ βˆ’ πœƒπ›ΌπΎ 3 π‘₯ 6501 Natural Language Processing 23

  24. Negative sampling v With a large vocabulary set, stochastic gradient descent is still not enough (why?) πœ– logπ‘ž 𝑝 𝑑 = 𝑣 β€’ βˆ’ 𝐹 pβˆΌβ„’ π‘₯ 𝑑 [𝑣 p ] πœ–π‘€ – v Let’s approximate it again! v Only sample a few words that do not appear in the context v Essentially, put more weights on positive samples 6501 Natural Language Processing 24

  25. More about Word2Vec – relation to LSA v LSA factorizes a matrix of co-occurrence counts v (Levy and Goldberg 2014) proves that skip-gram model implicitly factorizes a (shifted) PMI matrix! Β‘(β€ž|…) Β‘(…,β€ž) v PMI(w,c) = log Β‘(β€ž) = log β„’(…)Β‘(β€ž) = log # π‘₯, 𝑑 β‹… |𝐸| #(π‘₯)#(𝑑) 6501 Natural Language Processing 25

  26. All problem solved? 6501 Natural Language Processing 26

  27. Continuous Semantic Representations sunny rainy cloudy windy car emotion cab sad wheel joy feeling 6501 Natural Language Processing 27

  28. Semantics Needs More Than Similarity Tomorrow will be rainy. Tomorrow will be sunny. π‘‘π‘—π‘›π‘—π‘šπ‘π‘ ( rainy, sunny ) ? π‘π‘œπ‘’π‘π‘œπ‘§π‘›( rainy, sunny ) ? 6501 Natural Language Processing 28

  29. Polarity Inducing LSA [Yih, Zweig, Platt 2012] v Data representation v Encode two opposite relations in a matrix using β€œpolarity” v Synonyms & antonyms (e.g., from a thesaurus) v Factorization v Apply SVD to the matrix to find latent components v Measuring degree of relation v Cosine of latent vectors

  30. Encode Synonyms & Antonyms in Matrix v Joyfulness: joy, gladden; sorrow, sadden v Sad: sorrow, sadden; joy, gladden Target word: row- Inducing polarity vector joy gladden sorrow sadden goodwill Group 1: β€œjoyfulness” 1 1 -1 -1 0 Group 2: β€œsad” -1 -1 1 1 0 Group 3: β€œaffection” 0 0 0 0 1 Cosine Score: + π‘‡π‘§π‘œπ‘π‘œπ‘§π‘›π‘‘

  31. Encode Synonyms & Antonyms in Matrix v Joyfulness: joy, gladden; sorrow, sadden v Sad: sorrow, sadden; joy, gladden Target word: row- Inducing polarity vector joy gladden sorrow sadden goodwill Group 1: β€œjoyfulness” 1 1 -1 -1 0 Group 2: β€œsad” -1 -1 1 1 0 Group 3: β€œaffection” 0 0 0 0 1 Cosine Score: βˆ’ π΅π‘œπ‘’π‘π‘œπ‘§π‘›π‘‘

  32. Continuous representations for entities Democratic Party Republic Party ? George W Bush Laura Bush Michelle Obama 6501 Natural Language Processing 32

  33. Continuous representations for entities β€’ Useful resources for NLP applications β€’ Semantic Parsing & Question Answering β€’ Information Extraction 6501 Natural Language Processing 33

Recommend


More recommend