distributed representations
play

Distributed Representations CMSC 473/673 UMBC Some slides adapted - PowerPoint PPT Presentation

Distributed Representations CMSC 473/673 UMBC Some slides adapted from 3SLP Outline Recap Maxent models Basic neural language models Continuous representations Motivation Key idea: represent blobs with vectors Two common counting types


  1. Distributed Representations CMSC 473/673 UMBC Some slides adapted from 3SLP

  2. Outline Recap Maxent models Basic neural language models Continuous representations Motivation Key idea: represent blobs with vectors Two common counting types Evaluation Common continuous representation models

  3. Maxent Objective: Log-Likelihood (n- gram LM ( β„Ž 𝑗 , 𝑦 𝑗 )) log ΰ·‘ π‘ž πœ„ 𝑦 𝑗 β„Ž 𝑗 = ෍ log π‘ž πœ„ (𝑦 𝑗 |β„Ž 𝑗 ) 𝑗 𝑗 πœ„ π‘ˆ 𝑔 𝑦 𝑗 , β„Ž 𝑗 βˆ’ log π‘Ž(β„Ž 𝑗 ) = ෍ 𝑗 = 𝐺 πœ„ Differentiating this becomes nicer (even though Z depends on ΞΈ ) The objective is implicitly defined with respect to (wrt) your data on hand

  4. Log-Likelihood Gradient Each component k is the difference between: the total value of feature f k in the ෍ 𝑔 𝑙 (𝑦 𝑗 , β„Ž 𝑗 ) training data 𝑗 and the total value the current model p ΞΈ 𝑙 (𝑦 β€² , β„Ž 𝑗 ) ෍ 𝔽 𝑦 β€² ~ π‘ž 𝑔 thinks it computes for feature f k 𝑗

  5. N-gram Language Models given some context… w i-3 w i-2 w i-1 compute beliefs about what is likely… π‘ž π‘₯ 𝑗 π‘₯ π‘—βˆ’3 , π‘₯ π‘—βˆ’2 , π‘₯ π‘—βˆ’1 ) ∝ π‘‘π‘π‘£π‘œπ‘’(π‘₯ π‘—βˆ’3 , π‘₯ π‘—βˆ’2 , π‘₯ π‘—βˆ’1 , π‘₯ 𝑗 ) w i predict the next word

  6. Maxent Language Models given some context… w i-3 w i-2 w i-1 compute beliefs about what is likely… π‘ž π‘₯ 𝑗 π‘₯ π‘—βˆ’3 , π‘₯ π‘—βˆ’2 , π‘₯ π‘—βˆ’1 ) ∝ softmax(πœ„ β‹… 𝑔(π‘₯ π‘—βˆ’3 , π‘₯ π‘—βˆ’2 , π‘₯ π‘—βˆ’1 , π‘₯ 𝑗 )) w i predict the next word

  7. Neural Language Models given some context… w i-3 w i-2 w i-1 create/use β€œ distributed representations”… e w e i-3 e i-2 e i-1 combine these matrix-vector C = f ΞΈ wi representations… product compute beliefs about what is likely… π‘ž π‘₯ 𝑗 π‘₯ π‘—βˆ’3 , π‘₯ π‘—βˆ’2 , π‘₯ π‘—βˆ’1 ) ∝ softmax(πœ„ π‘₯ 𝑗 β‹… π’ˆ(π‘₯ π‘—βˆ’3 , π‘₯ π‘—βˆ’2 , π‘₯ π‘—βˆ’1 )) w i predict the next word

  8. Neural Language Models given some context… w i-3 w i-2 w i-1 create/use β€œ distributed representations”… e w e i-3 e i-2 e i-1 combine these matrix-vector C = f ΞΈ wi representations… product compute beliefs about what is likely… π‘ž π‘₯ 𝑗 π‘₯ π‘—βˆ’3 , π‘₯ π‘—βˆ’2 , π‘₯ π‘—βˆ’1 ) ∝ softmax(πœ„ π‘₯ 𝑗 β‹… π’ˆ(π‘₯ π‘—βˆ’3 , π‘₯ π‘—βˆ’2 , π‘₯ π‘—βˆ’1 )) w i predict the next word

  9. Outline Recap Maxent models Basic neural language models Continuous representations Motivation Key idea: represent blobs with vectors Two common counting types Evaluation Common continuous representation models

  10. Recall from Deck 2: Representing a Linguistic β€œBlob” 1. An array of sub-blobs Let V = vocab size (# types) word β†’ array of characters 1. Represent each word type sentence β†’ array of words with a unique integer i, where 0 ≀ 𝑗 < π‘Š 2. Integer 2. Or equivalently, … representation/one-hot – Assign each word to some encoding index i, where 0 ≀ 𝑗 < π‘Š – Represent each word w with a V-dimensional binary vector 𝑓 π‘₯ , where 𝑓 π‘₯,𝑗 = 1 and 0 3. Dense embedding otherwise

  11. Recall from Deck 2: One-Hot Encoding Example β€’ Let our vocab be {a, cat, saw, mouse, happy} β€’ V = # types = 5 0 β€’ Assign: 0 How do we 𝑓 cat = 1 a 4 represent β€œcat?” 0 cat 2 saw 3 0 mouse 0 0 happy 1 1 How do we 𝑓 happy = 0 represent 0 β€œhappy?” 0

  12. Recall from Deck 2: Representing a Linguistic β€œBlob” 1. An array of sub-blobs word β†’ array of characters Let E be some embedding sentence β†’ array of words size (often 100, 200, 300, etc.) 2. Integer representation/one-hot Represent each word w with encoding an E-dimensional real- valued vector 𝑓 π‘₯ 3. Dense embedding

  13. Recall from Deck 2: A Dense Representation (E=2)

  14. Maxent Plagiarism Detector? Given two documents 𝑦 1 , 𝑦 2 , predict y = 1 (plagiarized) or y = 0 (not plagiarized) What is/are the: β€’ Method/steps for predicting? β€’ General formulation? β€’ Features?

  15. Plagiarism Detection: Word Similarity?

  16. Distributional Representations A dense , β€œ low ” dimensional vector representation

  17. How have we represented words? Each word is a distinct item Bijection between the strings and unique integer ids: "cat" --> 3, "kitten" --> 792 "dog" --> 17394 Are "cat" and "kitten" similar?

  18. How have we represented words? Each word is a distinct item Bijection between the strings and unique integer ids: "cat" --> 3, "kitten" --> 792 "dog" --> 17394 Are "cat" and "kitten" similar? Equivalently: "One-hot" encoding Represent each word type w with a vector the size of the vocabulary This vector has V-1 zero entries, and 1 non-zero (one) entry

  19. Distributional Representations A dense , β€œ low ” dimensional vector representation An E-dimensional vector, often (but not always) real-valued

  20. Distributional Representations A dense , β€œ low ” dimensional vector representation Up till ~2013: E could be An E-dimensional any size vector, often (but not 2013-present: E << vocab always) real-valued

  21. Distributional Representations A dense , β€œ low ” dimensional vector representation Many values Up till ~2013: E could be An E-dimensional are not 0 (or at any size vector, often (but not least less 2013-present: E << vocab always) real-valued sparse than one-hot)

  22. Distributional Representations A dense , β€œ low ” dimensional vector representation Many values Up till ~2013: E could be An E-dimensional are not 0 (or at any size vector, often (but not least less 2013-present: E << vocab always) real-valued sparse than These are also called one-hot) β€’ embeddings β€’ Continuous representations β€’ (word/sentence/…) vectors β€’ Vector-space model

  23. Distributional models of meaning = vector-space models of meaning = vector semantics Zellig Harris (1954): β€œoculist and eye - doctor … occur in almost the same environments” β€œIf A and B have almost identical environments we say that they are synonyms.” Firth (1957): β€œYou shall know a word by the company it keeps!”

  24. Continuous Meaning The paper reflected the truth.

  25. Continuous Meaning The paper reflected the truth. reflected paper truth

  26. Continuous Meaning The paper reflected the truth. reflected paper glean hide truth falsehood

  27. (Some) Properties of Embeddings Capture β€œlike” (similar) words Mikolov et al. (2013)

  28. (Some) Properties of Embeddings Capture β€œlike” (similar) words Capture relationships vector( β€˜king’ ) – vector( β€˜man’ ) + vector( β€˜woman’ ) β‰ˆ vector(β€˜queen’) vector( β€˜Paris’ ) - vector( β€˜France’ ) + vector( β€˜Italy’ ) β‰ˆ vector(β€˜Rome’) Mikolov et al. (2013)

  29. β€œ Embeddings ” Did Not Begin In This Century Hinton (1986): β€œLearning Distributed Representations of Concepts” Deerwester et al. (1990): β€œIndexing by Latent Semantic Analysis” Brown et al. (1992): β€œClass -based n-gram models of natural language”

  30. Outline Recap Maxent models Basic neural language models Continuous representations Motivation Key idea: represent blobs with vectors Two common counting types Evaluation Common continuous representation models

  31. Key Ideas 1. Acquire basic contextual statistics (often counts) for each word type w

  32. Key Ideas 1. Acquire basic contextual statistics (often counts) for each word type w 2. Extract a real-valued vector v for each word w from those statistics

  33. Key Ideas 1. Acquire basic contextual statistics (often counts) for each word type w 2. Extract a real-valued vector v for each word w from those statistics 3. Use the vectors to represent each word in later tasks

  34. Key Ideas: Generalizing to β€œblobs” 1. Acquire basic contextual statistics (often counts) for each blob type w 2. Extract a real-valued vector v for each blob w from those statistics 3. Use the vectors to represent each blob in later tasks

  35. Outline Recap Maxent models Basic neural language models Continuous representations Motivation Key idea: represent blobs with vectors Two common counting types Evaluation Common continuous representation models

  36. β€œ Acquire basic contextual statistics (often counts) for each word type w” β€’ Two basic, initial counting approaches – Record which words appear in which documents – Record which words appear together β€’ These are good first attempts, but with some large downsides

  37. β€œYou shall know a word by the company it keeps!” Firth (1957) document (↓) -word (β†’) count matrix battle soldier fool clown As You Like It 1 2 37 6 Twelfth Night 1 2 58 117 Julius Caesar 8 12 1 0 Henry V 15 36 5 0

  38. β€œYou shall know a word by the company it keeps!” Firth (1957) document (↓) -word (β†’) count matrix battle soldier fool clown As You Like It 1 2 37 6 Twelfth Night 1 2 58 117 Julius Caesar 8 12 1 0 Henry V 15 36 5 0 basic bag-of- words counting

Recommend


More recommend