distributed representations
play

Distributed Representations CMSC 473/673 UMBC Some slides adapted - PowerPoint PPT Presentation

Distributed Representations CMSC 473/673 UMBC Some slides adapted from 3SLP Outline Recap Maxent models Basic neural language models Continuous representations Motivation Key idea: represent words with vectors Two common counting types


  1. Distributed Representations CMSC 473/673 UMBC Some slides adapted from 3SLP

  2. Outline Recap Maxent models Basic neural language models Continuous representations Motivation Key idea: represent words with vectors Two common counting types Two (four) common continuous representation models Evaluation

  3. Maxent Objective: Log-Likelihood = 𝐺 𝜄 Differentiating this becomes nicer (even though Z depends The objective is implicitly defined with on θ ) respect to (wrt) your data on hand

  4. Log-Likelihood Gradient Each component k is the difference between: the total value of feature f k in the training data and the total value the current model p θ X' Yi thinks it computes for feature f k

  5. N-gram Language Models given some context… w i-3 w i-2 w i-1 compute beliefs about 𝑞 𝑥 𝑗 𝑥 𝑗−3 ,𝑥 𝑗−2 ,𝑥 𝑗−1 ) ∝ 𝑑𝑝𝑣𝑜𝑢(𝑥 𝑗−3 ,𝑥 𝑗−2 ,𝑥 𝑗−1 ,𝑥 𝑗 ) what is likely… w i predict the next word

  6. Maxent Language Models given some context… w i-3 w i-2 w i-1 compute beliefs about 𝑞 𝑥 𝑗 𝑥 𝑗−3 ,𝑥 𝑗−2 ,𝑥 𝑗−1 ) ∝ softmax(𝜄 ⋅ 𝑔(𝑥 𝑗−3 ,𝑥 𝑗−2 ,𝑥 𝑗−1 ,𝑥 𝑗 )) what is likely… w i predict the next word

  7. Neural Language Models given some context… w i-3 w i-2 w i-1 create/use “ distributed representations”… e w e i-3 e i-2 e i-1 combine these matrix-vector θ wi C = f representations… product compute beliefs about 𝑞 𝑥 𝑗 𝑥 𝑗−3 ,𝑥 𝑗−2 ,𝑥 𝑗−1 ) ∝ softmax(𝜄 𝑥 𝑗 ⋅ 𝒈(𝑥 𝑗−3 ,𝑥 𝑗−2 ,𝑥 𝑗−1 )) what is likely… w i predict the next word

  8. Neural Language Models given some context… w i-3 w i-2 w i-1 create/use “ distributed representations”… e w e i-3 e i-2 e i-1 combine these matrix-vector θ wi C = f representations… product compute beliefs about 𝑞 𝑥 𝑗 𝑥 𝑗−3 ,𝑥 𝑗−2 ,𝑥 𝑗−1 ) ∝ softmax(𝜄 𝑥 𝑗 ⋅ 𝒈(𝑥 𝑗−3 ,𝑥 𝑗−2 ,𝑥 𝑗−1 )) what is likely… w i predict the next word

  9. Outline Recap Maxent models Basic neural language models Continuous representations Motivation Key idea: represent words with vectors Two common counting types Two (four) common continuous representation models Evaluation

  10. How have we represented words? Each word is a distinct item Bijection between the strings and unique integer ids: "cat" --> 3, "kitten" --> 792 "dog" --> 17394 Are "cat" and "kitten" similar?

  11. How have we represented words? Each word is a distinct item Bijection between the strings and unique integer ids: "cat" --> 3, "kitten" --> 792 "dog" --> 17394 Are "cat" and "kitten" similar? Equivalently: "One-hot" encoding Represent each word type w with a vector the size of the vocabulary This vector has V-1 zero entries, and 1 non-zero (one) entry

  12. Word Similarity ➔ Plagiarism Detection

  13. Distributional models of meaning = vector-space models of meaning = vector semantics Zellig Harris (1954): “oculist and eye - doctor … occur in almost the same environments” “If A and B have almost identical environments we say that they are synonyms.” Firth (1957): “You shall know a word by the company it keeps!”

  14. Continuous Meaning The paper reflected the truth.

  15. Continuous Meaning The paper reflected the truth. reflected paper truth

  16. Continuous Meaning The paper reflected the truth. reflected paper glean hide truth falsehood

  17. (Some) Properties of Embeddings Capture “like” (similar) words Mikolov et al. (2013)

  18. (Some) Properties of Embeddings Capture “like” (similar) words Capture relationships vector( ‘king’ ) – vector( ‘woman’ ) ≈ vector( ‘man’ ) + vector(‘queen’) vector( ‘Italy’ ) ≈ vector( ‘Paris’ ) - vector( ‘France’ ) + vector(‘Rome’) Mikolov et al. (2013)

  19. Outline Recap Maxent models Basic neural language models Continuous representations Motivation Key idea: represent words with vectors Two common counting types Two (four) common continuous representation models Evaluation

  20. Key Idea 1. Acquire basic contextual statistics (counts) for each word type w 2. Extract a real-valued vector v for each word w from those statistics 3. Use the vectors to represent each word in later tasks

  21. Outline Recap Maxent models Basic neural language models Continuous representations Motivation Key idea: represent words with vectors Two common counting types Two (four) common continuous representation models Evaluation

  22. “You shall know a word by the company it keeps!” Firth (1957) document (↓) -word (→) count matrix battle soldier fool clown As You Like It 1 2 37 6 Twelfth Night 1 2 58 117 Julius Caesar 8 12 1 0 Henry V 15 36 5 0

  23. “You shall know a word by the company it keeps!” Firth (1957) document (↓) -word (→) count matrix battle soldier fool clown As You Like It 1 2 37 6 Twelfth Night 1 2 58 117 Julius Caesar 8 12 1 0 Henry V 15 36 5 0 basic bag-of- words counting

  24. “You shall know a word by the company it keeps!” Firth (1957) document (↓) -word (→) count matrix battle soldier fool clown As You Like It 1 2 37 6 Twelfth Night 1 2 58 117 Julius Caesar 8 12 1 0 Henry V 15 36 5 0 Assumption: Two documents are similar if their vectors are similar

  25. “You shall know a word by the company it keeps!” Firth (1957) document (↓) -word (→) count matrix battle soldier fool clown As You Like It 1 2 37 6 Twelfth Night 1 2 58 117 Julius Caesar 8 12 1 0 Henry V 15 36 5 0 Assumption: Two words are similar if their vectors are similar

  26. “You shall know a word by the company it keeps!” Firth (1957) document (↓) -word (→) count matrix battle soldier fool clown As You Like It 1 2 37 6 Twelfth Night 1 2 58 117 Julius Caesar 8 12 1 0 Henry V 15 36 5 0 Assumption: Two words are similar if their vectors are similar Issue: Count word vectors are very large, sparse, and skewed!

  27. “You shall know a word by the company it keeps!” Firth (1957) context (↓) -word (→) count matrix apricot pineapple digital information aardvark 0 0 0 0 computer 0 0 2 1 data 0 10 1 6 pinch 1 1 0 0 result 0 0 1 4 sugar 1 1 0 0 Context: those other words within a small “window” of a target word

  28. “You shall know a word by the company it keeps!” Firth (1957) context (↓) -word (→) count matrix apricot pineapple digital information aardvark 0 0 0 0 computer 0 0 2 1 data 0 10 1 6 pinch 1 1 0 0 result 0 0 1 4 sugar 1 1 0 0 Context: those other words within a small “window” of a target word a cloud computer stores digital data on a remote computer

  29. “You shall know a word by the company it keeps!” Firth (1957) context (↓) -word (→) count matrix apricot pineapple digital information aardvark 0 0 0 0 computer 0 0 2 1 data 0 10 1 6 pinch 1 1 0 0 result 0 0 1 4 sugar 1 1 0 0 The size of windows depends on your goals ± 1- 3 more “syntax - y” The shorter the windows , the more syntactic the representation ± 4- 10 more “semantic - y” The longer the windows, the more semantic the representation

  30. “You shall know a word by the company it keeps!” Firth (1957) context (↓) -word (→) count matrix apricot pineapple digital information aardvark 0 0 0 0 computer 0 0 2 1 data 0 10 1 6 pinch 1 1 0 0 result 0 0 1 4 sugar 1 1 0 0 Context: those other words within a small “window” of a target word Assumption: Two words are similar if their vectors are similar Issue: Count word vectors are very large, sparse, and skewed!

  31. Outline Recap Maxent models Basic neural language models Continuous representations Motivation Key idea: represent words with vectors Two common counting types Two (four) common continuous representation models Evaluation

  32. Four kinds of vector models Sparse vector representations 1. Mutual-information weighted word co-occurrence matrices Dense vector representations: 2. Singular value decomposition/Latent Semantic Analysis 3. Neural-network-inspired models (skip-grams, CBOW) 4. Brown clusters

  33. Shared Intuition Model the meaning of a word by “embedding” in a vector space The meaning of a word is a vector of numbers Contrast: word meaning is represented in many computational linguistic applications by a vocabulary index (“word number 545”) or the string itself

  34. What’s the Meaning of Life?

  35. What’s the Meaning of Life? LIFE’

  36. What’s the Meaning of Life? LIFE’ (.478, - .289, .897, …)

  37. “ Embeddings ” Did Not Begin In This Century Hinton (1986): “Learning Distributed Representations of Concepts” Deerwester et al. (1990): “Indexing by Latent Semantic Analysis” Brown et al. (1992): “Class -based n-gram models of natural language”

Recommend


More recommend