Distributed Representations CMSC 473/673 UMBC Some slides adapted from 3SLP
Outline Recap Maxent models Basic neural language models Continuous representations Motivation Key idea: represent blobs with vectors Two common counting types Evaluation Common continuous representation models
Maxent Objective: Log-Likelihood (n- gram LM ( β π , π¦ π )) log ΰ· π π π¦ π β π = ΰ· log π π (π¦ π |β π ) π π π π π π¦ π , β π β log π(β π ) = ΰ· π = πΊ π Differentiating this becomes nicer (even though Z depends on ΞΈ ) The objective is implicitly defined with respect to (wrt) your data on hand
Log-Likelihood Gradient Each component k is the difference between: the total value of feature f k in the ΰ· π π (π¦ π , β π ) training data π and the total value the current model p ΞΈ π (π¦ β² , β π ) ΰ· π½ π¦ β² ~ π π thinks it computes for feature f k π
N-gram Language Models given some contextβ¦ w i-3 w i-2 w i-1 compute beliefs about what is likelyβ¦ π π₯ π π₯ πβ3 , π₯ πβ2 , π₯ πβ1 ) β πππ£ππ’(π₯ πβ3 , π₯ πβ2 , π₯ πβ1 , π₯ π ) w i predict the next word
Maxent Language Models given some contextβ¦ w i-3 w i-2 w i-1 compute beliefs about what is likelyβ¦ π π₯ π π₯ πβ3 , π₯ πβ2 , π₯ πβ1 ) β softmax(π β π(π₯ πβ3 , π₯ πβ2 , π₯ πβ1 , π₯ π )) w i predict the next word
Neural Language Models given some contextβ¦ w i-3 w i-2 w i-1 create/use β distributed representationsββ¦ e w e i-3 e i-2 e i-1 combine these matrix-vector C = f ΞΈ wi representationsβ¦ product compute beliefs about what is likelyβ¦ π π₯ π π₯ πβ3 , π₯ πβ2 , π₯ πβ1 ) β softmax(π π₯ π β π(π₯ πβ3 , π₯ πβ2 , π₯ πβ1 )) w i predict the next word
Neural Language Models given some contextβ¦ w i-3 w i-2 w i-1 create/use β distributed representationsββ¦ e w e i-3 e i-2 e i-1 combine these matrix-vector C = f ΞΈ wi representationsβ¦ product compute beliefs about what is likelyβ¦ π π₯ π π₯ πβ3 , π₯ πβ2 , π₯ πβ1 ) β softmax(π π₯ π β π(π₯ πβ3 , π₯ πβ2 , π₯ πβ1 )) w i predict the next word
Outline Recap Maxent models Basic neural language models Continuous representations Motivation Key idea: represent blobs with vectors Two common counting types Evaluation Common continuous representation models
Recall from Deck 2: Representing a Linguistic βBlobβ 1. An array of sub-blobs Let V = vocab size (# types) word β array of characters 1. Represent each word type sentence β array of words with a unique integer i, where 0 β€ π < π 2. Integer 2. Or equivalently, β¦ representation/one-hot β Assign each word to some encoding index i, where 0 β€ π < π β Represent each word w with a V-dimensional binary vector π π₯ , where π π₯,π = 1 and 0 3. Dense embedding otherwise
Recall from Deck 2: One-Hot Encoding Example β’ Let our vocab be {a, cat, saw, mouse, happy} β’ V = # types = 5 0 β’ Assign: 0 How do we π cat = 1 a 4 represent βcat?β 0 cat 2 saw 3 0 mouse 0 0 happy 1 1 How do we π happy = 0 represent 0 βhappy?β 0
Recall from Deck 2: Representing a Linguistic βBlobβ 1. An array of sub-blobs word β array of characters Let E be some embedding sentence β array of words size (often 100, 200, 300, etc.) 2. Integer representation/one-hot Represent each word w with encoding an E-dimensional real- valued vector π π₯ 3. Dense embedding
Recall from Deck 2: A Dense Representation (E=2)
Maxent Plagiarism Detector? Given two documents π¦ 1 , π¦ 2 , predict y = 1 (plagiarized) or y = 0 (not plagiarized) What is/are the: β’ Method/steps for predicting? β’ General formulation? β’ Features?
Plagiarism Detection: Word Similarity?
Distributional Representations A dense , β low β dimensional vector representation
How have we represented words? Each word is a distinct item Bijection between the strings and unique integer ids: "cat" --> 3, "kitten" --> 792 "dog" --> 17394 Are "cat" and "kitten" similar?
How have we represented words? Each word is a distinct item Bijection between the strings and unique integer ids: "cat" --> 3, "kitten" --> 792 "dog" --> 17394 Are "cat" and "kitten" similar? Equivalently: "One-hot" encoding Represent each word type w with a vector the size of the vocabulary This vector has V-1 zero entries, and 1 non-zero (one) entry
Distributional Representations A dense , β low β dimensional vector representation An E-dimensional vector, often (but not always) real-valued
Distributional Representations A dense , β low β dimensional vector representation Up till ~2013: E could be An E-dimensional any size vector, often (but not 2013-present: E << vocab always) real-valued
Distributional Representations A dense , β low β dimensional vector representation Many values Up till ~2013: E could be An E-dimensional are not 0 (or at any size vector, often (but not least less 2013-present: E << vocab always) real-valued sparse than one-hot)
Distributional Representations A dense , β low β dimensional vector representation Many values Up till ~2013: E could be An E-dimensional are not 0 (or at any size vector, often (but not least less 2013-present: E << vocab always) real-valued sparse than These are also called one-hot) β’ embeddings β’ Continuous representations β’ (word/sentence/β¦) vectors β’ Vector-space model
Distributional models of meaning = vector-space models of meaning = vector semantics Zellig Harris (1954): βoculist and eye - doctor β¦ occur in almost the same environmentsβ βIf A and B have almost identical environments we say that they are synonyms.β Firth (1957): βYou shall know a word by the company it keeps!β
Continuous Meaning The paper reflected the truth.
Continuous Meaning The paper reflected the truth. reflected paper truth
Continuous Meaning The paper reflected the truth. reflected paper glean hide truth falsehood
(Some) Properties of Embeddings Capture βlikeβ (similar) words Mikolov et al. (2013)
(Some) Properties of Embeddings Capture βlikeβ (similar) words Capture relationships vector( βkingβ ) β vector( βmanβ ) + vector( βwomanβ ) β vector(βqueenβ) vector( βParisβ ) - vector( βFranceβ ) + vector( βItalyβ ) β vector(βRomeβ) Mikolov et al. (2013)
β Embeddings β Did Not Begin In This Century Hinton (1986): βLearning Distributed Representations of Conceptsβ Deerwester et al. (1990): βIndexing by Latent Semantic Analysisβ Brown et al. (1992): βClass -based n-gram models of natural languageβ
Outline Recap Maxent models Basic neural language models Continuous representations Motivation Key idea: represent blobs with vectors Two common counting types Evaluation Common continuous representation models
Key Ideas 1. Acquire basic contextual statistics (often counts) for each word type w
Key Ideas 1. Acquire basic contextual statistics (often counts) for each word type w 2. Extract a real-valued vector v for each word w from those statistics
Key Ideas 1. Acquire basic contextual statistics (often counts) for each word type w 2. Extract a real-valued vector v for each word w from those statistics 3. Use the vectors to represent each word in later tasks
Key Ideas: Generalizing to βblobsβ 1. Acquire basic contextual statistics (often counts) for each blob type w 2. Extract a real-valued vector v for each blob w from those statistics 3. Use the vectors to represent each blob in later tasks
Outline Recap Maxent models Basic neural language models Continuous representations Motivation Key idea: represent blobs with vectors Two common counting types Evaluation Common continuous representation models
β Acquire basic contextual statistics (often counts) for each word type wβ β’ Two basic, initial counting approaches β Record which words appear in which documents β Record which words appear together β’ These are good first attempts, but with some large downsides
βYou shall know a word by the company it keeps!β Firth (1957) document (β) -word (β) count matrix battle soldier fool clown As You Like It 1 2 37 6 Twelfth Night 1 2 58 117 Julius Caesar 8 12 1 0 Henry V 15 36 5 0
βYou shall know a word by the company it keeps!β Firth (1957) document (β) -word (β) count matrix battle soldier fool clown As You Like It 1 2 37 6 Twelfth Night 1 2 58 117 Julius Caesar 8 12 1 0 Henry V 15 36 5 0 basic bag-of- words counting
Recommend
More recommend