cs 3750 word models
play

CS 3750: Word Models PRESENTED BY: MUHENG YAN UNIVERSITY OF - PDF document

2/20/2020 CS 3750: Word Models PRESENTED BY: MUHENG YAN UNIVERSITY OF PITTSBURGH FEB.20, 2020 Is Document Models Enough? Recap: previously we have LDA and LSI to learn document representations What if we have very short documents, or


  1. 2/20/2020 CS 3750: Word Models PRESENTED BY: MUHENG YAN UNIVERSITY OF PITTSBURGH FEB.20, 2020 Is Document Models Enough? • Recap: previously we have LDA and LSI to learn document representations • What if we have very short documents, or even sentences? (e.g. Tweets) • Can we investigate relationships between words/sentences with previous models? • We need to model words individually for a better granularity 1

  2. 2/20/2020 Distributional Semantics: from a Linguistic Aspect Word Embedding , Distributed Representations, Semantic Vector Space... What are they? A more formal term from linguistic: Distributional Semantic Model "… quantifying and categorizing semantic similarities between linguistic items based on their distributional properties in large samples of language data." -- Wikipedia --> Represent elements of language (word here) as distributions of other elements (i.e. documents, paragraphs, sentences, and words) E.g. word 1 = doc 1 + doc 5 + doc 10 / word 1 = 0.5*word 12 + 0.7*word 24 Document Level Representation Words as distributions of documents: Latent Semantic Analysis/Indexing ( LSA/LSI ) 1.Build a co-occurrence matrix of word vs. doc (n by d) 2.Decompose the Word-Document matrix via SVD 3.Take the highest singular values to get the lower-ranked approximation of the w-d matrix, as the word representations Picture Credit: https://en.wikipedia.org/wiki/Latent_semantic_analysis 2

  3. 2/20/2020 Word Level Representation I. Counting and Matrix Factorization II. Latent Representation I.Neural Network for Language Models II.CBOW III.Skip-gram IV.Other Models III.Graph-based Models I.Node2Vec Counting and Matrix Factorization • Counting methods start with constructing a matrix of co- occurrences between words and words (can be expanded to other levels, e.g. at document level it becomes LSA) • Due to the high-dimensionality and sparcity, usually used with a dim-reduction algorithm (PCA, SVD, etc.) • The rows of the matrix approximates the distribution of co- occurring words for every word we are trying to model Example Models including: LSA, Explicit Semantic Analysis (ESA), Global vectors for word representation (GloVe) 3

  4. 2/20/2020 Explicit Semantic Analysis • Similar words most likely appear with the same distribution of topics • ESA represents topics by Wikipedia concepts (Pages). ESA use Wikipedia concepts as dimensions to construct the space in which words will be projected • For each dimension (concept), words in this concept article are counted • Inverted index is then constructed to convert each word into a vector of concepts • The vector constructed for each word represents the frequency of its occurrences within each (concept). Picture and Content Credit: Ahmed Magooda Global vectors for word representation (GloVe) 1. Word-word co-occurrence with sliding “I learn machine learning in CS - 3750” window (|V| by |V|) (and normalize as probability) 2. Construct the cost as: |𝑾| Window=2 I learn machine learning 𝟑 𝑈 𝒘 𝑘 + 𝒄 𝑗 + 𝒄 𝑘 − log 𝒀 𝑗,𝑘 I 0 1 1 0 𝑲 = ෍ 𝒈 𝒀 𝒋,𝒌 𝒘 𝑗 Learn 1 0 1 1 𝒋,𝒌 machine 1 1 0 2 3. Use gradient descent to solve the optimization 4

  5. 2/20/2020 GloVe Cont. How the cost is derived? 𝑌 𝑗𝑙 Probability of word i and k appear together: 𝑄 𝑗,𝑙 = 𝑌𝑗 𝑄 𝑗𝑙 Using word k as a probe, the “ratio” of two word pairs: 𝑠𝑏𝑢𝑗𝑝 𝑗,𝑘,𝑙 = 𝑄 𝑘𝑙 2 To model the ratio with embedding v : 𝐾 = σ 𝑠𝑏𝑢𝑗𝑝 𝑗𝑘𝑙 − 𝑕 𝑤 𝑗 , 𝑤 𝑘 , 𝑤 𝑙 -> O(N^3) 𝑈 𝑤 𝑙 Simplify the computation by design 𝑕 ∙ = 𝑓 𝑤 𝑗 −𝑤 𝑘 Value of ratio J and k related J and k not related I and k related 1 Inf 𝑈 𝑤 𝑙 ) 𝑄 𝑗𝑙 𝑓^(𝑤 𝑗 Thus we are trying to make 𝑄 𝑘𝑙 = I and k not related 0 1 𝑈 𝑤 𝑙 ) 𝑓^(𝑤 𝑘 2 𝑈 𝑤 𝑘 Thus we have 𝐾 = σ log 𝑄 𝑗𝑘 − 𝑤 𝑗 𝑈 𝑤 𝑘 , we have log 𝑌 𝑗𝑘 − log 𝑌 𝑗 = 𝑤 𝑗 𝑈 𝑤 𝑘 , then To expand the object log 𝑄 𝑗𝑘 = 𝑤 𝑗 𝑈 𝑤 𝑘 + 𝑐 𝑗 + 𝑐 𝑈 𝑤 𝑗 log 𝑌 𝑗𝑘 = 𝑤 𝑗 𝑘 . By doing this, we solve the problem that 𝑄 𝑗𝑘 ≠ 𝑄 𝑘𝑗 but 𝑤 𝑘 𝟑 , where 𝑔(∙) is a weight |𝑾| 𝒈 𝒀 𝒋,𝒌 𝑈 𝒘 𝑘 + 𝒄 𝑗 + 𝒄 𝑘 − log 𝒀 𝑗,𝑘 Then we come up with the final cost function 𝐾 = σ 𝒋,𝒌 𝒘 𝑗 function Latent Representation Modeling the distribution of context* for a certain words through a series of latent variables, by maximizing the likelihood P (word | context)* Usually fulfilled by neural networks The learned latent variables are used as the representations of words after optimization * context refers to the other words from the distribution of which we model the target word * in some models it could be P(context | word), e.g. Skip-gram 5

  6. 2/20/2020 Neural Network for Language Model Learning Objective (predicting next word 𝒙 𝒌 ): Find the parameter set 𝜄 to minimize 1 𝑈 σ 𝑘 log(𝑄(𝑥 𝑀 𝜄 = − 𝑘 |𝑥 𝑘−1 ,… , 𝑥 𝑘−𝑜+1 )) + 𝑆(𝜄) 𝑓 𝑧𝑥𝑗 Where 𝑄 ∙ = σ 𝑗≠𝑘 𝑓 𝑧𝑥𝑘 , Y = b + 𝑿 𝑝𝑣𝑢 tanh(d + 𝑿 𝑗𝑜 X ), And X is the lookup results of the n-length sequence: X = [ 𝐷 𝑥 𝑘−1 , … , 𝑑(𝑥 𝑘−𝑜+1 )] * ( 𝑿 𝑝𝑣𝑢 , b ) is the parameter set of output layer, ( 𝑿 𝑗𝑜 , d ) is the parameter set of hidden layer In this mode we learn the parameters in C (|V| * |N|), 𝑿 𝑗𝑜 (n * |V| * hidden_size), and 𝑿 𝑝𝑣𝑢 (hidden_size * |V|) Content Credit: Ahmed Magooda RNN for Language Model Learning Objective: similar to NN for LM Alter from NN: ◦ The hidden layer is now the linear combination of the input current word t and the hidden of previous word t-1 : 𝑡 𝑢 = 𝑔(𝑽𝑥 𝑢 + 𝑿𝑡 𝑢 − 1 ) Where 𝑔(∙) is the activation function Content Credit: Ahmed Magooda 6

  7. 2/20/2020 Continuous Bag-of-Words Model Learning Objective: maximizing the likelihood of 𝑄(𝑥𝑝𝑠𝑒|𝑑𝑝𝑜𝑢𝑓𝑦𝑢) for every word in a corpus Similar to NN for LM, the inputs are one-hot vectors and the matrix 𝑿 here is like the look-up matrix. Differences compared to the NN for LM: ◦ Bi-directional : not predicting the “next”, instead predicting the center word inside a window, where words from both directions are input ◦ Significantly reduced complexity: only learns 2 * |V| * |N| parameters Picture Credit: Francois Chaubard, Rohit Mundra, Richard Socher, from https://cs224d.stanford.edu/lecture_notes/notes1.pdf CBOW Cont. Steps breakdown: 1. Generate the one-hot vectors for the context: ( 𝒚 𝑑−𝑛 , … , 𝒚 𝑑−1 , 𝒚 𝑑+1 , … , 𝒚 𝑑+𝑛 𝜗 𝑺 𝑊 ), and lookup for the word vectors 𝒘 𝑗 = 𝑿𝒚 𝑗 2. Average the vectors over contexts: 𝒊 𝑑 = 𝒘 𝑑−𝑛 + …+𝒘 𝑑+𝑛 2𝑛 3. Generate the posterior 𝒜 𝑑 = 𝑿 ′ 𝒊 𝑑 , and turn it in to probabilities ෝ 𝒛 𝑑 = 𝑡𝑝𝑔𝑢𝑛𝑏𝑦(𝒜 𝑑 ) |𝑊| 𝑧 𝑗 log(ො Notations: 4. Calculate the loss as cross-entropy: σ 𝑗=1 𝑧 𝑗 ) m : half window size  𝑄(𝑥 𝑑 |𝑥 𝑑−𝑛 , … 𝑥 𝑑+𝑛 ) c: center word index 𝑥 𝑗 : word i from vocabulary V 𝒚 𝑗 : one-hot input of word i 𝑿𝜗 𝑺 𝑊 × 𝑜 : the context lookup matrix 𝑿 ′ 𝜗 𝑺 𝑜 × 𝑊 : the center lookup matrix 7

  8. 2/20/2020 CBOW Cont. Loss fuction: 𝐺𝑝𝑠 𝑏𝑚𝑚 𝑥 𝑑 ∈ 𝑊 , 𝑛𝑗𝑜𝑗𝑛𝑗𝑨𝑓 𝐾 ∙ = 𝑚𝑝𝑕𝑄 𝑥 𝑑 𝑥 𝑑−𝑛 , … 𝑥 𝑑+𝑛 ⇒ − 1 |𝑊| ෍ 𝑚𝑝𝑕𝑄 𝑿 𝑑 𝒊 𝑑 ′ 𝑈 𝒊 𝑑 𝑓 𝒙 𝑑 = − 1 𝑊 ෍ 𝑚𝑝𝑕 ′𝑼 𝒊 𝑑 𝑊 𝑓 𝒙 𝑘 σ 𝑘=1 𝑊 = − 1 ′𝑼 𝒊 𝑑 ) ′𝑈 𝒊 𝑑 + log(෍ 𝑓 𝒙 𝑘 𝑊 ෍ −𝒙 𝑑 𝑘=1 ′ 𝑏𝑜𝑒 𝒙 Optimization: use SGD to update all relevant vectors 𝒙 𝑑 Skip-gram Model Learning Objective: maximizing the likelihood of 𝑄(𝑑𝑝𝑜𝑢𝑓𝑦𝑢|𝑥𝑝𝑠𝑒) for every word in a corpus Steps Breakdown: 1. Generate one-hot vector for the center word 𝒚 𝜗 𝑺 𝑊 , and calculate the embedded vector 𝒊 𝑑 = 𝑿𝒚 𝜗 𝑺 𝑜 2. Calculate the posterior 𝒜 𝑑 = 𝑿 ′ 𝒊 𝑑 3. For each word j in the context of the center word, calculate the probabilities ෝ 𝒛 𝑑 = 𝑡𝑝𝑔𝑢𝑛𝑏𝑦(𝒜 𝑑 ) 𝑧 𝑑𝑘 in ෝ 4. We want the probabilities ො 𝒛 𝑑 match the true probabilities of the contexts which are 𝑧 𝑑−𝑛 , … , 𝑧 𝑑+𝑛 Cost function constructed similarly to the CBOW model Picture Credit: Francois Chaubard, Rohit Mundra, Richard Socher, from https://cs224d.stanford.edu/lecture_notes/notes1.pdf 8

Recommend


More recommend