word embedding
play

Word Embedding Praveen Krishnan CVIT, IIIT Hyderabad June 22, 2017 - PowerPoint PPT Presentation

Word Embedding Praveen Krishnan CVIT, IIIT Hyderabad June 22, 2017 1 Outline Introduction Classical Methods Language Modeling Neural Language Model Challenges of SoftMax Hierarchical Softmax Margin Based Hinge Loss Sampling Based


  1. Word Embedding Praveen Krishnan CVIT, IIIT Hyderabad June 22, 2017 1

  2. Outline Introduction Classical Methods Language Modeling Neural Language Model Challenges of SoftMax Hierarchical Softmax Margin Based Hinge Loss Sampling Based Approaches Word2Vec Noise Contrastive Estimation Negative Sampling 2

  3. Philosophy of Language “(...) the meaning of a word is its use in the language.” -Ludwig Wittgenstein Philosophical Investigations - 1953 Slide Credit: Christian Perone, Word Embeddings - Introduction 3

  4. Word Embedding ◮ Word embeddings refer to dense representations of words in a low-dimensional vector space which encodes the associated semantics. ◮ Introduced by Bengio et. al. NIPS’01. ◮ The silver bullet for many NLP tasks. 4

  5. Word Embedding The Syntactic and Semantic Phenomenon ◮ Morphology, Tense etc. ◮ Vocabulary Mismatch [Synonymy, Polysemy] ◮ Topic vs. Word Distribution Reasoning vs. Analogy w ( athens ) − w ( greece ) ≈ w ( oslo ) − ? w ( apples ) − w ( apple ) ≈ w ( oranges ) − ? w ( walking ) − w ( walked ) ≈ w ( swimming ) − ? 5

  6. Word Embedding The Syntactic and Semantic Phenomenon ◮ Morphology, Tense etc. ◮ Vocabulary Mismatch [Synonymy, Polysemy] ◮ Topic vs. Word Distribution Reasoning vs. Analogy w ( athens ) − w ( greece ) ≈ w ( oslo ) − w ( norway ) w ( apples ) − w ( apple ) ≈ w ( oranges ) − w ( orange ) w ( walking ) − w ( walked ) ≈ w ( swimming ) − w ( swam ) 6

  7. Word Embedding ◮ Sparse-to-Dense. ◮ Unsupervised learning. ◮ Typically learned as a by-product* of language modeling problem. 7

  8. Classical Methods - Topic Modeling Latent Semantic Analysis [Deerwester et. al. ’1990] Project terms and documents into a topic space using SVD on term-document (co-occurrence) matrix. 8

  9. Classical Methods - Topic Modeling Latent Dirichilet Allocation [Blei et. al ’2003] ◮ Assumes generative probabilistic model of a corpus. ◮ Documents are represented as distribution over latent topics, where each topic is characterized by a distribution over words. Figure 1: Plate notation representing the LDA model. Source: Wikipedia 9

  10. Language Modeling Probabilistic Language Modeling Given a set of words, does they form valid construct in a language. � p ( w 1 , . . . , w T ) = p ( w i | w 1 , . . . , w i − 1 ) i p ( ′ high ′ , ′ winds ′ , ′ tonight ′ ) > p ( ′ large ′ , ′ winds ′ , ′ tonight ′ ) Using Markov assumptions:- � p ( w 1 , . . . , w T ) = p ( w i | w i − 1 , . . . , w i − n +1 ) i Applications Spell correction, Machine translation, Speech recognition, OCR etc. 10

  11. Language Modeling Probabilistic Language Modelling ◮ nGram based model:- p ( w t | w t − 1 , . . . , w t − n +1 ) = count ( w t − n +1 , . . . , w t − 1 , w t ) count ( w t − n +1 , . . . , w t − 1 ) 11

  12. Language Model Neural Probabilistic Language Model Bengio et. al. JMLR’03 12

  13. Language Model Neural Probabilistic Language Model exp ( h T v w t ) p ( w t | w t − 1 , . . . , w t − n +1 ) = w i ∈ V exp ( h T v w i ) ⇒ Softmax Layer � Here h is the hidden representation of input, v w i is the output word embedding of word i and V is the vocabulary. Bengio et. al. JMLR’03 12

  14. Neural Probabilistic Language Model ◮ Associate each word in vocabulary a distributed feature vector. ◮ Learn both the embedding and parameters for probability function jointly. Bengio et. al. JMLR’03 13

  15. Softmax Classifier Figure 2: Predicting the next word with softmax Slide Credit: Sebastian Ruder. Blog On word embeddings - Part 2: Approximating the Softmax. 14

  16. Challenges of SoftMax One of the major challenges in previous formulation is the cost of computing the softmax which is O ( | V | ) (typically > 100 K ) Major works:- ◮ Softmax-based approaches [Bringing more efficiency] ◮ Hierarchical Softmax ◮ Differentiated Softmax ◮ CNN-Softmax ◮ Sampling-based approaches [Approximating the softmax using a different loss function] ◮ Importance Sampling. ◮ Margin based Hinge loss. ◮ Noise Contrastive Approximation ◮ Negative Sampling ◮ ... 15

  17. Hierarchical Softmax ◮ Uses a binary tree representation of output layer with leaves belonging to each word in vocabulary. ◮ Evaluates at-most log 2 ( | V | ) nodes instead of | V | nodes. ◮ Parameters are stored only at internal nodes, hence total parameters same as regular softmax. p ( right | n , c ) = σ ( h T v n ) p ( left | n , c ) = 1 − p ( right | n , c ) Figure 3: Hierarchical Softmax: Morin and Bengio, 2005 16

  18. Hierarchical Softmax Figure 4: Hierarchical Softmax: Hugo Lachorelle’s Youtube lectures ◮ Structure of tree important for further efficiency in computation and performance. ◮ Examples:- ◮ Morin and Bengio: Using synsets in WordNet as clusters of tree. ◮ Mikolov et. al.: Use Huffman tree which takes into account the frequency of words. 17

  19. Margin Based Hinge Loss C&W Model ◮ Avoids computing the expensive softmax by reformulating the objective. ◮ Train a network to produce higher scores for correct word windows than incorrect ones. ◮ The pairwise ranking criteria is given as:- � � max { 0 , 1 − f θ ( x ) + f θ ( x ( w ) ) } J θ = x ∈ X w ∈ V Here x is the correct windows and x w is the incorrect windows created by replacing the center word and f θ ( x ) is the score output by the model. 18

  20. Sampling Based Approaches Sampling based approaches approximates the softmax by an alternative loss function which is cheaper to compute . Interpreting logistic loss function exp( h ⊤ v ′ w ) J θ = − log w i ∈ V exp( h ⊤ v ′ � w i ) J θ = − h ⊤ v ′ � exp( h ⊤ v ′ w + log w i ) w i ∈ V Computing gradient w.r.t. model params, we get � ∇ θ J θ = ∇ θ E ( w ) − P ( w i ) ∇ θ E ( w i ) w i ∈ V where − E ( w ) = h ⊤ v ′ w 19

  21. Sampling Based Approaches � ∇ θ J θ = ∇ θ E ( w ) − P ( w i ) ∇ θ E ( w i ) w i ∈ V The gradient has two parts:- ◮ Positive reinforcement for target word . ◮ Negative reinforcement for all other words weighted by its probability. � P ( w i ) ∇ θ E ( w i ) = E w i ∼ P [ ∇ θ E ( w i )] w i ∈ V To avoid this all sampling based approaches approximates the negative reinforcement. 20

  22. Word2Vec ◮ Proposed by Mikolov et. al. and widely used for many NLP applications. ◮ Key features:- ◮ Removed hidden layer. ◮ Use of additional context for training LM’s. ◮ Introduced newer training strategies using huge database of words efficiently. Word Analogies Mikolov et. al. 2013 21

  23. Word2Vec - Model Architectures Continuous Bag-of-Words ◮ All words in the context gets projected to the same position. ◮ Context is defined using both history and future words. ◮ Order of words in the context does not matter. ◮ Uses a log-linear classifier model. T J θ = 1 � log p ( w t | w t − n , · · · , w t − 1 , w t +1 , · · · , w t + n ) T t =1 Mikolov et. al. 2013 22

  24. Word2Vec - Model Architectures Continuous Skip-gram Model ◮ Given the current word, predict the words in the context within a certain range. ◮ Rest of ideas follow CBOW. T J θ = 1 � � log p ( w t + j | w t ) T t =1 − n ≤ j ≤ n , � =0 Here, exp( v ⊤ w t v ′ w t + j ) p ( w t + j | w t ) = � w i ∈ V exp( v ⊤ w t v ′ w i ) Mikolov et. al. 2013 23

  25. Noise Contrastive Estimation Key Idea Similar to margin based hinge loss, learn a noise classifier which differentiates between a target word and noise. ◮ Formulated as a binary classification problem. ◮ Minimizes cross entropy logistic loss. ◮ Draws k noise samples from a noise distribution for each correct word. ◮ Approximates softmax as k increases. Gutmann, M. et. al.’ 2010 24

  26. Noise Contrastive Estimation Distributions ◮ Empirical: ( P train ) Actual distribution given by the training samples. ◮ Noise: ( Q ) ◮ Easy to sample. ◮ Allows analytical expression to log pdf. ◮ Close to actual data distribution. E.g. Uniform or empirical unigram. ◮ Model: ( P ) Approximation to empirical distribution. Notation For every correct word w i along with its context c i , we generate k noise samples ˜ w ik from a noise distribution Q . The labels y = 1 for all correct words and y = 0 for noise samples. Gutmann, M. et. al.’ 2010 25

  27. Noise Contrastive Estimation Objective Function � J θ = − [log P ( y = 1 | w i , c i ) + k E ˜ w ik ∼ Q [ log P ( y = 0 | ˜ w ij , c i )]] w i ∈ V Using Monte Carlo approximation:- k 1 � � J θ = − [log P ( y = 1 | w i , c i ) + k k log P ( y = 0 | ˜ w ij , c i )] j =1 w i ∈ V Gutmann, M. et. al.’ 2010 26

  28. Noise Contrastive Estimation The conditional distribution is given as:- 1 k + 1 P train ( w | c ) P ( y = 1 | w , c ) = 1 k k + 1 P train ( w | c ) + k + 1 Q ( w ) P train ( w | c ) P ( y = 1 | w , c ) = P train ( w | c ) + k Q ( w ) Using model distribution:- P ( w | c ) P ( y = 1 | w , c ) = P ( w | c ) + k Q ( w ) exp( h ⊤ v ′ w ) Here P ( w | c ) = w i ) corresponds to softmax w i ∈ V exp( h ⊤ v ′ � function. Gutmann, M. et. al.’ 2010 27

Recommend


More recommend