Word Embedding Praveen Krishnan CVIT, IIIT Hyderabad June 22, 2017 1
Outline Introduction Classical Methods Language Modeling Neural Language Model Challenges of SoftMax Hierarchical Softmax Margin Based Hinge Loss Sampling Based Approaches Word2Vec Noise Contrastive Estimation Negative Sampling 2
Philosophy of Language “(...) the meaning of a word is its use in the language.” -Ludwig Wittgenstein Philosophical Investigations - 1953 Slide Credit: Christian Perone, Word Embeddings - Introduction 3
Word Embedding ◮ Word embeddings refer to dense representations of words in a low-dimensional vector space which encodes the associated semantics. ◮ Introduced by Bengio et. al. NIPS’01. ◮ The silver bullet for many NLP tasks. 4
Word Embedding The Syntactic and Semantic Phenomenon ◮ Morphology, Tense etc. ◮ Vocabulary Mismatch [Synonymy, Polysemy] ◮ Topic vs. Word Distribution Reasoning vs. Analogy w ( athens ) − w ( greece ) ≈ w ( oslo ) − ? w ( apples ) − w ( apple ) ≈ w ( oranges ) − ? w ( walking ) − w ( walked ) ≈ w ( swimming ) − ? 5
Word Embedding The Syntactic and Semantic Phenomenon ◮ Morphology, Tense etc. ◮ Vocabulary Mismatch [Synonymy, Polysemy] ◮ Topic vs. Word Distribution Reasoning vs. Analogy w ( athens ) − w ( greece ) ≈ w ( oslo ) − w ( norway ) w ( apples ) − w ( apple ) ≈ w ( oranges ) − w ( orange ) w ( walking ) − w ( walked ) ≈ w ( swimming ) − w ( swam ) 6
Word Embedding ◮ Sparse-to-Dense. ◮ Unsupervised learning. ◮ Typically learned as a by-product* of language modeling problem. 7
Classical Methods - Topic Modeling Latent Semantic Analysis [Deerwester et. al. ’1990] Project terms and documents into a topic space using SVD on term-document (co-occurrence) matrix. 8
Classical Methods - Topic Modeling Latent Dirichilet Allocation [Blei et. al ’2003] ◮ Assumes generative probabilistic model of a corpus. ◮ Documents are represented as distribution over latent topics, where each topic is characterized by a distribution over words. Figure 1: Plate notation representing the LDA model. Source: Wikipedia 9
Language Modeling Probabilistic Language Modeling Given a set of words, does they form valid construct in a language. � p ( w 1 , . . . , w T ) = p ( w i | w 1 , . . . , w i − 1 ) i p ( ′ high ′ , ′ winds ′ , ′ tonight ′ ) > p ( ′ large ′ , ′ winds ′ , ′ tonight ′ ) Using Markov assumptions:- � p ( w 1 , . . . , w T ) = p ( w i | w i − 1 , . . . , w i − n +1 ) i Applications Spell correction, Machine translation, Speech recognition, OCR etc. 10
Language Modeling Probabilistic Language Modelling ◮ nGram based model:- p ( w t | w t − 1 , . . . , w t − n +1 ) = count ( w t − n +1 , . . . , w t − 1 , w t ) count ( w t − n +1 , . . . , w t − 1 ) 11
Language Model Neural Probabilistic Language Model Bengio et. al. JMLR’03 12
Language Model Neural Probabilistic Language Model exp ( h T v w t ) p ( w t | w t − 1 , . . . , w t − n +1 ) = w i ∈ V exp ( h T v w i ) ⇒ Softmax Layer � Here h is the hidden representation of input, v w i is the output word embedding of word i and V is the vocabulary. Bengio et. al. JMLR’03 12
Neural Probabilistic Language Model ◮ Associate each word in vocabulary a distributed feature vector. ◮ Learn both the embedding and parameters for probability function jointly. Bengio et. al. JMLR’03 13
Softmax Classifier Figure 2: Predicting the next word with softmax Slide Credit: Sebastian Ruder. Blog On word embeddings - Part 2: Approximating the Softmax. 14
Challenges of SoftMax One of the major challenges in previous formulation is the cost of computing the softmax which is O ( | V | ) (typically > 100 K ) Major works:- ◮ Softmax-based approaches [Bringing more efficiency] ◮ Hierarchical Softmax ◮ Differentiated Softmax ◮ CNN-Softmax ◮ Sampling-based approaches [Approximating the softmax using a different loss function] ◮ Importance Sampling. ◮ Margin based Hinge loss. ◮ Noise Contrastive Approximation ◮ Negative Sampling ◮ ... 15
Hierarchical Softmax ◮ Uses a binary tree representation of output layer with leaves belonging to each word in vocabulary. ◮ Evaluates at-most log 2 ( | V | ) nodes instead of | V | nodes. ◮ Parameters are stored only at internal nodes, hence total parameters same as regular softmax. p ( right | n , c ) = σ ( h T v n ) p ( left | n , c ) = 1 − p ( right | n , c ) Figure 3: Hierarchical Softmax: Morin and Bengio, 2005 16
Hierarchical Softmax Figure 4: Hierarchical Softmax: Hugo Lachorelle’s Youtube lectures ◮ Structure of tree important for further efficiency in computation and performance. ◮ Examples:- ◮ Morin and Bengio: Using synsets in WordNet as clusters of tree. ◮ Mikolov et. al.: Use Huffman tree which takes into account the frequency of words. 17
Margin Based Hinge Loss C&W Model ◮ Avoids computing the expensive softmax by reformulating the objective. ◮ Train a network to produce higher scores for correct word windows than incorrect ones. ◮ The pairwise ranking criteria is given as:- � � max { 0 , 1 − f θ ( x ) + f θ ( x ( w ) ) } J θ = x ∈ X w ∈ V Here x is the correct windows and x w is the incorrect windows created by replacing the center word and f θ ( x ) is the score output by the model. 18
Sampling Based Approaches Sampling based approaches approximates the softmax by an alternative loss function which is cheaper to compute . Interpreting logistic loss function exp( h ⊤ v ′ w ) J θ = − log w i ∈ V exp( h ⊤ v ′ � w i ) J θ = − h ⊤ v ′ � exp( h ⊤ v ′ w + log w i ) w i ∈ V Computing gradient w.r.t. model params, we get � ∇ θ J θ = ∇ θ E ( w ) − P ( w i ) ∇ θ E ( w i ) w i ∈ V where − E ( w ) = h ⊤ v ′ w 19
Sampling Based Approaches � ∇ θ J θ = ∇ θ E ( w ) − P ( w i ) ∇ θ E ( w i ) w i ∈ V The gradient has two parts:- ◮ Positive reinforcement for target word . ◮ Negative reinforcement for all other words weighted by its probability. � P ( w i ) ∇ θ E ( w i ) = E w i ∼ P [ ∇ θ E ( w i )] w i ∈ V To avoid this all sampling based approaches approximates the negative reinforcement. 20
Word2Vec ◮ Proposed by Mikolov et. al. and widely used for many NLP applications. ◮ Key features:- ◮ Removed hidden layer. ◮ Use of additional context for training LM’s. ◮ Introduced newer training strategies using huge database of words efficiently. Word Analogies Mikolov et. al. 2013 21
Word2Vec - Model Architectures Continuous Bag-of-Words ◮ All words in the context gets projected to the same position. ◮ Context is defined using both history and future words. ◮ Order of words in the context does not matter. ◮ Uses a log-linear classifier model. T J θ = 1 � log p ( w t | w t − n , · · · , w t − 1 , w t +1 , · · · , w t + n ) T t =1 Mikolov et. al. 2013 22
Word2Vec - Model Architectures Continuous Skip-gram Model ◮ Given the current word, predict the words in the context within a certain range. ◮ Rest of ideas follow CBOW. T J θ = 1 � � log p ( w t + j | w t ) T t =1 − n ≤ j ≤ n , � =0 Here, exp( v ⊤ w t v ′ w t + j ) p ( w t + j | w t ) = � w i ∈ V exp( v ⊤ w t v ′ w i ) Mikolov et. al. 2013 23
Noise Contrastive Estimation Key Idea Similar to margin based hinge loss, learn a noise classifier which differentiates between a target word and noise. ◮ Formulated as a binary classification problem. ◮ Minimizes cross entropy logistic loss. ◮ Draws k noise samples from a noise distribution for each correct word. ◮ Approximates softmax as k increases. Gutmann, M. et. al.’ 2010 24
Noise Contrastive Estimation Distributions ◮ Empirical: ( P train ) Actual distribution given by the training samples. ◮ Noise: ( Q ) ◮ Easy to sample. ◮ Allows analytical expression to log pdf. ◮ Close to actual data distribution. E.g. Uniform or empirical unigram. ◮ Model: ( P ) Approximation to empirical distribution. Notation For every correct word w i along with its context c i , we generate k noise samples ˜ w ik from a noise distribution Q . The labels y = 1 for all correct words and y = 0 for noise samples. Gutmann, M. et. al.’ 2010 25
Noise Contrastive Estimation Objective Function � J θ = − [log P ( y = 1 | w i , c i ) + k E ˜ w ik ∼ Q [ log P ( y = 0 | ˜ w ij , c i )]] w i ∈ V Using Monte Carlo approximation:- k 1 � � J θ = − [log P ( y = 1 | w i , c i ) + k k log P ( y = 0 | ˜ w ij , c i )] j =1 w i ∈ V Gutmann, M. et. al.’ 2010 26
Noise Contrastive Estimation The conditional distribution is given as:- 1 k + 1 P train ( w | c ) P ( y = 1 | w , c ) = 1 k k + 1 P train ( w | c ) + k + 1 Q ( w ) P train ( w | c ) P ( y = 1 | w , c ) = P train ( w | c ) + k Q ( w ) Using model distribution:- P ( w | c ) P ( y = 1 | w , c ) = P ( w | c ) + k Q ( w ) exp( h ⊤ v ′ w ) Here P ( w | c ) = w i ) corresponds to softmax w i ∈ V exp( h ⊤ v ′ � function. Gutmann, M. et. al.’ 2010 27
Recommend
More recommend