Word Embedding Praveen Krishnan CVIT, IIIT Hyderabad June 22, 2017 - PowerPoint PPT Presentation

Word Embedding Praveen Krishnan CVIT, IIIT Hyderabad June 22, 2017 1

Outline Introduction Classical Methods Language Modeling Neural Language Model Challenges of SoftMax Hierarchical Softmax Margin Based Hinge Loss Sampling Based Approaches Word2Vec Noise Contrastive Estimation Negative Sampling 2

Philosophy of Language “(...) the meaning of a word is its use in the language.” -Ludwig Wittgenstein Philosophical Investigations - 1953 Slide Credit: Christian Perone, Word Embeddings - Introduction 3

Word Embedding ◮ Word embeddings refer to dense representations of words in a low-dimensional vector space which encodes the associated semantics. ◮ Introduced by Bengio et. al. NIPS’01. ◮ The silver bullet for many NLP tasks. 4

Word Embedding The Syntactic and Semantic Phenomenon ◮ Morphology, Tense etc. ◮ Vocabulary Mismatch [Synonymy, Polysemy] ◮ Topic vs. Word Distribution Reasoning vs. Analogy w ( athens ) − w ( greece ) ≈ w ( oslo ) − ? w ( apples ) − w ( apple ) ≈ w ( oranges ) − ? w ( walking ) − w ( walked ) ≈ w ( swimming ) − ? 5

Word Embedding The Syntactic and Semantic Phenomenon ◮ Morphology, Tense etc. ◮ Vocabulary Mismatch [Synonymy, Polysemy] ◮ Topic vs. Word Distribution Reasoning vs. Analogy w ( athens ) − w ( greece ) ≈ w ( oslo ) − w ( norway ) w ( apples ) − w ( apple ) ≈ w ( oranges ) − w ( orange ) w ( walking ) − w ( walked ) ≈ w ( swimming ) − w ( swam ) 6

Word Embedding ◮ Sparse-to-Dense. ◮ Unsupervised learning. ◮ Typically learned as a by-product* of language modeling problem. 7

Classical Methods - Topic Modeling Latent Semantic Analysis [Deerwester et. al. ’1990] Project terms and documents into a topic space using SVD on term-document (co-occurrence) matrix. 8

Classical Methods - Topic Modeling Latent Dirichilet Allocation [Blei et. al ’2003] ◮ Assumes generative probabilistic model of a corpus. ◮ Documents are represented as distribution over latent topics, where each topic is characterized by a distribution over words. Figure 1: Plate notation representing the LDA model. Source: Wikipedia 9

Language Modeling Probabilistic Language Modeling Given a set of words, does they form valid construct in a language. � p ( w 1 , . . . , w T ) = p ( w i | w 1 , . . . , w i − 1 ) i p ( ′ high ′ , ′ winds ′ , ′ tonight ′ ) > p ( ′ large ′ , ′ winds ′ , ′ tonight ′ ) Using Markov assumptions:- � p ( w 1 , . . . , w T ) = p ( w i | w i − 1 , . . . , w i − n +1 ) i Applications Spell correction, Machine translation, Speech recognition, OCR etc. 10

Language Modeling Probabilistic Language Modelling ◮ nGram based model:- p ( w t | w t − 1 , . . . , w t − n +1 ) = count ( w t − n +1 , . . . , w t − 1 , w t ) count ( w t − n +1 , . . . , w t − 1 ) 11

Language Model Neural Probabilistic Language Model Bengio et. al. JMLR’03 12

Language Model Neural Probabilistic Language Model exp ( h T v w t ) p ( w t | w t − 1 , . . . , w t − n +1 ) = w i ∈ V exp ( h T v w i ) ⇒ Softmax Layer � Here h is the hidden representation of input, v w i is the output word embedding of word i and V is the vocabulary. Bengio et. al. JMLR’03 12

Neural Probabilistic Language Model ◮ Associate each word in vocabulary a distributed feature vector. ◮ Learn both the embedding and parameters for probability function jointly. Bengio et. al. JMLR’03 13

Softmax Classifier Figure 2: Predicting the next word with softmax Slide Credit: Sebastian Ruder. Blog On word embeddings - Part 2: Approximating the Softmax. 14

Challenges of SoftMax One of the major challenges in previous formulation is the cost of computing the softmax which is O ( | V | ) (typically > 100 K ) Major works:- ◮ Softmax-based approaches [Bringing more efficiency] ◮ Hierarchical Softmax ◮ Differentiated Softmax ◮ CNN-Softmax ◮ Sampling-based approaches [Approximating the softmax using a different loss function] ◮ Importance Sampling. ◮ Margin based Hinge loss. ◮ Noise Contrastive Approximation ◮ Negative Sampling ◮ ... 15

Hierarchical Softmax ◮ Uses a binary tree representation of output layer with leaves belonging to each word in vocabulary. ◮ Evaluates at-most log 2 ( | V | ) nodes instead of | V | nodes. ◮ Parameters are stored only at internal nodes, hence total parameters same as regular softmax. p ( right | n , c ) = σ ( h T v n ) p ( left | n , c ) = 1 − p ( right | n , c ) Figure 3: Hierarchical Softmax: Morin and Bengio, 2005 16

Hierarchical Softmax Figure 4: Hierarchical Softmax: Hugo Lachorelle’s Youtube lectures ◮ Structure of tree important for further efficiency in computation and performance. ◮ Examples:- ◮ Morin and Bengio: Using synsets in WordNet as clusters of tree. ◮ Mikolov et. al.: Use Huffman tree which takes into account the frequency of words. 17

Margin Based Hinge Loss C&W Model ◮ Avoids computing the expensive softmax by reformulating the objective. ◮ Train a network to produce higher scores for correct word windows than incorrect ones. ◮ The pairwise ranking criteria is given as:- � � max { 0 , 1 − f θ ( x ) + f θ ( x ( w ) ) } J θ = x ∈ X w ∈ V Here x is the correct windows and x w is the incorrect windows created by replacing the center word and f θ ( x ) is the score output by the model. 18

Sampling Based Approaches Sampling based approaches approximates the softmax by an alternative loss function which is cheaper to compute . Interpreting logistic loss function exp( h ⊤ v ′ w ) J θ = − log w i ∈ V exp( h ⊤ v ′ � w i ) J θ = − h ⊤ v ′ � exp( h ⊤ v ′ w + log w i ) w i ∈ V Computing gradient w.r.t. model params, we get � ∇ θ J θ = ∇ θ E ( w ) − P ( w i ) ∇ θ E ( w i ) w i ∈ V where − E ( w ) = h ⊤ v ′ w 19

Sampling Based Approaches � ∇ θ J θ = ∇ θ E ( w ) − P ( w i ) ∇ θ E ( w i ) w i ∈ V The gradient has two parts:- ◮ Positive reinforcement for target word . ◮ Negative reinforcement for all other words weighted by its probability. � P ( w i ) ∇ θ E ( w i ) = E w i ∼ P [ ∇ θ E ( w i )] w i ∈ V To avoid this all sampling based approaches approximates the negative reinforcement. 20

Word2Vec ◮ Proposed by Mikolov et. al. and widely used for many NLP applications. ◮ Key features:- ◮ Removed hidden layer. ◮ Use of additional context for training LM’s. ◮ Introduced newer training strategies using huge database of words efficiently. Word Analogies Mikolov et. al. 2013 21

Word2Vec - Model Architectures Continuous Bag-of-Words ◮ All words in the context gets projected to the same position. ◮ Context is defined using both history and future words. ◮ Order of words in the context does not matter. ◮ Uses a log-linear classifier model. T J θ = 1 � log p ( w t | w t − n , · · · , w t − 1 , w t +1 , · · · , w t + n ) T t =1 Mikolov et. al. 2013 22

Word2Vec - Model Architectures Continuous Skip-gram Model ◮ Given the current word, predict the words in the context within a certain range. ◮ Rest of ideas follow CBOW. T J θ = 1 � � log p ( w t + j | w t ) T t =1 − n ≤ j ≤ n , � =0 Here, exp( v ⊤ w t v ′ w t + j ) p ( w t + j | w t ) = � w i ∈ V exp( v ⊤ w t v ′ w i ) Mikolov et. al. 2013 23

Noise Contrastive Estimation Key Idea Similar to margin based hinge loss, learn a noise classifier which differentiates between a target word and noise. ◮ Formulated as a binary classification problem. ◮ Minimizes cross entropy logistic loss. ◮ Draws k noise samples from a noise distribution for each correct word. ◮ Approximates softmax as k increases. Gutmann, M. et. al.’ 2010 24

Noise Contrastive Estimation Distributions ◮ Empirical: ( P train ) Actual distribution given by the training samples. ◮ Noise: ( Q ) ◮ Easy to sample. ◮ Allows analytical expression to log pdf. ◮ Close to actual data distribution. E.g. Uniform or empirical unigram. ◮ Model: ( P ) Approximation to empirical distribution. Notation For every correct word w i along with its context c i , we generate k noise samples ˜ w ik from a noise distribution Q . The labels y = 1 for all correct words and y = 0 for noise samples. Gutmann, M. et. al.’ 2010 25

Noise Contrastive Estimation Objective Function � J θ = − [log P ( y = 1 | w i , c i ) + k E ˜ w ik ∼ Q [ log P ( y = 0 | ˜ w ij , c i )]] w i ∈ V Using Monte Carlo approximation:- k 1 � � J θ = − [log P ( y = 1 | w i , c i ) + k k log P ( y = 0 | ˜ w ij , c i )] j =1 w i ∈ V Gutmann, M. et. al.’ 2010 26

Word Embedding Praveen Krishnan CVIT, IIIT Hyderabad June 22, 2017 - PowerPoint PPT Presentation

Word Embedding Praveen Krishnan CVIT, IIIT Hyderabad June 22, 2017 1 Outline Introduction Classical Methods Language Modeling Neural Language Model Challenges of SoftMax Hierarchical Softmax Margin Based Hinge Loss Sampling Based

Greedy embedding of a graph Greedy embedding of a graph 99 Greedy embedding Greedy embedding

Memory Memory Decoders M bits M bits RWM NVRWM ROM S 0 S 0 Word 0 Word 0 S 1 Word 1 Word

Graph Drawing Embedding Embedding For a given graph G = ( V , E ) , an embedding (into R 2 )

Planarity Embedding Embedding For a given graph G = ( V , E ) , an embedding (into R 2 ) assigns

OPTIMIZATION OF SKIP-GRAM MODEL Chenxi Wu Final Presentation for STA 790 Word Embedding

Embedding 3-manifolds via surgery on surfaces Kyle Larson University of Texas at Austin

>>>CLICK HERE<<< Presentation d un document word New Haven. peugeot 207 workshop

Is this a word that would be used by a mature language user? Is it a frequently used word?

Word Meaning & Word Sense Disambiguation CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT

Building On The Word Building On The Word Nehemiah 8:1-8 Nehemiah 8:1-8

Create PDF in MS Word 2013 using Adobe Distiller 10 Sep 2020 V0C V0C Create PDF In MS Word 2013

Word Sense Word Sense Word Sense Disambiguation Disambiguation Disambiguation Presented by

Deep Learning for Natural Language Processing Inspecting and evaluating word embedding models

Avoiding artifacts in spectral white matter fiber clustering and embedding Demian Wassermann

Large-Scale Clustering through Functional NCut Embedding Embedding Experiments Summary

Rby : An Embedding of Alloy in Ruby Aleksandar Milicevic , Ido Efrati, and Daniel Jackson

PENG Session 13 Roland M uhlenbernd Seminar f ur Sprachwissenschaft University of T

Brainstorming Organized Ideation - Forget what you thought you knew about Brainstorming, Osborne

Routing in Ad-hoc networks P R E S E N T E D B Y - L E W I S T S E N G R A C H I T A G A R W A

1 In-network data aggregation What is directed diffusion? Old way Interests Binding a

2

Thermal energy supply and storage in ehub and NEST 6 th SCCER HaE storage Symposium October 25 th ,

Building a mag moment signal model for LZ Winnie Wang, Scott Hertel University of

Building FPGA-Targeted Accelerators with HeteroCL Zhiru Zhang School of ECE, Cornell University

Word Embedding Praveen Krishnan CVIT, IIIT Hyderabad June 22, 2017 - PowerPoint PPT Presentation

Word Embedding Praveen Krishnan CVIT, IIIT Hyderabad June 22, 2017 1 Outline Introduction Classical Methods Language Modeling Neural Language Model Challenges of SoftMax Hierarchical Softmax Margin Based Hinge Loss Sampling Based

Greedy embedding of a graph Greedy embedding of a graph 99 Greedy embedding Greedy embedding

Memory Memory Decoders M bits M bits RWM NVRWM ROM S 0 S 0 Word 0 Word 0 S 1 Word 1 Word

Graph Drawing Embedding Embedding For a given graph G = ( V , E ) , an embedding (into R 2 )

Planarity Embedding Embedding For a given graph G = ( V , E ) , an embedding (into R 2 ) assigns

OPTIMIZATION OF SKIP-GRAM MODEL Chenxi Wu Final Presentation for STA 790 Word Embedding

Embedding 3-manifolds via surgery on surfaces Kyle Larson University of Texas at Austin

&gt;&gt;&gt;CLICK HERE&lt;&lt;&lt; Presentation d un document word New Haven. peugeot 207 workshop

Is this a word that would be used by a mature language user? Is it a frequently used word?

Word Meaning &amp; Word Sense Disambiguation CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT

Building On The Word Building On The Word Nehemiah 8:1-8 Nehemiah 8:1-8

Create PDF in MS Word 2013 using Adobe Distiller 10 Sep 2020 V0C V0C Create PDF In MS Word 2013

Word Sense Word Sense Word Sense Disambiguation Disambiguation Disambiguation Presented by

Deep Learning for Natural Language Processing Inspecting and evaluating word embedding models

Avoiding artifacts in spectral white matter fiber clustering and embedding Demian Wassermann

Large-Scale Clustering through Functional NCut Embedding Embedding Experiments Summary

Rby : An Embedding of Alloy in Ruby Aleksandar Milicevic , Ido Efrati, and Daniel Jackson

PENG Session 13 Roland M uhlenbernd Seminar f ur Sprachwissenschaft University of T

Brainstorming Organized Ideation - Forget what you thought you knew about Brainstorming, Osborne

Routing in Ad-hoc networks P R E S E N T E D B Y - L E W I S T S E N G R A C H I T A G A R W A

1 In-network data aggregation What is directed diffusion? Old way Interests Binding a

2

Thermal energy supply and storage in ehub and NEST 6 th SCCER HaE storage Symposium October 25 th ,

Building a mag moment signal model for LZ Winnie Wang, Scott Hertel University of

Building FPGA-Targeted Accelerators with HeteroCL Zhiru Zhang School of ECE, Cornell University

>>>CLICK HERE<<< Presentation d un document word New Haven. peugeot 207 workshop

Word Meaning & Word Sense Disambiguation CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT