Efficient Estimation of Word Representation in Vector Space Topics - PowerPoint PPT Presentation

Efficient Estimation of Word Representation in Vector Space

Topics Language Models in NLP o Markov Models (n-gram model) o Distributed Representation of words o Motivation for word vector model of data o Feedforward Neural Network Language Model (Feedforward NNLM) o Recurrent Neural Network Language Model (Recurrent NNLM) o Continuous Bag of Words Recurrent NNLM o Skip-gram Recurrent NNLM o Results o References o

𝑜 -gram model for NLP Traditional NLP models are based on prediction of next word given o previous 𝑜 − 1 words. Also known as 𝑜 -gram model An 𝑜 -gram model is defined as probability of a word 𝑥 , given previous o words 𝑦 1 , 𝑦 2 … 𝑦 𝑜−1 using 𝑜 − 1 𝑢ℎ order Markov assumption Mathematically, the parameter o 𝑟 𝑥 𝑦 1 , 𝑦 2 … 𝑦 𝑜−1 = 𝑑𝑝𝑣𝑜𝑢 𝑥, 𝑦 1 , 𝑦 2 … 𝑦 𝑜−1 𝑑𝑝𝑣𝑜𝑢 𝑦 1 , 𝑦 2 … 𝑦 𝑜−1 where 𝑥, 𝑦 1 , 𝑦 2 … 𝑦 𝑜−1 ∈ 𝑊 and 𝑊 is some definite size vocabulary Above model is based on Maximum Likelihood estimation o Probability of occurrence of any sentence can be obtained by multiplying o the 𝑜 -gram model of every word Estimation can be done using linear interpolation or discounting o methods

Drawbacks associated with 𝑜 -gram models Curse of dimensionality: large number of parameters to be learned even o with the small size of vocabulary 𝑜 - gram model has discrete space, so it’s difficult to generalize the o parameters for that model. On the other hand, generalization is easier when the model has continuous space Simple scaling up of 𝑜 -gram models do not show expected performance o improvement for vocabularies containing limited data 𝑜 -gram models do not perform well in word similarity tasks o

Distributed representation of words as vectors Associate with each word in the vocabulary a distributed word feature vector in o ℝ 𝑛 genesis  0.537 0.299 0.098 … 0.624 𝑛 A vocabulary 𝑊 of size 𝑊 will therefore have 𝑊 × 𝑛 free parameters, which o needs to learned using some learning algorithm. These distributed feature vectors can either be learned in an unsupervised o fashion as part of pre-training procedure or can also be learned in a supervised way as well.

Why word vector model? This model is based on continuous space real variables, hence probability o distribution learn by generative models are smooth functions Therefore unlike the 𝑜 -gram models, where if a sequence of words is not o present in the data corpus is not a big issue; generalization is better with this approach Multiple degrees of similarity : similarity between words goes beyond basic o syntactic and semantic regularities. For example: 𝑤𝑓𝑑𝑢𝑝𝑠 𝐿𝑗𝑜𝑕 − 𝑤𝑓𝑑𝑢𝑝𝑠 𝑁𝑏𝑜 + 𝑤𝑓𝑑𝑢𝑝𝑠(𝑋𝑝𝑛𝑏𝑜) ≈ 𝑤𝑓𝑑𝑢𝑝𝑠(𝑅𝑣𝑓𝑓𝑜) 𝑤𝑓𝑑𝑢𝑝𝑠 𝑄𝑏𝑠𝑗𝑡 − 𝑤𝑓𝑑𝑢𝑝𝑠 𝐺𝑠𝑏𝑜𝑑𝑓 + 𝑤𝑓𝑑𝑢𝑝𝑠 𝐽𝑢𝑏𝑚𝑧 ≈ 𝑤𝑓𝑑𝑢𝑝𝑠 𝑆𝑝𝑛𝑓 Easier to train vector models on unsupervised data o

Learning distributed word vector representations Feedforward Neural Network Language Model : Joint probability o distribution of words sequences is learned along with word feature vectors using feed forward neural network Recurrent Neural Network Language Models : These NNLM are based on o recurrent neural networks Continuous Bag of Words : It is based on log linear classifier, but the input o will be average of past and future word vectors. In short, here our goal is to predict word surrounding a context Continuous Skip-gram Model: It is also based on log linear classifier, but o here it will try to predict the past and future words surrounding a given word

Feedforward Neural Network Language Model Initially proposed by Yoshua Bengio et al o It is slightly related to 𝑜 -gram language model, as it aims to learn the o probability function of word sequences of length 𝑜 Here input will be a concatenated feature vector of words o 𝑥 𝑜−1 , 𝑥 𝑜−2 … 𝑥 2 , 𝑥 1 and training criteria will be to predict the word 𝑥 𝑜 Output of the model will give us the estimated probability of a given sequence o of 𝑜 words Neural network architecture consists of a projection layer, a hidden layer of o neurons, output layer and a softmax function to evaluate the joint probability distribution of words

Feedforward NNLM T R A I N I N G C O R P U S 𝑥 𝑜−2 𝑥 𝑜−1 𝑥 1 𝑥 2 ⋮ Lookup table of word vectors of size Input Projection Layer Concatenated input vector of size of Hidden Layer ⋮ Output Layer ⋮ Softmax function 𝑄(𝑥 𝑜 |𝑥 𝑜−1 … 𝑥 2 , 𝑥 1 )

Feedforward NNLM Fairly huge model in terms of free parameters o Neural network parameters consist of 𝑜 − 1 × 𝑛 × 𝐼 + 𝐼 × 𝑊 o parameters Training criteria is to predict 𝑜 𝑢ℎ word o Uses forward propagation and backpropagation algorithm for training o using mini batch gradient descent Number of output layers in neural network can be reduced to log 2 𝑊 o using hierarchical softmax layers. This will significantly reduce the training time of model

Recurrent Neural Network Language Model Initially implemented by Tomas Mikolov, but probably inspired by Yoshua o Bengio’s seminal work on NNLM Uses a recurrent neural network, where input layer consists of the current o word vector and hidden neuron values of previous word Training objective is to predict the current word o Contrary to Feedforward NNLM, it keeps on building a kind of history of o previous words which got trained using the model. Therefore context window of analysis is variable here

Recurrent NNLM T R A I N I N G C O R P U S 𝑥 𝑢 Lookup table of word vectors of size Hidden Input 𝑑𝑝𝑜𝑢𝑓𝑦𝑢(𝑢) Hidden ⋮ Output Layer ⋮ Softmax function

Recurrent NNLM Requires less number of hidden units in comparison to feedforward NNLM, o though one may have to increase the same with increase in vocabulary size Stochastic gradient descent is used along with backpropagation algorithm to o train the model over several epochs Number of output layers can be reduced to log 2 𝑊 using hierarchical softmax o layers Recurrent NNLM models as much as twice reduction in perplexity as compared o to 𝑜 -gram models In practice recurrent NNLM models are much faster to train than feedforward o NNLM models

Continuous Bag of Words It is similar to feedforward NNLM with no hidden layer. This model only o consists of an input and an output layer In this model, words in sequences from past and future are input and they o are trained to predict the current sample Owing to its simplicity, this model can be trained on huge amount of data in o a small time as compared to other neural network models This model actually does the current word estimation provided context or a o sentence.

Continuous Bag of Words 𝑥 𝑜+4 𝑥 𝑜+3 T R A I N I N G C O R P U S input vectors Lookup table of word Softmax function 𝑥 𝑜+2 vectors of size ⋮ 𝑥 𝑜+1 𝑥 𝑜 Average of 𝑥 𝑜−1 𝑥 𝑜−2 𝑥 𝑜−3 Output 𝑥 𝑜−4 Input Projection Layer Layer

Continuous Skip-gram Model This model is similar to continuous bag of words model, its just the roles are o reversed for input and output Here model attempts to predict the words around the current word o Input layer consists of the word vector from single word, while multiple o output layers are connected to input layer

Model Continuous Skip-gram T R A I N I N G C O R P U S 𝑥 𝑜 Lookup table of word Input Projection vectors of size Layer Single word vector ⋮ ⋮ ⋮ Softmax Softmax ⋮ Softmax 𝑥 𝑜−4 𝑥 𝑜+4 𝑥 𝑜+1 Output Layer

Analyzing language models Perplexity : A measurement of how well a language model is able to o adapt the underlying probability distribution of a model Word error rate : Percentage of words misrecognized by the language o model Semantic Analysis : Deriving semantic analogies of word pairs, filling the o sentence with most logical word choice etc. These kind of tests are especially used for measuring the performance of word vectors. For example : Berlin : Germany :: Toronto : Canada Syntactic Analysis : For language model, it might be the construction of o syntactically correct parse tree, for testing word vectors one might look for predicting syntactic analogies such as : possibly : impossibly :: ethical : unethical

Perplexity Comparison Perplexity of different models tested on Brown Corpus

Perplexity Comparison Perplexity comparison of different models on Penn Treebank

Sentence Completion Task WSJ Kaldi Rescoring

Semantic Syntactic Tests

Results

Results Different models with 640 dimensional word vectors Training Time comparison of different models

Microsoft Research Sentence Completion Challenge

Complex Learned Relationships

Efficient Estimation of Word Representation in Vector Space Topics - PowerPoint PPT Presentation

Efficient Estimation of Word Representation in Vector Space Topics Language Models in NLP o Markov Models (n-gram model) o Distributed Representation of words o Motivation for word vector model of data o Feedforward Neural Network

Memory Memory Decoders M bits M bits RWM NVRWM ROM S 0 S 0 Word 0 Word 0 S 1 Word 1 Word

Motion Estimation by Affine Transforms Motion Estimation by Affine Transforms Motion Estimation

>>>CLICK HERE<<< Presentation d un document word New Haven. peugeot 207 workshop

Is this a word that would be used by a mature language user? Is it a frequently used word?

Word Meaning & Word Sense Disambiguation CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT

Building On The Word Building On The Word Nehemiah 8:1-8 Nehemiah 8:1-8

Create PDF in MS Word 2013 using Adobe Distiller 10 Sep 2020 V0C V0C Create PDF In MS Word 2013

Word Sense Word Sense Word Sense Disambiguation Disambiguation Disambiguation Presented by

Efficient induction of probabilistic word classes with LDA Grzegorz Chrupa la Saarland

MLSE Channel Estimation MLSE Channel Estimation MLSE Channel Estimation Parametric or Non-

M-Estimation under High-Dimensional Asymptotics DLD, Andrea Montanari 2014-05-01 DLD, Andrea

Part 3. Spectrum Estimation Part 3. Spectrum Estimation 3.2 Parametric Methods for Spectral

Lecture 6. Bayesian estimation Lecture 6. Bayesian estimation 1 (172) 6. Bayesian estimation

Word Class A recap on earlier years word class learning for Year 5 and 6 classes Grammarsaurus

4.2 Microsoft Word Microsoft Word is the word processing component of the Microsoft Office

Word Sense Disambiguation Word Sense Disambiguation (WSD) Given A

Gram Swaraj Abhiyaan and Pradhan Mantri Ujjwala Yojana Pradhan Mantri Ujj jjwala Yoja jana

Approximate search in misuse detection-based IDS by using the q-gram distance Sverre Bakke

Management of suspected bacterial urinary tract infections A team Approach Jane Lawson Senior

LASER INDUCED POROUS GRAPHENE SPONGE Capstone Spring 2015

14 th April 2018 to 5 th May 2018 1 GRAM SWARAJ ABHIYAN Sabka Sath, Sabka Gaon, Sabka Vikas

Many words share the same root word This week we are focusing on words with the root gram.

SANT GADGEBABA GRAM SWACHHATA ABHIYAN (Sant Gadgebaba Clean Village Campaign) B. K. Sawai (

Nebraska Challenge Set Exercise February, 2012 1 Continuing Education Fax to 402.559.7799

Sambuz

Useful Links

Newsletter

Mail Us

Efficient Estimation of Word Representation in Vector Space Topics - PowerPoint PPT Presentation

Efficient Estimation of Word Representation in Vector Space Topics Language Models in NLP o Markov Models (n-gram model) o Distributed Representation of words o Motivation for word vector model of data o Feedforward Neural Network

Memory Memory Decoders M bits M bits RWM NVRWM ROM S 0 S 0 Word 0 Word 0 S 1 Word 1 Word

Motion Estimation by Affine Transforms Motion Estimation by Affine Transforms Motion Estimation

&gt;&gt;&gt;CLICK HERE&lt;&lt;&lt; Presentation d un document word New Haven. peugeot 207 workshop

Is this a word that would be used by a mature language user? Is it a frequently used word?

Word Meaning &amp; Word Sense Disambiguation CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT

Building On The Word Building On The Word Nehemiah 8:1-8 Nehemiah 8:1-8

Create PDF in MS Word 2013 using Adobe Distiller 10 Sep 2020 V0C V0C Create PDF In MS Word 2013

Word Sense Word Sense Word Sense Disambiguation Disambiguation Disambiguation Presented by

Efficient induction of probabilistic word classes with LDA Grzegorz Chrupa la Saarland

MLSE Channel Estimation MLSE Channel Estimation MLSE Channel Estimation Parametric or Non-

M-Estimation under High-Dimensional Asymptotics DLD, Andrea Montanari 2014-05-01 DLD, Andrea

Part 3. Spectrum Estimation Part 3. Spectrum Estimation 3.2 Parametric Methods for Spectral

Lecture 6. Bayesian estimation Lecture 6. Bayesian estimation 1 (172) 6. Bayesian estimation

Word Class A recap on earlier years word class learning for Year 5 and 6 classes Grammarsaurus

4.2 Microsoft Word Microsoft Word is the word processing component of the Microsoft Office

Word Sense Disambiguation Word Sense Disambiguation (WSD) Given A

Gram Swaraj Abhiyaan and Pradhan Mantri Ujjwala Yojana Pradhan Mantri Ujj jjwala Yoja jana

Approximate search in misuse detection-based IDS by using the q-gram distance Sverre Bakke

Management of suspected bacterial urinary tract infections A team Approach Jane Lawson Senior

LASER INDUCED POROUS GRAPHENE SPONGE Capstone Spring 2015

14 th April 2018 to 5 th May 2018 1 GRAM SWARAJ ABHIYAN Sabka Sath, Sabka Gaon, Sabka Vikas

Many words share the same root word This week we are focusing on words with the root gram.

SANT GADGEBABA GRAM SWACHHATA ABHIYAN (Sant Gadgebaba Clean Village Campaign) B. K. Sawai (

Nebraska Challenge Set Exercise February, 2012 1 Continuing Education Fax to 402.559.7799

Sambuz

Useful Links

Newsletter

Mail Us

>>>CLICK HERE<<< Presentation d un document word New Haven. peugeot 207 workshop

Word Meaning & Word Sense Disambiguation CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT