OPTIMIZATION OF SKIP-GRAM MODEL Chenxi Wu Final Presentation for STA 790
Word Embedding ■ Map words to vectors of real numbers ■ The earliest word representation is ”one hot representation”
Word Embedding ■ Distributed representation
Word2Vec ■ An unsupervised NLP method developed by Google in 2013 ■ Quantify the relationship between words Hierarchical Softmax Skip-gram Negative word2vec Sampling CBOW
Skip-gram ■ Input a vector representation of a specific word ■ Output the context word vector corresponding to this word
DNN (Deep Neural Network)
Huffman Tree ■ Leaf nodes denote all words in the vocabulary ■ The leaf nodes act as neurons in the output layer, and the internal nodes act as hidden neurons. ■ Input: n weights f1, f2, ..., fn (The frequency of each word in the corpus) ■ Output: The corresponding Huffman tree ■ Benefit: Common words have shorter Huffman code
■ (1) Treat f1, f2, ..., fn as a forest with n trees (Each tree has only one node); ■ (2) In the forest, select the two trees with the smallest weights to merge as the left and right subtrees of a new tree. And the weight of the root node of this new tree is the sum of the weights of the left and right child nodes; ■ (3) Delete the two selected trees from the forest and add the new trees to the forest; ■ (4) Repeat steps (2) and (3) until there is only one tree left in the forest
Hierarchical Softmax
HS Details
HS Details ■ use sigmoid function to decide whether to go left (+) or go right (-) ■ In the example above, w is "hierarchical".
HS Target Function
HS Gradient
Negative Sampling ■ Alternative method for training Skip-gram model ■ Subsampling frequent words to decrease the number of training examples. ■ Let each training sample to update only a small percentage of the model’s weights.
Negative sample ■ randomly select one word u from its surrounding words, so u and w compose one "positive sample". ■ The negative sample would be to use this same u, we randomly choose a word from the dictionary that is not w.
Sampling method ■ The unigram distribution is used to select negative words. ■ The probability of a word being selected as a negative sample is related to the frequency of its occurrence. The higher the frequency of occurrence, the easier it is to select as negative words
NS Details ■ Still use sigmoid function to train the model ■ Suppose through negative sampling, we get neg negative samples (context(w), w_i), I = 1, 2, …, neg. So each training sample is ( context(w), w, w_1, …w_neg ). ■ We expect our positive sample to satisfy: ■ Expect negative samples to satisfy:
NS Details ■ Want to maximize the following log-likelihood: ■ Similarly, compute gradient to update parameters.
Reference Mikolov et al., 2013, Distributed d Repres esentations of Words ds a and Phrases es a and t d thei eir Compositio ional ality http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases- and The code for this implementation can be found on my GItHub repo: https://github.com/cassie1102
Recommend
More recommend