word embeddings word2vec
play

Word Embeddings - Word2Vec Fall 2020 2020-09-30 Adapted from slides - PowerPoint PPT Presentation

SFU NatLangLab CMPT 825: Natural Language Processing Word Embeddings - Word2Vec Fall 2020 2020-09-30 Adapted from slides from Dan Jurafsky, Chris Manning, Danqi Chen and Karthik Narasimhan 1 Announcements Homework 1 due today Both


  1. SFU NatLangLab CMPT 825: Natural Language Processing Word Embeddings - Word2Vec Fall 2020 2020-09-30 Adapted from slides from Dan Jurafsky, Chris Manning, Danqi Chen and Karthik Narasimhan 1

  2. Announcements • Homework 1 due today • Both parts are due • Programming component has 2 grace days, but something must be turned in by tonight • Single person groups - highly encouraged to team up with each other • Video Lectures • Summary of logisBc regression (opBonal) • Word vectors (required) - covers PPMI • Word vectors TF-IDF (required, not yet posted) - covers TF-IDF • Word vectors Summary (opBonal, not yet posted) • Using SVD to get dense word vectors, and connecBons to word2vec • TA video summarizing key points about word vectors 2

  3. Representing words by their context Distributional hypothesis : words that occur in similar contexts tend to have similar meanings J.R.Firth 1957 • “You shall know a word by the company it keeps” • One of the most successful ideas of modern statistical NLP! These context words will represent banking . 3

  4. Word Vectors • One-hot vectors hotel = [0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0] motel = [0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0] • Represent words by their context context word-word (term-context) co-occurrence matrix (other words in the span around the target word) term matrix | V | × | V | sugar, a sliced lemon, a tablespoonful of apricot jam, a pinch each of, their enjoyment. Cautiously she sampled her first pineapple and another fruit whose taste she likened well suited to programming on the digital computer . In finding the optimal R-stage policy from for the purpose of gathering data and information necessary for the study authorized in the 4

  5. Sparse vs dense vectors Vectors we get from word-word (term-context) co-occurrence matrix are • long (length |V|= 20,000 to 50,000) • sparse (most elements are zero) True for both one-hot, U-idf and PPMI vectors AlternaBve: we want to represent words as • The focus of this lecture • short (50-300 dimensional) • The basis of all the modern NLP systems • dense (real-valued) vectors 5

  6. <latexit sha1_base64="rlyV9z4BFdXN5ATSBwve48t4Vs=">ACa3icbZHPT9swFMedjA3oNijAGI7WKuQdlmVhELaAxKC0cmrYDUVJXjvhYLx4nsF0QV9bI/cTf+Ay78DzhNQAP2JEsfd8P38dZ1IY9Lw7x3239P7D8spq4+Onz2vrzY0v5ybNYc+T2WqL2NmQAoFfRQo4TLTwJYwkV8fVLmL25AG5Gq3zjLYJiwqRITwRladT8EyHcYgFJtMZgJnTQxrFMBWqyBKGWtzOqdcOugc0iyEvaCEn17bD8Mn8hbkl9Srtf1OUDXsdXoVBKFfQadry0GNn+ePmi2v7S2CvgW/hap42zU/BuNU54noJBLZszA9zIcFkyj4BLmjSg3kDF+zaYwsKhYAmZYLya012rjOk1fYopAv1346CJcbMkthW2v2uzOtcKf4vN8hx0h0WQmU5guLVRZNcUkxpaTwdCw0c5cwC41rYXSm/YpxtN/TsCb4r5/8Fs6Ddmnzr07r6Li2Y4V8Jd/JD+KTkByRU3JG+oSTe2fN2XK2nQd3091xv1WlrlP3bJIX4e4+Ani5r64=</latexit> <latexit sha1_base64="rlyV9z4BFdXN5ATSBwve48t4Vs=">ACa3icbZHPT9swFMedjA3oNijAGI7WKuQdlmVhELaAxKC0cmrYDUVJXjvhYLx4nsF0QV9bI/cTf+Ay78DzhNQAP2JEsfd8P38dZ1IY9Lw7x3239P7D8spq4+Onz2vrzY0v5ybNYc+T2WqL2NmQAoFfRQo4TLTwJYwkV8fVLmL25AG5Gq3zjLYJiwqRITwRladT8EyHcYgFJtMZgJnTQxrFMBWqyBKGWtzOqdcOugc0iyEvaCEn17bD8Mn8hbkl9Srtf1OUDXsdXoVBKFfQadry0GNn+ePmi2v7S2CvgW/hap42zU/BuNU54noJBLZszA9zIcFkyj4BLmjSg3kDF+zaYwsKhYAmZYLya012rjOk1fYopAv1346CJcbMkthW2v2uzOtcKf4vN8hx0h0WQmU5guLVRZNcUkxpaTwdCw0c5cwC41rYXSm/YpxtN/TsCb4r5/8Fs6Ddmnzr07r6Li2Y4V8Jd/JD+KTkByRU3JG+oSTe2fN2XK2nQd3091xv1WlrlP3bJIX4e4+Ani5r64=</latexit> <latexit sha1_base64="rlyV9z4BFdXN5ATSBwve48t4Vs=">ACa3icbZHPT9swFMedjA3oNijAGI7WKuQdlmVhELaAxKC0cmrYDUVJXjvhYLx4nsF0QV9bI/cTf+Ay78DzhNQAP2JEsfd8P38dZ1IY9Lw7x3239P7D8spq4+Onz2vrzY0v5ybNYc+T2WqL2NmQAoFfRQo4TLTwJYwkV8fVLmL25AG5Gq3zjLYJiwqRITwRladT8EyHcYgFJtMZgJnTQxrFMBWqyBKGWtzOqdcOugc0iyEvaCEn17bD8Mn8hbkl9Srtf1OUDXsdXoVBKFfQadry0GNn+ePmi2v7S2CvgW/hap42zU/BuNU54noJBLZszA9zIcFkyj4BLmjSg3kDF+zaYwsKhYAmZYLya012rjOk1fYopAv1346CJcbMkthW2v2uzOtcKf4vN8hx0h0WQmU5guLVRZNcUkxpaTwdCw0c5cwC41rYXSm/YpxtN/TsCb4r5/8Fs6Ddmnzr07r6Li2Y4V8Jd/JD+KTkByRU3JG+oSTe2fN2XK2nQd3091xv1WlrlP3bJIX4e4+Ani5r64=</latexit> <latexit sha1_base64="rlyV9z4BFdXN5ATSBwve48t4Vs=">ACa3icbZHPT9swFMedjA3oNijAGI7WKuQdlmVhELaAxKC0cmrYDUVJXjvhYLx4nsF0QV9bI/cTf+Ay78DzhNQAP2JEsfd8P38dZ1IY9Lw7x3239P7D8spq4+Onz2vrzY0v5ybNYc+T2WqL2NmQAoFfRQo4TLTwJYwkV8fVLmL25AG5Gq3zjLYJiwqRITwRladT8EyHcYgFJtMZgJnTQxrFMBWqyBKGWtzOqdcOugc0iyEvaCEn17bD8Mn8hbkl9Srtf1OUDXsdXoVBKFfQadry0GNn+ePmi2v7S2CvgW/hap42zU/BuNU54noJBLZszA9zIcFkyj4BLmjSg3kDF+zaYwsKhYAmZYLya012rjOk1fYopAv1346CJcbMkthW2v2uzOtcKf4vN8hx0h0WQmU5guLVRZNcUkxpaTwdCw0c5cwC41rYXSm/YpxtN/TsCb4r5/8Fs6Ddmnzr07r6Li2Y4V8Jd/JD+KTkByRU3JG+oSTe2fN2XK2nQd3091xv1WlrlP3bJIX4e4+Ani5r64=</latexit> Dense vectors   0 . 286 0 . 792     − 0 . 177     − 0 . 107     employees = 10 . 109     − 0 . 542     0 . 349     0 . 271   0 . 487 short + dense 6

  7. Why dense vectors? • Short vectors are easier to use as features in ML systems • Dense vectors may generalize better than storing explicit counts • They do better at capturing synonymy • co-occurs with “car”, co-occurs with “automobile” w 1 w 2 • Different methods for getting dense vectors: • Singular value decomposition (SVD) • word2vec and friends: “learn” the vectors! 7

  8. Word2vec and friends 8

  9. Download pretrained word embeddings Word2vec (Mikolov et al.) https://code.google.com/archive/p/word2vec/ Fasttext http://www.fasttext.cc/ Glove (Pennington, Socher, Manning) http://nlp.stanford.edu/projects/glove/ 9

  10. Word2Vec • Popular embedding method • Very fast to train • Idea: predict rather than count (Mikolov et al, 2013): Distributed Representations of Words and Phrases and their Compositionality 10

  11. Word2Vec • Instead of counting how often each word occurs near “apricot” w • Train a classifier on a binary prediction task: • Is likely to show up near “apricot”? w • We don’t actually care about this task • But we’ll take the learned classifier weights as the word embeddings 11

  12. Word2Vec Insight: use running text as implicitly supervised training data! • A word near apricot s • Act as gold “correct answer” to the question “Is word w likely to show up near apricot?” • No need for hand-labeled supervision • The idea comes from neural language modeling • Bengio et al (2003) • Collobert et al (2011) (Bengio et al, 2003): A Neural Probabilistic Language Model 12

Recommend


More recommend