cs7015 deep learning lecture 10
play

CS7015 (Deep Learning) : Lecture 10 Learning Vectorial - PowerPoint PPT Presentation

CS7015 (Deep Learning) : Lecture 10 Learning Vectorial Representations Of Words Mitesh M. Khapra Department of Computer Science and Engineering Indian Institute of Technology Madras 1/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10


  1. CS7015 (Deep Learning) : Lecture 10 Learning Vectorial Representations Of Words Mitesh M. Khapra Department of Computer Science and Engineering Indian Institute of Technology Madras 1/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

  2. Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained: deriving Mikolov et al.’s negative-sampling word- embedding method’ by Yoav Goldberg and Omer Levy Sebastian Ruder’s blogs on word embeddings a a Blog1, Blog2, Blog3 2/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

  3. Module 10.1: One-hot representations of words 3/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

  4. Let us start with a very simple mo- tivation for why we are interested in vectorial representations of words Suppose we are given an input stream of words (sentence, document, etc.) and we are interested in learning some function of it (say, ˆ = y sentiments ( words )) Model Say, we employ a machine learning al- gorithm (some mathematical model) for learning such a function (ˆ y = [5.7, 1.2, 2.3, -10.2, 4.5, ..., 11.9, 20.1, -0.5, 40.7] f ( x )) We first need a way of converting the This is by far AAMIR KHAN’s best one. Finest input stream (or each word in the casting and terrific acting by all. stream) to a vector x (a mathemat- ical quantity) 4/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

  5. Given a corpus, consider the set V Corpus: of all unique words across all input Human machine interface for computer streams ( i.e. , all sentences or docu- applications ments) User opinion of computer system response V is called the vocabulary of the time corpus ( i.e. , all sentences or docu- User interface management system ments) System engineering for improved response We need a representation for every time word in V V = [human,machine, interface, for, computer, One very simple way of doing this is applications, user, opinion, of, system, response, to use one-hot vectors of size | V | time, management, engineering, improved] The representation of the i -th word machine : 0 1 0 ... 0 0 0 will have a 1 in the i -th position and a 0 in the remaining | V | − 1 positions 5/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

  6. Problems: cat: 0 0 0 0 0 1 0 V tends to be very large (for example, dog: 0 1 0 0 0 0 0 50K for PTB, 13M for Google 1T cor- pus) truck: 0 0 0 1 0 0 0 These representations do not capture any notion of similarity √ euclid dist ( cat , dog ) = 2 Ideally, we would want the represent- √ ations of cat and dog (both domestic euclid dist ( dog , truck ) = 2 animals) to be closer to each other cosine sim ( cat , dog ) = 0 than the representations of cat and cosine sim ( dog , truck ) = 0 truck However, with 1-hot representations, the Euclidean distance between any √ two words in the vocabulary in 2 And the cosine similarity between any two words in the vocabulary is 0 6/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

  7. Module 10.2: Distributed Representations of words 7/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

  8. You shall know a word by the com- pany it keeps - Firth, J. R. 1957:11 Distributional similarity based rep- resentations This leads us to the idea of co- A bank is a financial institution that accepts occurrence matrix deposits from the public and creates credit . The idea is to use the accompanying words (financial, deposits, credit) to represent bank 8/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

  9. Corpus: A co-occurrence matrix is a terms × Human machine interface for computer ap- terms matrix which captures the plications number of times a term appears in User opinion of computer system response the context of another term time The context is defined as a window of User interface management system k words around the terms System engineering for improved response Let us build a co-occurrence matrix time for this toy corpus with k = 2 This is also known as a word × human machine system for ... user 0 1 0 1 ... 0 context matrix human 1 0 0 1 ... 0 machine 0 0 0 1 ... 2 system You could choose the set of words 1 1 1 0 ... 0 for . . . . . . . and contexts to be same or different . . . . . . . . . . . . . . Each row (column) of the co- 0 0 2 0 ... 0 user occurrence matrix gives a vectorial Co-occurence Matrix representation of the corresponding word (context) 9/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

  10. Some (fixable) problems Stop words (a, the, for, etc.) are very frequent → these counts will be very high human machine system for ... user human 0 1 0 1 ... 0 machine 1 0 0 1 ... 0 system 0 0 0 1 ... 2 for 1 1 1 0 ... 0 . . . . . . . . . . . . . . . . . . . . . user 0 0 2 0 ... 0 10/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

  11. Some (fixable) problems Solution 1: Ignore very frequent words human machine system ... user human 0 1 0 ... 0 machine 1 0 0 ... 0 system 0 0 0 ... 2 . . . . . . . . . . . . . . . . . . user 0 0 2 ... 0 11/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

  12. Some (fixable) problems Solution 2: Use a threshold t (say, t = 100) human machine system for ... user human 0 1 0 x ... 0 X ij = min ( count ( w i , c j ) , t ) , machine 1 0 0 x ... 0 system 0 0 0 x ... 2 for x x x x ... x where w is word and c is context. . . . . . . . . . . . . . . . . . . . . . user 0 0 2 x ... 0 12/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

  13. Some (fixable) problems Solution 3: Instead of count ( w, c ) use PMI ( w, c ) PMI ( w, c ) = log p ( c | w ) human machine system for ... user human 0 2.944 0 2.25 ... 0 p ( c ) machine 2.944 0 0 2.25 ... 0 system 0 0 0 1.15 ... 1.84 count ( w, c ) ∗ N for 2.25 2.25 1.15 0 ... 0 = log . . . . . . . count ( c ) ∗ count ( w ) . . . . . . . . . . . . . . N is the total number of words user 0 0 1.84 0 ... 0 If count ( w, c ) = 0, PMI ( w, c ) = −∞ Instead use, PMI 0 ( w, c ) = PMI ( w, c ) if count ( w, c ) > 0 = 0 otherwise or PPMI ( w, c ) = PMI ( w, c ) if PMI ( w, c ) > 0 = 0 otherwise 13/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

  14. Some (severe) problems Very high dimensional ( | V | ) Very sparse human machine system for ... user Grows with the size of the vocabulary human 0 2.944 0 2.25 ... 0 machine 2.944 0 0 2.25 ... 0 Solution: Use dimensionality reduc- system 0 0 0 1.15 ... 1.84 for 2.25 2.25 1.15 0 ... 0 tion (SVD) . . . . . . . . . . . . . . . . . . . . . user 0 0 1.84 0 ... 0 14/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

  15. Module 10.3: SVD for learning word representations 15/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

  16. Singular Value Decomposition gives a rank k approximation of the original matrix   X = X PPMIm × n = U m × k Σ k × k V T k × n   =   X   X PPMI (simplifying notation to m × n X ) is the co-occurrence matrix       ↑ · · · ↑ v T σ 1 ← → 1 with PPMI values   . ...     .   .     u 1 · · · u k   SVD gives the best rank- k ap- v T σ k ← → ↓ · · · ↓ k k × k k × n proximation of the original data m × k ( X ) Discovers latent semantics in the corpus (let us examine this with the help of an example) 16/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

  17. Notice that the product can be written as a sum of k rank-1 matrices i ∈ R m × n because it Each σ i u i v T   is a product of a m × 1 vector   =   X   with a 1 × n vector m × n If we truncate the sum at σ 1 u 1 v T   1     ↑ · · · ↑ v T σ 1 ← → then we get the best rank-1 ap- 1   . ...     .   proximation of X (By SVD the- .     u 1 · · · u k   v T σ k ← → orem! But what does this mean? ↓ · · · ↓ k k × k k × n m × k We will see on the next slide) = σ 1 u 1 v T 1 + σ 2 u 2 v T 2 + · · · + σ k u k v T k If we truncate the sum at σ 1 u 1 v T 1 + σ 2 u 2 v T 2 then we get the best rank-2 approximation of X and so on 17/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

  18. What do we mean by approxim- ation here? Notice that X has m × n entries   When we use he rank-1 approx-   imation we are using only n + =   X   m + 1 entries to reconstruct [ u ∈ m × n R m , v ∈ R n , σ ∈ R 1 ]       ↑ · · · ↑ v T σ 1 ← → 1 But SVD theorem tells us that   . ...     .   .     u 1 · · · u k u 1 , v 1 and σ 1 store the most in-   v T σ k ← → ↓ · · · ↓ k formation in X (akin to the prin- k × k k × n m × k = σ 1 u 1 v T 1 + σ 2 u 2 v T 2 + · · · + σ k u k v T cipal components in X ) k Each subsequent term ( σ 2 u 2 v T 2 , σ 3 u 3 v T 3 , . . . ) stores less and less important information 18/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Recommend


More recommend