Analogies Explained Towards Understanding Word Embeddings Carl Allen, Tim Hospedales June 13 2019 School of Informatics, University of Edinburgh
The Problem: linking semantics to geometry from: “man is to king as woman is to queen” explain: w king w man w woman w queen or rather: ? P 1 ? P 1 1
The Problem: linking semantics to geometry from: “man is to king as woman is to queen” explain: or rather: ? P 1 ? P 1 1 w king − w man + w woman ≈ w queen
The Problem: linking semantics to geometry from: P 1 ? P 1 ? 1 “man is to king as woman is to queen” explain: or rather: w king − w man + w woman ≈ w queen prince auxiliary sol permitting reign queen royal princess crown lord prince queen w K − w M w K − w M + w W king woman man woman
p w i c j w i c j p w i p c j PMI w i c j W C use sigmoid with negative sampling ( k ) Levy and Goldberg (2014) Word2Vec: SkipGram with Negative Sampling Mikolov et al. (2013a,b) k k W C PMI k . . . . w 1 w 2 w 3 w n target c 1 c 2 c 3 c n context . . 2 • p ( c j | w i ) by softmax expensive words ( E ) words ( E )
p w i c j w i c j p w i p c j PMI w i c j Word2Vec: SkipGram with Negative Sampling Mikolov et al. (2013a,b) k PMI W C k k Levy and Goldberg (2014) sampling ( k ) C W . . . c 2 w 1 w 2 w 3 w n target c 1 c 3 . c n context . . 2 • p ( c j | w i ) by softmax expensive words ( E ) words ( E ) • use sigmoid with negative
Word2Vec: SkipGram with Negative Sampling . k PMI W C sampling ( k ) C W . . Mikolov et al. (2013a,b) . . . 2 target w 1 w 2 w 3 context w n c 2 c 1 c 3 c n • p ( c j | w i ) by softmax expensive words ( E ) words ( E ) • use sigmoid with negative • Levy and Goldberg (2014) p ( w i , c j ) w ⊤ i c j ≈ log p ( w i ) p ( c j ) − log k = PMI ( w i , c j ) − log k
Word2Vec: SkipGram with Negative Sampling context sampling ( k ) C W . . . Mikolov et al. (2013a,b) . . . c n w n w 1 w 2 w 3 2 target c 1 c 2 c 3 • p ( c j | w i ) by softmax expensive words ( E ) words ( E ) • use sigmoid with negative • Levy and Goldberg (2014) p ( w i , c j ) w ⊤ i c j ≈ log p ( w i ) p ( c j ) − log k = PMI ( w i , c j ) − log k W ⊤ C ≈ PMI − log k
Routemap PMI woman semantic geometric w queen w woman w man w king PMI queen PMI man “man is to king as woman is to queen” PMI king {man, queen} paraphrases {woman, king} woman transforms to queen as man transforms to king 3
Routemap PMI woman semantic geometric w queen w woman w man w king PMI queen PMI man “man is to king as woman is to queen” PMI king {man, queen} paraphrases {woman, king} woman transforms to queen as man transforms to king 3 ⇕
Routemap PMI man semantic geometric w queen w woman w man w king PMI queen PMI woman PMI king “man is to king as woman is to queen” {man, queen} paraphrases {woman, king} woman transforms to queen as man transforms to king 3 ⇕ ⇕
Routemap “man is to king as woman is to queen” semantic geometric w queen w woman w man w king {man, queen} paraphrases {woman, king} woman transforms to queen as man transforms to king 3 ⇕ ⇕ ⇓ PMI king − PMI man + PMI woman ≈ PMI queen
Routemap {woman, king} semantic geometric {man, queen} “man is to king as woman is to queen” paraphrases woman transforms to queen as man transforms to king 3 ⇕ ⇕ ⇓ PMI king − PMI man + PMI woman ≈ PMI queen ⇓ w king − w man + w woman ≈ w queen
Routemap {woman, king} semantic geometric {man, queen} “man is to king as woman is to queen” paraphrases woman transforms to queen as man transforms to king 3 ⇕ ⇕ ⇓ PMI king − PMI man + PMI woman ≈ PMI queen ⇓ w king − w man + w woman ≈ w queen
Routemap {woman, king} semantic geometric {man, queen} “man is to king as woman is to queen” paraphrases woman transforms to queen as man transforms to king 4 ⇕ ⇕ ⇓ PMI king − PMI man + PMI woman ≈ PMI queen ⇓ w king − w man + w woman ≈ w queen
Routemap {woman, king} semantic geometric “man is to king as woman is to queen” paraphrases {man, queen} woman transforms to queen as man transforms to king 5 ⇕ ⇕ ⇓ PMI king − PMI man + PMI woman ≈ PMI queen ⇓ PMI i ≈ w ⊤ i C w king − w man + w woman ≈ w queen
n is (element-wise) small: p c j w 6 l , if paraphrase error Inspired by Gittens et al. (2017) c j p c j j w w , paraphrases Definition (D1): w w n w 1 Paraphrase † of W by w ∗ Intuition: word w ∗ ∈E paraphrases word set W = { w 1 , ..., w m }⊆E , if w ∗ and W are semantically interchangeable . p ( E|W ) p ( E| w ∗ ) E
6 w n j w 1 Paraphrase † of W by w ∗ Intuition: word w ∗ ∈E paraphrases word set W = { w 1 , ..., w m }⊆E , if w ∗ and W are semantically interchangeable . p ( E|W ) p ( E| w ∗ ) E Definition (D1): w ∗ ∈E paraphrases W ⊆E , |W| < l , if paraphrase error ρ W , w ∗ ∈ R n is (element-wise) small: = log p ( c j | w ∗ ) W , w ∗ p ( c j |W ) , c j ∈E ρ † Inspired by Gittens et al. (2017)
PMI w 1 c j PMI w 2 c j p w 1 c j p w 2 c j p w 1 p w 2 p c j w p w 1 c j p w 2 c j p w 1 p w 2 Summing PMI vectors of a paraphrase error conditional independence error p independence , Lemma 1: For any word w and word set l : PMI w i PMI i w 1 j error c j p PMI w c j p w c j p w p c j c j p p p p c j w j paraphrase 7 PMI 1 + PMI 2 ≈ PMI ∗ ?
p w 1 c j p w 2 c j p w 1 p w 2 p c j w p w 1 c j p w 2 c j p w 1 p w 2 Summing PMI vectors of a paraphrase error conditional independence error p independence and word set Lemma 1: For any word w , l : PMI w i PMI i w 1 j error c j c j p w c j p w p c j p p p p p c j w j paraphrase 7 PMI 1 + PMI 2 ≈ PMI ∗ ? ( ) PMI ( w ∗ , c j ) − PMI ( w 1 , c j ) + PMI ( w 2 , c j )
p c j w p w 1 c j p w 2 c j p w 1 p w 2 error j conditional independence error p independence Summing PMI vectors of a paraphrase c j and word set , l : PMI w i PMI i w 1 Lemma 1: For any word w error p p p p c j c j p p c j w j paraphrase 7 PMI 1 + PMI 2 ≈ PMI ∗ ? ( ) PMI ( w ∗ , c j ) − PMI ( w 1 , c j ) + PMI ( w 2 , c j ) = log p ( w ∗ | c j ) − log p ( w 1 | c j ) p ( w 2 | c j ) p ( w ∗ ) p ( w 1 ) p ( w 2 )
p c j w p w 1 c j p w 2 c j p w 1 p w 2 Summing PMI vectors of a paraphrase conditional independence error p independence error Lemma 1: For any word w and word set , l : PMI w i PMI i w 1 j c j 7 p error paraphrase j w p c j PMI 1 + PMI 2 ≈ PMI ∗ ? ( ) PMI ( w ∗ , c j ) − PMI ( w 1 , c j ) + PMI ( w 2 , c j ) = log p ( w ∗ | c j ) − log p ( w 1 | c j ) p ( w 2 | c j ) + log p ( W| c j ) p ( W| c j ) + log p ( W ) p ( w ∗ ) p ( w 1 ) p ( w 2 ) p ( W )
Summing PMI vectors of a paraphrase Lemma 1: For any word w j conditional independence error independence error and word set paraphrase , l : PMI w i PMI i w 1 error 7 j PMI 1 + PMI 2 ≈ PMI ∗ ? ( ) PMI ( w ∗ , c j ) − PMI ( w 1 , c j ) + PMI ( w 2 , c j ) = log p ( w ∗ | c j ) − log p ( w 1 | c j ) p ( w 2 | c j ) + log p ( W| c j ) p ( W| c j ) + log p ( W ) p ( w ∗ ) p ( w 1 ) p ( w 2 ) p ( W ) = log p ( c j | w ∗ ) p ( W| c j ) p ( W ) + log − log p ( c j |W ) p ( w 1 | c j ) p ( w 2 | c j ) p ( w 1 ) p ( w 2 ) � �� � � �� � � �� � ρ W , w ∗ σ W τ W
Summing PMI vectors of a paraphrase j error independence error independence conditional j error paraphrase 7 PMI 1 + PMI 2 ≈ PMI ∗ ? ( ) PMI ( w ∗ , c j ) − PMI ( w 1 , c j ) + PMI ( w 2 , c j ) = log p ( w ∗ | c j ) − log p ( w 1 | c j ) p ( w 2 | c j ) + log p ( W| c j ) p ( W| c j ) + log p ( W ) p ( w ∗ ) p ( w 1 ) p ( w 2 ) p ( W ) = log p ( c j | w ∗ ) p ( W| c j ) p ( W ) + log − log p ( c j |W ) p ( w 1 | c j ) p ( w 2 | c j ) p ( w 1 ) p ( w 2 ) � �� � � �� � � �� � ρ W , w ∗ σ W τ W Lemma 1: For any word w ∗ ∈E and word set W ⊆E , |W| < l : ∑ W − τ W , w ∗ + σ PMI ∗ = PMI i + ρ W 1 w i ∈ W
8 p 1 PMI i w i PMI i w i l : , , , Lemma 2: For any word sets p w n w 1 : Replace word w with word set Generalised Paraphrase (of W by W ∗ ) Lemma 1: For any word w ∗ ∈E and word set W ⊆E , |W| < l : ∑ W − τ W , w ∗ + σ PMI ∗ = PMI i + ρ W 1 w i ∈ W
8 Lemma 2: For any word sets 1 PMI i w i PMI i w i l : , , , w 1 w n Generalised Paraphrase (of W by W ∗ ) Lemma 1: For any word w ∗ ∈E and word set W ⊆E , |W| < l : ∑ W − τ W , w ∗ + σ PMI ∗ = PMI i + ρ W 1 w i ∈ W Replace word w ∗ with word set W ∗ ⊆E : p ( E|W ) p ( E|W ∗ ) E
8 w 1 w n Generalised Paraphrase (of W by W ∗ ) Lemma 1: For any word w ∗ ∈E and word set W ⊆E , |W| < l : ∑ W − τ W , w ∗ + σ PMI ∗ = PMI i + ρ W 1 w i ∈ W Replace word w ∗ with word set W ∗ ⊆E : p ( E|W ) p ( E|W ∗ ) E Lemma 2: For any word sets W , W ∗ ⊆E , |W| , |W ∗ | < l : ∑ ∑ W − σ W − τ W , W∗ + σ W∗ − ( τ PMI i = PMI i + ρ W∗ ) 1 w i ∈ W ∗ w i ∈ W
Recommend
More recommend