tengyu ma
play

Tengyu Ma Joint works with Sanjeev Arora, Yuanzhi Li, Yingyu Liang, - PowerPoint PPT Presentation

Tengyu Ma Joint works with Sanjeev Arora, Yuanzhi Li, Yingyu Liang, and Andrej Risteski Princeton University & ( Euclidean space with complicated space meaningful inner products )*+,-.*/01, 23/03+4 Kernel


  1. Tengyu Ma Joint works with Sanjeev Arora, Yuanzhi Li, Yingyu Liang, and Andrej Risteski Princeton University

  2. ๐‘ค & โˆˆ โ„ ( ๐‘ฆ โˆˆ ๐’ด Euclidean space with complicated space meaningful inner products )*+,-.*/01, 23/03+4 ร˜ Kernel methods Linearly separable 0.*3+1, +15.*2 +106 Multi-class linear ร˜ Neural nets classifier

  3. Vocabulary= โ„ 788 { 60k most frequent words } Goal: Embedding captures semantics information (via linear algebraic operations) ร˜ inner products characterize similarity ร˜ similar words have large inner products ร˜ differences characterize relationship ร˜ analogous pairs have similar differences ร˜ more? picture: Chris Olahโ€™s blog

  4. Meaning of a word is determined by words it co-occurs with. ( Distributional hypothesis of meaning , [Harrisโ€™54], [Firthโ€™57] ) ร˜ Pr ๐‘ฆ, ๐‘ง โ‰œ prob. of co-occurrences of ๐‘ฆ, ๐‘ง in a window of size 5 word ๐‘ง โ†“ โ‹ฏ ๐‘ค & ,๐‘ค C - a good measure of ร˜ โ‹ฎ โ‹ฎ similarity of (๐‘ฆ,๐‘ง) [Lund-Burgessโ€™96] word ๐‘ฆ โ†’ ๐‘ค & โ‹ฑ โ‹ฎ โ‹ฎ โ‹ฏ ร˜ ๐‘ค & = row of entry-wise square-root of co-occurrence matrix [Rohde et alโ€™05] Co-occurrence matrix Pr โ‹…,โ‹… L. [&,C] ร˜ ๐‘ค & = row of PMI ๐‘ฆ, ๐‘ง = log L. & L.[C] matrix [Church-Hanksโ€™90]

  5. Algorithm [Levy-Goldberg]: (dimension-reduction version of [Church-Hanksโ€™90]) L. [&,C] ร˜ Compute PMI ๐‘ฆ, ๐‘ง = log L. & L.[C] ร˜ Take rank-300 SVD (best rank-300 approximation) of PMI ร˜ โ‡” Fit PMI ๐‘ฆ,๐‘ง โ‰ˆ โŒฉ๐‘ค & , ๐‘ค C โŒช (with squared loss), where ๐‘ค & โˆˆ โ„ 788 ร˜ โ€œLinear structureโ€ in the found ๐‘ค & โ€™s : ๐‘ค PQRST โˆ’ ๐‘ค RST โ‰ˆ ๐‘ค WXYYT โˆ’ ๐‘ค Z[T\ โ‰ˆ ๐‘ค XT]^Y โˆ’ ๐‘ค SXT_ โ‰ˆ โ‹ฏ king queen uncle man aunt woman

  6. ร˜ Questions: woman: man queen: ? , aunt: ? ร˜ Answers: ๐‘™๐‘—๐‘œ๐‘• = argmin k || ๐‘ค WXYYT โˆ’ ๐‘ค P โˆ’ (๐‘ค PQRST โˆ’๐‘ค RST )|| ๐‘๐‘ฃ๐‘œ๐‘ข = argmin k || ๐‘ค XT]^Y โˆ’ ๐‘ค P โˆ’ (๐‘ค PQRST โˆ’๐‘ค RST )|| king queen uncle man aunt woman

  7. ร˜ recurrent neural network based model [Mikolov et alโ€™12] ร˜ word2vec [Mikolov et alโ€™13] : โˆ expโŒฉ๐‘ค & yz{ ,1 Pr ๐‘ฆ [pq ๐‘ฆ [pr ,โ€ฆ,๐‘ฆ [pt 5 ๐‘ค & yz~ + โ‹ฏ + ๐‘ค & yzโ‚ฌ โŒช ร˜ GloVe [Pennington et alโ€™14] : log Pr [๐‘ฆ,๐‘ง] โ‰ˆ ๐‘ค & ,๐‘ค C + ๐‘ก & + ๐‘ก C + ๐ท ร˜ [Levy-Goldbergโ€™14] (Previous slide) L. [&,C] PMI ๐‘ฆ,๐‘ง = log L. & L.[C] โ‰ˆ ๐‘ค & ,๐‘ค C + ๐ท Logarithm (or exponential) seems to exclude linear algebra!

  8. Why co-occurrence statistics + log ร  linear structure [Levy-Goldbergโ€™13, Pennington et alโ€™14, rephrased] ร˜ For most of the words ๐œ“: Pr[๐œ“ โˆฃ ๐‘™๐‘—๐‘œ๐‘•] Pr[๐œ“ โˆฃ ๐‘›๐‘๐‘œ] Pr[๐œ“ โˆฃ ๐‘Ÿ๐‘ฃ๐‘“๐‘“๐‘œ] โ‰ˆ Pr ๐œ“ ๐‘ฅ๐‘๐‘›๐‘๐‘œ] ยง For ๐œ“ unrelated to gender: LHS, RHS โ‰ˆ 1 ยง for ๐œ“ =dress, LHS, RHS โ‰ช 1 ; for ๐œ“ = John, LHS, RHS โ‰ซ 1 ร˜ It suggests โ€ข โ€ข log Pr ๐œ“ ๐‘Ÿ๐‘ฃ๐‘“๐‘“๐‘œ โˆ’ log Pr ๐œ“ ๐‘™๐‘—๐‘œ๐‘• ๐‘›๐‘๐‘œ โ‰ˆ 0 Pr ๐œ“ Pr ๐œ“ ๐‘ฅ๐‘๐‘›๐‘๐‘œ] ลฝ โ€ข = โ€ข PMI ๐œ“, ๐‘™๐‘—๐‘œ๐‘• โˆ’ PMI ๐œ“, ๐‘Ÿ๐‘ฃ๐‘“๐‘“๐‘œ โˆ’ PMI ๐œ“, ๐‘›๐‘๐‘œ โˆ’ PMI ๐œ“, ๐‘ฅ๐‘๐‘›๐‘๐‘œ โ‰ˆ 0 ลฝ ร˜ Rows of PMI matrix has โ€œlinear structureโ€ ร˜ Empirically one can find ๐‘ค P โ€™s s.t. PMI ๐œ“, ๐‘ฅ โ‰ˆ โŒฉ๐‘ค ลฝ ,๐‘ค P โŒช ร˜ Suggestion: ๐‘ค P โ€™s also have linear structure

  9. M1: Why do low-dim vectors capture essence of huge co-occurrence statistics? That is, why is a low-dim fit of PMI matrix even possible? PMI ๐‘ฆ, ๐‘ง โ‰ˆ ๐‘ค & , ๐‘ค C (โˆ—) ร˜ NB: PMI matrix is not necessarily PSD. M2: Why low-dim vectors solves analogy when (โˆ—) is only roughly true? โ†‘ empirical fit has 17% error ร˜ NB: solving analogy task requires inner products of 6 pairs of word vectors, and that โ€œkingโ€ survives against all other words โ€“ noise is potentially an issue! ๐‘™๐‘—๐‘œ๐‘• = argmax k || ๐‘ค WXYYT โˆ’ ๐‘ค P โˆ’ (๐‘ค PQRST โˆ’๐‘ค RST ) || โ€ข ร˜ Fact: low-dim word vectors have more accurate linear structure than the rows of PMI (therefore better analogy task performance).

  10. M1: Why do low-dim vectors capture essence of huge co-occurrence statistics? That is, why is a low-dim fit of PMI matrix even possible? PMI ๐‘ฆ, ๐‘ง โ‰ˆ ๐‘ค & , ๐‘ค C (โˆ—) A1: Under a generative model (named RAND-WALK) , (*) provablyholds M2: Why low-dim vectors solves analogy when (โˆ—) is only roughly true? A2: (*) + isotropy of word vectors โ‡’ low-dim fitting reduces noise (Quite intuitive, though doesnโ€™t follow Occamโ€™s bound for PAC-learning)

  11. ๐‘‘ _ ๐‘‘ _pr ๐‘‘ _pโ€ข ๐‘‘ _p7 ๐‘‘ _pโ€“ ๐‘ฅ _pr ๐‘ฅ _pโ€ข ๐‘ฅ _pโ€“ ๐‘ฅ _ ๐‘ฅ _p7 ร˜ Hidden Markov Model: ยง discourse vector ๐‘‘ _ โˆˆ โ„ ( governs the discourse/theme/context of time ๐‘ข ยง words ๐‘ฅ _ (observable); embedding ๐‘ค P โ€ข โˆˆ โ„ ( (parameters to learn) ยง log-linear observation model Pr[๐‘ฅ _ โˆฃ ๐‘‘ _ ] โˆ expโŒฉ๐‘ค P โ€ข ,๐‘‘ _ โŒช ร˜ Closely related to [Mnih-Hintonโ€™07]

  12. ๐‘‘ _ ๐‘‘ _pr ๐‘‘ _pโ€ข ๐‘‘ _p7 ๐‘‘ _pโ€“ ๐‘ฅ _pr ๐‘ฅ _pโ€ข ๐‘ฅ _pโ€“ ๐‘ฅ _ ๐‘ฅ _p7 ร˜ Ideally, ๐‘‘ _ ,๐‘ค P โˆˆ โ„ ( should contain semantic information in its coordinates ยง E.g. (0.5, -0.3, โ€ฆ) could mean โ€œ0.5 gender, -0.3 age,..โ€ ร˜ But, the whole system is rotational invariant: ๐‘‘ _ ,๐‘ค P = โŒฉ๐‘†๐‘‘ _ ,๐‘†๐‘ค P โŒช ร˜ There should exist a rotation so that the coordinates are meaningful (back to this later)

  13. ๐‘‘ _ ๐‘‘ _pr ๐‘‘ _pโ€ข ๐‘‘ _p7 ๐‘‘ _pโ€“ ๐‘ฅ _ ๐‘ฅ _pr ๐‘ฅ _pโ€ข ๐‘ฅ _p7 ๐‘ฅ _pโ€“ ร˜ Assumptions: ยง { ๐‘ค P } consists of vectors drawn from ๐‘ก โ‹… ๐’ช(0,Id) ; ๐‘ก is bounded scalar r.v. ยง ๐‘‘ _ does a slow random walk (doesnโ€™t change much in a window of 5) ยง log-linear observation model: Pr[๐‘ฅ _ โˆฃ ๐‘‘ _ ] โˆ expโŒฉ๐‘ค P โ€ข ,๐‘‘ _ โŒช ร˜ Main Theorem: ๐‘ค P + ๐‘ค Pโ€บ โ€ข /๐‘’ โˆ’ 2 log ๐‘Ž ยฑ ๐œ— (1) log Pr ๐‘ฅ,๐‘ฅโ€ฒ = ๐‘ค P โ€ข /๐‘’ โˆ’ log ๐‘Ž ยฑ ๐œ— (2) log Pr ๐‘ฅ = Fact: (2) implies that the words have power PMI ๐‘ฅ,๐‘ฅ โ€บ = ๐‘ค P ,๐‘ค P ยข /๐‘’ ยฑ ๐œ— (3) law dist. ร˜ Norm determines frequency; spatial orientation determines โ€œmeaningโ€

  14. ร˜ word2vec [Mikolov et alโ€™13] : โˆ expโŒฉ๐‘ค P yz{ ,1 Pr ๐‘ฅ [pq ๐‘ฅ [pr ,โ€ฆ ,๐‘ฅ [pt 5 ๐‘ค P yz~ + โ‹ฏ + ๐‘ค P yzโ‚ฌ โŒช ร˜ GloVe [Pennington et alโ€™14] : log Pr [๐‘ฅ,๐‘ฅโ€ฒ] โ‰ˆ ๐‘ค P , ๐‘ค P ยข + ๐‘ก P + ๐‘ก Pโ€บ + ๐ท log Pr ๐‘ฅ,๐‘ฅ โ€บ = โ€ข /๐‘’ โˆ’ 2log ๐‘Ž ยฑ ๐œ— Eq. (1) ๐‘ค P + ๐‘ค P ยข ร˜ [Levy-Goldbergโ€™14] PMI ๐‘ฅ,๐‘ฅ โ€บ โ‰ˆ ๐‘ค P ,๐‘ค P ยข + ๐ท Eq. (3) PMI ๐‘ฅ, ๐‘ฅ โ€บ = ๐‘ค P , ๐‘ค P ยข /๐‘’ ยฑ ๐œ—

  15. ร˜ word2vec [Mikolov et alโ€™13] : โˆ expโŒฉ๐‘ค P yz{ ,1 Pr ๐‘ฅ [pq ๐‘ฅ [pr ,โ€ฆ, ๐‘ฅ [pt 5 ๐‘ค P yz~ + โ‹ฏ+ ๐‘ค P yzโ‚ฌ โŒช โ†‘ max-likelihood estimate of ๐‘‘ [pq ร˜ Under our model, ๐‘‘ [pโ€“ ๐‘‘ [pt ๐‘‘ [pq ยง Random walk is slow: ๐‘‘ [pr โ‰ˆ ๐‘‘ [pโ€ข โ‰ˆ โ‹ฏ โ‰ˆ ๐‘‘ [pq โ‰ˆ ๐‘‘ ยง Best estimate for current discourse ๐‘‘ [pq : ๐‘ฅ [pโ€“ ๐‘ฅ [pt ๐‘ฅ [pq argmax Pr ๐‘‘ ๐‘ฅ [pr ,โ€ฆ,๐‘ฅ t ] = ๐›ฝ ๐‘ค P yz~ + โ‹ฏ+ ๐‘ค P yzโ‚ฌ ],||]||ยฃr ยง Prob. distribution of next word given the best guess ๐‘‘ : Pr[๐‘ฅ [pq โˆฃ ๐‘‘ [pq = ๐›ฝ ๐‘ค P yz~ + โ‹ฏ+ ๐‘ค P yzโ‚ฌ ] โˆ expโŒฉ๐‘ค P yz{ ,๐›ฝ ๐‘ค P yz~ + โ‹ฏ+ ๐‘ค P yzโ‚ฌ โŒช

  16. This talk: window of size 2 Pr[๐‘ฅ โˆฃ ๐‘‘] โˆ expโŒฉ๐‘ค P , ๐‘‘โŒช r ร˜ Pr[๐‘ฅ โˆฃ ๐‘‘] = ยง ยจ โ‹… expโŒฉ๐‘ค P , ๐‘‘โŒช Pr[๐‘ฅโ€ฒ โˆฃ ๐‘‘โ€ฒ] โˆ expโŒฉ๐‘ค P ยข ,๐‘‘โ€ฒโŒช ๐‘‘โ€ฒ ๐‘‘ ร˜ ๐‘Ž ] = โˆ‘ exp โŒฉ๐‘ค P ,๐‘‘โŒช partition function P Pr[๐‘ฅ,๐‘ฅ โ€บ ] = ยฅ Pr ๐‘ฅ ๐‘‘] Pr ๐‘ฅ โ€บ ๐‘‘โ€ฒ] ๐‘ž ๐‘‘,๐‘‘ โ€บ ๐‘’๐‘‘๐‘’๐‘‘โ€ฒ ๐‘ฅ ๐‘ฅโ€ฒ spherical Gaussian vector ๐‘‘ 1 โ‹… exp ๐‘ค P ,๐‘‘ expโŒฉ๐‘ค P ยข ,๐‘‘ โ€บ โŒช ๐‘ž ๐‘‘, ๐‘‘ โ€บ ๐‘’๐‘‘๐‘’๐‘‘โ€ฒ โ€ข /๐‘’ ร˜ ๐”ฝ exp ๐‘ค,๐‘‘ = exp ๐‘ค = ยฅ ๐‘Ž ] ๐‘Ž ]โ€บ ร˜ Assume ๐‘‘ = ๐‘‘โ€ฒ with probability 1, ?? โ€ข /๐‘’ = ยฅexpโŒฉ๐‘ค P + ๐‘ค P ยข , ๐‘‘โŒช๐‘ž ๐‘‘ ๐‘’๐‘‘ = exp ๐‘ค P + ๐‘ค P ยข Eq. (1) log Pr ๐‘ฅ, ๐‘ฅ โ€บ = โ€ข /๐‘’ โˆ’ 2 log ๐‘Ž ยฑ ๐œ— ๐‘ค P + ๐‘ค P ยข

  17. This talk: window of size 2 Pr[๐‘ฅ โˆฃ ๐‘‘] โˆ expโŒฉ๐‘ค P , ๐‘‘โŒช r ร˜ Pr[๐‘ฅ โˆฃ ๐‘‘] = ยง ยจ โ‹… expโŒฉ๐‘ค P , ๐‘‘โŒช Pr[๐‘ฅโ€ฒ โˆฃ ๐‘‘โ€ฒ] โˆ expโŒฉ๐‘ค P ยข ,๐‘‘โ€ฒโŒช ๐‘‘โ€ฒ ๐‘‘ ร˜ ๐‘Ž ] = โˆ‘ exp โŒฉ๐‘ค P ,๐‘‘โŒช partition function P Lemma 1: for almost all c, almost all ๐‘ค P , ๐‘Ž ] = 1 + ๐‘ 1 ๐‘Ž ๐‘ฅ ๐‘ฅโ€ฒ ร˜ Proof (sketch) : ยง for most ๐‘‘ , ๐‘Ž ] concentrates around its mean ยง mean of ๐‘Ž ] is determined by ||๐‘‘|| , which in turn concentrates ยง caveat: expโŒฉ๐‘ค,๐‘‘โŒช for ๐‘ค โˆผ ๐’ช(0,Id) is not subgaussian, nor sub- exponential. ( ๐›ฝ -Orlicz norm is not bounded for any ๐›ฝ > 0 ) Eq. (1) log Pr ๐‘ฅ, ๐‘ฅ โ€บ = โ€ข /๐‘’ โˆ’ 2 log ๐‘Ž ยฑ ๐œ— ๐‘ค P + ๐‘ค P ยข

  18. Lemma 1: for almost all c, almost all ๐‘ค P , ๐‘Ž ] = 1 + ๐‘ 1 ๐‘Ž ร˜ Proof Sketch: ร˜ Fixing ๐‘‘ , to show high probability over choices of ๐‘ค P โ€™s ๐‘Ž ] = โ€ข expโŒฉ๐‘ค P ,๐‘‘โŒช = 1 + ๐‘ 1 ๐”ฝ[๐‘Ž ] ] P ร˜ ๐‘จ P = โŒฉ๐‘ค P ,๐‘‘โŒช scalar Gaussian random variable ร˜ ||๐‘‘|| governs the mean and variance of ๐‘จ P . ร˜ ||๐‘‘|| in turns is concentrated

Recommend


More recommend