Tengyu Ma Joint works with Sanjeev Arora, Yuanzhi Li, Yingyu Liang, - PowerPoint PPT Presentation

Tengyu Ma Joint works with Sanjeev Arora, Yuanzhi Li, Yingyu Liang, and Andrej Risteski Princeton University

𝑤 & ∈ ℝ ( 𝑦 ∈ 𝒴 Euclidean space with complicated space meaningful inner products )*+,-.*/01, 23/03+4 Ø Kernel methods Linearly separable 0.*3+1, +15.*2 +106 Multi-class linear Ø Neural nets classifier

Vocabulary= ℝ 788 { 60k most frequent words } Goal: Embedding captures semantics information (via linear algebraic operations) Ø inner products characterize similarity Ø similar words have large inner products Ø differences characterize relationship Ø analogous pairs have similar differences Ø more? picture: Chris Olah’s blog

Meaning of a word is determined by words it co-occurs with. ( Distributional hypothesis of meaning , [Harris’54], [Firth’57] ) Ø Pr 𝑦, 𝑧 ≜ prob. of co-occurrences of 𝑦, 𝑧 in a window of size 5 word 𝑧 ↓ ⋯ 𝑤 & ,𝑤 C - a good measure of Ø ⋮ ⋮ similarity of (𝑦,𝑧) [Lund-Burgess’96] word 𝑦 → 𝑤 & ⋱ ⋮ ⋮ ⋯ Ø 𝑤 & = row of entry-wise square-root of co-occurrence matrix [Rohde et al’05] Co-occurrence matrix Pr ⋅,⋅ L. [&,C] Ø 𝑤 & = row of PMI 𝑦, 𝑧 = log L. & L.[C] matrix [Church-Hanks’90]

Algorithm [Levy-Goldberg]: (dimension-reduction version of [Church-Hanks’90]) L. [&,C] Ø Compute PMI 𝑦, 𝑧 = log L. & L.[C] Ø Take rank-300 SVD (best rank-300 approximation) of PMI Ø ⇔ Fit PMI 𝑦,𝑧 ≈ 〈𝑤 & , 𝑤 C 〉 (with squared loss), where 𝑤 & ∈ ℝ 788 Ø “Linear structure” in the found 𝑤 & ’s : 𝑤 PQRST − 𝑤 RST ≈ 𝑤 WXYYT − 𝑤 Z[T\ ≈ 𝑤 XT]^Y − 𝑤 SXT_ ≈ ⋯ king queen uncle man aunt woman

Ø Questions: woman: man queen: ? , aunt: ? Ø Answers: 𝑙𝑗𝑜𝑕 = argmin k || 𝑤 WXYYT − 𝑤 P − (𝑤 PQRST −𝑤 RST )|| 𝑏𝑣𝑜𝑢 = argmin k || 𝑤 XT]^Y − 𝑤 P − (𝑤 PQRST −𝑤 RST )|| king queen uncle man aunt woman

Ø recurrent neural network based model [Mikolov et al’12] Ø word2vec [Mikolov et al’13] : ∝ exp〈𝑤 & yz{ ,1 Pr 𝑦 [pq 𝑦 [pr ,…,𝑦 [pt 5 𝑤 & yz~ + ⋯ + 𝑤 & yz€ 〉 Ø GloVe [Pennington et al’14] : log Pr [𝑦,𝑧] ≈ 𝑤 & ,𝑤 C + 𝑡 & + 𝑡 C + 𝐷 Ø [Levy-Goldberg’14] (Previous slide) L. [&,C] PMI 𝑦,𝑧 = log L. & L.[C] ≈ 𝑤 & ,𝑤 C + 𝐷 Logarithm (or exponential) seems to exclude linear algebra!

Why co-occurrence statistics + log à linear structure [Levy-Goldberg’13, Pennington et al’14, rephrased] Ø For most of the words 𝜓: Pr[𝜓 ∣ 𝑙𝑗𝑜𝑕] Pr[𝜓 ∣ 𝑛𝑏𝑜] Pr[𝜓 ∣ 𝑟𝑣𝑓𝑓𝑜] ≈ Pr 𝜓 𝑥𝑝𝑛𝑏𝑜] § For 𝜓 unrelated to gender: LHS, RHS ≈ 1 § for 𝜓 =dress, LHS, RHS ≪ 1 ; for 𝜓 = John, LHS, RHS ≫ 1 Ø It suggests • • log Pr 𝜓 𝑟𝑣𝑓𝑓𝑜 − log Pr 𝜓 𝑙𝑗𝑜𝑕 𝑛𝑏𝑜 ≈ 0 Pr 𝜓 Pr 𝜓 𝑥𝑝𝑛𝑏𝑜] Ž • = • PMI 𝜓, 𝑙𝑗𝑜𝑕 − PMI 𝜓, 𝑟𝑣𝑓𝑓𝑜 − PMI 𝜓, 𝑛𝑏𝑜 − PMI 𝜓, 𝑥𝑝𝑛𝑏𝑜 ≈ 0 Ž Ø Rows of PMI matrix has “linear structure” Ø Empirically one can find 𝑤 P ’s s.t. PMI 𝜓, 𝑥 ≈ 〈𝑤 Ž ,𝑤 P 〉 Ø Suggestion: 𝑤 P ’s also have linear structure

M1: Why do low-dim vectors capture essence of huge co-occurrence statistics? That is, why is a low-dim fit of PMI matrix even possible? PMI 𝑦, 𝑧 ≈ 𝑤 & , 𝑤 C (∗) Ø NB: PMI matrix is not necessarily PSD. M2: Why low-dim vectors solves analogy when (∗) is only roughly true? ↑ empirical fit has 17% error Ø NB: solving analogy task requires inner products of 6 pairs of word vectors, and that “king” survives against all other words – noise is potentially an issue! 𝑙𝑗𝑜𝑕 = argmax k || 𝑤 WXYYT − 𝑤 P − (𝑤 PQRST −𝑤 RST ) || • Ø Fact: low-dim word vectors have more accurate linear structure than the rows of PMI (therefore better analogy task performance).

M1: Why do low-dim vectors capture essence of huge co-occurrence statistics? That is, why is a low-dim fit of PMI matrix even possible? PMI 𝑦, 𝑧 ≈ 𝑤 & , 𝑤 C (∗) A1: Under a generative model (named RAND-WALK) , (*) provablyholds M2: Why low-dim vectors solves analogy when (∗) is only roughly true? A2: (*) + isotropy of word vectors ⇒ low-dim fitting reduces noise (Quite intuitive, though doesn’t follow Occam’s bound for PAC-learning)

𝑑 _ 𝑑 _pr 𝑑 _p• 𝑑 _p7 𝑑 _p– 𝑥 _pr 𝑥 _p• 𝑥 _p– 𝑥 _ 𝑥 _p7 Ø Hidden Markov Model: § discourse vector 𝑑 _ ∈ ℝ ( governs the discourse/theme/context of time 𝑢 § words 𝑥 _ (observable); embedding 𝑤 P • ∈ ℝ ( (parameters to learn) § log-linear observation model Pr[𝑥 _ ∣ 𝑑 _ ] ∝ exp〈𝑤 P • ,𝑑 _ 〉 Ø Closely related to [Mnih-Hinton’07]

𝑑 _ 𝑑 _pr 𝑑 _p• 𝑑 _p7 𝑑 _p– 𝑥 _pr 𝑥 _p• 𝑥 _p– 𝑥 _ 𝑥 _p7 Ø Ideally, 𝑑 _ ,𝑤 P ∈ ℝ ( should contain semantic information in its coordinates § E.g. (0.5, -0.3, …) could mean “0.5 gender, -0.3 age,..” Ø But, the whole system is rotational invariant: 𝑑 _ ,𝑤 P = 〈𝑆𝑑 _ ,𝑆𝑤 P 〉 Ø There should exist a rotation so that the coordinates are meaningful (back to this later)

𝑑 _ 𝑑 _pr 𝑑 _p• 𝑑 _p7 𝑑 _p– 𝑥 _ 𝑥 _pr 𝑥 _p• 𝑥 _p7 𝑥 _p– Ø Assumptions: § { 𝑤 P } consists of vectors drawn from 𝑡 ⋅ 𝒪(0,Id) ; 𝑡 is bounded scalar r.v. § 𝑑 _ does a slow random walk (doesn’t change much in a window of 5) § log-linear observation model: Pr[𝑥 _ ∣ 𝑑 _ ] ∝ exp〈𝑤 P • ,𝑑 _ 〉 Ø Main Theorem: 𝑤 P + 𝑤 P› • /𝑒 − 2 log 𝑎 ± 𝜗 (1) log Pr 𝑥,𝑥′ = 𝑤 P • /𝑒 − log 𝑎 ± 𝜗 (2) log Pr 𝑥 = Fact: (2) implies that the words have power PMI 𝑥,𝑥 › = 𝑤 P ,𝑤 P ¢ /𝑒 ± 𝜗 (3) law dist. Ø Norm determines frequency; spatial orientation determines “meaning”

Ø word2vec [Mikolov et al’13] : ∝ exp〈𝑤 P yz{ ,1 Pr 𝑥 [pq 𝑥 [pr ,… ,𝑥 [pt 5 𝑤 P yz~ + ⋯ + 𝑤 P yz€ 〉 Ø GloVe [Pennington et al’14] : log Pr [𝑥,𝑥′] ≈ 𝑤 P , 𝑤 P ¢ + 𝑡 P + 𝑡 P› + 𝐷 log Pr 𝑥,𝑥 › = • /𝑒 − 2log 𝑎 ± 𝜗 Eq. (1) 𝑤 P + 𝑤 P ¢ Ø [Levy-Goldberg’14] PMI 𝑥,𝑥 › ≈ 𝑤 P ,𝑤 P ¢ + 𝐷 Eq. (3) PMI 𝑥, 𝑥 › = 𝑤 P , 𝑤 P ¢ /𝑒 ± 𝜗

Ø word2vec [Mikolov et al’13] : ∝ exp〈𝑤 P yz{ ,1 Pr 𝑥 [pq 𝑥 [pr ,…, 𝑥 [pt 5 𝑤 P yz~ + ⋯+ 𝑤 P yz€ 〉 ↑ max-likelihood estimate of 𝑑 [pq Ø Under our model, 𝑑 [p– 𝑑 [pt 𝑑 [pq § Random walk is slow: 𝑑 [pr ≈ 𝑑 [p• ≈ ⋯ ≈ 𝑑 [pq ≈ 𝑑 § Best estimate for current discourse 𝑑 [pq : 𝑥 [p– 𝑥 [pt 𝑥 [pq argmax Pr 𝑑 𝑥 [pr ,…,𝑥 t ] = 𝛽 𝑤 P yz~ + ⋯+ 𝑤 P yz€ ],||]||£r § Prob. distribution of next word given the best guess 𝑑 : Pr[𝑥 [pq ∣ 𝑑 [pq = 𝛽 𝑤 P yz~ + ⋯+ 𝑤 P yz€ ] ∝ exp〈𝑤 P yz{ ,𝛽 𝑤 P yz~ + ⋯+ 𝑤 P yz€ 〉

This talk: window of size 2 Pr[𝑥 ∣ 𝑑] ∝ exp〈𝑤 P , 𝑑〉 r Ø Pr[𝑥 ∣ 𝑑] = § ¨ ⋅ exp〈𝑤 P , 𝑑〉 Pr[𝑥′ ∣ 𝑑′] ∝ exp〈𝑤 P ¢ ,𝑑′〉 𝑑′ 𝑑 Ø 𝑎 ] = ∑ exp 〈𝑤 P ,𝑑〉 partition function P Pr[𝑥,𝑥 › ] = ¥ Pr 𝑥 𝑑] Pr 𝑥 › 𝑑′] 𝑞 𝑑,𝑑 › 𝑒𝑑𝑒𝑑′ 𝑥 𝑥′ spherical Gaussian vector 𝑑 1 ⋅ exp 𝑤 P ,𝑑 exp〈𝑤 P ¢ ,𝑑 › 〉 𝑞 𝑑, 𝑑 › 𝑒𝑑𝑒𝑑′ • /𝑒 Ø 𝔽 exp 𝑤,𝑑 = exp 𝑤 = ¥ 𝑎 ] 𝑎 ]› Ø Assume 𝑑 = 𝑑′ with probability 1, ?? • /𝑒 = ¥exp〈𝑤 P + 𝑤 P ¢ , 𝑑〉𝑞 𝑑 𝑒𝑑 = exp 𝑤 P + 𝑤 P ¢ Eq. (1) log Pr 𝑥, 𝑥 › = • /𝑒 − 2 log 𝑎 ± 𝜗 𝑤 P + 𝑤 P ¢

This talk: window of size 2 Pr[𝑥 ∣ 𝑑] ∝ exp〈𝑤 P , 𝑑〉 r Ø Pr[𝑥 ∣ 𝑑] = § ¨ ⋅ exp〈𝑤 P , 𝑑〉 Pr[𝑥′ ∣ 𝑑′] ∝ exp〈𝑤 P ¢ ,𝑑′〉 𝑑′ 𝑑 Ø 𝑎 ] = ∑ exp 〈𝑤 P ,𝑑〉 partition function P Lemma 1: for almost all c, almost all 𝑤 P , 𝑎 ] = 1 + 𝑝 1 𝑎 𝑥 𝑥′ Ø Proof (sketch) : § for most 𝑑 , 𝑎 ] concentrates around its mean § mean of 𝑎 ] is determined by ||𝑑|| , which in turn concentrates § caveat: exp〈𝑤,𝑑〉 for 𝑤 ∼ 𝒪(0,Id) is not subgaussian, nor sub- exponential. ( 𝛽 -Orlicz norm is not bounded for any 𝛽 > 0 ) Eq. (1) log Pr 𝑥, 𝑥 › = • /𝑒 − 2 log 𝑎 ± 𝜗 𝑤 P + 𝑤 P ¢

Lemma 1: for almost all c, almost all 𝑤 P , 𝑎 ] = 1 + 𝑝 1 𝑎 Ø Proof Sketch: Ø Fixing 𝑑 , to show high probability over choices of 𝑤 P ’s 𝑎 ] = • exp〈𝑤 P ,𝑑〉 = 1 + 𝑝 1 𝔽[𝑎 ] ] P Ø 𝑨 P = 〈𝑤 P ,𝑑〉 scalar Gaussian random variable Ø ||𝑑|| governs the mean and variance of 𝑨 P . Ø ||𝑑|| in turns is concentrated

Tengyu Ma Joint works with Sanjeev Arora, Yuanzhi Li, Yingyu Liang, - PowerPoint PPT Presentation

Tengyu Ma Joint works with Sanjeev Arora, Yuanzhi Li, Yingyu Liang, and Andrej Risteski Princeton University & ( Euclidean space with complicated space meaningful inner products )+,-./01, 23/03+4 Kernel

Data-Dependent Sample Complexities for Deep Neural Networks Tengyu Ma Colin Wei Stanford

Regularization Effect of Large Initial Learning Rate Yuanzhi Li* Colin Wei* Tengyu Ma Carnegie

Active Online Domain Adaptation Yining Chen (Stanford) , Haipeng Luo (USC), Tengyu Ma (Stanford),

A A La Carte Emb mbedding: Ch Cheap but Effective Induction on of of Se Semantic Feature

Motivation Matrix workloads increasingly common and complex Existing languages sacrifice

Robust and On-the-fly Data Denoising For Image Classification Jia ming Song, Yann Dauphin, Michael

Communication Lower Bounds for Statistical Estimation Problems via a Distributed Data Processing

The HPC Challenge Benchmark: The HPC Challenge Benchmark: A Candidate for Replacing A Candidate

Fundamentals of Computational Neuroscience 2e December 27, 2009 Chapter 6: Feed-forward mapping

Contents Graph and Social Network Analysis Graph Construction Graph Visualization Graph Query

Parking Required Per Arlington County Zoning Ordinance without a Modification Of Use 1 Parking

ATemporalModelandDistanceMetricsfor NetworkAnalysis JohnTang

A short walk into randomness Silvio Capobianco 1 1 Institute of Cybernetics at TUT Institute of

Quick Check A Lightweight Tool for Random Testing of Haskell Programs Koen Claessen, John Hughes

CS 525: Advanced Database Topics for today Organization How to lay out data on disk 03: Disk

Development of AIRS and IASI Test Data Chris Barnet NOAA/NESDIS/STAR SOAT Chair & Government

Martin J Savage USQCD Allhands Meeting, FermiLab April 20, 2018 Institute for Nuclear Theory 1

NOvA Update Peter Shanahan Fermilab PAC In partnership with: 20 June 2016 NOvA Collaboration

Stochastic Thermodynamics with Martingales Izaak Neri, Workshop on Martingales in Finance and

Welcome Chairmans Welcome Nick Unkovic BABC Chairman British Cons onsul ulate Upd pdate

!"#$%" &'($)"'(#$'+,&-'+($./-'0(.-1,&,2 34,-*+-&'(5#46.+"

19 July 2020 Access service sheets at thecrossing.com.sg/im-new/sunday-services/ Family

r i i r r camera Computer 3D synthetic tone model Graphics: models

How Can Science Study History? Beth Haven Creation Conference May 13, 2017 Limits of empirical

Tengyu Ma Joint works with Sanjeev Arora, Yuanzhi Li, Yingyu Liang, - PowerPoint PPT Presentation

Tengyu Ma Joint works with Sanjeev Arora, Yuanzhi Li, Yingyu Liang, and Andrej Risteski Princeton University & ( Euclidean space with complicated space meaningful inner products )*+,-.*/01, 23/03+4 Kernel

Data-Dependent Sample Complexities for Deep Neural Networks Tengyu Ma Colin Wei Stanford

Regularization Effect of Large Initial Learning Rate Yuanzhi Li* Colin Wei* Tengyu Ma Carnegie

Active Online Domain Adaptation Yining Chen (Stanford) , Haipeng Luo (USC), Tengyu Ma (Stanford),

A A La Carte Emb mbedding: Ch Cheap but Effective Induction on of of Se Semantic Feature

Motivation Matrix workloads increasingly common and complex Existing languages sacrifice

Robust and On-the-fly Data Denoising For Image Classification Jia ming Song, Yann Dauphin, Michael

Communication Lower Bounds for Statistical Estimation Problems via a Distributed Data Processing

The HPC Challenge Benchmark: The HPC Challenge Benchmark: A Candidate for Replacing A Candidate

Fundamentals of Computational Neuroscience 2e December 27, 2009 Chapter 6: Feed-forward mapping

Contents Graph and Social Network Analysis Graph Construction Graph Visualization Graph Query

Parking Required Per Arlington County Zoning Ordinance without a Modification Of Use 1 Parking

ATemporalModelandDistanceMetricsfor NetworkAnalysis JohnTang

A short walk into randomness Silvio Capobianco 1 1 Institute of Cybernetics at TUT Institute of

Quick Check A Lightweight Tool for Random Testing of Haskell Programs Koen Claessen, John Hughes

CS 525: Advanced Database Topics for today Organization How to lay out data on disk 03: Disk

Development of AIRS and IASI Test Data Chris Barnet NOAA/NESDIS/STAR SOAT Chair &amp; Government

Martin J Savage USQCD Allhands Meeting, FermiLab April 20, 2018 Institute for Nuclear Theory 1

NOvA Update Peter Shanahan Fermilab PAC In partnership with: 20 June 2016 NOvA Collaboration

Stochastic Thermodynamics with Martingales Izaak Neri, Workshop on Martingales in Finance and

Welcome Chairmans Welcome Nick Unkovic BABC Chairman British Cons onsul ulate Upd pdate

!&quot;#$%&quot; &amp;'($)&quot;'(#$'*+,&amp;-'+(*$./-'0(.-1,&amp;,2 34,-*+-&amp;'(5#46.+&quot;

19 July 2020 Access service sheets at thecrossing.com.sg/im-new/sunday-services/ Family

r i i r r camera Computer 3D synthetic tone model Graphics: models

How Can Science Study History? Beth Haven Creation Conference May 13, 2017 Limits of empirical

Tengyu Ma Joint works with Sanjeev Arora, Yuanzhi Li, Yingyu Liang, and Andrej Risteski Princeton University & ( Euclidean space with complicated space meaningful inner products )+,-./01, 23/03+4 Kernel

Development of AIRS and IASI Test Data Chris Barnet NOAA/NESDIS/STAR SOAT Chair & Government

!"#$%" &'($)"'(#$'+,&-'+($./-'0(.-1,&,2 34,-*+-&'(5#46.+"