June 6 th 2019, Minneapolis, USA RepEval 2019 The Influence of Down-Sampling Strategies on SVD Word Embedding Stability Johannes Hellrich, Bernd Kampe & Udo Hahn Jena University Language & Information Engineering (JULIE) Lab Friedrich Schiller University Jena, Jena, Germany www.julielab.de Down-Sampling and SVD Word Embedding Stability 1 Johannes Hellrich, Bernd Kampe & Udo Hahn
June 6 th 2019, Minneapolis, USA RepEval 2019 Typical Word Embeddings are Unstable lots tiger tiger corpus of text cat cat dog dog random embeddings random processing final embeddings Down-Sampling and SVD Word Embedding Stability 2 Johannes Hellrich, Bernd Kampe & Udo Hahn
June 6 th 2019, Minneapolis, USA RepEval 2019 Typical Word Embeddings are Unstable tiger lots tiger corpus of text dog cat cat dog random embeddings final embeddings random processing Down-Sampling and SVD Word Embedding Stability 3 Johannes Hellrich, Bernd Kampe & Udo Hahn
June 6 th 2019, Minneapolis, USA RepEval 2019 Measuring Stability lots corpus of text tiger tiger tiger cat cat cat dog dog dog | T m ∈ M msw( a, n, m ) | j @ n := 1 X | S | A | m ∈ M msw( a, n, m ) | a ∈ A Down-Sampling and SVD Word Embedding Stability 5 Johannes Hellrich, Bernd Kampe & Udo Hahn
June 6 th 2019, Minneapolis, USA RepEval 2019 Why SVD Embeddings? tiger roar food lots dog 475 156 corpus of cat cat 823 492 text counting SVD dog tiger 51 19 final embeddings Down-Sampling and SVD Word Embedding Stability 7 Johannes Hellrich, Bernd Kampe & Udo Hahn
June 6 th 2019, Minneapolis, USA RepEval 2019 Why SVD Embeddings? tiger roar food lots dog 0.02 0.01 corpus of cat cat 0.5 0.4 text counting SVD dog tiger 0.01 0.19 & down-sampling final embeddings Replaced with association values in SVD PPMI (Levy et al., TACL 2015) Down-Sampling and SVD Word Embedding Stability 8 Johannes Hellrich, Bernd Kampe & Udo Hahn
June 6 th 2019, Minneapolis, USA RepEval 2019 Why Down-Sampling? • Avoids over-representing frequent words • Closer context words are more salient than distant ones à Increased Performance (Mikolov, NIPS 2013) Down-Sampling and SVD Word Embedding Stability 9 Johannes Hellrich, Bernd Kampe & Udo Hahn
June 6 th 2019, Minneapolis, USA RepEval 2019 Down-Sampling Mechanism Probabilistic Weighting • word2vec • GloVe • SVD PPMI • New: SVD wPPMI Down-Sampling and SVD Word Embedding Stability 10 Johannes Hellrich, Bernd Kampe & Udo Hahn
June 6 th 2019, Minneapolis, USA RepEval 2019 Experimental Design I/II • Three Corpora: • Corpus of Historical American English 2000s decade (COHA; 28M tokens.) • English News Crawl Corpus (NEWS; 550M tokens) • Wikipedia (WIKI; 1.7G tokens) à Other studies used mostly COHA-sized corpora! Down-Sampling and SVD Word Embedding Stability 13 Johannes Hellrich, Bernd Kampe & Udo Hahn
June 6 th 2019, Minneapolis, USA RepEval 2019 Experimental Design II/II • Train 10 models each with SGNS, GloVe, SVD PPMI (none / prob. down-sampling), SVD wPPMI • Evaluate intrinsically with four word similarity & two analogy test sets • Measure stability with j@10 for 1k most frequent words Down-Sampling and SVD Word Embedding Stability 14 Johannes Hellrich, Bernd Kampe & Udo Hahn
June 6 th 2019, Minneapolis, USA RepEval 2019 Stability Results GloVe‘s high stability (Antoniak & Mimno, TACL 2018; Wendlandt et al., NAACL 2018) is true only for small corpora Down-Sampling and SVD Word Embedding Stability 16 Johannes Hellrich, Bernd Kampe & Udo Hahn
June 6 th 2019, Minneapolis, USA RepEval 2019 Exemplary Accuracy Results Wilcoxon rank-sum test shows SVD wPPMI and SGNS to be indistinguishable in accuracy over all test sets and corpora Down-Sampling and SVD Word Embedding Stability 17 Johannes Hellrich, Bernd Kampe & Udo Hahn
June 6 th 2019, Minneapolis, USA RepEval 2019 Conclusion • Typical word embeddings are unstable • Down-sampling details greatly affect stability • GloVe’s stability is worse than claimed in literature • SVD wPPMI embeddings provide SGNS-like performance and perfect stability • See paper for additional results (and bootstrapping) Down-Sampling and SVD Word Embedding Stability 18 Johannes Hellrich, Bernd Kampe & Udo Hahn
June 6 th 2019, Minneapolis, USA RepEval 2019 The Influence of Down-Sampling Strategies on SVD Word Embedding Stability Johannes Hellrich, Bernd Kampe & Udo Hahn Jena University Language & Information Engineering (JULIE) Lab Friedrich Schiller University Jena, Jena, Germany www.julielab.de Down-Sampling and SVD Word Embedding Stability 19 Johannes Hellrich, Bernd Kampe & Udo Hahn
More recommend