August 11, 2017, Montreal, Canada DH 2017 Don’t Get Fooled by Word Embeddings— Better Watch Their Neighborhood Johannes Hellrich 1,2 & Udo Hahn 2 2: Jena University Language & Information 1: Graduate School 'The Romantic Model', Engineering (JULIE) Lab Friedrich Schiller University Jena, Friedrich Schiller University Jena, Jena, Germany Jena, Germany http://www.modellromantik.uni-jena.de http://www.julielab.de Don’t Get Fooled by Word Embeddings Johannes Hellrich & Udo Hahn 1
August 11, 2017, Montreal, Canada DH 2017 You shall know a word by the company it keeps! Firth, 1957 He reads a poem. She reads a novel. The novel has 312 pages. The poem fits on two pages. She listens to an opera. He listens to jazz. Don’t Get Fooled by Word Embeddings Johannes Hellrich & Udo Hahn 2
August 11, 2017, Montreal, Canada DH 2017 You shall know a word by the company it keeps! Firth, 1957 He reads a poem. She reads a novel. The novel has 312 pages. The poem fits on two pages. She listens to an opera. He listens to jazz. Don’t Get Fooled by Word Embeddings Johannes Hellrich & Udo Hahn 3
August 11, 2017, Montreal, Canada DH 2017 Counting Cooccurrences read pages hate enjoy listen … novel 98 60 3 56 2 poem 67 10 1 47 8 opera 4 8 0 42 38 jazz 2 1 2 61 47 … Don’t Get Fooled by Word Embeddings Johannes Hellrich & Udo Hahn 4
August 11, 2017, Montreal, Canada DH 2017 Vector Representation opera jazz listen poem novel read Don’t Get Fooled by Word Embeddings Johannes Hellrich & Udo Hahn 5
August 11, 2017, Montreal, Canada DH 2017 Distance and Similarity opera jazz listen poem novel read Don’t Get Fooled by Word Embeddings Johannes Hellrich & Udo Hahn 6
August 11, 2017, Montreal, Canada DH 2017 Dimensionality Problem • One dimension per word • 50k to 100k dimensions à Large files and slow operations • What about synonyms – it shouldn‘t matter if I buy or purchase a novel Don’t Get Fooled by Word Embeddings Johannes Hellrich & Udo Hahn 7
August 11, 2017, Montreal, Canada DH 2017 Word Embeddings • Represent words as dense vectors with 200– 500 instead of 50k–100k dimensions • Very popular in computational linguistics and digital humanities • Better on judging word similarity Don’t Get Fooled by Word Embeddings Johannes Hellrich & Udo Hahn 8
August 11, 2017, Montreal, Canada DH 2017 Application in DH: Semantic Development of Herz ‚heart‘ 0.8 0.8 Gehirn 'brain' 0.75 0.75 Lunge 'lung' Gemüth 'mind' 0.7 0.7 similarity similarity erschrecke 'frighten' 0.65 0.65 0.6 0.6 1800 1900 2000 0.55 0.55 1799 1806 1813 1820 1827 1834 1841 1848 1855 1862 1869 1876 1883 1890 1897 1904 1911 1918 1925 1932 1939 1946 1953 1960 1967 1974 1981 1988 1995 2002 2009 • Hellrich & Hahn, DH 2016 • First applied by Kim et al., ACL 2014 Workshop on Language year Technologies and Computational Social Science Don’t Get Fooled by Word Embeddings Johannes Hellrich & Udo Hahn 9
August 11, 2017, Montreal, Canada DH 2017 Types of Word Embeddings Singular Value Decomposition Neural Word Embeddings opera lots lots of poem of text text novel read pages musician poem 475 156 0 lots lots novel 823 492 3 of of opera 51 19 993 text text opera opera poem poem novel novel Don’t Get Fooled by Word Embeddings Johannes Hellrich & Udo Hahn 10
August 11, 2017, Montreal, Canada DH 2017 Neural Word Embeddings • Extremely popular skip- INPUT PROJECTION OUTPUT gram negative sampling algorithm SGNS/word2vec w(t-2) (Mikolov et al., NIPS 2013) w(t-1) • Alternative neural w(t) embeddings using an w(t+1) explicit cooccurrence matrix: GloVe (Pennington et w(t+2) al., EMNLP 2014) SGNS Don’t Get Fooled by Word Embeddings Johannes Hellrich & Udo Hahn 11
August 11, 2017, Montreal, Canada DH 2017 Training Neural Word Embeddings • Word Embeddings are updated after looking at the text • Tries to minimize false predictions (cost function) • Will lead us to a local, yet rarely to the global minimum Don’t Get Fooled by Word Embeddings Johannes Hellrich & Udo Hahn 12
August 11, 2017, Montreal, Canada DH 2017 Training Neural Word Embeddings • Word Embeddings are updated after looking at the text • Tries to minimize false predictions (cost function) • Will lead us to a local, yet rarely to the global minimum Don’t Get Fooled by Word Embeddings Johannes Hellrich & Udo Hahn 13
August 11, 2017, Montreal, Canada DH 2017 Singular Value Decomposition • Express Cooccurrences as U Σ V T • U represents words, V T context words • Σ measures importance of dimensions Don’t Get Fooled by Word Embeddings Johannes Hellrich & Udo Hahn 14
August 11, 2017, Montreal, Canada DH 2017 Singular Value Decomposition • Classical SVD embeddings: 𝑉 d , selecting only d dimensions from 𝑉 based on Σ Don’t Get Fooled by Word Embeddings Johannes Hellrich & Udo Hahn 15
August 11, 2017, Montreal, Canada DH 2017 SVD PPMI • Levy et al., TACL 2015 • Positive pointwise mutual information instead of frequency • Post-/preprocessing inspired by SGNS and GloVe Don’t Get Fooled by Word Embeddings Johannes Hellrich & Udo Hahn 16
August 11, 2017, Montreal, Canada DH 2017 Measuring Reliability • Train multiple models with identical parameters on one corpus • Measure percentage of identical neighborhoods for each word between models • Hellrich&Hahn, COLING 2016 Don’t Get Fooled by Word Embeddings Johannes Hellrich & Udo Hahn 17
August 11, 2017, Montreal, Canada DH 2017 Measuring Reliability • Train multiple models with identical parameters on one corpus • Measure percentage of identical neighborhoods for each word between models • Example: No agreement at neighborhood size 1 for poem opera opera opera poem poem poem novel novel novel Don’t Get Fooled by Word Embeddings Johannes Hellrich & Udo Hahn 18
August 11, 2017, Montreal, Canada DH 2017 Measuring Reliability • Train multiple models with identical parameters on one corpus • Measure percentage of identical neighborhoods for each word between models • Example: Agreement at neighborhood size 2 for poem opera opera opera poem poem poem novel novel novel Don’t Get Fooled by Word Embeddings Johannes Hellrich & Udo Hahn 19
August 11, 2017, Montreal, Canada DH 2017 Experiment • 3 models each for SGNS, GloVe and SVD PPMI • Trained on corpus of 645 German texts from 19 th century, subset of Deutsches Textarchiv ‘German Text Archive’ • Technical Details: • Window size 5, • 300 dimensions • hyperwords toolkit Don’t Get Fooled by Word Embeddings Johannes Hellrich & Udo Hahn 20
August 11, 2017, Montreal, Canada DH 2017 Reliability for Herz ‘heart’ Embedding First Second Third Fourth Fifth Model Neighbor Neighbor Neighbor Neighbor Neighbor schmerzen beklommen busen bluten herzen SGNS 1 ‘pain’ ‘anxious’ ‘bosom’ ‘to bleed’ ‘to caress’ bluten klopfend busen beklommen herzen SGNS 2 ‘to bleed ’ ‘beating ’ ‘bosom ’ ‘anxious ’ ‘to caress’ herzen busen klopfend beklommen bluten SGNS 3 ‘to caress ’ ‘bosom ’ ‘beating ’ ‘anxious ’ ‘to bleed’ gemüt mein seele liebe brust GloVe 1 ‘mind ’ ‘my ’ ‘soul ’ ‘love ’ ‘chest’ gemüt mein seele brust liebe GloVe 2 ‘mind ’ ‘my ’ ‘soul ’ ‘chest ’ ‘love’ gemüt mein seele brust liebe GloVe 3 ‘mind ’ ‘my ’ ‘soul ’ ‘chest ’ ‘love’ busen fühlen liebe schmerzen menschenherz SVD PPMI , all ‘bosom ’ ‘to feel ’ ‘love ’ ‘pain ’ ‘human heart’ Don’t Get Fooled by Word Embeddings Johannes Hellrich & Udo Hahn 21
August 11, 2017, Montreal, Canada DH 2017 Reliability for Herz ‘heart’ Embedding First Second Third Fourth Fifth Model Neighbor Neighbor Neighbor Neighbor Neighbor schmerzen beklommen busen bluten herzen SGNS 1 ‘pain’ ‘anxious’ ‘bosom’ ‘to bleed’ ‘to caress’ bluten klopfend busen beklommen herzen SGNS 2 ‘to bleed ’ ‘beating ’ ‘bosom ’ ‘anxious ’ ‘to caress’ herzen busen klopfend beklommen bluten SGNS 3 ‘to caress ’ ‘bosom ’ ‘beating ’ ‘anxious ’ ‘to bleed’ gemüt mein seele liebe brust GloVe 1 ‘mind ’ ‘my ’ ‘soul ’ ‘love ’ ‘chest’ gemüt mein seele brust liebe GloVe 2 ‘mind ’ ‘my ’ ‘soul ’ ‘chest ’ ‘love’ gemüt mein seele brust liebe GloVe 3 ‘mind ’ ‘my ’ ‘soul ’ ‘chest ’ ‘love’ busen fühlen liebe schmerzen menschenherz SVD PPMI , all ‘bosom ’ ‘to feel ’ ‘love ’ ‘pain ’ ‘human heart’ Don’t Get Fooled by Word Embeddings Johannes Hellrich & Udo Hahn 22
Recommend
More recommend