Dont Get Fooled by Word Embeddings Better Watch Their Neighborhood - PowerPoint PPT Presentation

August 11, 2017, Montreal, Canada DH 2017 Don’t Get Fooled by Word Embeddings— Better Watch Their Neighborhood Johannes Hellrich 1,2 & Udo Hahn 2 2: Jena University Language & Information 1: Graduate School 'The Romantic Model', Engineering (JULIE) Lab Friedrich Schiller University Jena, Friedrich Schiller University Jena, Jena, Germany Jena, Germany http://www.modellromantik.uni-jena.de http://www.julielab.de Don’t Get Fooled by Word Embeddings Johannes Hellrich & Udo Hahn 1

August 11, 2017, Montreal, Canada DH 2017 You shall know a word by the company it keeps! Firth, 1957 He reads a poem. She reads a novel. The novel has 312 pages. The poem fits on two pages. She listens to an opera. He listens to jazz. Don’t Get Fooled by Word Embeddings Johannes Hellrich & Udo Hahn 2

August 11, 2017, Montreal, Canada DH 2017 You shall know a word by the company it keeps! Firth, 1957 He reads a poem. She reads a novel. The novel has 312 pages. The poem fits on two pages. She listens to an opera. He listens to jazz. Don’t Get Fooled by Word Embeddings Johannes Hellrich & Udo Hahn 3

August 11, 2017, Montreal, Canada DH 2017 Counting Cooccurrences read pages hate enjoy listen … novel 98 60 3 56 2 poem 67 10 1 47 8 opera 4 8 0 42 38 jazz 2 1 2 61 47 … Don’t Get Fooled by Word Embeddings Johannes Hellrich & Udo Hahn 4

August 11, 2017, Montreal, Canada DH 2017 Vector Representation opera jazz listen poem novel read Don’t Get Fooled by Word Embeddings Johannes Hellrich & Udo Hahn 5

August 11, 2017, Montreal, Canada DH 2017 Distance and Similarity opera jazz listen poem novel read Don’t Get Fooled by Word Embeddings Johannes Hellrich & Udo Hahn 6

August 11, 2017, Montreal, Canada DH 2017 Dimensionality Problem • One dimension per word • 50k to 100k dimensions à Large files and slow operations • What about synonyms – it shouldn‘t matter if I buy or purchase a novel Don’t Get Fooled by Word Embeddings Johannes Hellrich & Udo Hahn 7

August 11, 2017, Montreal, Canada DH 2017 Word Embeddings • Represent words as dense vectors with 200– 500 instead of 50k–100k dimensions • Very popular in computational linguistics and digital humanities • Better on judging word similarity Don’t Get Fooled by Word Embeddings Johannes Hellrich & Udo Hahn 8

August 11, 2017, Montreal, Canada DH 2017 Application in DH: Semantic Development of Herz ‚heart‘ 0.8 0.8 Gehirn 'brain' 0.75 0.75 Lunge 'lung' Gemüth 'mind' 0.7 0.7 similarity similarity erschrecke 'frighten' 0.65 0.65 0.6 0.6 1800 1900 2000 0.55 0.55 1799 1806 1813 1820 1827 1834 1841 1848 1855 1862 1869 1876 1883 1890 1897 1904 1911 1918 1925 1932 1939 1946 1953 1960 1967 1974 1981 1988 1995 2002 2009 • Hellrich & Hahn, DH 2016 • First applied by Kim et al., ACL 2014 Workshop on Language year Technologies and Computational Social Science Don’t Get Fooled by Word Embeddings Johannes Hellrich & Udo Hahn 9

August 11, 2017, Montreal, Canada DH 2017 Types of Word Embeddings Singular Value Decomposition Neural Word Embeddings opera lots lots of poem of text text novel read pages musician poem 475 156 0 lots lots novel 823 492 3 of of opera 51 19 993 text text opera opera poem poem novel novel Don’t Get Fooled by Word Embeddings Johannes Hellrich & Udo Hahn 10

August 11, 2017, Montreal, Canada DH 2017 Neural Word Embeddings • Extremely popular skip- INPUT PROJECTION OUTPUT gram negative sampling algorithm SGNS/word2vec w(t-2) (Mikolov et al., NIPS 2013) w(t-1) • Alternative neural w(t) embeddings using an w(t+1) explicit cooccurrence matrix: GloVe (Pennington et w(t+2) al., EMNLP 2014) SGNS Don’t Get Fooled by Word Embeddings Johannes Hellrich & Udo Hahn 11

August 11, 2017, Montreal, Canada DH 2017 Training Neural Word Embeddings • Word Embeddings are updated after looking at the text • Tries to minimize false predictions (cost function) • Will lead us to a local, yet rarely to the global minimum Don’t Get Fooled by Word Embeddings Johannes Hellrich & Udo Hahn 12

August 11, 2017, Montreal, Canada DH 2017 Training Neural Word Embeddings • Word Embeddings are updated after looking at the text • Tries to minimize false predictions (cost function) • Will lead us to a local, yet rarely to the global minimum Don’t Get Fooled by Word Embeddings Johannes Hellrich & Udo Hahn 13

August 11, 2017, Montreal, Canada DH 2017 Singular Value Decomposition • Express Cooccurrences as U Σ V T • U represents words, V T context words • Σ measures importance of dimensions Don’t Get Fooled by Word Embeddings Johannes Hellrich & Udo Hahn 14

August 11, 2017, Montreal, Canada DH 2017 Singular Value Decomposition • Classical SVD embeddings: 𝑉 d , selecting only d dimensions from 𝑉 based on Σ Don’t Get Fooled by Word Embeddings Johannes Hellrich & Udo Hahn 15

August 11, 2017, Montreal, Canada DH 2017 SVD PPMI • Levy et al., TACL 2015 • Positive pointwise mutual information instead of frequency • Post-/preprocessing inspired by SGNS and GloVe Don’t Get Fooled by Word Embeddings Johannes Hellrich & Udo Hahn 16

August 11, 2017, Montreal, Canada DH 2017 Measuring Reliability • Train multiple models with identical parameters on one corpus • Measure percentage of identical neighborhoods for each word between models • Hellrich&Hahn, COLING 2016 Don’t Get Fooled by Word Embeddings Johannes Hellrich & Udo Hahn 17

August 11, 2017, Montreal, Canada DH 2017 Measuring Reliability • Train multiple models with identical parameters on one corpus • Measure percentage of identical neighborhoods for each word between models • Example: No agreement at neighborhood size 1 for poem opera opera opera poem poem poem novel novel novel Don’t Get Fooled by Word Embeddings Johannes Hellrich & Udo Hahn 18

August 11, 2017, Montreal, Canada DH 2017 Measuring Reliability • Train multiple models with identical parameters on one corpus • Measure percentage of identical neighborhoods for each word between models • Example: Agreement at neighborhood size 2 for poem opera opera opera poem poem poem novel novel novel Don’t Get Fooled by Word Embeddings Johannes Hellrich & Udo Hahn 19

August 11, 2017, Montreal, Canada DH 2017 Experiment • 3 models each for SGNS, GloVe and SVD PPMI • Trained on corpus of 645 German texts from 19 th century, subset of Deutsches Textarchiv ‘German Text Archive’ • Technical Details: • Window size 5, • 300 dimensions • hyperwords toolkit Don’t Get Fooled by Word Embeddings Johannes Hellrich & Udo Hahn 20

August 11, 2017, Montreal, Canada DH 2017 Reliability for Herz ‘heart’ Embedding First Second Third Fourth Fifth Model Neighbor Neighbor Neighbor Neighbor Neighbor schmerzen beklommen busen bluten herzen SGNS 1 ‘pain’ ‘anxious’ ‘bosom’ ‘to bleed’ ‘to caress’ bluten klopfend busen beklommen herzen SGNS 2 ‘to bleed ’ ‘beating ’ ‘bosom ’ ‘anxious ’ ‘to caress’ herzen busen klopfend beklommen bluten SGNS 3 ‘to caress ’ ‘bosom ’ ‘beating ’ ‘anxious ’ ‘to bleed’ gemüt mein seele liebe brust GloVe 1 ‘mind ’ ‘my ’ ‘soul ’ ‘love ’ ‘chest’ gemüt mein seele brust liebe GloVe 2 ‘mind ’ ‘my ’ ‘soul ’ ‘chest ’ ‘love’ gemüt mein seele brust liebe GloVe 3 ‘mind ’ ‘my ’ ‘soul ’ ‘chest ’ ‘love’ busen fühlen liebe schmerzen menschenherz SVD PPMI , all ‘bosom ’ ‘to feel ’ ‘love ’ ‘pain ’ ‘human heart’ Don’t Get Fooled by Word Embeddings Johannes Hellrich & Udo Hahn 21

August 11, 2017, Montreal, Canada DH 2017 Reliability for Herz ‘heart’ Embedding First Second Third Fourth Fifth Model Neighbor Neighbor Neighbor Neighbor Neighbor schmerzen beklommen busen bluten herzen SGNS 1 ‘pain’ ‘anxious’ ‘bosom’ ‘to bleed’ ‘to caress’ bluten klopfend busen beklommen herzen SGNS 2 ‘to bleed ’ ‘beating ’ ‘bosom ’ ‘anxious ’ ‘to caress’ herzen busen klopfend beklommen bluten SGNS 3 ‘to caress ’ ‘bosom ’ ‘beating ’ ‘anxious ’ ‘to bleed’ gemüt mein seele liebe brust GloVe 1 ‘mind ’ ‘my ’ ‘soul ’ ‘love ’ ‘chest’ gemüt mein seele brust liebe GloVe 2 ‘mind ’ ‘my ’ ‘soul ’ ‘chest ’ ‘love’ gemüt mein seele brust liebe GloVe 3 ‘mind ’ ‘my ’ ‘soul ’ ‘chest ’ ‘love’ busen fühlen liebe schmerzen menschenherz SVD PPMI , all ‘bosom ’ ‘to feel ’ ‘love ’ ‘pain ’ ‘human heart’ Don’t Get Fooled by Word Embeddings Johannes Hellrich & Udo Hahn 22

Dont Get Fooled by Word Embeddings Better Watch Their Neighborhood - PowerPoint PPT Presentation

August 11, 2017, Montreal, Canada DH 2017 Dont Get Fooled by Word Embeddings Better Watch Their Neighborhood Johannes Hellrich 1,2 & Udo Hahn 2 2: Jena University Language & Information 1: Graduate School 'The Romantic

Word Embeddings Natural Language Processing VU (706.230) - Andi Rexha 02/04/2020 Word Embeddings

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word embeddings Rappel Embeddings ( pas Word Embeddings ) Est une lookup table Formalisme:

Mixed membership word embeddings: Corpus-specific embeddings without big data James Foulds

Word Embeddings through Hellinger PCA Rmi Lebret and Ronan Collobert Idiap Research Institute /

Word Embeddings Tutorial HILA GONEN PHD STUDENT AT YOAV GOLDBERGS LAB BAR ILAN UNIVERSITY

Word Embeddings Revisited: Contextual Embeddings CS 6956: Deep Learning for NLP Overview

Symmetric Pattern Based Word Embeddings for Improved Word Similarity Prediction Roy Schwartz + ,

Searching for the X-Factor: Exploring Corpus Subjectivity for Word Embeddings Maksim Tkachenko

Pattern-based Solutions to Limitations of Leading Word Embeddings Roy Schwartz University of

Word, Sense and Contextualized Embeddings: Vector Representations of Meaning in NLP Jose

Dense Word Embeddings CMSC 470 Marine Carpuat Slides credit: Jurasky & Martin How to

Dense Word Embeddings CMSC 470 Marine Carpuat Slides credit: Jurasky & Martin How to

Learning Word Embeddings for Low-resource Languages by PU Learning Chao Jiang, Hsiang-Fu Yu,

Humor in Word Embeddings: Cockamamie Gobbledegook for Nincompoops Limor Gultchin, Genevieve

Lecture 8: NLP and Word Embeddings Alireza Akhavan Pour CLASS.VISION

Motivation Current Scenario : Rising interest in vector space word embeddings and their use, given

USING PSEUDO-SENSES FOR IMPROVING THE EXTRACTION OF SYNONYMS FROM WORD EMBEDDINGS Olivier Ferret

Generalizing Word Embeddings using Bag of Subwords Jinman Zhao , Sidharth Mudgal, Yingyu Liang

Deep convolutional acoustic word embeddings using word-pair side information Herman Kamper 1 ,

word2vec Kuan-Ting Lai 2020/5/28 Word2vec (Word Embeddings) Embed one-hot encoded word

Dont Get Fooled by Word Embeddings Better Watch Their Neighborhood - PowerPoint PPT Presentation

August 11, 2017, Montreal, Canada DH 2017 Dont Get Fooled by Word Embeddings Better Watch Their Neighborhood Johannes Hellrich 1,2 & Udo Hahn 2 2: Jena University Language & Information 1: Graduate School 'The Romantic

Word Embeddings Natural Language Processing VU (706.230) - Andi Rexha 02/04/2020 Word Embeddings

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word embeddings Rappel Embeddings ( pas Word Embeddings ) Est une lookup table Formalisme:

Mixed membership word embeddings: Corpus-specific embeddings without big data James Foulds

Word Embeddings through Hellinger PCA Rmi Lebret and Ronan Collobert Idiap Research Institute /

Word Embeddings Tutorial HILA GONEN PHD STUDENT AT YOAV GOLDBERGS LAB BAR ILAN UNIVERSITY

Word Embeddings Revisited: Contextual Embeddings CS 6956: Deep Learning for NLP Overview

Symmetric Pattern Based Word Embeddings for Improved Word Similarity Prediction Roy Schwartz + ,

Searching for the X-Factor: Exploring Corpus Subjectivity for Word Embeddings Maksim Tkachenko

Pattern-based Solutions to Limitations of Leading Word Embeddings Roy Schwartz University of

Word, Sense and Contextualized Embeddings: Vector Representations of Meaning in NLP Jose

Dense Word Embeddings CMSC 470 Marine Carpuat Slides credit: Jurasky &amp; Martin How to

Dense Word Embeddings CMSC 470 Marine Carpuat Slides credit: Jurasky &amp; Martin How to

Learning Word Embeddings for Low-resource Languages by PU Learning Chao Jiang, Hsiang-Fu Yu,

Humor in Word Embeddings: Cockamamie Gobbledegook for Nincompoops Limor Gultchin, Genevieve

Lecture 8: NLP and Word Embeddings Alireza Akhavan Pour CLASS.VISION

Motivation Current Scenario : Rising interest in vector space word embeddings and their use, given

USING PSEUDO-SENSES FOR IMPROVING THE EXTRACTION OF SYNONYMS FROM WORD EMBEDDINGS Olivier Ferret

Generalizing Word Embeddings using Bag of Subwords Jinman Zhao , Sidharth Mudgal, Yingyu Liang

Deep convolutional acoustic word embeddings using word-pair side information Herman Kamper 1 ,

word2vec Kuan-Ting Lai 2020/5/28 Word2vec (Word Embeddings) Embed one-hot encoded word

Dense Word Embeddings CMSC 470 Marine Carpuat Slides credit: Jurasky & Martin How to

Dense Word Embeddings CMSC 470 Marine Carpuat Slides credit: Jurasky & Martin How to