Mimicking Word Embeddings using Subword RNNs Yuval Pinter, Robert - PowerPoint PPT Presentation

Enter MIMICK ● What data do we have, post-unlabeled corpus? ○ Vector dictionary ○ Orthography (the way words are spelled) ● Use the former as training objective, latter as input ● Pre-trained vectors as target ○ No need to access original unlabeled corpus ○ Many training examples Unlabeled Supervised ○ (No context) Unlabeled corpus task corpus Unlabeled corpus Unlabeled corpus corpus OOV OOV 11

Enter MIMICK ● What data do we have, post-unlabeled corpus? ○ Vector dictionary ○ Orthography (the way words are spelled) ● Use the former as training objective, latter as input ● Pre-trained vectors as target ○ No need to access original unlabeled corpus ○ Many training examples Unlabeled Supervised ○ (No context) Unlabeled corpus task corpus Unlabeled corpus Unlabeled ● Subword units as inputs corpus corpus ○ Very extensible ○ (Character inventory changes?) OOV OOV 11

MIMICK Training Pre-trained Embedding make (Polyglot/FastText/etc.) All possible text Unlabeled text make m a k e 12

MIMICK Training Pre-trained Embedding make (Polyglot/FastText/etc.) All possible text Unlabeled text make Character embeddings m a k e 12

MIMICK Training Pre-trained Embedding make (Polyglot/FastText/etc.) All possible text Unlabeled text make Forward LSTM Character embeddings m a k e 12

MIMICK Training Pre-trained Embedding make (Polyglot/FastText/etc.) All possible text Unlabeled text make Backward LSTM Forward LSTM Character embeddings m a k e 12

MIMICK Training Pre-trained Embedding make (Polyglot/FastText/etc.) All possible text Multilayered Perceptron Unlabeled text make Backward LSTM Forward LSTM Character embeddings m a k e 12

MIMICK Training Pre-trained Embedding make (Polyglot/FastText/etc.) Loss (L 2 ) All possible text Mimicked Embedding Multilayered Perceptron Unlabeled text make Backward LSTM Forward LSTM Character embeddings m a k e 12

MIMICK Inference All possible text Mimicked Embedding Multilayered Perceptron Unlabeled text blah Backward LSTM Forward LSTM Character embeddings b l a h 13

Observation – Nearest Neighbors 14

Observation – Nearest Neighbors ● English (OOV  Nearest in-vocab words) 14

Observation – Nearest Neighbors ● English (OOV  Nearest in-vocab words) ○ MCT → AWS, OTA, APT, PDM 14

Observation – Nearest Neighbors ● English (OOV  Nearest in-vocab words) ○ MCT → AWS, OTA, APT, PDM ○ pesky → euphoric, disagreeable, horrid, ghastly 14

Observation – Nearest Neighbors ● English (OOV  Nearest in-vocab words) ○ MCT → AWS, OTA, APT, PDM ○ pesky → euphoric, disagreeable, horrid, ghastly ○ lawnmower → tradesman, bookmaker, postman, hairdresser 14

Observation – Nearest Neighbors ● English (OOV  Nearest in-vocab words) ○ MCT → AWS, OTA, APT, PDM ○ pesky → euphoric, disagreeable, horrid, ghastly ○ lawnmower → tradesman, bookmaker, postman, hairdresser ● Hebrew 14

Observation – Nearest Neighbors ● English (OOV  Nearest in-vocab words) ○ MCT → AWS, OTA, APT, PDM ○ pesky → euphoric, disagreeable, horrid, ghastly ○ lawnmower → tradesman, bookmaker, postman, hairdresser ● Hebrew ○ רותפת → גתת (she/you-3p.sg.) will come true (she/you-3p.sg.) will solve 14

Observation – Nearest Neighbors ● English (OOV  Nearest in-vocab words) ○ MCT → AWS, OTA, APT, PDM ○ pesky → euphoric, disagreeable, horrid, ghastly ○ lawnmower → tradesman, bookmaker, postman, hairdresser ● Hebrew ○ רותפת → גתת (she/you-3p.sg.) will come true (she/you-3p.sg.) will solve ○ םיירט → םיירטמואיג geometric (m.pl., nontrad. spelling) geometric (m.pl.) 14

Observation – Nearest Neighbors ● English (OOV  Nearest in-vocab words) ○ MCT → AWS, OTA, APT, PDM ○ pesky → euphoric, disagreeable, horrid, ghastly ○ lawnmower → tradesman, bookmaker, postman, hairdresser ● Hebrew ○ רותפת → גתת (she/you-3p.sg.) will come true (she/you-3p.sg.) will solve ○ םיירט → םיירטמואיג geometric (m.pl., nontrad. spelling) geometric (m.pl.) ○ ךרא → ציר ’ ןוסדר Richardson Eustrach 14

Observation – Nearest Neighbors ● English (OOV  Nearest in-vocab words) ○ MCT → AWS, OTA, APT, PDM ○ pesky → euphoric, disagreeable, horrid, ghastly ○ lawnmower → tradesman, bookmaker, postman, hairdresser ● Hebrew ○ רותפת → גתת (she/you-3p.sg.) will come true (she/you-3p.sg.) will solve ○ םיירט → םיירטמואיג geometric (m.pl., nontrad. spelling) geometric (m.pl.) ○ ךרא → ציר ’ ןוסדר Richardson Eustrach ● ✔ Surface form ✔ Syntactic properties ✘ Semantics 14

Intrinsic Evaluation – RareWords 15

Intrinsic Evaluation – RareWords ● RareWords similarity task: morphologically-complex, mostly unseen words 15

Intrinsic Evaluation – RareWords ● RareWords similarity task: morphologically-complex, mostly unseen words Names ● Domain-specific jargon ● Foreign words ● Rare(-ish) morphological ● derivations Nonce words ● Nonstandard orthography ● Typos and other errors ● ... ● 15

Intrinsic Evaluation – RareWords ● RareWords similarity task: morphologically-complex, mostly unseen words NN FUN!!! Nearest: programmatic transformational Names ● mechanistic Domain-specific jargon ● transactional Foreign words ● contextual Rare(-ish) morphological ● derivations Nonce words ● Nonstandard orthography ● Typos and other errors ● ... ● 15

Extrinsic Evaluation – POS + Attribute Tagging ● UD is annotated for POS and morphosyntactic attributes ○ Eng: his stated goals Tense=Past|VerbForm=Part ○ Cze: osoby v pokročilém věku Animacy=Inan|Case=Loc|Degree=Pos|Gender=Masc|Negative=Pos|Number=Sing people of advanced age ● Names ● Domain-specific jargon ● Foreign words ● Rare(-ish) morphological derivations Nonce words ● ● Nonstandard orthography ● Typos and other errors ● ... 16

Extrinsic Evaluation – POS + Attribute Tagging ● UD is annotated for POS and morphosyntactic attributes ○ Eng: his stated goals Tense=Past|VerbForm=Part ○ Cze: osoby v pokročilém věku Animacy=Inan|Case=Loc|Degree=Pos|Gender=Masc|Negative=Pos|Number=Sing people of advanced age ● POS model from Ling et al. (2015) DT NN VBZ VBG ● Names ● Domain-specific jargon Backward LSTM ● Foreign words ● Rare(-ish) morphological Forward derivations LSTM Nonce words ● ● Nonstandard Word orthography embeddings ● Typos and other errors the cat is sitting ● ... 16

Extrinsic Evaluation – POS + Attribute Tagging ● UD is annotated for POS and morphosyntactic attributes ○ Eng: his stated goals Tense=Past|VerbForm=Part ○ Cze: osoby v pokročilém věku Animacy=Inan|Case=Loc|Degree=Pos|Gender=Masc|Negative=Pos|Number=Sing people of advanced age ● POS model from Ling et al. (2015) Tense -- -- pres -- Number -- sing -- -- ● Attributes - same as POS layer POS DT NN VBZ VBG Backward LSTM Forward LSTM Word embeddings the cat is sitting 17

Extrinsic Evaluation – POS + Attribute Tagging ● UD is annotated for POS and morphosyntactic attributes ○ Eng: his stated goals Tense=Past|VerbForm=Part ○ Cze: osoby v pokročilém věku Animacy=Inan|Case=Loc|Degree=Pos|Gender=Masc|Negative=Pos|Number=Sing people of advanced age ● POS model from Ling et al. (2015) Tense -- -- pres -- Number -- sing -- -- ● Attributes - same as POS layer POS DT NN VBZ VBG ● Negative effect on POS Backward LSTM Forward LSTM Word embeddings the cat is sitting 17

Extrinsic Evaluation – POS + Attribute Tagging ● UD is annotated for POS and morphosyntactic attributes ○ Eng: his stated goals Tense=Past|VerbForm=Part ○ Cze: osoby v pokročilém věku Animacy=Inan|Case=Loc|Degree=Pos|Gender=Masc|Negative=Pos|Number=Sing people of advanced age ● POS model from Ling et al. (2015) Tense -- -- pres -- Number -- sing -- -- ● Attributes - same as POS layer POS DT NN VBZ VBG ● Negative effect on POS ● Attribute evaluation metric Backward LSTM ○ Micro F1 Forward LSTM Word embeddings the cat is sitting 17

Language Selection Minna Sundberg 18

Language Selection ● |UD ∩ Polyglot| = 44, we took 23 Minna Sundberg 18

Language Selection ● |UD ∩ Polyglot| = 44, we took 23 ● Morphological structure Minna Sundberg 18

Language Selection ● |UD ∩ Polyglot| = 44, we took 23 ● Morphological structure ○ 12 fusional Minna Sundberg 18

Language Selection ● |UD ∩ Polyglot| = 44, we took 23 ● Morphological structure ○ 12 fusional ○ 3 analytic Minna Sundberg 18

Language Selection ● |UD ∩ Polyglot| = 44, we took 23 ● Morphological structure ○ 12 fusional ○ 3 analytic ○ 1 isolating Minna Sundberg 18

Language Selection ● |UD ∩ Polyglot| = 44, we took 23 ● Morphological structure ○ 12 fusional ○ 3 analytic ○ 1 isolating ○ 7 agglutinative Minna Sundberg 18

Language Selection ● |UD ∩ Polyglot| = 44, we took 23 ● Morphological structure ○ 12 fusional ○ 3 analytic ○ 1 isolating ○ 7 agglutinative ● Geneological diversity Minna Sundberg 18

Language Selection ● |UD ∩ Polyglot| = 44, we took 23 ● Morphological structure ○ 12 fusional ○ 3 analytic ○ 1 isolating ○ 7 agglutinative ● Geneological diversity ○ 13 Indo-European (7 different branches) Minna Sundberg 18

Language Selection ● |UD ∩ Polyglot| = 44, we took 23 ● Morphological structure ○ 12 fusional ○ 3 analytic ○ 1 isolating ○ 7 agglutinative ● Geneological diversity ○ 13 Indo-European (7 different branches) ○ 10 from 8 non-IE branches Minna Sundberg 18

Language Selection ● |UD ∩ Polyglot| = 44, we took 23 ● Morphological structure ○ 12 fusional ○ 3 analytic ○ 1 isolating ○ 7 agglutinative ● Geneological diversity ○ 13 Indo-European (7 different branches) ○ 10 from 8 non-IE branches MRLs (e.g. Slavic languages) ● Minna Sundberg 18

Language Selection ● |UD ∩ Polyglot| = 44, we took 23 ● Morphological structure ○ 12 fusional ○ 3 analytic ○ 1 isolating ○ 7 agglutinative ● Geneological diversity ○ 13 Indo-European (7 different branches) ○ 10 from 8 non-IE branches MRLs (e.g. Slavic languages) ● ○ Much word-level data Minna Sundberg 18

Language Selection ● |UD ∩ Polyglot| = 44, we took 23 ● Morphological structure ○ 12 fusional ○ 3 analytic ○ 1 isolating ○ 7 agglutinative ● Geneological diversity ○ 13 Indo-European (7 different branches) ○ 10 from 8 non-IE branches MRLs (e.g. Slavic languages) ● ○ Much word-level data ○ Relatively free word order Minna Sundberg 18

Language Selection ● |UD ∩ Polyglot| = 44, we took 23 ● Morphological structure ○ 12 fusional ○ 3 analytic ○ 1 isolating ○ 7 agglutinative Institutional ● Geneological diversity Entrepreneurial Linguistic ○ 13 Indo-European (7 different branches) Anatomical ○ 10 from 8 non-IE branches Ideological MRLs (e.g. Slavic languages) ● ○ Much word-level data ○ Relatively free word order Minna Sundberg 18

Language Selection (contd.) 19

Language Selection (contd.) ● Script type 19

Language Selection (contd.) ● Script type ○ 7 in non-alphabetic scripts 19

Language Selection (contd.) ● Script type ○ 7 in non-alphabetic scripts ○ Ideographic (Chinese) - ~12K characters 19

Language Selection (contd.) ● Script type ○ 7 in non-alphabetic scripts ○ Ideographic (Chinese) - ~12K characters ○ Hebrew, Arabic - no casing, no vowels, syntactic fusion 19

Language Selection (contd.) ● Script type ○ 7 in non-alphabetic scripts ○ Ideographic (Chinese) - ~12K characters ○ Hebrew, Arabic - no casing, no vowels, syntactic fusion ○ Vietnamese - tokens are non-compositional syllables 19

Language Selection (contd.) ● Script type ○ 7 in non-alphabetic scripts ○ Ideographic (Chinese) - ~12K characters ○ Hebrew, Arabic - no casing, no vowels, syntactic fusion ○ Vietnamese - tokens are non-compositional syllables ● Attribute-carrying tokens 19

Language Selection (contd.) ● Script type ○ 7 in non-alphabetic scripts ○ Ideographic (Chinese) - ~12K characters ○ Hebrew, Arabic - no casing, no vowels, syntactic fusion ○ Vietnamese - tokens are non-compositional syllables ● Attribute-carrying tokens ○ Range from 0% (Vietnamese) to 92.4% (Hindi) 19

Language Selection (contd.) ● Script type ○ 7 in non-alphabetic scripts ○ Ideographic (Chinese) - ~12K characters ○ Hebrew, Arabic - no casing, no vowels, syntactic fusion ○ Vietnamese - tokens are non-compositional syllables ● Attribute-carrying tokens ○ Range from 0% (Vietnamese) to 92.4% (Hindi) ● OOV rate (UD against Polyglot vocabulary) 19

Language Selection (contd.) ● Script type ○ 7 in non-alphabetic scripts ○ Ideographic (Chinese) - ~12K characters ○ Hebrew, Arabic - no casing, no vowels, syntactic fusion ○ Vietnamese - tokens are non-compositional syllables ● Attribute-carrying tokens ○ Range from 0% (Vietnamese) to 92.4% (Hindi) ● OOV rate (UD against Polyglot vocabulary) ○ 16.9%-70.8% type-level (median 29.1%) 19

Language Selection (contd.) ● Script type ○ 7 in non-alphabetic scripts ○ Ideographic (Chinese) - ~12K characters ○ Hebrew, Arabic - no casing, no vowels, syntactic fusion ○ Vietnamese - tokens are non-compositional syllables ● Attribute-carrying tokens ○ Range from 0% (Vietnamese) to 92.4% (Hindi) ● OOV rate (UD against Polyglot vocabulary) ○ 16.9%-70.8% type-level (median 29.1%) ○ 2.2%-33.1% token-level (median 9.2%) 19

Evaluated Systems ● NONE: Polyglot ’ s default UNK embedding the flatfish is sitting 20

Evaluated Systems ● NONE: Polyglot ’ s default UNK embedding ● MIMICK the flatfish is sitting 20

Evaluated Systems ● NONE: Polyglot ’ s default UNK embedding ● MIMICK ● CHAR2TAG - additional RNN layer ○ 3x Training time the flatfish is sitting Char- Char- Char- Char- LSTM LSTM LSTM LSTM 20

Evaluated Systems ● NONE: Polyglot ’ s default UNK embedding ● MIMICK ● CHAR2TAG - additional RNN layer ○ 3x Training time ● BOTH: MIMICK + CHAR2TAG the flatfish is sitting Char- Char- Char- Char- LSTM LSTM LSTM LSTM 20

Evaluated Systems ● NONE: Polyglot ’ s default UNK embedding ● MIMICK POINT ● CHAR2TAG - additional RNN layer UNION ○ 3x Training time ROAD LIGHT ● BOTH: MIMICK + CHAR2TAG LONG the flatfish is sitting Char- Char- Char- Char- LSTM LSTM LSTM LSTM 20

Results - Full Data NONE MIMICK CHAR2TAG BOTH NONE MIMICK CHAR2TAG BOTH Morpho. Attributes (micro F1) POS tags (accuracy) 21

Results - 5,000 training tokens NONE MIMICK CHAR2TAG BOTH NONE MIMICK CHAR2TAG BOTH Morpho. Attributes (micro F1) POS tags (accuracy) 22

Results - Language Types (5,000 tokens) NONE MIMICK CHAR2TAG BOTH Slavic languages POS 23

Results - Language Types (5,000 tokens) NONE MIMICK CHAR2TAG BOTH NONE MIMICK CHAR2TAG BOTH Agglutinative languages morpho. attribute F1 Slavic languages POS 23

Results - Chinese NONE MIMICK CHAR2TAG BOTH NONE MIMICK CHAR2TAG BOTH Morpho. Attributes (micro F1) POS tags (accuracy) 24

Mimicking Word Embeddings using Subword RNNs Yuval Pinter, Robert - PowerPoint PPT Presentation

Mimicking Word Embeddings using Subword RNNs Yuval Pinter, Robert Guthrie, Jacob Eisenstein @yuvalpi Presented at EMNLP September 2017, Copenhagen The Word Embedding Pipeline Unlabeled Unlabeled corpus Unlabeled corpus Unlabeled corpus

Word Embeddings Natural Language Processing VU (706.230) - Andi Rexha 02/04/2020 Word Embeddings

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word embeddings Rappel Embeddings ( pas Word Embeddings ) Est une lookup table Formalisme:

Mixed membership word embeddings: Corpus-specific embeddings without big data James Foulds

Word Embeddings through Hellinger PCA Rmi Lebret and Ronan Collobert Idiap Research Institute /

Word Embeddings Tutorial HILA GONEN PHD STUDENT AT YOAV GOLDBERGS LAB BAR ILAN UNIVERSITY

Word Embeddings Revisited: Contextual Embeddings CS 6956: Deep Learning for NLP Overview

Symmetric Pattern Based Word Embeddings for Improved Word Similarity Prediction Roy Schwartz + ,

Searching for the X-Factor: Exploring Corpus Subjectivity for Word Embeddings Maksim Tkachenko

Pattern-based Solutions to Limitations of Leading Word Embeddings Roy Schwartz University of

Word, Sense and Contextualized Embeddings: Vector Representations of Meaning in NLP Jose

Dense Word Embeddings CMSC 470 Marine Carpuat Slides credit: Jurasky & Martin How to

Dense Word Embeddings CMSC 470 Marine Carpuat Slides credit: Jurasky & Martin How to

Learning Word Embeddings for Low-resource Languages by PU Learning Chao Jiang, Hsiang-Fu Yu,

Humor in Word Embeddings: Cockamamie Gobbledegook for Nincompoops Limor Gultchin, Genevieve

Lecture 8: NLP and Word Embeddings Alireza Akhavan Pour CLASS.VISION

Motivation Current Scenario : Rising interest in vector space word embeddings and their use, given

USING PSEUDO-SENSES FOR IMPROVING THE EXTRACTION OF SYNONYMS FROM WORD EMBEDDINGS Olivier Ferret

Generalizing Word Embeddings using Bag of Subwords Jinman Zhao , Sidharth Mudgal, Yingyu Liang

Deep convolutional acoustic word embeddings using word-pair side information Herman Kamper 1 ,

word2vec Kuan-Ting Lai 2020/5/28 Word2vec (Word Embeddings) Embed one-hot encoded word

Mimicking Word Embeddings using Subword RNNs Yuval Pinter, Robert - PowerPoint PPT Presentation

Mimicking Word Embeddings using Subword RNNs Yuval Pinter, Robert Guthrie, Jacob Eisenstein @yuvalpi Presented at EMNLP September 2017, Copenhagen The Word Embedding Pipeline Unlabeled Unlabeled corpus Unlabeled corpus Unlabeled corpus

Word Embeddings Natural Language Processing VU (706.230) - Andi Rexha 02/04/2020 Word Embeddings

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word embeddings Rappel Embeddings ( pas Word Embeddings ) Est une lookup table Formalisme:

Mixed membership word embeddings: Corpus-specific embeddings without big data James Foulds

Word Embeddings through Hellinger PCA Rmi Lebret and Ronan Collobert Idiap Research Institute /

Word Embeddings Tutorial HILA GONEN PHD STUDENT AT YOAV GOLDBERGS LAB BAR ILAN UNIVERSITY

Word Embeddings Revisited: Contextual Embeddings CS 6956: Deep Learning for NLP Overview

Symmetric Pattern Based Word Embeddings for Improved Word Similarity Prediction Roy Schwartz + ,

Searching for the X-Factor: Exploring Corpus Subjectivity for Word Embeddings Maksim Tkachenko

Pattern-based Solutions to Limitations of Leading Word Embeddings Roy Schwartz University of

Word, Sense and Contextualized Embeddings: Vector Representations of Meaning in NLP Jose

Dense Word Embeddings CMSC 470 Marine Carpuat Slides credit: Jurasky &amp; Martin How to

Dense Word Embeddings CMSC 470 Marine Carpuat Slides credit: Jurasky &amp; Martin How to

Learning Word Embeddings for Low-resource Languages by PU Learning Chao Jiang, Hsiang-Fu Yu,

Humor in Word Embeddings: Cockamamie Gobbledegook for Nincompoops Limor Gultchin, Genevieve

Lecture 8: NLP and Word Embeddings Alireza Akhavan Pour CLASS.VISION

Motivation Current Scenario : Rising interest in vector space word embeddings and their use, given

USING PSEUDO-SENSES FOR IMPROVING THE EXTRACTION OF SYNONYMS FROM WORD EMBEDDINGS Olivier Ferret

Generalizing Word Embeddings using Bag of Subwords Jinman Zhao , Sidharth Mudgal, Yingyu Liang

Deep convolutional acoustic word embeddings using word-pair side information Herman Kamper 1 ,

word2vec Kuan-Ting Lai 2020/5/28 Word2vec (Word Embeddings) Embed one-hot encoded word

Dense Word Embeddings CMSC 470 Marine Carpuat Slides credit: Jurasky & Martin How to

Dense Word Embeddings CMSC 470 Marine Carpuat Slides credit: Jurasky & Martin How to