mimicking word embeddings
play

Mimicking Word Embeddings using Subword RNNs Yuval Pinter, Robert - PowerPoint PPT Presentation

Mimicking Word Embeddings using Subword RNNs Yuval Pinter, Robert Guthrie, Jacob Eisenstein @yuvalpi Presented at EMNLP September 2017, Copenhagen The Word Embedding Pipeline Unlabeled Unlabeled corpus Unlabeled corpus Unlabeled corpus


  1. Enter MIMICK ● What data do we have, post-unlabeled corpus? ○ Vector dictionary ○ Orthography (the way words are spelled) ● Use the former as training objective, latter as input ● Pre-trained vectors as target ○ No need to access original unlabeled corpus ○ Many training examples Unlabeled Supervised ○ (No context) Unlabeled corpus task corpus Unlabeled corpus Unlabeled corpus corpus OOV OOV 11

  2. Enter MIMICK ● What data do we have, post-unlabeled corpus? ○ Vector dictionary ○ Orthography (the way words are spelled) ● Use the former as training objective, latter as input ● Pre-trained vectors as target ○ No need to access original unlabeled corpus ○ Many training examples Unlabeled Supervised ○ (No context) Unlabeled corpus task corpus Unlabeled corpus Unlabeled ● Subword units as inputs corpus corpus ○ Very extensible ○ (Character inventory changes?) OOV OOV 11

  3. MIMICK Training Pre-trained Embedding make (Polyglot/FastText/etc.) All possible text Unlabeled text make m a k e 12

  4. MIMICK Training Pre-trained Embedding make (Polyglot/FastText/etc.) All possible text Unlabeled text make Character embeddings m a k e 12

  5. MIMICK Training Pre-trained Embedding make (Polyglot/FastText/etc.) All possible text Unlabeled text make Forward LSTM Character embeddings m a k e 12

  6. MIMICK Training Pre-trained Embedding make (Polyglot/FastText/etc.) All possible text Unlabeled text make Backward LSTM Forward LSTM Character embeddings m a k e 12

  7. MIMICK Training Pre-trained Embedding make (Polyglot/FastText/etc.) All possible text Multilayered Perceptron Unlabeled text make Backward LSTM Forward LSTM Character embeddings m a k e 12

  8. MIMICK Training Pre-trained Embedding make (Polyglot/FastText/etc.) Loss (L 2 ) All possible text Mimicked Embedding Multilayered Perceptron Unlabeled text make Backward LSTM Forward LSTM Character embeddings m a k e 12

  9. MIMICK Inference All possible text Mimicked Embedding Multilayered Perceptron Unlabeled text blah Backward LSTM Forward LSTM Character embeddings b l a h 13

  10. Observation – Nearest Neighbors 14

  11. Observation – Nearest Neighbors ● English (OOV  Nearest in-vocab words) 14

  12. Observation – Nearest Neighbors ● English (OOV  Nearest in-vocab words) ○ MCT → AWS, OTA, APT, PDM 14

  13. Observation – Nearest Neighbors ● English (OOV  Nearest in-vocab words) ○ MCT → AWS, OTA, APT, PDM ○ pesky → euphoric, disagreeable, horrid, ghastly 14

  14. Observation – Nearest Neighbors ● English (OOV  Nearest in-vocab words) ○ MCT → AWS, OTA, APT, PDM ○ pesky → euphoric, disagreeable, horrid, ghastly ○ lawnmower → tradesman, bookmaker, postman, hairdresser 14

  15. Observation – Nearest Neighbors ● English (OOV  Nearest in-vocab words) ○ MCT → AWS, OTA, APT, PDM ○ pesky → euphoric, disagreeable, horrid, ghastly ○ lawnmower → tradesman, bookmaker, postman, hairdresser ● Hebrew 14

  16. Observation – Nearest Neighbors ● English (OOV  Nearest in-vocab words) ○ MCT → AWS, OTA, APT, PDM ○ pesky → euphoric, disagreeable, horrid, ghastly ○ lawnmower → tradesman, bookmaker, postman, hairdresser ● Hebrew ○ רותפת → גתת (she/you-3p.sg.) will come true (she/you-3p.sg.) will solve 14

  17. Observation – Nearest Neighbors ● English (OOV  Nearest in-vocab words) ○ MCT → AWS, OTA, APT, PDM ○ pesky → euphoric, disagreeable, horrid, ghastly ○ lawnmower → tradesman, bookmaker, postman, hairdresser ● Hebrew ○ רותפת → גתת (she/you-3p.sg.) will come true (she/you-3p.sg.) will solve ○ םיירט → םיירטמואיג geometric (m.pl., nontrad. spelling) geometric (m.pl.) 14

  18. Observation – Nearest Neighbors ● English (OOV  Nearest in-vocab words) ○ MCT → AWS, OTA, APT, PDM ○ pesky → euphoric, disagreeable, horrid, ghastly ○ lawnmower → tradesman, bookmaker, postman, hairdresser ● Hebrew ○ רותפת → גתת (she/you-3p.sg.) will come true (she/you-3p.sg.) will solve ○ םיירט → םיירטמואיג geometric (m.pl., nontrad. spelling) geometric (m.pl.) ○ ךרא → ציר ’ ןוסדר Richardson Eustrach 14

  19. Observation – Nearest Neighbors ● English (OOV  Nearest in-vocab words) ○ MCT → AWS, OTA, APT, PDM ○ pesky → euphoric, disagreeable, horrid, ghastly ○ lawnmower → tradesman, bookmaker, postman, hairdresser ● Hebrew ○ רותפת → גתת (she/you-3p.sg.) will come true (she/you-3p.sg.) will solve ○ םיירט → םיירטמואיג geometric (m.pl., nontrad. spelling) geometric (m.pl.) ○ ךרא → ציר ’ ןוסדר Richardson Eustrach ● ✔ Surface form ✔ Syntactic properties ✘ Semantics 14

  20. Intrinsic Evaluation – RareWords 15

  21. Intrinsic Evaluation – RareWords ● RareWords similarity task: morphologically-complex, mostly unseen words 15

  22. Intrinsic Evaluation – RareWords ● RareWords similarity task: morphologically-complex, mostly unseen words 15

  23. Intrinsic Evaluation – RareWords ● RareWords similarity task: morphologically-complex, mostly unseen words Names ● Domain-specific jargon ● Foreign words ● Rare(-ish) morphological ● derivations Nonce words ● Nonstandard orthography ● Typos and other errors ● ... ● 15

  24. Intrinsic Evaluation – RareWords ● RareWords similarity task: morphologically-complex, mostly unseen words NN FUN!!! Nearest: programmatic transformational Names ● mechanistic Domain-specific jargon ● transactional Foreign words ● contextual Rare(-ish) morphological ● derivations Nonce words ● Nonstandard orthography ● Typos and other errors ● ... ● 15

  25. Extrinsic Evaluation – POS + Attribute Tagging ● UD is annotated for POS and morphosyntactic attributes ○ Eng: his stated goals Tense=Past|VerbForm=Part ○ Cze: osoby v pokročilém věku Animacy=Inan|Case=Loc|Degree=Pos|Gender=Masc|Negative=Pos|Number=Sing people of advanced age ● Names ● Domain-specific jargon ● Foreign words ● Rare(-ish) morphological derivations Nonce words ● ● Nonstandard orthography ● Typos and other errors ● ... 16

  26. Extrinsic Evaluation – POS + Attribute Tagging ● UD is annotated for POS and morphosyntactic attributes ○ Eng: his stated goals Tense=Past|VerbForm=Part ○ Cze: osoby v pokročilém věku Animacy=Inan|Case=Loc|Degree=Pos|Gender=Masc|Negative=Pos|Number=Sing people of advanced age ● POS model from Ling et al. (2015) DT NN VBZ VBG ● Names ● Domain-specific jargon Backward LSTM ● Foreign words ● Rare(-ish) morphological Forward derivations LSTM Nonce words ● ● Nonstandard Word orthography embeddings ● Typos and other errors the cat is sitting ● ... 16

  27. Extrinsic Evaluation – POS + Attribute Tagging ● UD is annotated for POS and morphosyntactic attributes ○ Eng: his stated goals Tense=Past|VerbForm=Part ○ Cze: osoby v pokročilém věku Animacy=Inan|Case=Loc|Degree=Pos|Gender=Masc|Negative=Pos|Number=Sing people of advanced age ● POS model from Ling et al. (2015) Tense -- -- pres -- Number -- sing -- -- ● Attributes - same as POS layer POS DT NN VBZ VBG Backward LSTM Forward LSTM Word embeddings the cat is sitting 17

  28. Extrinsic Evaluation – POS + Attribute Tagging ● UD is annotated for POS and morphosyntactic attributes ○ Eng: his stated goals Tense=Past|VerbForm=Part ○ Cze: osoby v pokročilém věku Animacy=Inan|Case=Loc|Degree=Pos|Gender=Masc|Negative=Pos|Number=Sing people of advanced age ● POS model from Ling et al. (2015) Tense -- -- pres -- Number -- sing -- -- ● Attributes - same as POS layer POS DT NN VBZ VBG ● Negative effect on POS Backward LSTM Forward LSTM Word embeddings the cat is sitting 17

  29. Extrinsic Evaluation – POS + Attribute Tagging ● UD is annotated for POS and morphosyntactic attributes ○ Eng: his stated goals Tense=Past|VerbForm=Part ○ Cze: osoby v pokročilém věku Animacy=Inan|Case=Loc|Degree=Pos|Gender=Masc|Negative=Pos|Number=Sing people of advanced age ● POS model from Ling et al. (2015) Tense -- -- pres -- Number -- sing -- -- ● Attributes - same as POS layer POS DT NN VBZ VBG ● Negative effect on POS ● Attribute evaluation metric Backward LSTM ○ Micro F1 Forward LSTM Word embeddings the cat is sitting 17

  30. Language Selection Minna Sundberg 18

  31. Language Selection ● |UD ∩ Polyglot| = 44, we took 23 Minna Sundberg 18

  32. Language Selection ● |UD ∩ Polyglot| = 44, we took 23 ● Morphological structure Minna Sundberg 18

  33. Language Selection ● |UD ∩ Polyglot| = 44, we took 23 ● Morphological structure ○ 12 fusional Minna Sundberg 18

  34. Language Selection ● |UD ∩ Polyglot| = 44, we took 23 ● Morphological structure ○ 12 fusional ○ 3 analytic Minna Sundberg 18

  35. Language Selection ● |UD ∩ Polyglot| = 44, we took 23 ● Morphological structure ○ 12 fusional ○ 3 analytic ○ 1 isolating Minna Sundberg 18

  36. Language Selection ● |UD ∩ Polyglot| = 44, we took 23 ● Morphological structure ○ 12 fusional ○ 3 analytic ○ 1 isolating ○ 7 agglutinative Minna Sundberg 18

  37. Language Selection ● |UD ∩ Polyglot| = 44, we took 23 ● Morphological structure ○ 12 fusional ○ 3 analytic ○ 1 isolating ○ 7 agglutinative ● Geneological diversity Minna Sundberg 18

  38. Language Selection ● |UD ∩ Polyglot| = 44, we took 23 ● Morphological structure ○ 12 fusional ○ 3 analytic ○ 1 isolating ○ 7 agglutinative ● Geneological diversity ○ 13 Indo-European (7 different branches) Minna Sundberg 18

  39. Language Selection ● |UD ∩ Polyglot| = 44, we took 23 ● Morphological structure ○ 12 fusional ○ 3 analytic ○ 1 isolating ○ 7 agglutinative ● Geneological diversity ○ 13 Indo-European (7 different branches) ○ 10 from 8 non-IE branches Minna Sundberg 18

  40. Language Selection ● |UD ∩ Polyglot| = 44, we took 23 ● Morphological structure ○ 12 fusional ○ 3 analytic ○ 1 isolating ○ 7 agglutinative ● Geneological diversity ○ 13 Indo-European (7 different branches) ○ 10 from 8 non-IE branches MRLs (e.g. Slavic languages) ● Minna Sundberg 18

  41. Language Selection ● |UD ∩ Polyglot| = 44, we took 23 ● Morphological structure ○ 12 fusional ○ 3 analytic ○ 1 isolating ○ 7 agglutinative ● Geneological diversity ○ 13 Indo-European (7 different branches) ○ 10 from 8 non-IE branches MRLs (e.g. Slavic languages) ● ○ Much word-level data Minna Sundberg 18

  42. Language Selection ● |UD ∩ Polyglot| = 44, we took 23 ● Morphological structure ○ 12 fusional ○ 3 analytic ○ 1 isolating ○ 7 agglutinative ● Geneological diversity ○ 13 Indo-European (7 different branches) ○ 10 from 8 non-IE branches MRLs (e.g. Slavic languages) ● ○ Much word-level data ○ Relatively free word order Minna Sundberg 18

  43. Language Selection ● |UD ∩ Polyglot| = 44, we took 23 ● Morphological structure ○ 12 fusional ○ 3 analytic ○ 1 isolating ○ 7 agglutinative Institutional ● Geneological diversity Entrepreneurial Linguistic ○ 13 Indo-European (7 different branches) Anatomical ○ 10 from 8 non-IE branches Ideological MRLs (e.g. Slavic languages) ● ○ Much word-level data ○ Relatively free word order Minna Sundberg 18

  44. Language Selection (contd.) 19

  45. Language Selection (contd.) ● Script type 19

  46. Language Selection (contd.) ● Script type ○ 7 in non-alphabetic scripts 19

  47. Language Selection (contd.) ● Script type ○ 7 in non-alphabetic scripts ○ Ideographic (Chinese) - ~12K characters 19

  48. Language Selection (contd.) ● Script type ○ 7 in non-alphabetic scripts ○ Ideographic (Chinese) - ~12K characters ○ Hebrew, Arabic - no casing, no vowels, syntactic fusion 19

  49. Language Selection (contd.) ● Script type ○ 7 in non-alphabetic scripts ○ Ideographic (Chinese) - ~12K characters ○ Hebrew, Arabic - no casing, no vowels, syntactic fusion ○ Vietnamese - tokens are non-compositional syllables 19

  50. Language Selection (contd.) ● Script type ○ 7 in non-alphabetic scripts ○ Ideographic (Chinese) - ~12K characters ○ Hebrew, Arabic - no casing, no vowels, syntactic fusion ○ Vietnamese - tokens are non-compositional syllables ● Attribute-carrying tokens 19

  51. Language Selection (contd.) ● Script type ○ 7 in non-alphabetic scripts ○ Ideographic (Chinese) - ~12K characters ○ Hebrew, Arabic - no casing, no vowels, syntactic fusion ○ Vietnamese - tokens are non-compositional syllables ● Attribute-carrying tokens ○ Range from 0% (Vietnamese) to 92.4% (Hindi) 19

  52. Language Selection (contd.) ● Script type ○ 7 in non-alphabetic scripts ○ Ideographic (Chinese) - ~12K characters ○ Hebrew, Arabic - no casing, no vowels, syntactic fusion ○ Vietnamese - tokens are non-compositional syllables ● Attribute-carrying tokens ○ Range from 0% (Vietnamese) to 92.4% (Hindi) ● OOV rate (UD against Polyglot vocabulary) 19

  53. Language Selection (contd.) ● Script type ○ 7 in non-alphabetic scripts ○ Ideographic (Chinese) - ~12K characters ○ Hebrew, Arabic - no casing, no vowels, syntactic fusion ○ Vietnamese - tokens are non-compositional syllables ● Attribute-carrying tokens ○ Range from 0% (Vietnamese) to 92.4% (Hindi) ● OOV rate (UD against Polyglot vocabulary) ○ 16.9%-70.8% type-level (median 29.1%) 19

  54. Language Selection (contd.) ● Script type ○ 7 in non-alphabetic scripts ○ Ideographic (Chinese) - ~12K characters ○ Hebrew, Arabic - no casing, no vowels, syntactic fusion ○ Vietnamese - tokens are non-compositional syllables ● Attribute-carrying tokens ○ Range from 0% (Vietnamese) to 92.4% (Hindi) ● OOV rate (UD against Polyglot vocabulary) ○ 16.9%-70.8% type-level (median 29.1%) ○ 2.2%-33.1% token-level (median 9.2%) 19

  55. Evaluated Systems ● NONE: Polyglot ’ s default UNK embedding the flatfish is sitting 20

  56. Evaluated Systems ● NONE: Polyglot ’ s default UNK embedding ● MIMICK the flatfish is sitting 20

  57. Evaluated Systems ● NONE: Polyglot ’ s default UNK embedding ● MIMICK ● CHAR2TAG - additional RNN layer ○ 3x Training time the flatfish is sitting Char- Char- Char- Char- LSTM LSTM LSTM LSTM 20

  58. Evaluated Systems ● NONE: Polyglot ’ s default UNK embedding ● MIMICK ● CHAR2TAG - additional RNN layer ○ 3x Training time ● BOTH: MIMICK + CHAR2TAG the flatfish is sitting Char- Char- Char- Char- LSTM LSTM LSTM LSTM 20

  59. Evaluated Systems ● NONE: Polyglot ’ s default UNK embedding ● MIMICK POINT ● CHAR2TAG - additional RNN layer UNION ○ 3x Training time ROAD LIGHT ● BOTH: MIMICK + CHAR2TAG LONG the flatfish is sitting Char- Char- Char- Char- LSTM LSTM LSTM LSTM 20

  60. Results - Full Data NONE MIMICK CHAR2TAG BOTH NONE MIMICK CHAR2TAG BOTH Morpho. Attributes (micro F1) POS tags (accuracy) 21

  61. Results - 5,000 training tokens NONE MIMICK CHAR2TAG BOTH NONE MIMICK CHAR2TAG BOTH Morpho. Attributes (micro F1) POS tags (accuracy) 22

  62. Results - Language Types (5,000 tokens) NONE MIMICK CHAR2TAG BOTH Slavic languages POS 23

  63. Results - Language Types (5,000 tokens) NONE MIMICK CHAR2TAG BOTH NONE MIMICK CHAR2TAG BOTH Agglutinative languages morpho. attribute F1 Slavic languages POS 23

  64. Results - Chinese NONE MIMICK CHAR2TAG BOTH NONE MIMICK CHAR2TAG BOTH Morpho. Attributes (micro F1) POS tags (accuracy) 24

Recommend


More recommend