character eyes seeing language through character level
play

Character Eyes: Seeing Language through Character-Level Taggers - PowerPoint PPT Presentation

Character Eyes: Seeing Language through Character-Level Taggers Yuval Pinter Marc Marone Jacob Eisenstein @yuvalpi @ruyimarone @jacobeisenstein Blackbox NLP 2019 https://github.com/ruyimarone/character-eyes Taggers N sg V past RB DET


  1. Character Eyes: Seeing Language through Character-Level Taggers Yuval Pinter Marc Marone Jacob Eisenstein @yuvalpi @ruyimarone @jacobeisenstein Blackbox NLP 2019 https://github.com/ruyimarone/character-eyes

  2. Taggers N sg V past RB DET The cat walked fast 3

  3. Neural Taggers DET N sg V past RB The cat walked fast 4

  4. Character-level Neural Taggers DET N sg V past RB The cat walked fast T h e c a t w a l k e d f a s t 5

  5. Character-level Recurrent Neural Taggers DET N sg V past RB The cat walked fast T h e c a t w a l k e d f a s t 6

  6. Recurrent Taggers – Good at Finding Morphemes? DET N sg V past RB The cat walked fast T h e c a t w a l k e d f a s t 7

  7. Recurrent Taggers – Good at Finding Morphemes? DET N sg V past RB Agglutination The cat walked fast T h e c a t w a l k e d f a s t 7

  8. Recurrent Taggers – Good at Prefixes and Suffixes? N sg;def V past RB thecat walked fast t h e c a t w a l k e d f a s t 8

  9. Recurrent Taggers – Good at Prefixes and Suffixes? N sg;def V past RB Prefixing morphology thecat walked fast (e.g. Coptic) t h e c a t w a l k e d f a s t 8

  10. Recurrent Taggers – Can They Handle diSCoNtinUiTY? DET N sg V past RB The cat waeldk fast T h e c a t w a e l d k f a s t 9

  11. Recurrent Taggers – Can They Handle diSCoNtinUiTY? DET N sg V past RB Introflexive morphology The cat waeldk fast (Hebrew, Arabic) T h e c a t w a e l d k f a s t 9

  12. Main Idea(s) Model Language w a l k e d t h e c a t w a e l d k 10

  13. Main Idea(s) Model Language measure how models encode different w a l k e d linguistic patterns t h e c a t w a e l d k 10

  14. Main Idea(s) Model Language characterize languages based on model analysis; w a l k e d help engineer language- aware systems t h e c a t w a e l d k 11

  15. Analysis Primitive – Unit Decomposition DET N sg V past RB The cat walked fast T h e c a t w a l k e d f a s t 12

  16. Analysis Primitive – Unit Decomposition ● Assumption: units are “in DET N sg V past RB charge” of tracking morphemes that help predict POS The cat walked fast Hidden unit #n T h e c a t w a l k e d f a s t 12

  17. Analysis Primitive – Unit Decomposition ● Assumption: units are “in DET N sg V past RB charge” of tracking morphemes that help predict POS The cat walked fast ● Hypothesis: easy for agglutinations , difficult for introflexions Hidden unit #n T h e c a t w a l k e d f a s t 12

  18. Analysis Primitive – Unit Decomposition ● Assumption: units are “in DET N sg V past RB charge” of tracking morphemes that help predict POS The cat walked fast ● Hypothesis: easy for agglutinations , difficult for introflexions Hidden unit #n ● Hypothesis: unit’s direction Hidden unit #m affects ease of tracking suffixes vs. prefixes T h e c a t w a l k e d f a s t 12

  19. Evidence? ● Turkish is an agglutinative language ○ ev ‘house’; evler ‘houses’; evleriniz ‘your houses’; evlerinizden ‘from your houses’ 13

  20. Evidence? ● Turkish is an agglutinative language ○ ev ‘house’; evler ‘houses’; evleriniz ‘your houses’; evlerinizden ‘from your houses’ Unit 3 ( → ) 13

  21. Evidence? ● Turkish is an agglutinative language ○ ev ‘house’; evler ‘houses’; evleriniz ‘your houses’; evlerinizden ‘from your houses’ Unit 3 ( → ) Unit 124 (  ) 13

  22. Model & Data DET N sg V past RB The cat walked fast T h e c a t w a l k e d f a s t 14

  23. Model & Data ● Universal Dependencies (n=24) DET N sg V past RB ○ POS tags + Morphosyntactic Descriptions The cat walked fast T h e c a t w a l k e d f a s t 14

  24. Model & Data ● Universal Dependencies (n=24) DET N sg V past RB ○ POS tags + Morphosyntactic Descriptions ● Linguistic diversity – morph. synthesis: ○ 5 agglutinative languages 2 introflexive languages ○ ○ 3 isolating, 14 fusional The cat walked fast T h e c a t w a l k e d f a s t Source for language classes: WALS 14

  25. Model & Data ● Universal Dependencies (n=24) DET N sg V past RB ○ POS tags + Morphosyntactic Descriptions ● Linguistic diversity – morph. synthesis: ○ 5 agglutinative languages 2 introflexive languages ○ ○ 3 isolating, 14 fusional The cat walked fast ● Linguistic diversity – affixation: ○ (All) 1 prefixing language 2 non-affixing ○ ○ 2 equally pre- and suffixing T h e c a t w a l k e d f a s t ○ 19 suffixing Source for language classes: WALS 14

  26. Model & Data ● Universal Dependencies (n=24) DET N sg V past RB ○ POS tags + Morphosyntactic Descriptions Linguistic diversity (synthesis + affixation) ○ ● Word → Tag: Bidirectional LSTM + MLP (Not analyzed) ○ The cat walked fast ○ No word embeddings T h e c a t w a l k e d f a s t 15

  27. Model & Data ● Universal Dependencies (n=24) DET N sg V past RB ○ POS tags + Morphosyntactic Descriptions Linguistic diversity (synthesis + affixation) ○ ● Word → Tag: Bidirectional LSTM + MLP (Not analyzed) ○ The cat walked fast ○ No word embeddings ● Char → Word: Bidirectional LSTM ○ Char embedding size: 256 T h e c a t w a l k e d f a s t 15

  28. Analysis Metrics 16

  29. Analysis Metrics ● Run model on training data words 16

  30. Analysis Metrics ● Run model on training data words ● Collect activation levels for each unit 16

  31. Analysis Metrics ● Run model on training data words 0.42 ● Collect activation levels for each unit ● Aggregate to single measure (e.g. average absolute or max-delta ) 16

  32. Analysis Metrics ● Run model on training data words 0.42 ● Collect activation levels for each unit ● Aggregate to single measure (e.g. average absolute or max-delta ) ● Bin per unit over parts of speech … Unit 42 [0.0,0.1) [0.1,0.2) [0.9,1.0) … NOUN 8 2 40 … VERB 20 0 4 … … … … … … ADJ 10 10 10 16

  33. Analysis Metrics ● Run model on training data words 0.42 ● Collect activation levels for each unit ● Aggregate to single measure (e.g. average absolute or max-delta ) ● Bin per unit over parts of speech ● Mutual Information metric – POS … Unit 42 [0.0,0.1) [0.1,0.2) [0.9,1.0) … NOUN 8 2 40 Discrimination Index, or PDI … VERB 20 0 4 ○ (Higher PDI = better discriminator) … … … … … … ADJ 10 10 10 16

  34. … Analysis Metrics Unit 40 [0.0,0.1) [0.1,0.2) [0.9,1.0) … Unit 41 [0.0,0.1) [0.1,0.2) [0.9,1.0) … NOUN 8 2 40 … Unit 42 [0.0,0.1) [0.1,0.2) [0.9,1.0) … NOUN 8 2 40 … VERB 20 0 4 … NOUN 8 2 40 … VERB 20 0 4 … … … … … ● Run model on training data words … VERB 20 0 4 … … … … … … ADJ 10 10 10 … … … … … ● Collect activation levels for each unit … ADJ 10 10 10 … ADJ 10 10 10 ● Aggregate to single measure (e.g. average absolute or max-delta ) ● Bin per unit over parts of speech ● Mutual Information metric – POS Discrimination Index, or PDI ○ (Higher PDI = better discriminator) ● Aggregate across units by 17

  35. … Analysis Metrics Unit 40 [0.0,0.1) [0.1,0.2) [0.9,1.0) … Unit 41 [0.0,0.1) [0.1,0.2) [0.9,1.0) … NOUN 8 2 40 … Unit 42 [0.0,0.1) [0.1,0.2) [0.9,1.0) … NOUN 8 2 40 … VERB 20 0 4 … NOUN 8 2 40 … VERB 20 0 4 … … … … … ● Run model on training data words … VERB 20 0 4 … … … … … … ADJ 10 10 10 … … … … … ● Collect activation levels for each unit … ADJ 10 10 10 … ADJ 10 10 10 ● Aggregate to single measure (e.g. average absolute or max-delta ) ● Bin per unit over parts of speech ● Mutual Information metric – POS  mass median Discrimination Index, or PDI ○ (Higher PDI = better discriminator) ● Aggregate across units by Summing total mass ○ Reporting % of forward units before mass median ○ 17

  36. Findings (Cherry Pick) ● English: fusional, suffixing ● Coptic: agglutinative, prefixing ○ Small mass (hard to capture POS) Large mass (easy to distinguish POS ○ based on char sequence) Backward-heavy (80%) ○ ○ Forward-heavy (71%) 18

  37. Total PDI mass Findings (General Trends) 19

  38. Total PDI mass Findings (General Trends) ● 4/5 agglutinatives hold 4/6 top total-mass positions 19

  39. Total PDI mass Findings (General Trends) ● 4/5 agglutinatives hold 4/6 top total-mass positions ● 2/2 introflexives in bottom 2/4 spots (Persian and Hindi below, both fusional w/ non-Latin charsets) 19

  40. Direction Balance Study 20

  41. Direction Balance Study ● Some languages might not need two equal LSTM directions 20

  42. Direction Balance Study ● Some languages might not need two equal LSTM directions ● What if… they don’t need one of them at all? 20

  43. Direction Balance Study ● Some languages might not need two equal LSTM directions ● What if they need them in a different balance? Somewhere in the middle? ● What if… they don’t need one of them at all? 20

  44. Balance Study – Results 21

Recommend


More recommend