Character Eyes: Seeing Language through Character-Level Taggers Yuval Pinter Marc Marone Jacob Eisenstein @yuvalpi @ruyimarone @jacobeisenstein Blackbox NLP 2019 https://github.com/ruyimarone/character-eyes
Taggers N sg V past RB DET The cat walked fast 3
Neural Taggers DET N sg V past RB The cat walked fast 4
Character-level Neural Taggers DET N sg V past RB The cat walked fast T h e c a t w a l k e d f a s t 5
Character-level Recurrent Neural Taggers DET N sg V past RB The cat walked fast T h e c a t w a l k e d f a s t 6
Recurrent Taggers – Good at Finding Morphemes? DET N sg V past RB The cat walked fast T h e c a t w a l k e d f a s t 7
Recurrent Taggers – Good at Finding Morphemes? DET N sg V past RB Agglutination The cat walked fast T h e c a t w a l k e d f a s t 7
Recurrent Taggers – Good at Prefixes and Suffixes? N sg;def V past RB thecat walked fast t h e c a t w a l k e d f a s t 8
Recurrent Taggers – Good at Prefixes and Suffixes? N sg;def V past RB Prefixing morphology thecat walked fast (e.g. Coptic) t h e c a t w a l k e d f a s t 8
Recurrent Taggers – Can They Handle diSCoNtinUiTY? DET N sg V past RB The cat waeldk fast T h e c a t w a e l d k f a s t 9
Recurrent Taggers – Can They Handle diSCoNtinUiTY? DET N sg V past RB Introflexive morphology The cat waeldk fast (Hebrew, Arabic) T h e c a t w a e l d k f a s t 9
Main Idea(s) Model Language w a l k e d t h e c a t w a e l d k 10
Main Idea(s) Model Language measure how models encode different w a l k e d linguistic patterns t h e c a t w a e l d k 10
Main Idea(s) Model Language characterize languages based on model analysis; w a l k e d help engineer language- aware systems t h e c a t w a e l d k 11
Analysis Primitive – Unit Decomposition DET N sg V past RB The cat walked fast T h e c a t w a l k e d f a s t 12
Analysis Primitive – Unit Decomposition ● Assumption: units are “in DET N sg V past RB charge” of tracking morphemes that help predict POS The cat walked fast Hidden unit #n T h e c a t w a l k e d f a s t 12
Analysis Primitive – Unit Decomposition ● Assumption: units are “in DET N sg V past RB charge” of tracking morphemes that help predict POS The cat walked fast ● Hypothesis: easy for agglutinations , difficult for introflexions Hidden unit #n T h e c a t w a l k e d f a s t 12
Analysis Primitive – Unit Decomposition ● Assumption: units are “in DET N sg V past RB charge” of tracking morphemes that help predict POS The cat walked fast ● Hypothesis: easy for agglutinations , difficult for introflexions Hidden unit #n ● Hypothesis: unit’s direction Hidden unit #m affects ease of tracking suffixes vs. prefixes T h e c a t w a l k e d f a s t 12
Evidence? ● Turkish is an agglutinative language ○ ev ‘house’; evler ‘houses’; evleriniz ‘your houses’; evlerinizden ‘from your houses’ 13
Evidence? ● Turkish is an agglutinative language ○ ev ‘house’; evler ‘houses’; evleriniz ‘your houses’; evlerinizden ‘from your houses’ Unit 3 ( → ) 13
Evidence? ● Turkish is an agglutinative language ○ ev ‘house’; evler ‘houses’; evleriniz ‘your houses’; evlerinizden ‘from your houses’ Unit 3 ( → ) Unit 124 ( ) 13
Model & Data DET N sg V past RB The cat walked fast T h e c a t w a l k e d f a s t 14
Model & Data ● Universal Dependencies (n=24) DET N sg V past RB ○ POS tags + Morphosyntactic Descriptions The cat walked fast T h e c a t w a l k e d f a s t 14
Model & Data ● Universal Dependencies (n=24) DET N sg V past RB ○ POS tags + Morphosyntactic Descriptions ● Linguistic diversity – morph. synthesis: ○ 5 agglutinative languages 2 introflexive languages ○ ○ 3 isolating, 14 fusional The cat walked fast T h e c a t w a l k e d f a s t Source for language classes: WALS 14
Model & Data ● Universal Dependencies (n=24) DET N sg V past RB ○ POS tags + Morphosyntactic Descriptions ● Linguistic diversity – morph. synthesis: ○ 5 agglutinative languages 2 introflexive languages ○ ○ 3 isolating, 14 fusional The cat walked fast ● Linguistic diversity – affixation: ○ (All) 1 prefixing language 2 non-affixing ○ ○ 2 equally pre- and suffixing T h e c a t w a l k e d f a s t ○ 19 suffixing Source for language classes: WALS 14
Model & Data ● Universal Dependencies (n=24) DET N sg V past RB ○ POS tags + Morphosyntactic Descriptions Linguistic diversity (synthesis + affixation) ○ ● Word → Tag: Bidirectional LSTM + MLP (Not analyzed) ○ The cat walked fast ○ No word embeddings T h e c a t w a l k e d f a s t 15
Model & Data ● Universal Dependencies (n=24) DET N sg V past RB ○ POS tags + Morphosyntactic Descriptions Linguistic diversity (synthesis + affixation) ○ ● Word → Tag: Bidirectional LSTM + MLP (Not analyzed) ○ The cat walked fast ○ No word embeddings ● Char → Word: Bidirectional LSTM ○ Char embedding size: 256 T h e c a t w a l k e d f a s t 15
Analysis Metrics 16
Analysis Metrics ● Run model on training data words 16
Analysis Metrics ● Run model on training data words ● Collect activation levels for each unit 16
Analysis Metrics ● Run model on training data words 0.42 ● Collect activation levels for each unit ● Aggregate to single measure (e.g. average absolute or max-delta ) 16
Analysis Metrics ● Run model on training data words 0.42 ● Collect activation levels for each unit ● Aggregate to single measure (e.g. average absolute or max-delta ) ● Bin per unit over parts of speech … Unit 42 [0.0,0.1) [0.1,0.2) [0.9,1.0) … NOUN 8 2 40 … VERB 20 0 4 … … … … … … ADJ 10 10 10 16
Analysis Metrics ● Run model on training data words 0.42 ● Collect activation levels for each unit ● Aggregate to single measure (e.g. average absolute or max-delta ) ● Bin per unit over parts of speech ● Mutual Information metric – POS … Unit 42 [0.0,0.1) [0.1,0.2) [0.9,1.0) … NOUN 8 2 40 Discrimination Index, or PDI … VERB 20 0 4 ○ (Higher PDI = better discriminator) … … … … … … ADJ 10 10 10 16
… Analysis Metrics Unit 40 [0.0,0.1) [0.1,0.2) [0.9,1.0) … Unit 41 [0.0,0.1) [0.1,0.2) [0.9,1.0) … NOUN 8 2 40 … Unit 42 [0.0,0.1) [0.1,0.2) [0.9,1.0) … NOUN 8 2 40 … VERB 20 0 4 … NOUN 8 2 40 … VERB 20 0 4 … … … … … ● Run model on training data words … VERB 20 0 4 … … … … … … ADJ 10 10 10 … … … … … ● Collect activation levels for each unit … ADJ 10 10 10 … ADJ 10 10 10 ● Aggregate to single measure (e.g. average absolute or max-delta ) ● Bin per unit over parts of speech ● Mutual Information metric – POS Discrimination Index, or PDI ○ (Higher PDI = better discriminator) ● Aggregate across units by 17
… Analysis Metrics Unit 40 [0.0,0.1) [0.1,0.2) [0.9,1.0) … Unit 41 [0.0,0.1) [0.1,0.2) [0.9,1.0) … NOUN 8 2 40 … Unit 42 [0.0,0.1) [0.1,0.2) [0.9,1.0) … NOUN 8 2 40 … VERB 20 0 4 … NOUN 8 2 40 … VERB 20 0 4 … … … … … ● Run model on training data words … VERB 20 0 4 … … … … … … ADJ 10 10 10 … … … … … ● Collect activation levels for each unit … ADJ 10 10 10 … ADJ 10 10 10 ● Aggregate to single measure (e.g. average absolute or max-delta ) ● Bin per unit over parts of speech ● Mutual Information metric – POS mass median Discrimination Index, or PDI ○ (Higher PDI = better discriminator) ● Aggregate across units by Summing total mass ○ Reporting % of forward units before mass median ○ 17
Findings (Cherry Pick) ● English: fusional, suffixing ● Coptic: agglutinative, prefixing ○ Small mass (hard to capture POS) Large mass (easy to distinguish POS ○ based on char sequence) Backward-heavy (80%) ○ ○ Forward-heavy (71%) 18
Total PDI mass Findings (General Trends) 19
Total PDI mass Findings (General Trends) ● 4/5 agglutinatives hold 4/6 top total-mass positions 19
Total PDI mass Findings (General Trends) ● 4/5 agglutinatives hold 4/6 top total-mass positions ● 2/2 introflexives in bottom 2/4 spots (Persian and Hindi below, both fusional w/ non-Latin charsets) 19
Direction Balance Study 20
Direction Balance Study ● Some languages might not need two equal LSTM directions 20
Direction Balance Study ● Some languages might not need two equal LSTM directions ● What if… they don’t need one of them at all? 20
Direction Balance Study ● Some languages might not need two equal LSTM directions ● What if they need them in a different balance? Somewhere in the middle? ● What if… they don’t need one of them at all? 20
Balance Study – Results 21
Recommend
More recommend