Finite state morphology and phonology Natural Language Processing LING/CSCI 5832 Mans Hulden Dept. of Linguistics mans.hulden@colorado.edu Jan 22 2014
Composition in+possible+ity r a n t 24 25 26 27 g 23 e s 33 34 35 s s i b l 18 19 20 21 22 28 n s o e a l s y 32 36 i + r e t t 39 p 14 15 16 17 31 c l 10 u 37 8 n t y + g u i 29 30 d e e 38 0 1 9 l i k s i 11 12 13 e m o a 2 3 4 u t 5 6 7 in+possible+ity 0 d s a 2 5 u i 1 e u 3 6 <+:0> m e t <n:m> 10 4 7 @ + m p @ + m <n:m> n i o n 4 <n:m> n 0 22 11 @ m p 8 + 2 <n:m> <+:0> <+:0> <+:0> n + 1 23 12 9 3 s u l s u l p p p 27 33 24 13 t i o r im+possible+ity 28 25 17 14 g k r s e 29 26 18 15 a e s t 30 34 19 16 n i l t @ + e l t y 20 35 @ e i l t y @ + i l t y l 7 e 31 b + 8 b 9 b i @ + e i l t y b b 10 21 @ + e i t y g 0 1 t b b 11 l y b <l:i> <e:l> <+:i> 32 2 3 4 <i:t> 5 @ + e i l y <t:y> e @ + e i l t 6 36 <y:0> im+possibility a n 37 i 40 e 41 38 t l s 42 c 39 @ <+:0> y s 43 0 impossibility impossibility
Composition NEG+possible+ity+NOUN+PLURAL NEG+possible+ity+NOUN+PLURAL t r a n 24 25 26 27 g 23 e s 33 34 35 s s i b l 18 19 20 21 22 28 n e s o a s 32 36 l y i + r e t t c 39 p 14 15 16 17 31 l 10 u 37 8 n + t y 29 g u i 30 d e e 0 1 9 l 38 i k s i 11 12 13 a e m o 2 3 4 u t 5 6 7 0 d s a in+possible+ity+s 2 5 u i 1 e u 3 6 <+:0> m e t 10 4 7 <n:m> @ + m p <n:m> @ + m n i o n 4 n <n:m> 22 11 0 8 @ m p + 2 <n:m> <+:0> <+:0> <+:0> n + 1 23 12 9 3 p s u l s u l p p 27 33 24 13 im+possible+ity+s t i o r 28 25 17 14 g k r s e 29 26 18 15 a e s t @ + e l t y 30 34 19 16 @ e i l t y n i l t @ + i l t y l 7 e 8 + b 9 @ + e i l t y b b i 20 35 b 10 @ + e i t y 0 1 t b b 11 31 b b <l:i> <e:l> <+:i> 2 3 4 <i:t> 5 21 @ + e i l y g <t:y> @ + e i l t 6 l y <y:0> 32 im+possibility+s e 36 a n 37 i 40 e @ <+:0> 41 38 t l s 42 c 39 0 y s 43 impossibilities impossibility
Compilers Several finite-state compilers available to do the hard work - Xerox xfst (http://www.fsmbook.com) - SFST (https://code.google.com/p/cistern/wiki/ SFST) - HFST (http://hfst.sf.net) - OpenFST (http://www.openfst.org) - Foma (http://foma.googlecode.com) Demo with foma
Toy grammar of English Toy lexicon: kiss, hire, spy Possible suffixes: ed, ing, s Generate kiss+s/kisses, spy+ed/spied, hire+ing/ hiring, hire+ed/hired, etc. More advanced version of this in tutorial form on: https://code.google.com/p/foma/wiki/ MorphologicalAnalysisTutorial
Some derivations hire+ing hire+ed kiss+s Edelete Edelete Edelete kiss+s hir+ed hir+ing EInsert EInsert EInsert hir+ing hir+ed kisses Delete + Delete + Delete + hiring hired kisses Delete +
Code analyzer1.foma def Stems s p y | k i s s | h i r e ; def Suffixes "+" [ 0 | s | e d | i n g ]; def Lexicon Stems Suffixes ; def YRule1 y -> i e || _ "+" s ; # spy+s > spie+s def YRule2 y -> i || _ "+" e d ; # spy+ed > spi+ed def Einsert "+" -> e || s _ s ; #kiss+s > kisses def Edelete e -> 0 || _ "+" [e|i]; #hire+ed > hir+ed, hire+ing > hir+ing def Cleanup "+" -> 0 ; #hir+ing >hiring, etc. def Grammar Lexicon .o. YRule1 .o. YRule2 .o. Einsert .o. Edelete .o. Cleanup; regex Grammar;
Code analyzer2.foma def Stems s p y | k i s s | h i r e ; def Suffixes 0:"+" [ "[INF]":0 | "[NOUN][SINGULAR]":0 | "[PRES]":s | "[NOUN][PLURAL]":s | "[PASTPART]":[e d] | "[PRESPART]":[i n g] ]; def Lexicon Stems Suffixes ; def YRule1 y -> i e || _ "+" s ; def YRule2 y -> i || _ "+" e d ; def Einsert "+" -> "+" e || s _ s ; def Edelete e -> 0 || _ "+" [e|i]; def Cleanup "+" -> 0; def Grammar Lexicon .o. YRule1 .o. YRule2 .o. Einsert .o. Edelete .o. Cleanup; regex Grammar;
automaton (output-side (1) Extract the possible “Grammar” transducer, (2) Test a word against The 2 second spell checker and convert to outputs of the projection) automaton NEG+possible+ity+NOUN+PLURAL impossibility u t 5 6 7 o a e m 2 3 4 i s 8 e t 14 15 16 r t e s 0 e 37 38 39 p u i 13 o <+:0> n d 12 s 35 s s y <n:m> 11 p 17 18 19 l 1 <+:0> a 10 36 40 l i e i b l 43 l c 20 21 32 9 24 i <+:0> n k e 25 26 34 41 u t y g s g 31 33 n 42 30 l a r <+:0> t 22 28 29 23 27 u s
The 5 second spelling corrector [med] Assume we have a list of words as a repeating FST as before hired W hired
The 5 second spelling corrector Assume we have a list of words as a repeating FST as before Now, create a transducer C1 that makes one change in a word (one deletion, one change, one insertion) abc @ @ <?:0> <0:?> <?:?> 0 1 ab,bc,ac,aba,aac,abca,...
The 5 second spelling corrector Compose hired W hired C1 xire, hird, hird, hiredx, ired, hied,...
The 5 second spelling corrector Compose hired W o C1 xire, hird, hird, hiredx, ired, hied,...
Code analyzer3.foma def Stems s p y | k i s s | h i r e ; def Suffixes 0:"+" [ "[INF]":0 | "[NOUN][SINGULAR]":0 | "[PRES]":s | "[NOUN] [PLURAL]":s | "[PASTPART]":[e d] | "[PRESPART]":[i n g] ]; def Lexicon Stems Suffixes ; def YRule1 y -> i e || _ "+" s ; def YRule2 y -> i || _ "+" e d ; def Epenthesis "+" -> "+" e || s _ s ; def Erule e -> 0 || _ "+" [e|i]; def Cleanup "+" -> 0; def Grammar Lexicon .o. YRule1 .o. YRule2 .o. Epenthesis .o. Erule .o. Cleanup; def C1 ?* [?:0|0:?|?:?-?] ?* ; regex Grammar.2 .o. C1;
Entirely non-orthographic grammar def Stems s p ɯ ɪ | k ɪ s | h ɯ ɪ r ; def Suffixes 0:"+" [ "[INF]":0 | "[PRES]":z | "[PASTPART]":[d] | "[PRESPART]":[ ɪ Ŧ ] ]; def Sib [s|z]; # Sibilants def Unvoiced [h|s]; # Unvoiced phonemes define ObsAssimilation d -> t || Unvoiced "+" _ ; define Epenthesis [..] -> ɪ || Sib "+" _ Sib ; define Cleanup "+" -> 0; def Lexicon Stems Suffixes ; def Grammar Lexicon .o. ObsAssimilation .o. Epenthesis .o. Cleanup; regex Grammar;
Applications Tokenization POS tagging Shallow parsing (chunking) Syntactic parsing Information extraction Text-to-speech Spell checking/correction Electronic dictionaries Machine translation …
Syntactic parsing
Wrapup • The above are standard techniques - morphological/ phonological grammars have been written for hundreds of languages in this way • The calculus is crucial - thinking about states and transitions is counterproductive • A well-designed grammar should be very accurate, barring misspellings (easily >99% recall) • There are also probabilistic extensions to all of the above (to combine with language models, to handle noisy data, etc.) • These grammars are also used to improve POS- taggers, parsers, chunkers, named entity recognition, etc.
Recommend
More recommend