Example Applications of Finite State Machines Data structures and algorithms for Computational Linguistics III Çağrı Çöltekin ccoltekin@sfs.uni-tuebingen.de University of Tübingen Seminar für Sprachwissenschaft Winter Semester 2018–2019
Introduction Pattern matching WS 18–19 SfS / University of Tübingen Ç. Çöltekin, – Finite-state morphology – FSA for storing a lexicon – FSA for pattern matching – … – Chunking – Morphological analysis – Tokenization, stemming – Pattern matching – Games – Workfmow management – Electronic circuit design computational reasons Applications of fjnite-state methods XFST: a quick introduction Finite-state morphology Morphology Finite-state lexicons 1 / 19 • Finite state methods are attractive for formal and • They are applied in a vast diversity of fjelds • This lecture
Introduction – Intersection WS 18–19 SfS / University of Tübingen Ç. Çöltekin, nodes (minimization) – A DFA can be minimized to equivalent DFA with minimum language (determinization) – For every NFA there is a DFA that accepts the same regular NFA transitions to possibly multiple states on a single input DFA single transition from each state on each input symbol – Reversal Pattern matching – Complement – Union – Kleene star – Concatenation equivalent to regular expressions a refresher Finite state automata XFST: a quick introduction Finite-state morphology Morphology Finite-state lexicons 2 / 19 • An FSA recognizes and generates a regular language, also • FSA are closed under • Two types: symbol, or without consuming an input symbol ( ϵ -NFA) • Every FSA has a unique minimal DFA
Introduction Pattern matching WS 18–19 SfS / University of Tübingen Ç. Çöltekin, – Composition – Inversion – Reversal – Union – Kleene star – Concatenation 3 / 19 while outputting the output symbol symbols a refresher Finite state transducers XFST: a quick introduction Finite-state morphology Morphology Finite-state lexicons • FST transitions are defjned on a pair of input–output • An FST moves between the states on the input symbol, • FSTs defjne a regular relation • FSTs are closed under • Not all FSTs can be determinized
Introduction b b Pattern matching b a b a b a a b b a b a b Note the wasted efgort after a partial match. Ç. Çöltekin, SfS / University of Tübingen WS 18–19 a a a b Finite-state lexicons Morphology Finite-state morphology XFST: a quick introduction Naive string match Example: searching ‘abab’ in ‘abbabbbabababbab’ a b b a b b b a b b a 4 / 19 a b × × × × × × × × × × × × ×
Introduction b b Pattern matching b a b a b a a b b a b a b Note the wasted efgort after a partial match. Ç. Çöltekin, SfS / University of Tübingen WS 18–19 a a a b Finite-state lexicons Morphology Finite-state morphology XFST: a quick introduction Naive string match Example: searching ‘abab’ in ‘abbabbbabababbab’ a b b a b b b a b b a 4 / 19 a b × × × × × × × × × × × × ×
Introduction Pattern matching WS 18–19 SfS / University of Tübingen Ç. Çöltekin, Is this faster than the naive algorithm? processed matches abab (including overlapping matches) b a b a a b 4 3 2 1 0 Consider running the following NFA over the string. Another solution String matching with an NFA XFST: a quick introduction Finite-state morphology Morphology Finite-state lexicons 5 / 19 • The NFA will be in the accepting state when last four letters
Introduction Pattern matching WS 18–19 SfS / University of Tübingen Ç. Çöltekin, processed matches abab (including overlapping matches) b a b a a b 4 3 2 1 0 Consider running the following NFA over the string. Another solution String matching with an NFA XFST: a quick introduction Finite-state morphology Morphology Finite-state lexicons 5 / 19 • The NFA will be in the accepting state when last four letters • Is this faster than the naive algorithm?
Introduction a WS 18–19 SfS / University of Tübingen Ç. Çöltekin, without additional computational cost (generally, not much larger than the NFA) b a b a b a Pattern matching b a b Finite-state lexicons Morphology Finite-state morphology XFST: a quick introduction DFA version 6 / 19 Knuth-Morris-Pratt (KMP) algorithm 0 01 02 013 024 • DFA processes every input symbol only once • The resulting DFA has the same number of states • Approach generalizes to arbitrary regular expressions
Introduction 1 WS 18–19 SfS / University of Tübingen Ç. Çöltekin, w o o a c 7 6 5 4 Pattern matching 2 3 0 lexicons incrementally Finite-state lexicons Morphology Finite-state morphology XFST: a quick introduction Finite state lexicons store lexicons individual words, and minimize/determinize the union of them constructing fjnite-state 7 / 19 • FSA are an effjcient way to • One can start from NFA for b a t g d • Or there are algorithms for
Introduction Derivational morphemes change the meaning of a word, WS 18–19 SfS / University of Tübingen Ç. Çöltekin, shared by all morphologically related forms Stem of a (possibly derived) word is the common string a lexicon Lemma of a word is its ‘citation’ form, what you look up in words Infmectional morphemes change the syntactic properties of sometimes changing the POS semantic information Pattern matching Root of a word is a free morpheme, often carrying the Morphemes make up words smallest meaningful or grammatical unit. Morpheme is an abstract linguistic unit, often defjned as some defjnitions Morphology XFST: a quick introduction Finite-state morphology Morphology Finite-state lexicons 8 / 19
Introduction – In agglutinative languages each morpheme has a single WS 18–19 SfS / University of Tübingen Ç. Çöltekin, Note that these are tendencies. single word (e.g., Ainu, Chukchi) – Polysynthetic languages may pack multiple ‘words’ in a functions (e.g., Latin, Russian) – In infmecting/fusional a single morpheme indicates multiple function (e.g., Finnish, Turkish) morphology (e.g., English) Pattern matching are simple (e.g., Vietnamese, Chinese) words are formed. Languages of the world behave difgerently with respect to how Morphological typology XFST: a quick introduction Finite-state morphology Morphology Finite-state lexicons 9 / 19 • Isolating languages have little or no morphology, all words • Analytic languages have little or no infmectional • Synthetic languages have rich morphological system
Introduction Pattern matching WS 18–19 SfS / University of Tübingen Ç. Çöltekin, (some Austronesian languages) (Arabic) k ā t i b ‘writer’ ktb k i t ā b ‘book’ ktb 10 / 19 Where do morphemes go Finite-state lexicons Morphology Finite-state morphology XFST: a quick introduction • Affjxation: attach → un -attach- ed • Infjxes: aussteigen → aus zu steigen • Circumfjxation: spiel → ge spiel t • Root-pattern morphology: → → • Reduplication: orang ‘person’ → orang-orang ‘people’
Introduction Pattern matching WS 18–19 SfS / University of Tübingen Ç. Çöltekin, (Turkish) od a -l a r ‘rooms’ oda ‘room’ e v-l e r ‘houses’ ev ‘house’ 11 / 19 examples: Morphology and phonology/orthography interact. A few or morphology and orthography Interaction of morphology and phonology XFST: a quick introduction Finite-state morphology Morphology Finite-state lexicons • dog- s , but fox- es • cit y → cit i -es • stop → stop p ing • panic → panic k -ed • goose → geese • Vowel harmony → →
Introduction s WS 18–19 SfS / University of Tübingen Ç. Çöltekin, or FSTs analysis surface representation (generation) cat Pattern matching Underlying: cat Surface: – An underlying , an abstract representation for the word – A surface representation which is what we hear or see Two-level morphology XFST: a quick introduction Finite-state morphology Morphology Finite-state lexicons 12 / 19 • We assume that there are two ‘levels’ of representation ⟨ PL ⟩ • An FST is used to map the underlying representation to the • If we run the FST in the inverse direction, we get an • Often the FST is a complex combination of many small FSA
Introduction Pattern matching WS 18–19 SfS / University of Tübingen Ç. Çöltekin, cascades (composition), or can be applied in parallel alternations (affjxation, applying templates, …) 13 / 19 a typical architecture Two-level morphology XFST: a quick introduction Finite-state morphology Morphology Finite-state lexicons • Typically, lexicon is converted to FSA • Concatenated (or composed) with morphological rules • The result is composed with phonological/orthographic • The phonological/orthographic rules can be designed as
Introduction 1 o Pattern matching w 0 1 0 2 c 3 x not x Ç. Çöltekin, SfS / University of Tübingen WS 18–19 a o 14 / 19 2 Finite-state lexicons Morphology Finite-state morphology XFST: a quick introduction Two-level morphology a (simplifjed) example 0 1 3 4 7 6 5 L b a t M ⟨ PL ⟩ : ⟨ S ⟩ x f P ϵ :e ⟨ S ⟩ :s Analyzer: ( LM ◦ P ) − 1 Generator: LM ◦ P
Recommend
More recommend