CroMo D. Ćavar Outline Introduction CroMo - Morphological Analysis for Croatian Model Evaluation Comments Damir Ćavar 1 , Ivo-Pavao Jazbec 2 and Tomislav Stojanov 2 Linguistics Department, University of Zadar 1 Institute of Croatian Language and Linguistics 2 FSMNLP 2008
CroMo D. Ćavar Outline Introduction 1 Introduction Model Evaluation Model 2 Comments Evaluation 3 Comments 4
Scenario CroMo D. Ćavar Synchronic and diachronic study of language change and acquisition models Outline Introduction Language data from a long period of time, and three major dialects in Croatia implying: Model Evaluation Variation wrt. e.g. string-based morphology or feature bundles Comments Ongoing discovery wrt. string combinatorics and features Research questions require quantitative and qualitative information: of phonological, morphological, syntactic and semantic tokens and feature bundles, and their correlation and variation at various stages over time
Morphological segmentation and annotation and lemmatization, and . . . CroMo D. Ćavar Outline Segmenting words: Introduction isponapijali su se “they got drunk a little bit to satisfaction” Model is – po – napija – li Evaluation Annotating segments: Comments aspect prefix – aspect prefix – from stem-lemma napiti – plural participle Extending the annotation: to a certain saturation – a little bit – “get drunk” from root-lemma piti – past event
FSA Architecture Mapping of morph-groups to DFSAs (Mealy or Moore machine): CroMo D. Ćavar Outline a v pref )-index asp n 2 1 č Introduction 1 i p t a v root )-index p o e 2 3 4 v pref )-index asp 0 0 3 4 š Model 5 v pref (-index v root (-index Evaluation Comments v suf )-index pres 1st pl o 8 m v suf )-index pres 1st sg 2 v suf )-index pres 2st sg š 3 v suf )-index pres 3rd sg ε 1 0 t e v suf )-index pres 2nd pl 4 5 j v suf )-index 2nd sg imper 6 v suf (-index u v suf )-index pres 3rd pl 7
FSA Architecture CroMo D. Ćavar Outline Introduction Mapping ambiguity on emission: emission tuple 1 to n Model Label DFSAs with variable names Evaluation Use rules referring to variable names for modeling of Comments morphotactic regularities: verbAspectPrefs* . verbAtiRoots . verbInflSuf
FSA Architecture CroMo D. Ćavar Generating potentially cyclic DFSAs: Outline Introduction Model v suf )-index pres 1st sg Evaluation o m 13 v suf )-index pres 1st pl 19 Comments v suf )-index pres 2st sg š 14 č ε v root )-index 6 i p v suf )-index pres 3rd sg ε t a ε 12 e 7 9 10 ε 5 ε š 8 t p o 11 0 3 4 v root (-index e v suf )-index pres 2nd pl n v pref )-index asp 15 16 j 1 a ε v pref (-index 2 u v suf )-index pres 3rd pl v suf (-index 17 18 v pref )-index asp v suf )-index 2nd sg imper
FSA Architecture CroMo D. Ćavar Ambiguity mapped on emission tuple: Outline Introduction Model Evaluation Comments e [ v suff )-index; adj suff )-index ] 4 o r [ v root )-index; adj root )-index ] g 1 2 3 [ v suff (-index; adj suff (-index ] 0 [ v root (-index; adj root (-index ]
FSA Architecture CroMo D. Ćavar Lemmatization as a rule: Outline Rightmost root is the semantic head Introduction Root-lemma: generate canonical word-form from the Model right-most root Evaluation neprijatelja → ne + prijatelj + a → NEG + N-root + ACC Comments “not friend” = “enemy” �⇒ ¬ friend not compositional! but useful for semantic field analysis! root-lemma: neprijatelja → prijatelj Stem/base-lemma: generate canonical word-form from the stem without inflectional suffixes base-lemma: neprijatelja → neprijatelj
FSA Architecture CroMo D. Ćavar Lemmatization (Hack): Outline emission of byte-offset for suffix-elimination Introduction Model pointer to suffix string Evaluation Clean solution: Comments g:(g, (FV, ...)) o:(o, ()) r:(r, ()) e:(a, (FV, ...)) 0 1 2 3 4
Implementation CroMo C++ wrapper for final application D. Ćavar Ragel code (automaton definition) generated from Outline morpheme DBs and rules, with associated feature bundles Introduction (extended version of Ragel, ( ≥ V. 6.1) for handling Model ambiguity via introduction of multiple emission symbols = Evaluation emission tuples) Comments Ragel generated C code (jump-code) Morpheme tables Rules Ragel code Code Code DOT Binary
Implementation CroMo Emission (feature bundles): as one bit-vector D. Ćavar Features mapped from the General Ontology for Linguistic Description (upper ontology) Outline Introduction possibility: reasoning over linguistic concepts and features Model Optimization: mapping of concepts and their relations on a Evaluation compressed bit-vector, maintaining inheritance and Comments implicatures top-node concept sub-class terminal-classes
Evaluation CroMo D. Ćavar Outline Hardware: dual core 2.4 GHz Introduction Lexical base: 120,000 morphemes (and allomorphs) Model Evaluation Speed: approx. 50,000 tokens per second with average Comments morpheme count of 2.5 per token Size: binary footprint approx. 5 MB Compilation (tables → Ragel + C; Ragel → C + DOT; gcc → bin): approx. 5 minutes, min. 4 GB RAM for monolithic architecture
Comments CroMo D. Ćavar Outline Interoperability issues addressed: Introduction Model GOLD Evaluation platform independent code Comments code-page independence Extensible (turnaround time of some minutes) Minimally invasive and minimalistic Open-source
Recommend
More recommend