UNSUPERVISED MORPHOLOGICAL SEGMENTATION & CLUSTERING ICL UNI HEIDELBERG - HS CL4LRL - KATHARINA ALLGAIER - 08.06.2016 1
OVERVIEW Introduction Morphological Segmentation (Creutz&Lagus 2005) Aims Models Evaluation Results Affix Clustering (Moon et al 2009) Idea Model Results Conclusion 2
WHAT ARE WE DOING? Morpheme Segmentation reads = read + s Morphemes = smallest meaning-bearing units machines = machine + s = smallest elements of syntax translation = translate + ion Meaning vs. Form goalkeeper = goal + keeper joystick = joy + stick Composition vs. Perturbation 3
WHAT ARE WE DOING ? Stem vs. Affixes (Prefixes + Suffixes) Inflectional vs. Derivational Affix Clustering 4
WHY ARE WE DOING IT? important information especially for highly inflected languages (like Turkish, Finnish, Nahuatl, Japanese agglutinative languges) used in other CL applications (language production, speech recognition, machine translation etc.) 5
INDUCING THE MORPHOLOGICAL LEXICON OF A NATURAL LANGUAGE FROM UNANNOTATED TEXT – CREUTZ&LAGUS 2005 „ algorithm for the unsupervised learning […] of a simple morphology of a natural language “ Unsupervised morpheme segmentation with hierarchical representation English and Finnish 6
AIMS Most accurate segmentation possible Learn representation of the language in the data + store it in a lexicon Based several models: Linguistica, Morfessor Baseline, Morfessor ML, Morfessor MAP 7
BASELINE Morph Lexicon talk teach es Morfessor Baseline Algorithm (Creutz&Langus) ed ing word Similar to some unsupervised word segmentation algorithms words morf Construct lexicon of morphs es Each word can be constructed out of those morphs sor AIM: find optimal + concise segmentation and lexicon PROBLEM: frequent words stored as a whole - rare words excessively split + stored in part no representation of a morph‘s inner structure 8
Linguistica (Goldsmith 2001) Splits word into stem + one (empty) prefix / affix ADVANTAGE: Modeling of simple word-internal syntax (morphotactics – rules on ordering of morphemes) – grouping sets of stems & suffixes into inflectional paradigms word +s talk + ed talk + s dog + s walk + ed walk + s DISADVANTAGE: handles highly inflecting + compounding languages poorly (alternating stems + affixes) 9
IMPROVED MODEL Morfessor Categories-ML (Creutz&Lagus) hidden states: categories (SUFF, PRE, …) Reanalyzes segmentation of Morphessor Baseline Maximum Likelihood Model Words represented as HMMs Stems, prefixes + suffixes can alternate (with some restrictions) „ noise “ category split words whose morphs are present in the lexicon join „ noise “ morphs with their neighbours to form proper morphs observable states: morphs CRITICISM: too ad hoc + information on word frequency is lost 10
NEW MODEL Morfessor Categories-MAP (Creutz&Lagus) Induces binary hierarchical lexicon Retains inner structure of words morphs represented as concatenation of (sub)morphs of the lexicon Word frequency (own entry vs. Split into morphs) Prefix – Stem – Suffix – Non-morpheme 11
Maximum a posteriori framework Words represented as HMMs Desired level of segmentation : „ finest resolution that does not contain non-morphemes “ 12
SEARCH ALGORITHM (GREEDY SEARCH) Initialisation of Representativ+ness segmentation stem+SUFF [Re+[present+ativ]]+[n+ess] Splitting of morphs PRE+stem+SUFF+non+SUFF [Re+[present+ativ]]+ness Joining of morphs PRE+stem+SUFF+SUFF [Re+[[pre+sent]+ativ]]+ness Splitting of morphs PRE+non+stem+SUFF+SUFF [Re+[[pre+sent]+ativ]]+ness Resegmentation of corpus + re-estimation PRE+non+stem+SUFF+SUFF of probabilitites [Re+[present+ativ]]+ness 13 Expansion to finest PRE+stem+SUFF+SUFF resolution
MODEL AIM: Finding optimal lexicon + segmentation Maximum a posteriori estimate to be maximized: Form Meaning String of letters vs. Submorphs Frequency 14 Morph emission probability transition probability Length Right+Left Perplexity
Morph Emission Probabilities probability that morph is emitted by the category Depend on frequency of morph in training data Prefix-/Suffix-Likeness (right+left perplexity) Stem-Likeness (length) Non-morpheme probability 15
EVALUATION Goldstandard English Data Finnish Data Hutmegs Prose + news + Prose + news text Linguistic scientific text morpheme Finnish IT Centre of segmentations Gutenberg Project Science 1.4 million Finnish Gigaword Corpus Finnish National 120 000 English Brown Corpus News Agency word forms Evaluation on 10.000, 50.000, 250.000, 12/16 million 16 words
RESULTS 17
UNSUPERVISED MORPHOLOGICAL SEGMENTATION AND CLUSTERING WITH DOCUMENT BOUNDARIES – MOON ET AL 2009 Simple model without heuristics /thresholds /trained parameters Word segmentation - constrain candidate stems + affixes by document boundaries Cluster affixes of certain stems morphologically related words USE: interlinearised glossed texts for LRL English + Uspanteko 18
IDEA two words in the same document are very similar in orthography likely to be related morphologically use document boundaries to filter out noise constrain potential membership of word clusters 19 He suddendly drew a sharp sword … The documentation of …
MODEL Candidate Generation Conflation set : „Set of word types that are related through either inflectional or derivational morphology “ 20
CANDIDATE TRIE Stems trunks like li ness affixes branches ness hood 21
MODEL Candidate Generation (D vs. G) X2 testing : Correlation betw. Affixes Candidate Conflation set : Filtering „Set of word types that are related through either inflectional or derivational morphology “ Affix Clustering Word Clustering (D vs. G) 22
RESULTS 23
Thank you for your attention! 24
Recommend
More recommend