unsupervised morphological segmentation clustering
play

UNSUPERVISED MORPHOLOGICAL SEGMENTATION & CLUSTERING ICL UNI - PowerPoint PPT Presentation

UNSUPERVISED MORPHOLOGICAL SEGMENTATION & CLUSTERING ICL UNI HEIDELBERG - HS CL4LRL - KATHARINA ALLGAIER - 08.06.2016 1 OVERVIEW Introduction Morphological Segmentation (Creutz&Lagus 2005) Aims Models


  1. UNSUPERVISED MORPHOLOGICAL SEGMENTATION & CLUSTERING ICL UNI HEIDELBERG - HS CL4LRL - KATHARINA ALLGAIER - 08.06.2016 1

  2. OVERVIEW  Introduction  Morphological Segmentation (Creutz&Lagus 2005)  Aims  Models  Evaluation  Results  Affix Clustering (Moon et al 2009)  Idea  Model  Results  Conclusion 2

  3. WHAT ARE WE DOING?  Morpheme Segmentation reads = read + s  Morphemes = smallest meaning-bearing units machines = machine + s  = smallest elements of syntax translation = translate + ion  Meaning vs. Form goalkeeper = goal + keeper joystick = joy + stick  Composition vs. Perturbation 3

  4. WHAT ARE WE DOING ?  Stem vs. Affixes (Prefixes + Suffixes)  Inflectional vs. Derivational  Affix Clustering 4

  5. WHY ARE WE DOING IT?  important information  especially for highly inflected languages  (like Turkish, Finnish, Nahuatl, Japanese  agglutinative languges)  used in other CL applications  (language production, speech recognition, machine translation etc.) 5

  6. INDUCING THE MORPHOLOGICAL LEXICON OF A NATURAL LANGUAGE FROM UNANNOTATED TEXT – CREUTZ&LAGUS 2005 „ algorithm for the unsupervised learning […] of a simple morphology of a natural language “   Unsupervised morpheme segmentation with hierarchical representation  English and Finnish 6

  7. AIMS  Most accurate segmentation possible  Learn representation of the language in the data + store it in a lexicon  Based several models: Linguistica, Morfessor Baseline, Morfessor ML, Morfessor MAP 7

  8. BASELINE Morph Lexicon talk teach es  Morfessor Baseline Algorithm (Creutz&Langus) ed ing word  Similar to some unsupervised word segmentation algorithms words morf  Construct lexicon of morphs es  Each word can be constructed out of those morphs sor  AIM: find optimal + concise segmentation and lexicon  PROBLEM: frequent words stored as a whole - rare words excessively split + stored in part no representation of a morph‘s inner structure 8

  9.  Linguistica (Goldsmith 2001)  Splits word into stem + one (empty) prefix / affix ADVANTAGE: Modeling of simple word-internal syntax (morphotactics – rules on ordering of morphemes)  – grouping sets of stems & suffixes into inflectional paradigms word +s talk + ed talk + s dog + s walk + ed walk + s  DISADVANTAGE: handles highly inflecting + compounding languages poorly (alternating stems + affixes) 9

  10. IMPROVED MODEL  Morfessor Categories-ML (Creutz&Lagus) hidden states: categories (SUFF, PRE, …)  Reanalyzes segmentation of Morphessor Baseline  Maximum Likelihood Model  Words represented as HMMs  Stems, prefixes + suffixes can alternate (with some restrictions) „ noise “ category   split words whose morphs are present in the lexicon   join „ noise “ morphs with their neighbours to form proper morphs  observable states: morphs  CRITICISM: too ad hoc + information on word frequency is lost 10

  11. NEW MODEL  Morfessor Categories-MAP (Creutz&Lagus)  Induces binary hierarchical lexicon Retains inner structure of words  morphs represented as concatenation of (sub)morphs of the lexicon   Word frequency (own entry vs. Split into morphs) Prefix – Stem – Suffix – Non-morpheme  11

  12.  Maximum a posteriori framework  Words represented as HMMs Desired level of segmentation : „ finest resolution that does not contain non-morphemes “  12

  13. SEARCH ALGORITHM (GREEDY SEARCH) Initialisation of Representativ+ness segmentation stem+SUFF [Re+[present+ativ]]+[n+ess] Splitting of morphs PRE+stem+SUFF+non+SUFF [Re+[present+ativ]]+ness Joining of morphs PRE+stem+SUFF+SUFF [Re+[[pre+sent]+ativ]]+ness Splitting of morphs PRE+non+stem+SUFF+SUFF [Re+[[pre+sent]+ativ]]+ness Resegmentation of corpus + re-estimation PRE+non+stem+SUFF+SUFF of probabilitites [Re+[present+ativ]]+ness 13 Expansion to finest PRE+stem+SUFF+SUFF resolution

  14. MODEL  AIM: Finding optimal lexicon + segmentation  Maximum a posteriori estimate to be maximized: Form Meaning String of letters vs. Submorphs Frequency 14 Morph emission probability transition probability Length Right+Left Perplexity

  15.  Morph Emission Probabilities  probability that morph is emitted by the category   Depend on frequency of morph in training data  Prefix-/Suffix-Likeness (right+left perplexity)  Stem-Likeness (length)  Non-morpheme probability 15

  16. EVALUATION Goldstandard English Data Finnish Data Hutmegs Prose + news + Prose + news text  Linguistic scientific text morpheme Finnish IT Centre of segmentations Gutenberg Project Science  1.4 million Finnish Gigaword Corpus Finnish National  120 000 English Brown Corpus News Agency word forms Evaluation on 10.000, 50.000, 250.000, 12/16 million 16 words

  17. RESULTS 17

  18. UNSUPERVISED MORPHOLOGICAL SEGMENTATION AND CLUSTERING WITH DOCUMENT BOUNDARIES – MOON ET AL 2009  Simple model without heuristics /thresholds /trained parameters  Word segmentation - constrain candidate stems + affixes by document boundaries Cluster affixes of certain stems  morphologically related words   USE: interlinearised glossed texts for LRL  English + Uspanteko 18

  19. IDEA  two words in the same document are very similar in orthography  likely to be related morphologically  use document boundaries to filter out noise  constrain potential membership of word clusters 19 He suddendly drew a sharp sword … The documentation of …

  20. MODEL Candidate Generation Conflation set : „Set of word types that are related through either inflectional or derivational morphology “ 20

  21. CANDIDATE TRIE Stems  trunks like li ness  affixes branches ness hood 21

  22. MODEL Candidate Generation (D vs. G) X2 testing : Correlation betw. Affixes Candidate Conflation set : Filtering „Set of word types that are related through either inflectional or derivational morphology “ Affix Clustering Word Clustering (D vs. G) 22

  23. RESULTS 23

  24.  Thank you for your attention! 24

Recommend


More recommend