UNSUPERVISED MORPHOLOGICAL SEGMENTATION & CLUSTERING ICL UNI - PowerPoint PPT Presentation

UNSUPERVISED MORPHOLOGICAL SEGMENTATION & CLUSTERING ICL UNI HEIDELBERG - HS CL4LRL - KATHARINA ALLGAIER - 08.06.2016 1

OVERVIEW  Introduction  Morphological Segmentation (Creutz&Lagus 2005)  Aims  Models  Evaluation  Results  Affix Clustering (Moon et al 2009)  Idea  Model  Results  Conclusion 2

WHAT ARE WE DOING?  Morpheme Segmentation reads = read + s  Morphemes = smallest meaning-bearing units machines = machine + s  = smallest elements of syntax translation = translate + ion  Meaning vs. Form goalkeeper = goal + keeper joystick = joy + stick  Composition vs. Perturbation 3

WHAT ARE WE DOING ?  Stem vs. Affixes (Prefixes + Suffixes)  Inflectional vs. Derivational  Affix Clustering 4

WHY ARE WE DOING IT?  important information  especially for highly inflected languages  (like Turkish, Finnish, Nahuatl, Japanese  agglutinative languges)  used in other CL applications  (language production, speech recognition, machine translation etc.) 5

INDUCING THE MORPHOLOGICAL LEXICON OF A NATURAL LANGUAGE FROM UNANNOTATED TEXT – CREUTZ&LAGUS 2005 „ algorithm for the unsupervised learning […] of a simple morphology of a natural language “   Unsupervised morpheme segmentation with hierarchical representation  English and Finnish 6

AIMS  Most accurate segmentation possible  Learn representation of the language in the data + store it in a lexicon  Based several models: Linguistica, Morfessor Baseline, Morfessor ML, Morfessor MAP 7

BASELINE Morph Lexicon talk teach es  Morfessor Baseline Algorithm (Creutz&Langus) ed ing word  Similar to some unsupervised word segmentation algorithms words morf  Construct lexicon of morphs es  Each word can be constructed out of those morphs sor  AIM: find optimal + concise segmentation and lexicon  PROBLEM: frequent words stored as a whole - rare words excessively split + stored in part no representation of a morph‘s inner structure 8

 Linguistica (Goldsmith 2001)  Splits word into stem + one (empty) prefix / affix ADVANTAGE: Modeling of simple word-internal syntax (morphotactics – rules on ordering of morphemes)  – grouping sets of stems & suffixes into inflectional paradigms word +s talk + ed talk + s dog + s walk + ed walk + s  DISADVANTAGE: handles highly inflecting + compounding languages poorly (alternating stems + affixes) 9

IMPROVED MODEL  Morfessor Categories-ML (Creutz&Lagus) hidden states: categories (SUFF, PRE, …)  Reanalyzes segmentation of Morphessor Baseline  Maximum Likelihood Model  Words represented as HMMs  Stems, prefixes + suffixes can alternate (with some restrictions) „ noise “ category   split words whose morphs are present in the lexicon   join „ noise “ morphs with their neighbours to form proper morphs  observable states: morphs  CRITICISM: too ad hoc + information on word frequency is lost 10

NEW MODEL  Morfessor Categories-MAP (Creutz&Lagus)  Induces binary hierarchical lexicon Retains inner structure of words  morphs represented as concatenation of (sub)morphs of the lexicon   Word frequency (own entry vs. Split into morphs) Prefix – Stem – Suffix – Non-morpheme  11

 Maximum a posteriori framework  Words represented as HMMs Desired level of segmentation : „ finest resolution that does not contain non-morphemes “  12

SEARCH ALGORITHM (GREEDY SEARCH) Initialisation of Representativ+ness segmentation stem+SUFF [Re+[present+ativ]]+[n+ess] Splitting of morphs PRE+stem+SUFF+non+SUFF [Re+[present+ativ]]+ness Joining of morphs PRE+stem+SUFF+SUFF [Re+[[pre+sent]+ativ]]+ness Splitting of morphs PRE+non+stem+SUFF+SUFF [Re+[[pre+sent]+ativ]]+ness Resegmentation of corpus + re-estimation PRE+non+stem+SUFF+SUFF of probabilitites [Re+[present+ativ]]+ness 13 Expansion to finest PRE+stem+SUFF+SUFF resolution

MODEL  AIM: Finding optimal lexicon + segmentation  Maximum a posteriori estimate to be maximized: Form Meaning String of letters vs. Submorphs Frequency 14 Morph emission probability transition probability Length Right+Left Perplexity

 Morph Emission Probabilities  probability that morph is emitted by the category   Depend on frequency of morph in training data  Prefix-/Suffix-Likeness (right+left perplexity)  Stem-Likeness (length)  Non-morpheme probability 15

EVALUATION Goldstandard English Data Finnish Data Hutmegs Prose + news + Prose + news text  Linguistic scientific text morpheme Finnish IT Centre of segmentations Gutenberg Project Science  1.4 million Finnish Gigaword Corpus Finnish National  120 000 English Brown Corpus News Agency word forms Evaluation on 10.000, 50.000, 250.000, 12/16 million 16 words

RESULTS 17

UNSUPERVISED MORPHOLOGICAL SEGMENTATION AND CLUSTERING WITH DOCUMENT BOUNDARIES – MOON ET AL 2009  Simple model without heuristics /thresholds /trained parameters  Word segmentation - constrain candidate stems + affixes by document boundaries Cluster affixes of certain stems  morphologically related words   USE: interlinearised glossed texts for LRL  English + Uspanteko 18

IDEA  two words in the same document are very similar in orthography  likely to be related morphologically  use document boundaries to filter out noise  constrain potential membership of word clusters 19 He suddendly drew a sharp sword … The documentation of …

MODEL Candidate Generation Conflation set : „Set of word types that are related through either inflectional or derivational morphology “ 20

CANDIDATE TRIE Stems  trunks like li ness  affixes branches ness hood 21

MODEL Candidate Generation (D vs. G) X2 testing : Correlation betw. Affixes Candidate Conflation set : Filtering „Set of word types that are related through either inflectional or derivational morphology “ Affix Clustering Word Clustering (D vs. G) 22

RESULTS 23

 Thank you for your attention! 24

UNSUPERVISED MORPHOLOGICAL SEGMENTATION & CLUSTERING ICL UNI - PowerPoint PPT Presentation

UNSUPERVISED MORPHOLOGICAL SEGMENTATION & CLUSTERING ICL UNI HEIDELBERG - HS CL4LRL - KATHARINA ALLGAIER - 08.06.2016 1 OVERVIEW Introduction Morphological Segmentation (Creutz&Lagus 2005) Aims Models

UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING UNSUPERVISED LEARNING Supervised

Unsupervised Learning and Clustering l In unsupervised learning you are given a data set with no

Segmentation Bottom-up Segmentation Semantic / instance segmentation Many Slides from L.

VIDEO SIGNALS Segmentation WHAT IS SEGMENTATION WHAT IS SEGMENTATION Segmentation is a

Chapter 7: Clustering (Unsupervised Data Organization) 7.1 Hierarchical Clustering 7.2 Flat

An Unsupervised Method for Uncovering Morphological Chains Karthik Narasimhan Regina Barzilay

Semantic Segmentation / Instance Segmentation Based on Deep learning Yiding Liu 2018.12.08

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Co-Segmentation of 3D Shapes via Subspace Clustering Ruizhen Hu Lubin Fan

Hierarchical Clustering 4-4-16 Hierarchical clustering: the setting Unsupervised learning

Automatic Learning of a Morphological Model Theory and Unsupervised Approaches Unsupervised

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Segmentation Segmentation Segmentation Define the accurate boundaries of all objects in an image

Segmentation using Segmentation using Bayesian Decision Theory Bayesian Decision Theory

Neutral Fermions and Skyrmions in the Moore-Read state at = 5 / 2 Gunnar M oller Cavendish

Saturated fusion systems over a Sylow p -subgroup of Sp 4 ( p n ) Ellen Henke (Aberdeen) Joint

Dedicated to the memory of Louis Michel and Roland S en eor two Polytechniciens

Some Thoughts on Privacy and Security for Educational Data Ryan S. Baker University of

Combinatorial algebraic topology of toric arrangements. Emanuele Delucchi (SNSF / Universit e

Architectural design: the coordination perspective Jos Proena HASLab - INESC TEC & UM

Lifting techniques for polynomial system solving Eric Schost ORCCA UWO Goals Genus 1

http://vision.unipv.it/ ANALYSIS OF GEMOMETRICAL AND TOPOLOGICAL ATTITUDE FOR PROTEIN-PROTEIN

UNSUPERVISED MORPHOLOGICAL SEGMENTATION & CLUSTERING ICL UNI - PowerPoint PPT Presentation

UNSUPERVISED MORPHOLOGICAL SEGMENTATION & CLUSTERING ICL UNI HEIDELBERG - HS CL4LRL - KATHARINA ALLGAIER - 08.06.2016 1 OVERVIEW Introduction Morphological Segmentation (Creutz&Lagus 2005) Aims Models

UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING UNSUPERVISED LEARNING Supervised

Unsupervised Learning and Clustering l In unsupervised learning you are given a data set with no

Segmentation Bottom-up Segmentation Semantic / instance segmentation Many Slides from L.

VIDEO SIGNALS Segmentation WHAT IS SEGMENTATION WHAT IS SEGMENTATION Segmentation is a

Chapter 7: Clustering (Unsupervised Data Organization) 7.1 Hierarchical Clustering 7.2 Flat

An Unsupervised Method for Uncovering Morphological Chains Karthik Narasimhan Regina Barzilay

Semantic Segmentation / Instance Segmentation Based on Deep learning Yiding Liu 2018.12.08

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Co-Segmentation of 3D Shapes via Subspace Clustering Ruizhen Hu Lubin Fan

Hierarchical Clustering 4-4-16 Hierarchical clustering: the setting Unsupervised learning

Automatic Learning of a Morphological Model Theory and Unsupervised Approaches Unsupervised

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Segmentation Segmentation Segmentation Define the accurate boundaries of all objects in an image

Segmentation using Segmentation using Bayesian Decision Theory Bayesian Decision Theory

Neutral Fermions and Skyrmions in the Moore-Read state at = 5 / 2 Gunnar M oller Cavendish

Saturated fusion systems over a Sylow p -subgroup of Sp 4 ( p n ) Ellen Henke (Aberdeen) Joint

Dedicated to the memory of Louis Michel and Roland S en eor two Polytechniciens

Some Thoughts on Privacy and Security for Educational Data Ryan S. Baker University of

Combinatorial algebraic topology of toric arrangements. Emanuele Delucchi (SNSF / Universit e

Architectural design: the coordination perspective Jos Proena HASLab - INESC TEC &amp; UM

Lifting techniques for polynomial system solving Eric Schost ORCCA UWO Goals Genus 1

http://vision.unipv.it/ ANALYSIS OF GEMOMETRICAL AND TOPOLOGICAL ATTITUDE FOR PROTEIN-PROTEIN

Architectural design: the coordination perspective Jos Proena HASLab - INESC TEC & UM