Simple Morpheme Labelling in Unsupervised Morpheme Analysis - PowerPoint PPT Presentation

Simple Morpheme Labelling in Unsupervised Morpheme Analysis Delphine Bernhard Ubiquitous Knowledge Processing Lab, Darmstadt, Germany Morpho Challenge 2007 – September 19, 2007 1 / 23

Main features of the method ◮ Algorithm already presented at Morpho Challenge 2005 ◮ Only input: plain list of words ⇒ no use of corpora or token frequency information ◮ Output: list of labelled morphemic segments for each word: ◮ prefix: dis arm ed ◮ suffix: sulk ing ◮ stem: grow ◮ linking element: oil – painting s 2 / 23

Overview of the method Step 1: List of Extraction of prefixes word forms and suffixes List of prefixes Step 2: and suffixes Acquisition of stems List of stems Step 3: Segmentation of words Potential segmentations for each word Step 4: Morphemic segments Selection of the best segmentation Step 5 (optional): Additional Application of the segments word forms to a new data set Segmented words 3 / 23

Step 1: Extraction of prefixes and suffixes Step 1: List of Extraction of prefixes word forms and suffixes List of prefixes Step 2: and suffixes Acquisition of stems List of stems Step 3: Segmentation of words Potential segmentations for each word Step 4: Morphemic segments Selection of the best segmentation Step 5 (optional): Additional Application of the segments word forms to a new data set Segmented words 4 / 23

h y p e r v e n t i l a t i n g 1 . 0 0 . 8 0 . 6 0 . 4 0 . 2 0 . 0 H Y P E R V E N T I L A T I N G L e t t r e s Step 1: Extraction of prefixes and suffixes Input Longest words 5 / 23

Step 1: Extraction of prefixes and suffixes Locate positions with low segment predictability h y p e r v e n t i l a t i n g 1 . 0 0 . 8 Input 0 . 6 Longest words 0 . 4 0 . 2 0 . 0 H Y P E R V E N T I L A T I N G L e t t r e s Variations of the average maximum transition probabilities 5 / 23

Step 1: Extraction of prefixes and suffixes Locate positions with low segment predictability h y p e r v e n t i l a t i n g 1 . 0 0 . 8 Input Output 0 . 6 Longest Segments words 0 . 4 0 . 2 0 . 0 H Y P E R V E N T I L A T I N G L e t t r e s Variations of the average maximum transition probabilities 5 / 23

Step 1: Extraction of prefixes and suffixes Identification of a stem among the segments hyper ventilat ing frequency 123 > 16 < 13 768 length 5 < 8 > 3 Prefixes and suffixes hyper ing ion or ventilat ors hyper ion un ed badly- ed 6 / 23

Step 2: Acquisition of stems Step 1: List of Extraction of prefixes word forms and suffixes List of prefixes Step 2: and suffixes Acquisition of stems List of stems Step 3: Segmentation of words Potential segmentations for each word Step 4: Morphemic segments Selection of the best segmentation Step 5 (optional): Additional Application of the segments word forms to a new data set Segmented words 7 / 23

Step 2: Acquisition of stems Subtract prefixes and suffixes from all words 8 / 23

Step 3: Segmentation of words Step 1: List of Extraction of prefixes word forms and suffixes List of prefixes Step 2: and suffixes Acquisition of stems List of stems Step 3: Segmentation of words Potential segmentations for each word Step 4: Morphemic segments Selection of the best segmentation Step 5 (optional): Additional Application of the segments word forms to a new data set Segmented words 9 / 23

Step 3: Segmentation of words ing d dis e integrat # re # ion - ist well s or fully Alignment of words containing the same stem in order to discover similar and dissimilar parts 10 / 23

Step 3: Segmentation of words Validation of new prefixes and suffixes Words Known prefixes Potential stems New prefixes A 1 A 2 A 3 fully-integrated fully- well-integrated well- reintegrated re disintegrated dis integrated ǫ | A 1 | + | A 2 | | A 1 | | A 1 | + | A 2 | + | A 3 | ≥ a and | A 1 | + | A 2 | ≥ b 11 / 23

Step 4: Selection of the best segmentation Step 1: List of Extraction of prefixes word forms and suffixes List of prefixes Step 2: and suffixes Acquisition of stems List of stems Step 3: Segmentation of words Potential segmentations for each word Step 4: Morphemic segments Selection of the best segmentation Step 5 (optional): Additional Application of the segments word forms to a new data set Segmented words 12 / 23

Step 4: Selection of the best segmentation ation transplant(40) (737) auto - transplantation (41) (12,194) (12) transplanta tion (16) (103) ◮ The most frequent segment is chosen when given a choice ◮ Some frequency and morphotactic constraints are verified 13 / 23

Step 5 (optional): Application of the morphemic segments to a new data set Step 1: List of Extraction of prefixes word forms and suffixes List of prefixes Step 2: and suffixes Acquisition of stems List of stems Step 3: Segmentation of words Potential segmentations for each word Step 4: Morphemic segments Selection of the best segmentation Step 5 (optional): Additional Application of the segments word forms to a new data set Segmented words 14 / 23

Step 5 (optional): Application of the morphemic segments to a new data set ◮ For each word, select segments so that the total cost is minimal ◮ Cost functions used: ◮ Method 1: f ( s i ) cost 1 ( s i ) = − log � i f ( s i ) ◮ Method 2: f ( s i ) cost 2 ( s i ) = − log max i [ f ( s i )] where: ◮ s i = morphemic segment ◮ f ( s i ) = frequency of segment s i 15 / 23

Results for competition 1: Precision 80 78.2 76.0 73.7 70 72.0 63.2 60 61.6 59.6 Precision % 50 49.1 40 30 20 10 Method 1 Method 2 0 English Finnish German Turkish ◮ Method 1 > Method 2 16 / 23

Results for competition 1: Recall 80 70 60 60.0 57.4 Recall % 50 52.5 40 40.4 37.7 30 25.0 20 14.8 10 10.9 Method 1 Method 2 0 English Finnish German Turkish ◮ Method 2 > Method 1 ◮ Low recall in Turkish 17 / 23

Results for competition 1: F-measure 80 70 60 60.8 60.7 F-Measure % 52.9 50 48.2 47.2 40 37.6 30 24.6 20 19.2 10 Method 1 Method 2 0 English Finnish German Turkish ◮ Method 2 > Method 1 ◮ Low F-measure in Turkish 18 / 23

Results for competition 2: Tfidf weighting 50 40 40.2 39.8 39.0 38.1 37.8 37.3 37.2 37.0 Tfidf - AP x 100 35.6 35.0 30 27.8 27.8 27.8 26.7 26.8 20 Dummy with new Method 1 without new 10 Method 1 with new Method 2 without new Method 2 with new 0 English Finnish German 19 / 23

Results for competition 2: Okapi BM 25 weighting 50 49.1 47.3 46.8 46.8 46.1 46.2 44.2 41.8 40 39.4 39.0 39.2 38.8 Okapi - AP x 100 32.7 32.3 30 31.2 20 Dummy with new Method 1 without new 10 Method 1 with new Method 2 without new Method 2 with new 0 English Finnish German 20 / 23

Challenges in unsupervised morpheme analysis ◮ Objectives of Morpho Challenge 2007: unsupervised morpheme analysis ⇒ more complex than segmentation of words into sub-units 21 / 23

Challenges in unsupervised morpheme analysis ◮ Objectives of Morpho Challenge 2007: unsupervised morpheme analysis ⇒ more complex than segmentation of words into sub-units ◮ Problems to be solved: ◮ allomorphy: different forms for the same morpheme oxen = ox +PL and flies = fly_N +PL ◮ homography: same form for different morphemes fly (noun = insect ) vs. fly (verb) 21 / 23

Challenges in unsupervised morpheme analysis ◮ Objectives of Morpho Challenge 2007: unsupervised morpheme analysis ⇒ more complex than segmentation of words into sub-units ◮ Problems to be solved: ◮ allomorphy: different forms for the same morpheme oxen = ox +PL and flies = fly_N +PL ◮ homography: same form for different morphemes fly (noun = insect ) vs. fly (verb) ◮ What can be solved by the system in its current state? 21 / 23

How well does the system disambiguate cross-category homography? Examples in English ship as a suffix vs. ship as a stem ◮ censor ship ◮ ship wreck ◮ !!!! space ship s !!!! Analysis of the results + Morphotactic constraints prevent a suffix from occurring at the beginning of a word – The most frequent segments are privileged when several morpheme categories are morphotactically plausible 22 / 23

Simple Morpheme Labelling in Unsupervised Morpheme Analysis - PowerPoint PPT Presentation

Simple Morpheme Labelling in Unsupervised Morpheme Analysis Delphine Bernhard Ubiquitous Knowledge Processing Lab, Darmstadt, Germany Morpho Challenge 2007 September 19, 2007 1 / 23 Main features of the method Algorithm already

Morpheme Extraction in Tamil using Finite State Machines (FIRE-2013 - Morpheme Extraction Task)

UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING UNSUPERVISED LEARNING Supervised

Unsupervised Learning Andrea Passerini passerini@disi.unitn.it Machine Learning Unsupervised

Unsupervised Learning and Clustering l In unsupervised learning you are given a data set with no

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Unsupervised Maximum Likelihood

Gap-labelling of the pinwheel tiling H. Moustafa Lab. de Math ematiques, Clermont-Ferrand

MorphoNet: Exploring the Use of Community Structure for Unsupervised Morpheme Analysis Delphine

Unsupervised Morpheme Analysis Competition 3: Statistical Machine Translation Mikko Kurimo, Sami

Korean morphology Seong-Hwan Jun Monday, April 15, 2013 Morphology Morpheme: smallest

On the Limitations of Unsupervised Bilingual Dictionary Induction Anders Sgaard Sebastian

Introduction to PCA Unsupervised Learning in R Unsupervised learning Two methods of

Classification of curves Simple, not closed Simple, closed Closed, not simple Not simple, not

Unsupervised Learning Introduction Nakul Verma Unsupervised Learning What can we learn from

Research Study - Calorie Labelling in Retail Retail Services Abigail Cork Catherine Hankey

Contents Introduction to Pret A Manger Background Why Did We Take Part? The Labelling Journey

in Blood Cell Labelling Sietske Rubow Radiopharmacist Nuclear Medicine Tygerberg Hospital and

Aging with HIV CoChairs Sharon Walmsley Adrian Betts Disclosure Served as an advisor,

lowering trials Prof. Nikolaus Marx, MD Aachen, Germany Asian Cardio Diabetes Forum March

(TOPCAT) AHA Nov 18, 2013 Late Breaking Session Marc A. Pfeffer MD, PhD, on behalf of the TOPCAT

COLLABORATIVE DEVELOPMENT OF INTERACTIVE LAB REPORT WRITING TOOL Fran Clements and Anuj Bhargava

Assessing Risk and Managing Crises 1 What is a Crisis? Time of intense difficulty or danger

1 Tim.. Tim: the GPs perspective Tim thinks Mo is a nag Engagement Some tension

Heart and Kidney Interactions: what are the challenges for prevention and progression Christoph

Diagnostic Point of Care Ultrasound POCUS is the future of the physical exam Nima Afshar MD

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us