simple morpheme labelling in unsupervised morpheme
play

Simple Morpheme Labelling in Unsupervised Morpheme Analysis - PowerPoint PPT Presentation

Simple Morpheme Labelling in Unsupervised Morpheme Analysis Delphine Bernhard Ubiquitous Knowledge Processing Lab, Darmstadt, Germany Morpho Challenge 2007 September 19, 2007 1 / 23 Main features of the method Algorithm already


  1. Simple Morpheme Labelling in Unsupervised Morpheme Analysis Delphine Bernhard Ubiquitous Knowledge Processing Lab, Darmstadt, Germany Morpho Challenge 2007 – September 19, 2007 1 / 23

  2. Main features of the method ◮ Algorithm already presented at Morpho Challenge 2005 ◮ Only input: plain list of words ⇒ no use of corpora or token frequency information ◮ Output: list of labelled morphemic segments for each word: ◮ prefix: dis arm ed ◮ suffix: sulk ing ◮ stem: grow ◮ linking element: oil – painting s 2 / 23

  3. Overview of the method Step 1: List of Extraction of prefixes word forms and suffixes List of prefixes Step 2: and suffixes Acquisition of stems List of stems Step 3: Segmentation of words Potential segmentations for each word Step 4: Morphemic segments Selection of the best segmentation Step 5 (optional): Additional Application of the segments word forms to a new data set Segmented words 3 / 23

  4. Step 1: Extraction of prefixes and suffixes Step 1: List of Extraction of prefixes word forms and suffixes List of prefixes Step 2: and suffixes Acquisition of stems List of stems Step 3: Segmentation of words Potential segmentations for each word Step 4: Morphemic segments Selection of the best segmentation Step 5 (optional): Additional Application of the segments word forms to a new data set Segmented words 4 / 23

  5. h y p e r v e n t i l a t i n g 1 . 0 0 . 8 0 . 6 0 . 4 0 . 2 0 . 0 H Y P E R V E N T I L A T I N G L e t t r e s Step 1: Extraction of prefixes and suffixes Input Longest words 5 / 23

  6. Step 1: Extraction of prefixes and suffixes Locate positions with low segment predictability h y p e r v e n t i l a t i n g 1 . 0 0 . 8 Input 0 . 6 Longest words 0 . 4 0 . 2 0 . 0 H Y P E R V E N T I L A T I N G L e t t r e s Variations of the average maximum transition probabilities 5 / 23

  7. Step 1: Extraction of prefixes and suffixes Locate positions with low segment predictability h y p e r v e n t i l a t i n g 1 . 0 0 . 8 Input Output 0 . 6 Longest Segments words 0 . 4 0 . 2 0 . 0 H Y P E R V E N T I L A T I N G L e t t r e s Variations of the average maximum transition probabilities 5 / 23

  8. Step 1: Extraction of prefixes and suffixes Identification of a stem among the segments hyper ventilat ing frequency 123 > 16 < 13 768 length 5 < 8 > 3 Prefixes and suffixes hyper ing ion or ventilat ors hyper ion un ed badly- ed 6 / 23

  9. Step 2: Acquisition of stems Step 1: List of Extraction of prefixes word forms and suffixes List of prefixes Step 2: and suffixes Acquisition of stems List of stems Step 3: Segmentation of words Potential segmentations for each word Step 4: Morphemic segments Selection of the best segmentation Step 5 (optional): Additional Application of the segments word forms to a new data set Segmented words 7 / 23

  10. Step 2: Acquisition of stems Subtract prefixes and suffixes from all words 8 / 23

  11. Step 3: Segmentation of words Step 1: List of Extraction of prefixes word forms and suffixes List of prefixes Step 2: and suffixes Acquisition of stems List of stems Step 3: Segmentation of words Potential segmentations for each word Step 4: Morphemic segments Selection of the best segmentation Step 5 (optional): Additional Application of the segments word forms to a new data set Segmented words 9 / 23

  12. Step 3: Segmentation of words ing d dis e integrat # re # ion - ist well s or fully Alignment of words containing the same stem in order to discover similar and dissimilar parts 10 / 23

  13. Step 3: Segmentation of words Validation of new prefixes and suffixes Words Known prefixes Potential stems New prefixes A 1 A 2 A 3 fully-integrated fully- well-integrated well- reintegrated re disintegrated dis integrated ǫ | A 1 | + | A 2 | | A 1 | | A 1 | + | A 2 | + | A 3 | ≥ a and | A 1 | + | A 2 | ≥ b 11 / 23

  14. Step 4: Selection of the best segmentation Step 1: List of Extraction of prefixes word forms and suffixes List of prefixes Step 2: and suffixes Acquisition of stems List of stems Step 3: Segmentation of words Potential segmentations for each word Step 4: Morphemic segments Selection of the best segmentation Step 5 (optional): Additional Application of the segments word forms to a new data set Segmented words 12 / 23

  15. Step 4: Selection of the best segmentation ation transplant(40) (737) auto - transplantation (41) (12,194) (12) transplanta tion (16) (103) ◮ The most frequent segment is chosen when given a choice ◮ Some frequency and morphotactic constraints are verified 13 / 23

  16. Step 5 (optional): Application of the morphemic segments to a new data set Step 1: List of Extraction of prefixes word forms and suffixes List of prefixes Step 2: and suffixes Acquisition of stems List of stems Step 3: Segmentation of words Potential segmentations for each word Step 4: Morphemic segments Selection of the best segmentation Step 5 (optional): Additional Application of the segments word forms to a new data set Segmented words 14 / 23

  17. Step 5 (optional): Application of the morphemic segments to a new data set ◮ For each word, select segments so that the total cost is minimal ◮ Cost functions used: ◮ Method 1: f ( s i ) cost 1 ( s i ) = − log � i f ( s i ) ◮ Method 2: f ( s i ) cost 2 ( s i ) = − log max i [ f ( s i )] where: ◮ s i = morphemic segment ◮ f ( s i ) = frequency of segment s i 15 / 23

  18. Results for competition 1: Precision 80 78.2 76.0 73.7 70 72.0 63.2 60 61.6 59.6 Precision % 50 49.1 40 30 20 10 Method 1 Method 2 0 English Finnish German Turkish ◮ Method 1 > Method 2 16 / 23

  19. Results for competition 1: Recall 80 70 60 60.0 57.4 Recall % 50 52.5 40 40.4 37.7 30 25.0 20 14.8 10 10.9 Method 1 Method 2 0 English Finnish German Turkish ◮ Method 2 > Method 1 ◮ Low recall in Turkish 17 / 23

  20. Results for competition 1: F-measure 80 70 60 60.8 60.7 F-Measure % 52.9 50 48.2 47.2 40 37.6 30 24.6 20 19.2 10 Method 1 Method 2 0 English Finnish German Turkish ◮ Method 2 > Method 1 ◮ Low F-measure in Turkish 18 / 23

  21. Results for competition 2: Tfidf weighting 50 40 40.2 39.8 39.0 38.1 37.8 37.3 37.2 37.0 Tfidf - AP x 100 35.6 35.0 30 27.8 27.8 27.8 26.7 26.8 20 Dummy with new Method 1 without new 10 Method 1 with new Method 2 without new Method 2 with new 0 English Finnish German 19 / 23

  22. Results for competition 2: Okapi BM 25 weighting 50 49.1 47.3 46.8 46.8 46.1 46.2 44.2 41.8 40 39.4 39.0 39.2 38.8 Okapi - AP x 100 32.7 32.3 30 31.2 20 Dummy with new Method 1 without new 10 Method 1 with new Method 2 without new Method 2 with new 0 English Finnish German 20 / 23

  23. Challenges in unsupervised morpheme analysis ◮ Objectives of Morpho Challenge 2007: unsupervised morpheme analysis ⇒ more complex than segmentation of words into sub-units 21 / 23

  24. Challenges in unsupervised morpheme analysis ◮ Objectives of Morpho Challenge 2007: unsupervised morpheme analysis ⇒ more complex than segmentation of words into sub-units ◮ Problems to be solved: ◮ allomorphy: different forms for the same morpheme oxen = ox +PL and flies = fly_N +PL ◮ homography: same form for different morphemes fly (noun = insect ) vs. fly (verb) 21 / 23

  25. Challenges in unsupervised morpheme analysis ◮ Objectives of Morpho Challenge 2007: unsupervised morpheme analysis ⇒ more complex than segmentation of words into sub-units ◮ Problems to be solved: ◮ allomorphy: different forms for the same morpheme oxen = ox +PL and flies = fly_N +PL ◮ homography: same form for different morphemes fly (noun = insect ) vs. fly (verb) ◮ What can be solved by the system in its current state? 21 / 23

  26. Challenges in unsupervised morpheme analysis ◮ Objectives of Morpho Challenge 2007: unsupervised morpheme analysis ⇒ more complex than segmentation of words into sub-units ◮ Problems to be solved: ◮ allomorphy: different forms for the same morpheme oxen = ox +PL and flies = fly_N +PL ◮ homography: same form for different morphemes fly (noun = insect ) vs. fly (verb) ◮ What can be solved by the system in its current state? 21 / 23

  27. Challenges in unsupervised morpheme analysis ◮ Objectives of Morpho Challenge 2007: unsupervised morpheme analysis ⇒ more complex than segmentation of words into sub-units ◮ Problems to be solved: ◮ allomorphy: different forms for the same morpheme oxen = ox +PL and flies = fly_N +PL ◮ homography: same form for different morphemes fly (noun = insect ) vs. fly (verb) ◮ What can be solved by the system in its current state? 21 / 23

  28. How well does the system disambiguate cross-category homography? Examples in English ship as a suffix vs. ship as a stem ◮ censor ship ◮ ship wreck ◮ !!!! space ship s !!!! Analysis of the results + Morphotactic constraints prevent a suffix from occurring at the beginning of a word – The most frequent segments are privileged when several morpheme categories are morphotactically plausible 22 / 23

Recommend


More recommend