dcu meets met bengali and hindi
play

DCU meets MET: Bengali and Hindi Morpheme Extraction Debasis - PowerPoint PPT Presentation

DCU meets MET: Bengali and Hindi Morpheme Extraction Debasis Ganguly, Johannes Leveling, Gareth J.F. Jones CNGL, School of Computing, Dublin City University, Ireland Outline Motivation Task Description Bengali Stemming Approach Hindi Stemming


  1. DCU meets MET: Bengali and Hindi Morpheme Extraction Debasis Ganguly, Johannes Leveling, Gareth J.F. Jones CNGL, School of Computing, Dublin City University, Ireland

  2. Outline Motivation Task Description Bengali Stemming Approach Hindi Stemming Approach Results Conclusions and Future Work

  3. Motivation Some languages have complex inflectional and derivational morphology, i.e. the same base form can correspond to multiple surface word forms Example: company, companies → company; hopeful → hope For information retrieval, indexing surface forms would lead to many mismatches between query terms and index terms extracted from documents Index base forms/stems: Reduce different surface forms to the same index form (stem, lemma) to increase the chance of matching query term with document terms

  4. Task Description Morpheme Extraction Task: Investigate effect of morphologic analysis/ lemmatization/ stemming on information retrieval (IR) performance (for Indian languages) Subtasks: Subtask 1: manual evaluation of morpheme extraction Subtask 2: IR evaluation using the proposed morpheme representation as index terms. Evaluation metric is mean average precision (MAP)

  5. Stemming Approaches Light vs aggressive stemming Rule-based vs. corpus-based stemming manually created vs. cluster of related words iteratively remove word suffixes problem: overstemming, i.e. removed suffix is too long e.g. international/intern; news/new understemming, i.e. removed suffix is too short e.g. forgetfulness/forgetful irregular forms e.g. feet/foot; women/woman

  6. Our Bengali Stemming Approach Rule-based stemmer created by native speaker Focus on nouns (most important for IR) Four categories [Bhattacharya et al. 2005]: Title markers added as suffixes to proper nouns e.g. “ দে�� ” (Mrs.) , “ �া�� ” (sir) Classifier for plurality and specificity/gender of a noun e.g. ছব�ুল া (Pictures) , ছব�টা (the Picture) , ছার� (female student) Case marker for possessive or accusative relations e.g. পবি�ালিি (family’s) Emphasizer to emp hasize the current word e.g. ছব�ই (only a picture), ছব�টাই (only this picture)

  7. Bengali Stemmer Drop emphasizers (iteratively) e.g. আবি্যই  আবি্য Drop classifiers and case markers e.g. �র�িাও  �র�, �ািলেি  �ািে Drop title markers e.g. ��োলে��  ��ো Drop plural suffixes e.g. �ািে�য়লেি  �ািে�য় Drop derivational suffixes e.g. বিে�শ�  বিে�

  8. Our Hindi Stemming Approach Hindi has less complex inflectional morphology fewer stemming rules Rule-based stemmer Stemming rules manually created by native Hindi speaker

  9. Hindi Stemmer Iteratively remove Hindi vowels, Matras, Anusvara, and “ य ” (character ya) from the right of a string until first consonant is encountered Drop derivational suffixes, e.g. लड़कं (to boys)  लड़का (boy) लड़ककयं (to girls)  लड़की (girl)

  10. MET Experiments Experiments for Bengali and Hindi Stemmers implemented in C Submission as source code Stemmed forms are used for retrieval with Terrier

  11. Results Team Language MAP Baseline Bengali 0.2740 JU Bengali 0.3307 (+20.69%) DCU Bengali 0.3300 (+20.44%) IIT-KGP Bengali 0.3225 (+17.70%) CVPR-Team Bengali 0.3159 (+15.29%) ISM Bengali 0.3103 (+13.25%) Baseline Hindi 0.2821 DCU Hindi 0.2963 (+5.03%) ISM Hindi 0.2793 (-0.99%)

  12. Conclusions Bengali stemmer: 2 nd best performance Hindi stemmer: Best performance Both have also been used successfully in previous ad-hoc IR experiments for FIRE

  13. Future work Explore use of exclusion lists for irregular cases Extend rule set (i.e. handle verbs) Compare to other stemmers for Bengali/Hindi e.g. Indian language in version 4 of Lucene; stemmers from Jacques Savoy’s web page on cross -language IR Investigate morphology of named entities

  14. Thank +s for your attention Any question +s ?

Recommend


More recommend