arabic pos tagging
play

Arabic POS Tagging Results Error Analysis Conclusion Emad - PowerPoint PPT Presentation

Arabic POS Tagging Arabic + POS Tagging Data + Experiments Segmentation POS Tagging Arabic POS Tagging Results Error Analysis Conclusion Emad Mohamed, Sandra K ubler Indiana University 1 / 13 The Structure of Arabic Words Arabic


  1. Arabic POS Tagging Arabic + POS Tagging Data + Experiments Segmentation POS Tagging Arabic POS Tagging Results Error Analysis Conclusion Emad Mohamed, Sandra K¨ ubler Indiana University 1 / 13

  2. The Structure of Arabic Words Arabic POS Tagging Arabic + POS Tagging ◮ An Arabic word may consist of several segments. Data + Experiments ◮ Possible segments: inflectional affixes, the stem, Segmentation clitics POS Tagging ◮ example: WsyktbwnhA (Engl.: and they will write it ): Results ◮ conjunction: w Error Analysis ◮ future particle: s Conclusion ◮ 3rd person imperfect verb prefix: y ◮ imperfect verb: ktb ◮ 3rd person feminine singular object pronoun: hA 2 / 13

  3. The Structure of Arabic Words Arabic POS Tagging Arabic + POS Tagging ◮ An Arabic word may consist of several segments. Data + Experiments ◮ Possible segments: inflectional affixes, the stem, Segmentation clitics POS Tagging ◮ example: WsyktbwnhA (Engl.: and they will write it ): Results ◮ conjunction: w Error Analysis ◮ future particle: s Conclusion ◮ 3rd person imperfect verb prefix: y ◮ imperfect verb: ktb ◮ 3rd person feminine singular object pronoun: hA ◮ POS tag: [CONJ+FUTURE PARTICLE+ IMPERFECT VERB PREFIX+IMPERFECT VERB+ IMPERFECT VERB SUFFIX MASC PLURAL 3RD PERSON+ OBJECT PRONOUN FEM SINGULAR] 2 / 13

  4. Tagging Approaches Arabic POS Tagging Arabic + POS Tagging ◮ whole word tagging: assign complex tag to complete Data + word Experiments Segmentation POS Tagging Results Error Analysis ◮ segment-based tagging: segment first; then assign Conclusion tags to segments 3 / 13

  5. Tagging Approaches Arabic POS Tagging Arabic + POS Tagging ◮ whole word tagging: assign complex tag to complete Data + word Experiments wsyktbwnhA : Segmentation POS Tagging CONJ+FUT+IV3MS+IV+IVSUFF SUBJ:MP MOOD:I+IVSUFF DO:3FS Results Error Analysis ◮ segment-based tagging: segment first; then assign Conclusion tags to segments ◮ w : CONJ ◮ s : FUT ◮ y : IV3MS ◮ ktb : IV ◮ wn : SUBJ:MP MOOD:I ◮ hA : IVSUFF DO:3FS 3 / 13

  6. Tagging Approaches Arabic POS Tagging Arabic + POS Tagging ◮ whole word tagging: assign complex tag to complete Data + word Experiments wsyktbwnhA : Segmentation POS Tagging CONJ+FUT+IV3MS+IV+IVSUFF SUBJ:MP MOOD:I+IVSUFF DO:3FS 993 tags Results Error Analysis ◮ segment-based tagging: segment first; then assign Conclusion tags to segments ◮ w : CONJ ◮ s : FUT ◮ y : IV3MS ◮ ktb : IV ◮ wn : SUBJ:MP MOOD:I ◮ hA : IVSUFF DO:3FS 139 tags 3 / 13

  7. Data Set & Experimental Setup Arabic POS Tagging Arabic + POS Tagging Data + ◮ Penn Arabic Treebank (after-treebank POS files) Experiments Segmentation ◮ P1V3 + P3V1: ca. 500 000 words POS Tagging ◮ non-vocalized version Results Error Analysis ◮ reattached conjunctions, prepositions, pronouns, etc. Conclusion to get text as written ◮ remove null elements: { i$otaraY+(null) / PV+PVSUFF SUBJ:3MS ⇒ { i$otaraY / PV ◮ 5-fold cross validation ◮ evaluation: per-segment accuracy (SAR) + per-word accuracy (WAR) 4 / 13

  8. Memory-Based Segmentation Arabic POS Tagging Arabic + POS Tagging Data + ◮ per character classification: segment-end, Experiments Segmentation no-segment-end POS Tagging ◮ memory-based learning: TiMBL Results Error Analysis ◮ features: focus character, previous 5 characters, and Conclusion following 5 characters, POS tag for word based on whole word tagging ◮ TiMBL parameters: IB, overlap metric, gain ratio weighting, nearest neighbors k = 1 ◮ two rounds: in second round include class from first round 5 / 13

  9. Segmentation Results Arabic POS Tagging Arabic + POS Tagging Data + all words: 98.23% Experiments known words: 99.75% Segmentation unknown words: 82.22% POS Tagging Results Error Analysis Conclusion 6 / 13

  10. Segmentation Results Arabic POS Tagging Arabic + POS Tagging Data + all words: 98.23% Experiments known words: 99.75% Segmentation unknown words: 82.22% POS Tagging Results Error Analysis Conclusion proper noun errors: 33.87% of all errors % unknown words in data: 8.5% 6 / 13

  11. POS Tagging Arabic POS Tagging Arabic + POS Tagging Data + Experiments ◮ memory-based tagger: MBT Segmentation ◮ parameters: Modified Value Difference metric, k = 25 POS Tagging Results ◮ for known words : IGTree, 2 words to left, their POS Error Analysis tags, focus word, its ambitag, 1 right context word, its Conclusion ambitag ◮ for unknown words : IB1, focus word, first 5 + last 3 characters, 1 left context word + its POS tag, 1 right context word + its ambitag ◮ previous decisions are included 7 / 13

  12. POS Tagging Results Arabic POS Tagging Arabic + POS Tagging Data + Experiments Segmentation POS Tagging Results Error Analysis gold standard seg. segmentation-based whole words Conclusion SAR WAR SAR WAR WAR 96.72% 94.91% 94.70% 93.47% 94.74% 8 / 13

  13. POS Tagging Results Arabic POS Tagging Arabic + POS Tagging Data + Experiments Segmentation POS Tagging Results Error Analysis gold standard seg. segmentation-based whole words Conclusion SAR WAR SAR WAR WAR 96.72% 94.91% 94.70% 93.47% 94.74% 8 / 13

  14. POS Tagging Results Arabic POS Tagging Arabic + POS Tagging Data + Experiments Segmentation POS Tagging Results Error Analysis gold standard seg. segmentation-based whole words Conclusion SAR WAR SAR WAR WAR 96.72% 94.91% 94.70% 93.47% 94.74% 8 / 13

  15. Discussion Arabic POS Tagging Arabic + POS ◮ gold standard segmentation: upper bound Tagging Data + ◮ gives best results Experiments Segmentation POS Tagging ◮ no gold standard segmentation available: whole Results Error Analysis words better than automatic segmentation Conclusion ◮ segmentation → more ambiguity per segment ◮ small percentage of unknown words ◮ in segmentation-based tagging, 28% of all errors are results of wrong segementation 9 / 13

  16. Known vs. Unknown Words Arabic POS Tagging Arabic + POS Tagging Data + Experiments Segmentation POS Tagging Results Error Analysis gold std. seg. seg.-based whole words Conclusion known words 95.90% 95.57% 96.61% unknown words 84.25% 71.06% 74.64% 10 / 13

  17. Known vs. Unknown Words Arabic POS Tagging Arabic + POS Tagging Data + Experiments Segmentation POS Tagging Results Error Analysis gold std. seg. seg.-based whole words Conclusion known words 95.90% 95.57% 96.61% unknown words 84.25% 71.06% 74.64% 10 / 13

  18. Known vs. Unknown Words Arabic POS Tagging Arabic + POS Tagging Data + Experiments Segmentation POS Tagging Results Error Analysis gold std. seg. seg.-based whole words Conclusion known words 95.90% 95.57% 96.61% unknown words 84.25% 71.06% 74.64% 10 / 13

  19. Known vs. Unknown Words Arabic POS Tagging Arabic + POS Tagging Data + Experiments Segmentation POS Tagging Results Error Analysis gold std. seg. seg.-based whole words Conclusion known words 95.90% 95.57% 96.61% unknown words 84.25% 71.06% 74.64% 10 / 13

  20. Error Analysis Arabic POS Tagging confusion sets: Arabic + POS Tagging Data + Experiments gold tagger % of errors Segmentation noun adjective 7.88% POS Tagging adjective noun 7.75% Results proper noun noun 9.10% Error Analysis Conclusion noun proper noun 2.51% 11 / 13

  21. Error Analysis Arabic POS Tagging confusion sets: Arabic + POS Tagging Data + Experiments gold tagger % of errors Segmentation noun adjective 7.88% POS Tagging adjective noun 7.75% Results proper noun noun 9.10% Error Analysis Conclusion noun proper noun 2.51% ◮ no clear distinction between nouns and adjectives in Arabic: adjectives behave morphologically like nouns and can be used as nouns ◮ proper nouns are normally standard nouns, and are no marked specifically 11 / 13

  22. Comparison to Habash & Rambow Arabic POS Tagging Arabic + POS Tagging Data + Experiments ◮ whole word tagging Segmentation POS Tagging ◮ then convert to Habash & Rambow tokenization + Results reduced tagset: 15 tags Error Analysis Conclusion H&R ATB1 H&R ATB2 whole word tagger Token. acc. 99.1 – 99.33 POS acc. 98.1 96.5 96.41 12 / 13

  23. Conclusion & Future Work Arabic POS Tagging Arabic + POS Tagging Data + ◮ whole word tagging has higher accuracy than Experiments Segmentation segmentation based tagging POS Tagging ◮ no preprocessing necessary Results ◮ but Penn Arabic Treebank has low percentage of Error Analysis Conclusion unknown words ◮ segmentation quality is bottleneck for improving segmentation-based tagger ◮ need to find more reliable segmentation ◮ will integrate vocalization with segmentation 13 / 13

Recommend


More recommend