quality estimation and evaluation of machine translation
play

QUALITY ESTIMATION AND EVALUATION OF MACHINE TRANSLATION INTO - PowerPoint PPT Presentation

1 QUALITY ESTIMATION AND EVALUATION OF MACHINE TRANSLATION INTO ARABIC Houda Bouamor, Carnegie Mellon University-Qatar email: hbouamor@qatar.cmu.edu 2 Outline 3 SUMT: A Framework of Summarization and MT 1. AL-BLEU: Metric and a dataset


  1. 1 QUALITY ESTIMATION AND EVALUATION OF MACHINE TRANSLATION INTO ARABIC Houda Bouamor, Carnegie Mellon University-Qatar email: hbouamor@qatar.cmu.edu

  2. 2

  3. Outline 3 SUMT: A Framework of Summarization and MT 1. AL-BLEU: Metric and a dataset for Arabic MT 2. evaluation Conclusions 3.

  4. SUMT: A Framework of 1 Summarization and Machine Translation

  5. Motivation 5 ¨ MT quality is far from ideal for many languages and text genres ¤ Provides incorrect context and confuses readers ¨ Some of sentences are not as informative ¤ Could be summarized to make a more cohesive document Keep informative sentences+ decent MT quality

  6. Questions? 6 How can we estimate the MT quality of a sentence 1. without human references? How can we find the most informative part of a 2. document? How can we find a middle point between 3. informativeness and MT quality? How can we evaluate the quality of our system? 4.

  7. Part1: outline 7 MT quality estimation 1. MT-aware summarization system 2. Experiments and results 3. Conclusion 4.

  8. SuMT [Bouamor et al.,2013] 8

  9. SuMT: Translation 9

  10. SuMT: MT quality estimation 10

  11. SuMT: MT quality estimation Data labeling procedure 11 MT Quality Sent EN 1 , Sent AR 1 : Q Score1 Sent EN 1 , Sent AR 1 Classifier Sent EN 2 , Sent AR 2 : Q Score2 Sent EN 2 , Sent AR 2 Sent EN 3 , Sent AR 3 : Q Score3 Sent EN 3 , Sent AR 3 . . . . . . . . Sent EN n , Sent AR n : Q Scoren Sent EN n , Sent AR n Sent EN n : A source sentence Sent AR n : Its auomaticaly obtained translation

  12. Quality Estimation: MT Quality classifier 12 ¨ Use SVM classifier ¨ Adapt Quest framework [Specia et al., 2013] to our EN-AR translation setup ¨ Each sentence is characterized with: ¤ General features: length, ratio of S-T length, S-T punctuations ¤ 5-gram LM scores ¤ MT-based scores ¤ Morphosyntactic features ¤ …

  13. SuMT: MT-aware summarization 13

  14. SuMT: MT-aware summarization MEAD as a ranker 14

  15. SuMT: MT-aware summarization Our adaptation of MEAD 15

  16. Evaluation quality estimation 16 ¨ How do we evaluate the quality of the estimation? ¤ Intrinsically: very hard to trust n Need references à MT evaluation n Next... ¤ Extrinsically: in an application n In the context of MT of Wikipedia n Compare using QE vs. a simple baseline

  17. SuMT: experimental settings 17 ¨ MT setup ¤ Baseline MT system: MOSES trained on a standard English-Arabic corpus ¤ Standard preprocessing and tokenization for both English and Arabic ¤ Word-alignment using GIZA++ ¨ Summarization and test data ¤ English-Arabic NIST corpora n ︎ Train: NIST 2008 and 2009 for the training and development (259 documents ) n Test: NIST 2005 (100 documents)

  18. SuMT: experimental settings 18 ¨ Summarization setup ¤ Bilingual summarization of the test data ¤ 2 native speakers chose half of the sentences ¤ Guidelines in sentence selection: n Being informative in respect to the main story n Preserving key informations (NE, dates, etc.) ¤ A moderate agreement of K=0.61 .

  19. SuMT: experimental settings 19 ¨ Producing summaries for each document using: ¤ Length-based: choose the shortest sentences ( Length ) ¤ State of the art MEAD summarizer ( MEAD ) ¤ MT quality estimation classifier ( Classifier ) ¤ MT-aware summarizer ( SuMT ) ¤ Oracle classifier: choose the sentences with the highest translation quality ( Oracle ).

  20. SuMT: Results 20 ¨ MT results 40% BLEU% 34.75% 35% 32.12% 31.36% 30% 28.45% 28.42% 27.52% 26.33% 25% 20% 15% 10% 5% 0% Baseline% Length% MEAD% Classifier% Interpol% SuMT% Oracle%

  21. SuMT: Results 21 ¨ Arabic Summary quality 30% RougeESU4% 24.07% 25% 23.56% 23.09% 20.33% 20% 15.81% 15% 10% 5% 0% Length% MEAD% Classifier% Interpol% SuMT%

  22. Conclusions 22 ¨ Presented a framework for pairing MT with summarization ¨ We extend a classification framework for reference-free prediction of translation quality at the sentence-level. ¨ We incorporate MT knowledge into a summarization system which results in high quality translation summaries. ¨ Quality estimation is shown to be useful in the context of text summarization.

  23. Automatic MT quality evaluation 23

  24. The BLEU metric 24 ¨ De facto metric, proposed by IBM [Papineni et al., 2002] ¨ Main ideas ¤ Exact matches of words ¤ Match against a set of reference translations for greater variety of expressions ¤ Account for adequacy by looking at word precision ¤ Account for fluency by calculating n-gram precisions for n=1,2,3,4 ¤ No recall : difficult with multiple refs: “Brevity penalty” , instead ¤ Final score: a weighted geometric average of the n-gram scores

  25. The BLEU metric: Example 25 Adequacy • Example: – Reference: “the Iraqi weapons are to be handed over to the army within two weeks” – MT output: “in two weeks Iraq’s weapons will give army” • BLUE metric: – 1-gram precision: 4/8 – 2-gram precision: 1/7 – 3-gram precision: 0/6 – 4-gram precision: 0/5 – BLEU score = 0 (weighted geometric average) Example taken from Alon Lavie’s AMTA 2010 MT evaluation tutorial

  26. The BLEU metric: Example 26 Fluency • Example: – Reference: “the Iraqi weapons are to be handed over to the army within two weeks” – MT output: “in two weeks Iraq’s weapons will give army” • BLUE metric: – 1-gram precision: 4/8 – 2-gram precision: 1/7 – 3-gram precision: 0/6 – 4-gram precision: 0/5 – BLEU score = 0 (weighted geometric average) This example is taken from Alon Lavie’s AMTA 2010 MT evaluation tutorial

  27. The BLEU metric: Example 27 • Example: – Reference: “the Iraqi weapons are to be handed over to the army within two weeks” – MT output: “in two weeks Iraq’s weapons will give army” • BLUE metric: – 1-gram precision: 4/8 – 2-gram precision: 1/7 – 3-gram precision: 0/6 – 4-gram precision: 0/5 – BLEU score = 0 (weighted geometric average) BLEU =~ smoothing ( ( Π P i-gram ) 1/n * ( brevity penalty ) ) Short story: the score is [0..1]; higher is better!

  28. BLEU & Arabic 28 ¨ BLEU heavily penalizes Arabic ¨ Question: How can we adapt BLEU to support Arabic Morphology?

  29. CMU-Q’s AL-BLEU (Bouamor et al., 2014) 29 ¨ For our experiments: 1. AL-BLEU metric 2. Data and systems

  30. AL-BLEU: Arabic Language BLEU 30 ¨ Extend BLEU to deal with Arabic rich morphology ¨ Update the n-gram scores with partial credits for partial matches: ¤ Morphological : POS, gender , number , person, definiteness ¤ Stem and lexical matches ¨ Compute a new matching score as follows:

  31. AL-BLEU: Arabic Language BLEU 31

  32. AL-BLEU: Arabic Language BLEU 32 ¨ MADA [Habash et al., 2009] provides stem and morph. Features ¨ Weights are optimized towards improvement of correlation with human judgments ¨ Hill climbing used on development set ¨ AL-BLEU is a geometric mean of the different matched n-grams

  33. AL-BLEU: Evaluation and Results 33 ¨ A good MT metric should correlate well with human judgments. ¨ Measure the correlation between BLEU, AL-BLEU and human judgments at the sentence level

  34. AL-BLEU: Data and Systems 34 ¨ Problem: “No” human judgment dataset for Arabic ¨ Data ¤ Annotate a corpus composed of different text genres: n News, climate change, Wikipedia ¨ Systems ¤ Six state-of-the art EN-AR MT systems n 4 research-oriented systems n 2 commercial off-the-shelf systems

  35. Data: Judgment collection 35 ¨ Rank the sentences relatively to each other from the best to the worst

  36. Data: Judgment collection 36 ¨ Rank the sentences relatively to each other from the best to the worst ¨ Adapt a commonly used framework for evaluating MT for European languages [Callison-Burch et al., 2011] ¨ 10 bilingual annotators were hired to assess the quality of each system K inter K intra EN-AR 0.57 0.62 Average EN-EU 0.41 0.57 EN-CZ 0.40 0.54

  37. AL-BLEU: Evaluation and Results 37 ¨ Use 900 sentences extracted from the dataset: 600 dev and 300 test ¨ AL-BLEU correlates better with human judgments Dev Test BLEU 0.3361 0.3162 AL-BLEU 0.3759 0.3521 τ = (# ofconcordantpairs − # ofdiscordantpairs ) ÷ totalpairs

  38. AL-BLEU: Conclusion 38 ¨ We provide an annotated corpus of human judgments for evaluation of Arabic MT ¨ We adapt BLEU and introduce AL-BLEU ¨ AL-BLEU uses morphological, syntactic and lexical matching ¨ AL-BLEU correlates better with human judgments http://nlp.qatar.cmu.edu/resources/AL-BLEU

  39. 39 Thank you for your attention

  40. Collaborators 40 Prof. Kemal Oflazer Dr. Behrang Mohit Hanan Mohammed

Recommend


More recommend