machine translation for human translators
play

Machine Translation for Human Translators Carnegie Mellon Ph.D. - PowerPoint PPT Presentation

Machine Translation for Human Translators Carnegie Mellon Ph.D. Thesis Michael Denkowski Language Technologies Institute School of Computer Science Carnegie Mellon University April 20, 2015 Thesis Committee: Alon Lavie (chair), Carnegie


  1. Hierarchical Phrase-Based Translation Example X X Pourtant , X X Yet X X la v´ erit´ e selon moi in my view the truth F E Pourtant , la v´ erit´ e est ailleurs selon moi . Yet in my view the truth 20

  2. Hierarchical Phrase-Based Translation Example S S X X X X Pourtant , X estailleurs X . Yet X , X lieselsewhere . la v´ erit´ e selon moi in my view the truth F E Pourtant , la v´ erit´ e est ailleurs selon moi . Yet in my view , the truth lies elsewhere . 20

  3. Model Parameterization Ambiguity: many ways to translate the same source phrase Add feature scores that encode properties of translation: � X − → devis quote 0.5 10 -137 ... � X − → devis estimate 0.4 13 -261 ... � X − → devis specifications 0.2 5 -407 ... Decoder uses feature scores and weights to select the most likely translation derivation. 21

  4. Linear Translation Models Single feature score for a translation derivation with rule-local features h i ∈ H i : X → ¯ � � � ¯ � H i ( D ) = h i f e X → ¯ f / ¯ e ∈ D Score for a derivation using several features H i ∈ H with weight vector w i ∈ W : | H | � S ( D ) = w i H i ( D ) i = 1 Decoder selects translation with largest product W · H 22

  5. Linear Translation Models Single feature score for a translation derivation with rule-local features h i ∈ H i : X → ¯ � � � ¯ � H i ( D ) = h i f e X → ¯ f / ¯ e ∈ D Score for a derivation using several features H i ∈ H with weight vector w i ∈ W : | H | � S ( D ) = w i H i ( D ) i = 1 Decoder selects translation with largest product W · H � sentence-level prediction step 22

  6. Learning Translations Learning translations 23

  7. Translation Model Estimation Sentence-parallel bilingual text F E Devis de garage en quatre ´ etapes. A shop’s estimate in four steps. With Avec l’outil Auda-Taller, l’entreprise the AudaTaller tool, Audatex guaran- Audatex garantit que l’usager ob- tees that the user gets an estimate tient un devis en seulement qua- in only 4 steps: identify the vehi- tre ´ etapes : identifier le v´ ehicule, cle, look for the spare part, create an chercher la pi` ece de rechange, cr´ eer estimate and generate an estimate. un devis et le g´ en´ erer. La facilit´ e User friendliness is an essential con- d’utilisation est un ´ el´ ement essentiel dition for these systems, especially to de ces syst` emes, surtout pour conva- convincing older technicians, who, to incre les professionnels les plus ˆ ag´ es varying degrees, are usually more re- qui, dans une plus ou moins grande luctant to use new management tech- mesure, sont r´ etifs ` a l’utilisation de niques. nouvelles techniques de gestion. 24

  8. Translation Model Estimation Sentence-parallel bilingual text F E Devis de garage en quatre ´ etapes. A shop’s estimate in four steps. With Avec l’outil Auda-Taller, l’entreprise the AudaTaller tool, Audatex guaran- Audatex garantit que l’usager ob- tees that the user gets an estimate tient un devis en seulement qua- in only 4 steps: identify the vehi- tre ´ etapes : identifier le v´ ehicule, cle, look for the spare part, create an chercher la pi` ece de rechange, cr´ eer estimate and generate an estimate. un devis et le g´ en´ erer. La facilit´ e User friendliness is an essential con- d’utilisation est un ´ el´ ement essentiel dition for these systems, especially to de ces syst` emes, surtout pour conva- convincing older technicians, who, to incre les professionnels les plus ˆ ag´ es varying degrees, are usually more re- qui, dans une plus ou moins grande luctant to use new management tech- mesure, sont r´ etifs ` a l’utilisation de niques. nouvelles techniques de gestion. Each sentence is a training instance 24

  9. Model Estimation: Word Alignment Brown et al. (1993), Dyer et al. (2013) F ´ Devis de garage en quatre etapes A shop ’s estimate in four steps E 25

  10. Model Estimation: Word Alignment Brown et al. (1993), Dyer et al. (2013) F ´ Devis de garage en quatre etapes A shop ’s estimate in four steps E 25

  11. Model Estimation: Phrase Extraction Koehn et al. (2003), Och and Ney (2004), Och et al. (1999) E estimate steps shop four in A ’s Devis • de • • garage • F en • quatre • ´ etapes • 26

  12. Model Estimation: Phrase Extraction Koehn et al. (2003), Och and Ney (2004), Och et al. (1999) E estimate de garage steps shop four a shop ’s in A ’s Devis • de • • garage • F en • en quatre ´ etapes quatre • ´ etapes • in four steps 26

  13. Model Estimation: Hierarchical Phrase Extraction Chiang (2007) E elsewhere truth view lies Yet the my in , . Pourtant • , • la • v´ erit´ e • F est • ailleurs • selon • • moi • . • 27

  14. Model Estimation: Hierarchical Phrase Extraction Chiang (2007) E elsewhere truth view lies Yet the my in , . Pourtant • , • la • v´ erit´ e • F est • ailleurs • selon • • moi • . • la v´ erit´ e est ailleurs selon moi . in my view , the truth lies elsewhere . − → 27

  15. Model Estimation: Hierarchical Phrase Extraction Chiang (2007) E elsewhere truth view lies Yet the my in , . Pourtant • , • la • X 1 v´ erit´ e • F est • ailleurs • selon • • moi • . • X 2 la v´ erit´ e est ailleurs selon moi . in my view , the truth lies elsewhere . − → X 1 est ailleurs X 2 . − → X 2 , X 1 lies elsewhere . 27

  16. Model Estimation: Hierarchical Phrase Extraction Chiang (2007) � sentence-level rule learning E elsewhere truth view lies Yet the my in , . Pourtant • , • la • X 1 v´ erit´ e • F est • ailleurs • selon • • moi • . • X 2 la v´ erit´ e est ailleurs selon moi . in my view , the truth lies elsewhere . − → X 1 est ailleurs X 2 . − → X 2 , X 1 lies elsewhere . 27

  17. Parameterization: Feature Scoring → ¯ Add feature functions to rules X − f / ¯ e : X → � N ¯ i = 1 f / ¯ e Training Data Corpus Stats Scored Grammar Translate (Global) Sentence Input Static Sentence 28

  18. Parameterization: Feature Scoring → ¯ Add feature functions to rules X − f / ¯ e : X → � N ¯ i = 1 f / ¯ e Training Data Corpus Stats Scored Grammar Translate (Global) Sentence Input Static Sentence × corpus-level rule scoring 28

  19. Suffix Array Grammar Extraction Brown (1996), Callison-Burch et al. (2005), Lopez (2008) Static Training Data Suffix Array X → � N ¯ f / ¯ e i = 1 SA Sample Sample Stats Grammar Translate (Sentence) Sentence Input Sentence 29

  20. Scoring via Sampling Suffix array statistics available in sample S for each source ¯ f : c S (¯ e ) : count of instances where ¯ f , ¯ f is aligned to ¯ e (co-occurrence count) c S (¯ f ) : count of instances where ¯ f is aligned to any target | S | : total number of instances (equal to occurrences of ¯ f in training data, up to the sample size) Used to calculate feature scores for each rule at the time of extraction 30

  21. Scoring via Sampling Suffix array statistics available in sample S for each source ¯ f : c S (¯ e ) : count of instances where ¯ f , ¯ f is aligned to ¯ e (co-occurrence count) c S (¯ f ) : count of instances where ¯ f is aligned to any target | S | : total number of instances (equal to occurrences of ¯ f in training data, up to the sample size) Used to calculate feature scores for each rule at the time of extraction × sentence-level grammar extraction, but static training data 30

  22. Overview Online learning for statistical MT Translation model review Real time model adaptation Simulated post-editing Post-editing software and experiments Kent State live post-editing Automatic metrics for post-editing Meteor automatic metric Evaluation and optimization for post-editing Conclusion and Future Work 31

  23. Online Grammar Extraction Denkowski et al. (EACL 2014) Static Training Data Suffix Array X → � N ¯ f / ¯ e i = 1 Sample Sample Stats Grammar Translate (Sentence) Sentence Input Sentence 32

  24. Online Grammar Extraction Denkowski et al. (EACL 2014) Dynamic Static Training Data Suffix Array Lookup Table Post-Edit Sentence X → � N ¯ f / ¯ e i = 1 Sample Sample Stats Grammar Translate (Sentence) Sentence Input Sentence 32

  25. Online Grammar Extraction Denkowski et al. (EACL 2014) Maintain dynamic lookup table for post-edit data Pair each sample S from suffix array with exhaustive lookup L from lookup table Parallel statistics available at grammar scoring time: c L (¯ e ) : count of instances where ¯ f , ¯ f is aligned to ¯ e (co-occurrence count) c L (¯ f ) : count of instances where ¯ f is aligned to any target | L | : total number of instances (equal to occurrences of ¯ f in post-edit data, no limit) 33

  26. Rule Scoring Denkowski et al. (EACL 2014) Suffix array feature set (Lopez 2008) Phrase features encode likelihood of translation rule given training data Features scored with S : CoherentP(e|f) = c S (¯ f , ¯ e ) | S | Count(f,e) = c S (¯ f , ¯ e ) SampleCount(f) = | S | 34

  27. Rule Scoring Denkowski et al. (EACL 2014) Suffix array feature set (Lopez 2008) Phrase features encode likelihood of translation rule given training data Features scored with S and L : CoherentP(e|f) = c S (¯ e ) + c L (¯ f , ¯ f , ¯ e ) | S | + | L | Count(f,e) = c S (¯ e ) + c L (¯ f , ¯ f , ¯ e ) SampleCount(f) = | S | + | L | 34

  28. Rule Scoring Denkowski et al. (EACL 2014) Indicator features identify certain classes of rules Features scored with S : � c S (¯ 1 f ) = 1 Singleton(f) = 0 otherwise � c S (¯ f , ¯ 1 e ) = 1 Singleton(f,e) = 0 otherwise 35

  29. Rule Scoring Denkowski et al. (EACL 2014) Indicator features identify certain classes of rules Features scored with S and L : � c S (¯ f ) + c L (¯ 1 f ) = 1 Singleton(f) = 0 otherwise � c S (¯ e ) + c L (¯ f , ¯ f , ¯ 1 e ) = 1 Singleton(f,e) = 0 otherwise � c L (¯ 1 f , ¯ e ) > 0 PostEditSupport(f,e) = 0 otherwise 35

  30. Parameter Optimization Denkowski et al. (EACL 2014) Choose feature weights that maximize objective function (BLEU score) on a development corpus Minimum error rate training (MERT) (Och, 2003): Translate Optimize 36

  31. Parameter Optimization Denkowski et al. (EACL 2014) Choose feature weights that maximize objective function (BLEU score) on a development corpus Minimum error rate training (MERT) (Och, 2003): Translate Optimize Margin infused relaxed algorithm (MIRA) (Chiang 2012): Update Translate Truth 36

  32. Post-Editing with Standard MT Denkowski et al. (EACL 2014) Static X → w i ... w n f / e Large LM Weights Grammar Input Sentence Post-Editing Decoder 37

  33. Post-Editing with Adaptive MT Denkowski et al. (EACL 2014) Static LM Large Bitext Dynamic w i ... w n Weights PE Data X → f / e TM Input Sentence Post-Editing Decoder 38

  34. Overview How can we build systems without translators in the loop? 39

  35. Overview Online learning for statistical MT Translation model review Real time model adaptation Simulated post-editing Post-editing software and experiments Kent State live post-editing Automatic metrics for post-editing Meteor automatic metric Evaluation and optimization for post-editing Conclusion and Future Work 40

  36. Simulated Post-Editing Denkowski et al. (EACL 2014) Incremental training data Hello voicemail, my old ... Hola contestadora ... He llamado a servicio ... I’ve called for tech ... Ignor´ e la advertencia ... I ignored my boss’ ... Ahora anochece, y mi ... Now it’s evening, and ... Todav´ ıa sigo en espera ... I’m still on hold. I’m ... No creo que me hayas ... I don’t think you ... Ya he presionado cada ... I punched every touch ... Target (Reference) Source Use pre-generated references in place of post-editing (Hardt and Elming, 2010) Build, evaluate, and deploy adaptive systems using only standard training data 41

  37. Simulated Post-Editing Experiments Denkowski et al. (EACL 2014) MT System ( cdec ) Hierarchical phrase-based model using suffix arrays Large 4-gram language model MIRA optimization Model Adaptation Update TM and weights independently and in conjunction Training Data WMT12 Spanish–English and NIST 2012 Arabic–English Evaluation Data WMT/NIST news (standard test sets) TED talks (totally blind out-of-domain test) 42

  38. Simulated Post-Editing Experiments Denkowski et al. (EACL 2014) Spanish–English Arabic–English 36 28 Baseline Baseline 26 Grammar Grammar 34 MIRA 24 MIRA Both Both 22 32 BLEU Score BLEU Score 20 30 18 16 28 14 12 26 10 WMT TED1 TED2 NIST TED1 TED2 43

  39. Simulated Post-Editing Experiments Denkowski et al. (EACL 2014) Spanish–English Arabic–English 36 28 Baseline Baseline 26 Grammar Grammar 34 MIRA 24 MIRA Both Both 22 32 BLEU Score BLEU Score 20 30 18 16 28 14 12 26 10 WMT TED1 TED2 NIST TED1 TED2 Up to 1.7 BLEU improvement over static baseline 43

  40. Recent Work How can we better leverage incremental data? 44

  41. Translation Model Combination Denkowski (AMTA 2014 Workshop on Interactive and Adaptive MT) cdec (Dyer et al., 2010) Single translation model updated with new data Single feature set that changes over time (summation) Moses (Koehn et al., 2007) Multiple translation models: background and post-editing Per-feature linear interpolation in context of full system Recent additions to Moses toolkit Dynamic suffix array phrase tables (Germann, 2014) Fast MIRA implementation (Cherry and Foster, 2012) Multiple phrase tables with runtime weight updates (Denkowski, 2014) 45

  42. Translation Model Combination Denkowski (AMTA 2014 Workshop on Interactive and Adaptive MT) Spanish–English Arabic–English 38 30 Baseline Baseline 37 PE Support PE Support Multi Model Multi Model 36 25 +MIRA +MIRA 35 BLEU Score BLEU Score 34 20 33 32 15 31 30 10 WMT TED1 TED2 NIST TED1 TED2 46

  43. Translation Model Combination Denkowski (AMTA 2014 Workshop on Interactive and Adaptive MT) Spanish–English Arabic–English 38 30 Baseline Baseline 37 PE Support PE Support Multi Model Multi Model 36 25 +MIRA +MIRA 35 BLEU Score BLEU Score 34 20 33 32 15 31 30 10 WMT TED1 TED2 NIST TED1 TED2 Up to 4.9 BLEU improvement over static baseline 46

  44. Related Work: Learning from Post-Editing Updating translation grammars with post-editing data Cache-based translation and language models (Nepveu et al., 2004; Bertoldi et al., 2013) Store sufficient statistics in grammar (Ortiz-Mart´ ınez et al., 2010) Distinguish between background and post-editing data (Hardt and Elming, 2010) Updating feature weights during decoding Various online learning algorithms to update MERT weights (Mart´ ınez-G´ omez et al., 2012; L´ opez-Salcedo et al., 2012) Algorithm for learning from binary classification examples (Saluja et al., 2012) 47

  45. Overview Online learning for statistical MT Translation model review Real time model adaptation Simulated post-editing Post-editing software and experiments Kent State live post-editing Automatic metrics for post-editing Meteor automatic metric Evaluation and optimization for post-editing Conclusion and Future Work 48

  46. Tools for Human Translators 49

  47. TransCenter Post-Editing Interface Denkowski and Lavie (AMTA 2012), Denkowski et al. (HaCat 2014) 50

  48. TransCenter Post-Editing Interface Denkowski and Lavie (AMTA 2012), Denkowski et al. (HaCat 2014) 51

  49. TransCenter Post-Editing Interface Denkowski and Lavie (AMTA 2012), Denkowski et al. (HaCat 2014) 52

  50. TransCenter Post-Editing Interface Denkowski and Lavie (AMTA 2012), Denkowski et al. (HaCat 2014) 53

  51. Overview Online learning for statistical MT Translation model review Real time model adaptation Simulated post-editing Post-editing software and experiments Kent State live post-editing Automatic metrics for post-editing Meteor automatic metric Evaluation and optimization for post-editing Conclusion and Future Work 54

  52. Post-Editing Field Test Denkowski et al. (HaCat 2014) Experimental Setup Six translation studies students from Kent State University post-edited MT output Text: 4 excerpts from TED talks translated from Spanish into English (100 sentences total) Two excerpts translated by static system, two by adaptive system (shuffled by user) Record post-editing effort (HTER) and translator rating 55

  53. Post-Editing Field Test Denkowski et al. (HaCat 2014) Results Adaptive system significantly outperforms static baseline Small improvement in simulated scenario leads to significant improvement in production HTER ↓ Rating ↑ Sim PE BLEU ↑ Baseline 19.26 4.19 34.50 Adaptive 17.01 4.31 34.95 56

  54. Related Work: Computer-Aided Translation Tools Translation software suites CASMACAT project: full-featured open source translator’s workbench software (Ortiz-Mart´ ınez et al., 2012) MateCat project: enterprise-grade workbench with MT integration and project management (Federico, 2014; Cattelan, 2014) Novel CAT approaches Streamlined interface with both phrase prediction and post-editing (Green, 2014) Effectiveness of monolingual post-editing assisted by word alignments (Schwartz, 2014) 57

  55. Overview Online learning for statistical MT Translation model review Real time model adaptation Simulated post-editing Post-editing software and experiments Kent State live post-editing Automatic metrics for post-editing Meteor automatic metric Evaluation and optimization for post-editing Conclusion and Future Work 58

  56. System Optimization Parameter optimization (MIRA) Choose feature weights W that maximizes objective on tuning set Automatic metrics approximate human evaluation of MT output against reference translations Adequacy-based evaluation Good translations should be semantically similar to references Several adequacy-driven research efforts: ACL WMT (Callison-Burch et al., 2011) NIST OpenMT (Przybocki et al., 2009) 59

  57. Standard MT Evaluation Standard BLEU metric based on N -gram precision ( P ) (Papineni et al., 2002) Matches spans of hypothesis E ′ against reference E Surface forms only, depends on multiple references to capture translation variation (expensive) Jointly measures word choice and order � N | E ′ | > | E | � � 1 1 � BLEU = BP × exp N log P n BP = 1 −| E | | E ′ | ≤ | E | | E ′| e n = 1 60

  58. Standard MT Evaluation Shortcomings of BLEU metric (Banerjee and Lavie 2005, Callison-Burch et al., 2007): Evaluating surface forms misses correct translations N -grams have no notion of global coherence 61

  59. Standard MT Evaluation Shortcomings of BLEU metric (Banerjee and Lavie 2005, Callison-Burch et al., 2007): Evaluating surface forms misses correct translations N -grams have no notion of global coherence E : The large home 61

  60. Standard MT Evaluation Shortcomings of BLEU metric (Banerjee and Lavie 2005, Callison-Burch et al., 2007): Evaluating surface forms misses correct translations N -grams have no notion of global coherence E : The large home E ′ 1 : A big house BLEU = 0 E ′ 2 : I am a dinosaur BLEU = 0 61

  61. Post-Editing Final translations must be human quality (editing required) Good MT output should require less work for humans to edit Human-targeted translation edit rate (HTER, Snover et al., 2006) Human translators correct MT output 1 Automatically calculate number of edits using TER 2 TER = # edits | E | Edits: insertion, deletion, substitution, block shift “Better” translations not always easier to post-edit 62

Recommend


More recommend