Translation Quality Estimation: Past, Present, and Future Andr´ e Martins MT Marathon, Lisbon, August 31st, 2017 Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 1 / 69
This Talk First part: largely based on Lucia Specia’s MTM16 slides Second part: joint work with Marcin, Fabio, Ramon, Chris, Roman Third part: my thoughts on the future of QE Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 2 / 69
Outline 1 MT Evaluation & Quality Estimation 2 Pushing the Limits of Quality Estimation 3 The Future Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 3 / 69
Why Do We Care About Evaluation? In the business of developing MT , we need to: measure progress over new/alternative versions compare different MT systems decide whether a translation is good enough for something optimize parameters of MT systems understand where systems go wrong (diagnosis) ... remember Yvette’s lecture on Monday: Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 4 / 69
Why Do We Care About Evaluation? One should optimize a system using the same metric that will be used to evaluate it Issue : how to choose a metric? Choice should be related to the system’s purpose (not the case in practice) Other aspects are important for tuning (sentence/corpus-level, fast, cheap, differentiable, ...) Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 5 / 69
Complex Problem What does quality mean? Fluent? Adequate? Both? Easy to post-edit? System A better than system B? ... Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 6 / 69
Complex Problem What does quality mean? Fluent? Adequate? Both? Easy to post-edit? System A better than system B? ... Quality for whom / what ? End-user (gisting vs dissemination) Post-editor (light vs heavy post-editing) Other applications (e.g. CLIR) MT-system (tuning or diagnosis for improvement) ... Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 6 / 69
Complex Problem MT Do buy this product, it’s their craziest invention! Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 7 / 69
Complex Problem MT Do buy this product, it’s their craziest invention! HT Do not buy this product, it’s their craziest invention! Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 7 / 69
Complex Problem MT Do buy this product, it’s their craziest invention! HT Do not buy this product, it’s their craziest invention! Severe if end-user does not speak source language Trivial to post-edit by translators Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 7 / 69
Complex Problem MT Six-hours battery, 30 minutes to full charge last . Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 8 / 69
Complex Problem MT Six-hours battery, 30 minutes to full charge last . HT The battery lasts 6 hours and it can be fully recharged in 30 minutes . Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 8 / 69
Complex Problem MT Six-hours battery, 30 minutes to full charge last . HT The battery lasts 6 hours and it can be fully recharged in 30 minutes . Ok for gisting - meaning preserved Very costly for post-editing if style is to be preserved Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 8 / 69
A Taxonomy of MT Evaluation Methods Manual Automatic Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 9 / 69
A Taxonomy of MT Evaluation Methods Scoring Direct asses. Manual Automatic Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 9 / 69
Manual Assessment: Scoring Is this translation correct? Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 10 / 69
A Taxonomy of MT Evaluation Methods Scoring Direct asses. Ranking Manual Automatic Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 11 / 69
Manual Assessment: Ranking Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 12 / 69
A Taxonomy of MT Evaluation Methods Scoring Direct asses. Ranking Error annotation Manual Automatic Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 13 / 69
MQM (Multidimensional Quality Metrics) Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 14 / 69
A Taxonomy of MT Evaluation Methods Scoring Direct asses. Ranking Error annotation Manual Post-editing Task-based Automatic Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 15 / 69
Amount of Post-Editing HTER Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 16 / 69
Amount of Post-Editing Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 16 / 69
A Taxonomy of MT Evaluation Methods Scoring Direct asses. Ranking Error annotation Manual Post-editing Task-based Reading comprehension Automatic Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 17 / 69
Reading Comprehension Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 18 / 69
A Taxonomy of MT Evaluation Methods Scoring Direct asses. Ranking Error annotation Manual Post-editing Task-based Reading comprehension Eye-tracking Automatic Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 19 / 69
Eye-Tracking Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 20 / 69
A Taxonomy of MT Evaluation Methods Scoring Direct asses. Ranking Error annotation Manual Post-editing Task-based Reading comprehension Eye-tracking Reference-based Automatic Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 21 / 69
A Taxonomy of MT Evaluation Methods Scoring Direct asses. Ranking Error annotation Manual Post-editing Task-based Reading comprehension Eye-tracking Reference-based BLEU, Meteor, NIST, TER, WER, PER, CDER, BEER, CiDER, Cobalt, RATATOUILLE, RED, AMBER, PARMESAN, ... Automatic Quality estimation Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 21 / 69
Reference-Based Evaluation Reference(s): subset of good translations, usually one Some metrics expand matching, e.g. synonyms in Meteor Huge variation in reference translations. E.g. Source 不过这一切都由不得你 However these all totally beyond the control of you. MT But all this is beyond the control of you. Human score BLEU score HT 1 But all this is beyond your control. 3.4 0.427 HT 2 However, you cannot choose yourself. 2 0.049 HT 3 However, not everything is up to you to decide. 2 0.050 HT 4 But you can’t choose that. 2.8 0.055 Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 22 / 69
Reference-Based Evaluation Reference(s): subset of good translations, usually one Some metrics expand matching, e.g. synonyms in Meteor Huge variation in reference translations. E.g. Source 不过这一切都由不得你 However these all totally beyond the control of you. MT But all this is beyond the control of you. Human score BLEU score HT 1 But all this is beyond your control. 3.4 0.427 HT 2 However, you cannot choose yourself. 2 0.049 HT 3 However, not everything is up to you to decide. 2 0.050 HT 4 But you can’t choose that. 2.8 0.055 Metrics completely disregard source segment Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 22 / 69
Reference-Based Evaluation Reference(s): subset of good translations, usually one Some metrics expand matching, e.g. synonyms in Meteor Huge variation in reference translations. E.g. Source 不过这一切都由不得你 However these all totally beyond the control of you. MT But all this is beyond the control of you. Human score BLEU score HT 1 But all this is beyond your control. 3.4 0.427 HT 2 However, you cannot choose yourself. 2 0.049 HT 3 However, not everything is up to you to decide. 2 0.050 HT 4 But you can’t choose that. 2.8 0.055 Metrics completely disregard source segment Main problem: Cannot be applied for MT systems in use Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 22 / 69
A Taxonomy of MT Evaluation Methods Scoring Direct asses. Ranking Error annotation Manual Post-editing Task-based Reading comprehension Eye-tracking Reference-based BLEU, Meteor, NIST, TER, WER, PER, CDER, BEER, CiDER, Cobalt, RATATOUILLE, RED, AMBER, PARMESAN, ... Automatic Quality estimation Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 23 / 69
Quality Estimation (Specia et al., 2013) Quality Estimation (QE): metrics that provide an estimate on the quality of translations on the fly Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 24 / 69
Quality Estimation (Specia et al., 2013) Quality Estimation (QE): metrics that provide an estimate on the quality of translations on the fly Quality defined by the data : purpose is clear, no comparison to references , source considered Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 24 / 69
Quality Estimation (Specia et al., 2013) Quality Estimation (QE): metrics that provide an estimate on the quality of translations on the fly Quality defined by the data : purpose is clear, no comparison to references , source considered Quality = Can we publish it as is? Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 24 / 69
Recommend
More recommend