estimating post editing effort
play

Estimating post-editing effort State-of-the-art systems and open - PowerPoint PPT Presentation

Overview Quality Estimation Shared Task Open issues Conclusions Estimating post-editing effort State-of-the-art systems and open issues Lucia Specia University of Sheffield l.specia@sheffield.ac.uk 17 August 2012 Estimating post-editing


  1. Overview Quality Estimation Shared Task Open issues Conclusions Datasets English source sentences Spanish MT outputs (PBSMT Moses) Effort scores by 3 human judges, scale 1-5, averaged Post-edited output Human Spanish translation (original references ) Training set : 1832 Estimating post-editing effort 12 / 40

  2. Overview Quality Estimation Shared Task Open issues Conclusions Datasets English source sentences Spanish MT outputs (PBSMT Moses) Effort scores by 3 human judges, scale 1-5, averaged Post-edited output Human Spanish translation (original references ) Training set : 1832 Blind test set : 422 Estimating post-editing effort 12 / 40

  3. Overview Quality Estimation Shared Task Open issues Conclusions Datasets Annotation guidelines 3 human judges for PE effort assigning 1 - 5 scores for � source, MT output, PE output � [1] The MT output is incomprehensible, with little or no information transferred accurately. It cannot be edited, needs to be translated from scratch. [2] About 50-70% of the MT output needs to be edited. It requires a significant editing effort in order to reach publishable level. [3] About 25-50% of the MT output needs to be edited. It contains different errors and mistranslations that need to be corrected. [4] About 10-25% of the MT output needs to be edited. It is generally clear and intelligible. [5] The MT output is perfectly clear and intelligible. It is not necessarily a perfect translation, but requires little to no editing. Estimating post-editing effort 13 / 40

  4. Overview Quality Estimation Shared Task Open issues Conclusions Resources SMT resources for training and test sets: SMT training corpus (Europarl and News-bommentaries) LMs: 5-gram LM; 3-gram LM and 1-3-gram counts IBM Model 1 table (Giza) Word-alignment file as produced by grow-diag-final Phrase table with word alignment information Moses configuration file used for decoding Moses run-time log: model component values, word graph, etc. Estimating post-editing effort 14 / 40

  5. Overview Quality Estimation Shared Task Open issues Conclusions Evaluation metrics Scoring metrics - standard MAE and RMSE � N i =1 | H ( s i ) − V ( s i ) | MAE = N �� N i =1 ( H ( s i ) − V ( s i )) 2 RMSE = N N = | S | H ( s i ) is the predicted score for s i V ( s i ) the is human score for s i Estimating post-editing effort 15 / 40

  6. Overview Quality Estimation Shared Task Open issues Conclusions Evaluation metrics Ranking metrics Spearman’s rank correlation and new metric: DeltaAvg For S 1 , S 2 , . . . , S n quantiles: � n − 1 k =1 V ( S 1 , k ) DeltaAvg V [ n ] = − V ( S ) n − 1 V ( S ): extrinsic function measuring the “quality” of set S Estimating post-editing effort 16 / 40

  7. Overview Quality Estimation Shared Task Open issues Conclusions Evaluation metrics Ranking metrics Spearman’s rank correlation and new metric: DeltaAvg For S 1 , S 2 , . . . , S n quantiles: � n − 1 k =1 V ( S 1 , k ) DeltaAvg V [ n ] = − V ( S ) n − 1 V ( S ): extrinsic function measuring the “quality” of set S Average human scores (1-5) of set S Estimating post-editing effort 16 / 40

  8. Overview Quality Estimation Shared Task Open issues Conclusions Evaluation metrics Example 1: n =2, quantiles S 1 , S 2 DeltaAvg[2] = V ( S 1 ) − V ( S ) “Quality of the top half compared to the overall quality” Average human scores of top half compared to average human scores of complete set Estimating post-editing effort 17 / 40

  9. Overview Quality Estimation Shared Task Open issues Conclusions Evaluation metrics Example 1: n =2, quantiles S 1 , S 2 DeltaAvg[2] = V ( S 1 ) − V ( S ) “Quality of the top half compared to the overall quality” Average human scores of top half compared to average human scores of complete set Example 2: n =3, quantiles S 1 , S 2 , S 3 DeltaAvg[3] = ( V ( S 1 ) − V ( S ))+( V ( S 1 , 2 ) − V ( S )) 2 Average human scores of top third compared to average human scores of complete set; average human scores of top two thirds compared to average human scores of complete set, averaged Estimating post-editing effort 17 / 40

  10. Overview Quality Estimation Shared Task Open issues Conclusions Evaluation metrics Final DeltaAvg metric � N n =2 DeltaAvg V [ n ] DeltaAvg V = N − 1 where N = | S | / 2 Estimating post-editing effort 18 / 40

  11. Overview Quality Estimation Shared Task Open issues Conclusions Evaluation metrics Final DeltaAvg metric � N n =2 DeltaAvg V [ n ] DeltaAvg V = N − 1 where N = | S | / 2 Average DeltaAvg[ n ] for all n , 2 ≤ n ≤ | S | / 2 Estimating post-editing effort 18 / 40

  12. Overview Quality Estimation Shared Task Open issues Conclusions Participants ID Participating team PRHLT-UPV Universitat Politecnica de Valencia, Spain UU Uppsala University, Sweden SDLLW SDL Language Weaver, USA Loria LORIA Institute, France UPC Universitat Politecnica de Catalunya, Spain DFKI DFKI, Germany WLV-SHEF Univ of Wolverhampton & Univ of Sheffield, UK SJTU Shanghai Jiao Tong University, China DCU-SYMC Dublin City University, Ireland & Symantec, Ireland UEdin University of Edinburgh, UK TCD Trinity College Dublin, Ireland One or two systems per team, most teams submitting for ranking and scoring sub-tasks Estimating post-editing effort 19 / 40

  13. Overview Quality Estimation Shared Task Open issues Conclusions Baseline system Feature extraction software – system-independent features: number of tokens in the source and target sentences average source token length average number of occurrences of words in the target number of punctuation marks in source and target sentences LM probability of source and target sentences average number of translations per source word % of source 1-grams, 2-grams and 3-grams in frequency quartiles 1 and 4 % of seen source unigrams Estimating post-editing effort 20 / 40

  14. Overview Quality Estimation Shared Task Open issues Conclusions Baseline system Feature extraction software – system-independent features: number of tokens in the source and target sentences average source token length average number of occurrences of words in the target number of punctuation marks in source and target sentences LM probability of source and target sentences average number of translations per source word % of source 1-grams, 2-grams and 3-grams in frequency quartiles 1 and 4 % of seen source unigrams SVM regression with RBF kernel with the parameters γ , ǫ and C optimized using a grid-search and 5-fold cross validation on the training set Estimating post-editing effort 20 / 40

  15. Overview Quality Estimation Shared Task Open issues Conclusions Results - ranking sub-task System ID DeltaAvg Spearman Corr • SDLLW M5PbestDeltaAvg 0.63 0.64 • SDLLW SVM 0.61 0.60 UU bltk 0.58 0.61 UU best 0.56 0.62 TCD M5P-resources-only* 0.56 0.56 Baseline (17FFs SVM) 0.55 0.58 PRHLT-UPV 0.55 0.55 UEdin 0.54 0.58 SJTU 0.53 0.53 WLV-SHEF FS 0.51 0.52 WLV-SHEF BL 0.50 0.49 DFKI morphPOSibm1LM 0.46 0.46 DCU-SYMC unconstrained 0.44 0.41 DCU-SYMC constrained 0.43 0.41 TCD M5P-all* 0.42 0.41 UPC 1 0.22 0.26 UPC 2 0.15 0.19 • = winning submissions gray area = not different from baseline * = bug-fix was applied after the submission Estimating post-editing effort 21 / 40

  16. Overview Quality Estimation Shared Task Open issues Conclusions Results - ranking sub-task Oracle methods : associate various metrics in a oracle manner to the test input: Oracle Effort : the gold-label Effort Oracle HTER : the HTER metric against the post-edited translations as reference System ID DeltaAvg Spearman Corr Oracle Effort 0.95 1.00 Oracle HTER 0.77 0.70 Estimating post-editing effort 22 / 40

  17. Overview Quality Estimation Shared Task Open issues Conclusions Results - scoring sub-task System ID MAE RMSE • SDLLW M5PbestDeltaAvg 0.61 0.75 UU best 0.64 0.79 SDLLW SVM 0.64 0.78 UU bltk 0.64 0.79 Loria SVMlinear 0.68 0.82 UEdin 0.68 0.82 TCD M5P-resources-only* 0.68 0.82 Baseline (17FFs SVM) 0.69 0.82 Loria SVMrbf 0.69 0.83 SJTU 0.69 0.83 WLV-SHEF FS 0.69 0.85 PRHLT-UPV 0.70 0.85 WLV-SHEF BL 0.72 0.86 DCU-SYMC unconstrained 0.75 0.97 DFKI grcfs-mars 0.82 0.98 DFKI cfs-plsreg 0.82 0.99 UPC 1 0.84 1.01 DCU-SYMC constrained 0.86 1.12 UPC 2 0.87 1.04 TCD M5P-all 2.09 2.32 Estimating post-editing effort 23 / 40

  18. Overview Quality Estimation Shared Task Open issues Conclusions Discussion New and effective quality indicators (features) Most participating systems use external resources : parsers, POS taggers, NER, etc. → wide variety of features Estimating post-editing effort 24 / 40

  19. Overview Quality Estimation Shared Task Open issues Conclusions Discussion New and effective quality indicators (features) Most participating systems use external resources : parsers, POS taggers, NER, etc. → wide variety of features Many tried to exploit linguistically-oriented features Estimating post-editing effort 24 / 40

  20. Overview Quality Estimation Shared Task Open issues Conclusions Discussion New and effective quality indicators (features) Most participating systems use external resources : parsers, POS taggers, NER, etc. → wide variety of features Many tried to exploit linguistically-oriented features none or modest improvements (e.g. WLV-SHEF) Estimating post-editing effort 24 / 40

  21. Overview Quality Estimation Shared Task Open issues Conclusions Discussion New and effective quality indicators (features) Most participating systems use external resources : parsers, POS taggers, NER, etc. → wide variety of features Many tried to exploit linguistically-oriented features none or modest improvements (e.g. WLV-SHEF) high performance (e.g. “UU” with constituency and dependency trees) Estimating post-editing effort 24 / 40

  22. Overview Quality Estimation Shared Task Open issues Conclusions Discussion New and effective quality indicators (features) Most participating systems use external resources : parsers, POS taggers, NER, etc. → wide variety of features Many tried to exploit linguistically-oriented features none or modest improvements (e.g. WLV-SHEF) high performance (e.g. “UU” with constituency and dependency trees) Previously overlooked features : SMT decoder feature values (e.g. SDLLW) Estimating post-editing effort 24 / 40

  23. Overview Quality Estimation Shared Task Open issues Conclusions Discussion New and effective quality indicators (features) Most participating systems use external resources : parsers, POS taggers, NER, etc. → wide variety of features Many tried to exploit linguistically-oriented features none or modest improvements (e.g. WLV-SHEF) high performance (e.g. “UU” with constituency and dependency trees) Previously overlooked features : SMT decoder feature values (e.g. SDLLW) A powerful single feature : agreement between two different SMT systems (e.g. SDLLW) Estimating post-editing effort 24 / 40

  24. Overview Quality Estimation Shared Task Open issues Conclusions Discussion Machine Learning techniques Best performing: Regression Trees (M5P) and SVR Estimating post-editing effort 25 / 40

  25. Overview Quality Estimation Shared Task Open issues Conclusions Discussion Machine Learning techniques Best performing: Regression Trees (M5P) and SVR M5P Regression Trees: compact models, less overfitting, “readable” Estimating post-editing effort 25 / 40

  26. Overview Quality Estimation Shared Task Open issues Conclusions Discussion Machine Learning techniques Best performing: Regression Trees (M5P) and SVR M5P Regression Trees: compact models, less overfitting, “readable” SVRs: easily overfit with small training data and large feature set Estimating post-editing effort 25 / 40

  27. Overview Quality Estimation Shared Task Open issues Conclusions Discussion Machine Learning techniques Best performing: Regression Trees (M5P) and SVR M5P Regression Trees: compact models, less overfitting, “readable” SVRs: easily overfit with small training data and large feature set Feature selection crucial in this setup Estimating post-editing effort 25 / 40

  28. Overview Quality Estimation Shared Task Open issues Conclusions Discussion Machine Learning techniques Best performing: Regression Trees (M5P) and SVR M5P Regression Trees: compact models, less overfitting, “readable” SVRs: easily overfit with small training data and large feature set Feature selection crucial in this setup Structured learning techniques: “UU” submissions (tree kernels) Estimating post-editing effort 25 / 40

  29. Overview Quality Estimation Shared Task Open issues Conclusions Discussion Evaluation metrics DeltaAvg → suitable for the ranking task Estimating post-editing effort 26 / 40

  30. Overview Quality Estimation Shared Task Open issues Conclusions Discussion Evaluation metrics DeltaAvg → suitable for the ranking task automatic and deterministic (and therefore consistent) Estimating post-editing effort 26 / 40

  31. Overview Quality Estimation Shared Task Open issues Conclusions Discussion Evaluation metrics DeltaAvg → suitable for the ranking task automatic and deterministic (and therefore consistent) Extrinsic interpretability , e.g.: Average quality in [1-5] = 2.5 Quality of top 25% = 3.1 Delta [1-5] = 0.6 Estimating post-editing effort 26 / 40

  32. Overview Quality Estimation Shared Task Open issues Conclusions Discussion Evaluation metrics DeltaAvg → suitable for the ranking task automatic and deterministic (and therefore consistent) Extrinsic interpretability , e.g.: Average quality in [1-5] = 2.5 Quality of top 25% = 3.1 Delta [1-5] = 0.6 Versatile : valuation function V can change Estimating post-editing effort 26 / 40

  33. Overview Quality Estimation Shared Task Open issues Conclusions Discussion Evaluation metrics DeltaAvg → suitable for the ranking task automatic and deterministic (and therefore consistent) Extrinsic interpretability , e.g.: Average quality in [1-5] = 2.5 Quality of top 25% = 3.1 Delta [1-5] = 0.6 Versatile : valuation function V can change High correlation with Spearman, but less strict Estimating post-editing effort 26 / 40

  34. Overview Quality Estimation Shared Task Open issues Conclusions Discussion Evaluation metrics DeltaAvg → suitable for the ranking task automatic and deterministic (and therefore consistent) Extrinsic interpretability , e.g.: Average quality in [1-5] = 2.5 Quality of top 25% = 3.1 Delta [1-5] = 0.6 Versatile : valuation function V can change High correlation with Spearman, but less strict MAE, RMSE → difficult task, values stubbornly high Estimating post-editing effort 26 / 40

  35. Overview Quality Estimation Shared Task Open issues Conclusions Discussion Evaluation metrics DeltaAvg → suitable for the ranking task automatic and deterministic (and therefore consistent) Extrinsic interpretability , e.g.: Average quality in [1-5] = 2.5 Quality of top 25% = 3.1 Delta [1-5] = 0.6 Versatile : valuation function V can change High correlation with Spearman, but less strict MAE, RMSE → difficult task, values stubbornly high Regression vs ranking Most submissions: regression results to infer ranking Estimating post-editing effort 26 / 40

  36. Overview Quality Estimation Shared Task Open issues Conclusions Discussion Evaluation metrics DeltaAvg → suitable for the ranking task automatic and deterministic (and therefore consistent) Extrinsic interpretability , e.g.: Average quality in [1-5] = 2.5 Quality of top 25% = 3.1 Delta [1-5] = 0.6 Versatile : valuation function V can change High correlation with Spearman, but less strict MAE, RMSE → difficult task, values stubbornly high Regression vs ranking Most submissions: regression results to infer ranking Ranking approach is simpler, directly useful in many applications Estimating post-editing effort 26 / 40

  37. Overview Quality Estimation Shared Task Open issues Conclusions Discussion Establish state-of-the-art performance “Baseline” - hard to beat, previous state-of-the-art Estimating post-editing effort 27 / 40

  38. Overview Quality Estimation Shared Task Open issues Conclusions Discussion Establish state-of-the-art performance “Baseline” - hard to beat, previous state-of-the-art Metrics, data sets, and performance points available Estimating post-editing effort 27 / 40

  39. Overview Quality Estimation Shared Task Open issues Conclusions Discussion Establish state-of-the-art performance “Baseline” - hard to beat, previous state-of-the-art Metrics, data sets, and performance points available Known values for oracle-based upperbounds Estimating post-editing effort 27 / 40

  40. Overview Quality Estimation Shared Task Open issues Conclusions Outline Overview 1 Quality Estimation 2 Shared Task 3 Open issues 4 Conclusions 5 Estimating post-editing effort 28 / 40

  41. Overview Quality Estimation Shared Task Open issues Conclusions Agreement between translators Absolute value judgements : difficult to achieve consistency across annotators even in highly controlled setup Estimating post-editing effort 29 / 40

  42. Overview Quality Estimation Shared Task Open issues Conclusions Agreement between translators Absolute value judgements : difficult to achieve consistency across annotators even in highly controlled setup 30% of initial dataset discarded: annotators disagreed by more than one category Remaining annotations had to be scaled Estimating post-editing effort 29 / 40

  43. Overview Quality Estimation Shared Task Open issues Conclusions Agreement between translators en-pt subtitles of TV series: 3 non-professionals annotators, 1-4 scores 351 cases (41%): full agreement 445 cases (52%): partial agreement 54 cases (7%): null agreement Estimating post-editing effort 30 / 40

  44. Overview Quality Estimation Shared Task Open issues Conclusions Agreement between translators en-pt subtitles of TV series: 3 non-professionals annotators, 1-4 scores 351 cases (41%): full agreement 445 cases (52%): partial agreement 54 cases (7%): null agreement Agreement by score : Score Full Partial/Null 4 59% 41% 3 35% 65% 2 23% 77% 1 50% 50% Estimating post-editing effort 30 / 40

  45. Overview Quality Estimation Shared Task Open issues Conclusions More objective ways of generating absolute scores TIME : varies considerably across translators (expected). E.g.: seconds per word Estimating post-editing effort 31 / 40

  46. Overview Quality Estimation Shared Task Open issues Conclusions More objective ways of generating absolute scores TIME : varies considerably across translators (expected). E.g.: seconds per word Can we normalise this variation? Estimating post-editing effort 31 / 40

  47. Overview Quality Estimation Shared Task Open issues Conclusions More objective ways of generating absolute scores TIME : varies considerably across translators (expected). E.g.: seconds per word Can we normalise this variation? Dedicated QE systems? Estimating post-editing effort 31 / 40

  48. Overview Quality Estimation Shared Task Open issues Conclusions More objective ways of generating absolute scores HTER : Edit distance between MT output and its minimally post-edited version Estimating post-editing effort 32 / 40

  49. Overview Quality Estimation Shared Task Open issues Conclusions More objective ways of generating absolute scores HTER : Edit distance between MT output and its minimally post-edited version # edits HTER = # words postedited version Edits: substitute, delete, insert, shift Estimating post-editing effort 32 / 40

  50. Overview Quality Estimation Shared Task Open issues Conclusions More objective ways of generating absolute scores HTER : Edit distance between MT output and its minimally post-edited version # edits HTER = # words postedited version Edits: substitute, delete, insert, shift Analysis by Maarit Koponen (WMT-12) on post-edited translations with HTER and 1-5 scores A number of cases where translations with low HTER (few edits) were assigned low quality scores (high post-editing effort), and vice-versa Estimating post-editing effort 32 / 40

  51. Overview Quality Estimation Shared Task Open issues Conclusions More objective ways of generating absolute scores HTER : Edit distance between MT output and its minimally post-edited version # edits HTER = # words postedited version Edits: substitute, delete, insert, shift Analysis by Maarit Koponen (WMT-12) on post-edited translations with HTER and 1-5 scores A number of cases where translations with low HTER (few edits) were assigned low quality scores (high post-editing effort), and vice-versa Certain edits seem to require more cognitive effort than others - not captured by HTER Estimating post-editing effort 32 / 40

  52. Overview Quality Estimation Shared Task Open issues Conclusions More objective ways of generating absolute scores Keystrokes : different PE strategies - data from 8 translators (joint work with Maarit Koponen and Wilker Aziz): Estimating post-editing effort 33 / 40

  53. Overview Quality Estimation Shared Task Open issues Conclusions More objective ways of generating absolute scores Keystrokes : different PE strategies - data from 8 translators (joint work with Maarit Koponen and Wilker Aziz): Estimating post-editing effort 33 / 40

  54. Overview Quality Estimation Shared Task Open issues Conclusions Use of relative scores Ranking of translations : Suitable if the final application is to compare alternative translations of same source sentence Estimating post-editing effort 34 / 40

  55. Overview Quality Estimation Shared Task Open issues Conclusions Use of relative scores Ranking of translations : Suitable if the final application is to compare alternative translations of same source sentence N-best list re-ranking System combination MT system evaluation Estimating post-editing effort 34 / 40

  56. Overview Quality Estimation Shared Task Open issues Conclusions What is the best metric to estimate PE effort? Effort/HTER seem to lack “cognitive load” Estimating post-editing effort 35 / 40

  57. Overview Quality Estimation Shared Task Open issues Conclusions What is the best metric to estimate PE effort? Effort/HTER seem to lack “cognitive load” Time varies too much across post-editors Estimating post-editing effort 35 / 40

  58. Overview Quality Estimation Shared Task Open issues Conclusions What is the best metric to estimate PE effort? Effort/HTER seem to lack “cognitive load” Time varies too much across post-editors Keystrokes seems to capture PE strategies, but do not correlate well with PE effort Estimating post-editing effort 35 / 40

  59. Overview Quality Estimation Shared Task Open issues Conclusions What is the best metric to estimate PE effort? Effort/HTER seem to lack “cognitive load” Time varies too much across post-editors Keystrokes seems to capture PE strategies, but do not correlate well with PE effort Eye-tracking data can be useful, but not always feasible Estimating post-editing effort 35 / 40

  60. Overview Quality Estimation Shared Task Open issues Conclusions How to use estimated PE effort scores? Should (supposedly) bad quality translations be filtered out or shown to translators (different scores/colour codes as in TMs)? Wasting time to read scores and translations vs wasting “gisting” information Estimating post-editing effort 36 / 40

  61. Overview Quality Estimation Shared Task Open issues Conclusions How to use estimated PE effort scores? Should (supposedly) bad quality translations be filtered out or shown to translators (different scores/colour codes as in TMs)? Wasting time to read scores and translations vs wasting “gisting” information How to define a threshold on the estimated translation quality to decide what should be filtered out? Translator dependent Task dependent (SDL) Estimating post-editing effort 36 / 40

  62. Overview Quality Estimation Shared Task Open issues Conclusions How to use estimated PE effort scores? Should (supposedly) bad quality translations be filtered out or shown to translators (different scores/colour codes as in TMs)? Wasting time to read scores and translations vs wasting “gisting” information How to define a threshold on the estimated translation quality to decide what should be filtered out? Translator dependent Task dependent (SDL) Do translators prefer detailed estimates (sub-sentence level) or an overall estimate for the complete sentence? Too much information vs hard-to-interpret scores Estimating post-editing effort 36 / 40

  63. Overview Quality Estimation Shared Task Open issues Conclusions Outline Overview 1 Quality Estimation 2 Shared Task 3 Open issues 4 Conclusions 5 Estimating post-editing effort 37 / 40

  64. Overview Quality Estimation Shared Task Open issues Conclusions Conclusions It is possible to estimate at least certain aspects of PE effort Estimating post-editing effort 38 / 40

  65. Overview Quality Estimation Shared Task Open issues Conclusions Conclusions It is possible to estimate at least certain aspects of PE effort PE effort estimates can be used in real applications Ranking translations: filter out bad quality translations Selecting translations from multiple MT systems Estimating post-editing effort 38 / 40

  66. Overview Quality Estimation Shared Task Open issues Conclusions Conclusions It is possible to estimate at least certain aspects of PE effort PE effort estimates can be used in real applications Ranking translations: filter out bad quality translations Selecting translations from multiple MT systems A number of open issues to be investigated... Estimating post-editing effort 38 / 40

  67. Overview Quality Estimation Shared Task Open issues Conclusions Conclusions It is possible to estimate at least certain aspects of PE effort PE effort estimates can be used in real applications Ranking translations: filter out bad quality translations Selecting translations from multiple MT systems A number of open issues to be investigated... My vision Sub-sentence level QE (error detection), highlighting errors but also given an overall estimate for the sentence Estimating post-editing effort 38 / 40

  68. Overview Quality Estimation Shared Task Open issues Conclusions Journal of MT - Special issue 15-06-12 - 1st CFP 15-08-12 - 2nd CFP 15-09-12 - submission deadline 15-10-12 - reviews due End of December 2012 - camera-ready due (tentative) WMT-12 QE Shared Task All feature sets available Estimating post-editing effort 39 / 40

Recommend


More recommend