Results of the WMT16 Tuning Shared Task 1 Bushra Jawaid 1 Amir - PowerPoint PPT Presentation

  Results of the WMT16 Tuning Shared Task 1   Bushra Jawaid 1   Amir Kamran 1   Milo š Stojanovi ć 2   Ond ř ej Bojar 1 ILLC, University of Amsterdam   2 MFF UFAL, Charles University in Prague WMT16, Aug 11, 2016

Overview Summary of Tuning Task • Updates in 2016 edition • Results •

Tuning Task

Tuning Task Lexical Adequacy Fluency Length … Choice

Tuning Task Lexical Fluency Adequacy Length … Choice

Tuning Task Lexical Fluency Adequacy Length … Choice λ = ?

Tuning Task So many things to choose in tuning:

Tuning Task So many things to choose in tuning: . . . Algorithm Metric Features Data

Tuning Task So many things to choose in tuning: . . . Algorithm Metric Features Data This task is organized to explore the tuning options in a controlled settings

System for Tuning Moses phrase-based models trained both for   • English-Czech and Czech-English.

System for Tuning Moses phrase-based models trained both for   • English-Czech and Czech-English. This year we used large dataset to train the models and • aligned the data using fast-align.

System for Tuning Moses phrase-based models trained both for   • English-Czech and Czech-English. This year we used large dataset to train the models and • aligned the data using fast-align. In constrained version 2.5K sentence pairs were available • for tuning.

System for Tuning Moses phrase-based models trained both for   • English-Czech and Czech-English. This year we used large dataset to train the models and • aligned the data using fast-align. In constrained version 2.5K sentence pairs were available • for tuning. Constrained version allowed only dense features. •

System for Tuning Moses phrase-based models trained both for   • English-Czech and Czech-English. This year we used large dataset to train the models and • aligned the data using fast-align. In constrained version 2.5K sentence pairs were available • for tuning. Constrained version allowed only dense features. • Any tuning algorithm or metric tuning was allowed (even • manually setting weights)

Data used for training Sentences Tokens Types Source cs en cs en cs en Europarl v7, News LM Commentary v11, 54M 206M 900M 4409M 2.1M 3.2M News Crawl Corpora (2007-15), News Discussion v1 TM CzEng 1.6 pre for 44M 501M 20.8M 1.8M 1.2M Corpora WMT16 Dev Set 2656 51K 60K 19K 13K newstest2015 Test Set 2999 56.9K 65.3K 15.1K 8.8K newstest2016

Data used for training Language Model Translation 210M Model 50M 158M 38M 105M 25M 13M 53M 0M 2015 2016 0M 2015 2016 2015 2016 en cs Comparison of data sizes (# of sentence pairs) 2015 vs 2016

Participants From 6 research groups we received, 4 submissions for Czech-English, 8 submissions • for English-Czech 2 Baselines • System Participant bleu-MIRA, bleu-MERT Baselines AFRL United States Air Force Research Laboratory DCU Dublin City University FJFI-PSO Czech Technical University in Prague ILLC-UvA-BEER ILLC – University of Amsterdam NRC-MEANT, NRC-NNBLEU National Research Council Canada USAAR Saarland University

Czech-English Results System Name True Skill Score BLEU BLEU-MIRA 0.114 22.73 AFRL 0.095 22.90 NRC-NNBLEU 0.090 23.10 NRC-MEANT 0.073 22.60 ILLC-UvA-BEER 22.46 0.032 BLEU-MERT 0.000 22.51

Czech-English Results System Name True Skill Score BLEU BLEU-MIRA 0.114 22.73 AFRL 0.095 22.90 NRC-NNBLEU 0.090 23.10 NRC-MEANT 0.073 22.60 ILLC-UvA-BEER 22.46 0.032 BLEU-MERT 0.000 22.51 Manual evaluation of tuning systems can draw only very • few clear division lines.

Czech-English Results System Name True Skill Score BLEU BLEU-MIRA 0.114 22.73 AFRL 0.095 22.90 NRC-NNBLEU 0.090 23.10 NRC-MEANT 0.073 22.60 ILLC-UvA-BEER 22.46 0.032 BLEU-MERT 0.000 22.51 Manual evaluation of tuning systems can draw only very • few clear division lines. KBMIRA turns out to consistently be better than MERT. •

English-Czech Results System Name True Skill Score BLEU BLEU-MIRA 0.160 15.12 ILLC-UvA-BEER 0.152 14.69 BLEU-MERT 0.151 14.93 AFRL2 0.139 14.84 AFRL1 0.136 15.02 DCU 0.134 14.34 FJFI-PSO 0.127 14.68 7.95 USAAR-HMM-MERT -0.433 0.82 USAAR-HMM-MIRA -1.133 0.20 USAAR-HMM -1.327

Comparison with main translation task 2016 2015

Conclusion The task was much larger this year. • Task attracted good participation like last year. • The quality of most submitted systems is hard to • distinguish manually. With large models, the few parameters are most likely not • powerful enough (and sadly nobody tried discriminative features) The results confirm that KBMIRA with the standard • features optimized towards BLEU should be preferred over MERT.

Results of the WMT16 Tuning Shared Task 1 Bushra Jawaid 1 Amir - PowerPoint PPT Presentation

Results of the WMT16 Tuning Shared Task 1 Bushra Jawaid 1 Amir Kamran 1 Milo Stojanovi 2 Ond ej Bojar 1 ILLC, University of Amsterdam 2 MFF UFAL, Charles University in Prague WMT16, Aug 11, 2016 Overview

Results of the WMT16 Metrics Shared Task Ond rej Bojar Yvette Graham Amir Kamran Milo s

5th Quality Estimation Shared Task WMT16 Lucia Specia, Varvara Logacheva and Carolina Scarton

LAW-MWE-CxG 2018 Shared task poster boosters 1. DEEP-BGT AT PARSEME SHARED TASK 2018:

Results of the WMT19 Metrics Shared Task Segment-Level and Strong MT Systems Pose Big Challenges

Shared Governance Task Force Report https://web.ramapo.edu/shared-governance-task-force/ 1

The PARSEME Shared Task on Automatic Identification of Verbal Multiword Expressions (1.1) C.

The SIGMORPHON 2016 shared task morphological reinflection Ryan Cotterell, Christo Kirov,

Parameter Tuning for Search-Based Test-Data Generation Revisited Support for Previous Results

Distribu(onal Seman(cs and Composi(onality 2011: Shared Task Descrip(on and Results Chris Biemann

WMT 10 Shared Tasks: Translation Task System Combination Task Chris Callison-Burch, Philipp

Modeling Unrestricted Coreference in OntoNotes CoNLL-2011 Shared Task Sameer S Pradhan 1 Lance

Automatically Tuning Task-Based Programs for Multi-core Processors Jin Zhou Brian Demsky

TUPA at MRP 2019 A Multi-Task Baseline System CoNLL Shared Task 3 November 2019 1 / 9 Daniel

HOO 2012: Helping Our Own Shared Task Vidas Daudaravi cius Vytautas Magnus University

TUNING Russia: Development of master programmes in engineering education using the Tuning

HAU at the GermEval 2019 Shared Task on the Identification of Offensive Language in Microposts

PAC PACE AUT AUTO-WER WERKS KS Vehicle Tuning Services Performance tuning with fuel

Status and Preliminary Results of GEO Task US-09-01a Task Lead & UIC Member: Lawrence

TRAVERSAL at PARSEME Shared Task 2018: Identification of VMWEs Using a Discriminative

FEVER shared Task Tariq Alhindi 08/22/2018 Motivation 67% of consumers now look online

Introduction to OpenMP Lecture 9: Performance tuning Sources of overhead There are 6 main

Advanced OpenMP Lecture 8: Performance tuning Sources of overhead There are 6 main causes of

Introduction to OpenMP Lecture 9: Performance tuning Sources of overhead There are 6 main

MI MI and Shared MI MI and Shared and Shared Decision Making and Shared Decision Making

Results of the WMT16 Tuning Shared Task 1 Bushra Jawaid 1 Amir - PowerPoint PPT Presentation

Results of the WMT16 Tuning Shared Task 1 Bushra Jawaid 1 Amir Kamran 1 Milo Stojanovi 2 Ond ej Bojar 1 ILLC, University of Amsterdam 2 MFF UFAL, Charles University in Prague WMT16, Aug 11, 2016 Overview

Results of the WMT16 Metrics Shared Task Ond rej Bojar Yvette Graham Amir Kamran Milo s

5th Quality Estimation Shared Task WMT16 Lucia Specia, Varvara Logacheva and Carolina Scarton

LAW-MWE-CxG 2018 Shared task poster boosters 1. DEEP-BGT AT PARSEME SHARED TASK 2018:

Results of the WMT19 Metrics Shared Task Segment-Level and Strong MT Systems Pose Big Challenges

Shared Governance Task Force Report https://web.ramapo.edu/shared-governance-task-force/ 1

The PARSEME Shared Task on Automatic Identification of Verbal Multiword Expressions (1.1) C.

The SIGMORPHON 2016 shared task morphological reinflection Ryan Cotterell, Christo Kirov,

Parameter Tuning for Search-Based Test-Data Generation Revisited Support for Previous Results

Distribu(onal Seman(cs and Composi(onality 2011: Shared Task Descrip(on and Results Chris Biemann

WMT 10 Shared Tasks: Translation Task System Combination Task Chris Callison-Burch, Philipp

Modeling Unrestricted Coreference in OntoNotes CoNLL-2011 Shared Task Sameer S Pradhan 1 Lance

Automatically Tuning Task-Based Programs for Multi-core Processors Jin Zhou Brian Demsky

TUPA at MRP 2019 A Multi-Task Baseline System CoNLL Shared Task 3 November 2019 1 / 9 Daniel

HOO 2012: Helping Our Own Shared Task Vidas Daudaravi cius Vytautas Magnus University

TUNING Russia: Development of master programmes in engineering education using the Tuning

HAU at the GermEval 2019 Shared Task on the Identification of Offensive Language in Microposts

PAC PACE AUT AUTO-WER WERKS KS Vehicle Tuning Services Performance tuning with fuel

Status and Preliminary Results of GEO Task US-09-01a Task Lead &amp; UIC Member: Lawrence

TRAVERSAL at PARSEME Shared Task 2018: Identification of VMWEs Using a Discriminative

FEVER shared Task Tariq Alhindi 08/22/2018 Motivation 67% of consumers now look online

Introduction to OpenMP Lecture 9: Performance tuning Sources of overhead There are 6 main

Advanced OpenMP Lecture 8: Performance tuning Sources of overhead There are 6 main causes of

Introduction to OpenMP Lecture 9: Performance tuning Sources of overhead There are 6 main

MI MI and Shared MI MI and Shared and Shared Decision Making and Shared Decision Making

Status and Preliminary Results of GEO Task US-09-01a Task Lead & UIC Member: Lawrence