Human Ranking of Machine Translation Matt Post Johns Hopkins - PowerPoint PPT Presentation

Human Ranking of Machine Translation Matt Post Johns Hopkins University University of Pennsylvania April 9, 2015 Some slides and ideas borrowed from Adam Lopez (Edinburgh)

Review • In translation, human evaluations are what matter – but they are expensive to run – this holds up science! • The solution is automatic metrics – fast, cheap, (usually) easy to compute – deterministic 2

Review • Automatic metrics produce a ranking • They are evaluated using correlation statistics against human judgments outputs ranking metrics A   B, D System A BLEU C System B ?? System C B   System D humans A D C 3

Review • The human judgments are the “gold standard” • Questions: 1. How do we get this gold   standard? 2. How do we know it’s correct? 4

Today • How we produce the gold-standard ranking • How we know it’s correct 5

At the end of this lecture… • You should understand – how to rank with incomplete – how to evaluate truth claims in science • You might come away with – a desire to submit your metric to the WMT metrics task (deadline: May 25, 2015) – a desire to buy an Xbox – a preference for simplicity 6

Producing a ranking • Then, we take this data and produce a ranking • Outline of the rest of the talk Human ranking methods Model selection Clustering 7

Goal system G } { system A 1. system C system B 2. system D system C 3. system A system D 4. system B system E 5. system G system F 6. system F 7. system E Slide from Adam Lopez 8

Goal • Produce a ranking of systems • There are many ways to do this: – Reading comprehension tests – Time spent on human post-editing – Aggregating sentence-level judgments • This last one is what is used by the Workshop on Statistical Machine Translation (statmt.org/wmt15) 9

              Inherent problems • Translation is used for a range of tasks   Understanding Technical Conversing Information the past manuals • What best (or sufficient) means likely varies by person and situation 10

  Collecting data • Data: K systems translate an N-sentence document • We use human judges to compare translations of an input sentence and select whether   the first is better ,   worse , or   equivalent to the second • We use a large pool of judges 11

Collecting data C > A > B > D > E C > A A > B B > D D > E   C > B A > D B > E C > D A > E C > E ten pairwise judgments 13

  Dataset • This yields ternary-valued pairwise judgments of the following form   judge “dredd” ranked onlineB > JHU on sent #74   judge “judy” ranked uedin > UU on sent #1734   judge “reinhold” ranked JHU > UU on sent #1   judge “jay” ranked onlineA = uedin on sent #953   … 14

    The sample space • How much data is there to collect?   (number of ways to pick two systems)   x (number of sentences) x (number of judges) – For 10 systems there are 135k comparisons – For 20 systems, 570k – More with multiple judges • Too much to collect, also wasteful; instead we sample 15

Design of the WMT Evaluation (2008-2011) WMT Raw Data: system A pairwise rankings system B { reference system A � system C reference system C � system D reference system D � reference system F � system E 1. reference system A system C � 2. system C system F system A system D � 3. system A, system F system A system F ≡ 4. system D system G system C system D � reference = system C system F � system D system F � While (evaluation period is not over): ➡ Sample input sentence. ➡ Sample five translators of it from Systems ∪ {Reference} . ➡ Sample a judge. ➡ Receive set of pairwise judgments from the judge.

How much data do we collect? of tens of millions possible 17

Producing a ranking • Then, we take this data and produce a ranking • Human ranking methods Expected wins and variants Bayesian model (relative ability) TrueSkill™ 18

    Expected wins (1) • This most appealing and intuitive approach • Define wins(A) , ties(A) , and loses(A) as the number of times system A won, tied, or lost • Score each system as follows   wins(A) ¡+ ¡ties(A) ¡ score(A) ¡= ¡ wins(A) ¡+ ¡ties(A) ¡+ ¡loses(A) • Now sort by scores 19

        Expected wins (2) • Do you see any problems with this?   wins(A) ¡+ ¡ties(A) ¡ score(A) ¡= ¡ wins(A) ¡+ ¡ties(A) ¡+ ¡loses(A) • Look at a judgments:   judge “dredd” ranked onlineB > JHU on sent #74   one winner, one loser judge “judy” ranked uedin > UU on sent #1734   one winner, one loser judge “reinhold” ranked JHU > UU on sent #1   one winner, one loser judge “jay” ranked onlineA = uedin on sent #953 two winners, no losers 20

      Expected wins (3) • A system is rewarded as much for a tie as for a win – …and most systems are variations of   the same underlying architecture, data • New formula: throw away ties   wins(A) score(A) ¡= ¡ wins(A) ¡+ ¡loses(A) • Wait: Is this better? A Grain of Salt for the WMT Manual Evaluation (Bojar et al., 2012) 21

        Expected wins (4) • Problem 2: the luck of the draw   aggregation over different sets of inputs different competitors different judges • Consider a case where in reality B > C, but – B gets compared to a bunch of good systems – C gets compared to a bunch of bad systems – we could get score(C) > score(B 22

Expected wins (5) • This can happen! – Systems include a human reference translation – Also include really good unconstrained commercial systems 23

Expected wins (6) onlineB rwth-combo cmu-hyposel-combo cambridge • Even more problems: lium dcu-combo – remember that the scores for cmu-heafield-combo a system is the percentage of upv-combo time it won in comparisons nrc uedin across all systems jhu limsi – what if score(B) > score(C), jhu-combo but in direct comparisons, C lium-combo was almost always better rali lig than B? bbn-combo – this leads to cycles in the rwth cmu-statxfer ranking onlineA huicong • Is this a problem? dfki cu-zeman geneva 24

Summary • List of problems: – Including ties biases similar systems, excluding discredits – Comparisons do not factor in difficulty of the “match” (i.e., losing to the best system should count less) – There are cycles in the judgments • We made intuitive changes, but how do we know whether they’re correct? 25

Relative ability model Models of Translation Competitions (Hopkins & May, 2013) • In Expected Wins, we estimate a probability of each system winning a competition • We now move to a setup that models the relative ability of a system – Assume each system S i has an inherent ability, µ j – Its translations are then represented by draws from a Gaussian distribution centered at µ j 26

Relative ability µ i better 27

Relative ability • A “competition” proceeds as follows: – Choose two systems, S i and S j , from the set {S} – Sample a “translation” from their distributions   q i ~ N(S i ; µ i , σ 2)   q j ~ N(S j ; µ j , σ 2) – Compare their values to determine who won • Define d as a “decision radius” • Record a tie if |q i – q j | < d • Else record a win or loss 28

Visually better TIE d q j q i S i wins d q j q i S j wins d q i q j 29

Observations • We can compute exact probabilities for all these events (difference of Gaussians) • On average, a system with a higher “ability” will have higher draws, and will win • Systems with close µs will tie more often 30

Learning the model • If we knew the system means, we could rank them • We assume the data was generated by the process above; we need to infer values for hidden params: – System means {µ} – Sampled translation qualities {q} • We’ll use Gibbs sampling – Uses simple random steps to learn a complicated joint distributions – Converges under certain conditions 31

      Gibbs sampling judge “dredd” ranked onlineB > JHU on sent #74   judge “judy” ranked uedin > UU on sent #1734   judge “reinhold” ranked JHU > UU on sent #1   judge “jay” ranked onlineA = uedin on sent #953 • Represent data as tuples (Si, ¡Sj, ¡π, ¡qi, ¡qj)   (onlineB, JHU, >, ?, ?)   unknown known (uedin, UU, >, ?, ?)   (JHU, UU, >, ?, ?)   (onlineA, uedin, =, ?, ?) • Iterate back and forth between guessing {q}s and {µ}s 32

Iterative process [collect ¡all ¡the ¡judgments] ¡ until ¡convergence ¡ ¡ ¡# ¡resample ¡translation ¡qualities ¡ ¡ ¡for ¡each ¡judgment ¡ ¡ ¡ ¡ ¡q i ¡~ ¡N(µ i ,σ 2 )   ¡ ¡ ¡ ¡q j ¡~ ¡N(µ j ,σ 2 ) ¡ ¡ ¡ ¡ ¡# ¡(adjust ¡samples ¡to ¡respect ¡judgment ¡π) ¡ ¡ ¡# ¡resample ¡the ¡system ¡means   ¡ ¡for ¡each ¡system ¡ ¡ ¡ ¡ ¡µ i ¡= ¡mean({q i }) ¡ 33

Human Ranking of Machine Translation Matt Post Johns Hopkins - PowerPoint PPT Presentation

Human Ranking of Machine Translation Matt Post Johns Hopkins University University of Pennsylvania April 9, 2015 Some slides and ideas borrowed from Adam Lopez (Edinburgh) Review In translation, human evaluations are what matter but

Statistical Machine Translation Nadir Durrani 21-November-2014 Machine Translation

Chicken Human 1 Human 2 Rat Chicken Human 1 Human 2 Rat Chicken Human 1 Human 2 Rat

Introd u ction to machine translation MAC H IN E TR AN SL ATION IN P YTH ON Th u shan

Machine Translation Machine Translation February 13, 2008 Andreas Eisele UdS Computerlinguistik

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1

11-731 Machine Translation Speech 2 Speech Translation Speech Translation Three part systems

Machine Translation Philipp Koehn 28 April 2020 Philipp Koehn Artificial Intelligence: Machine

Statistical Machine Translation Statistical Machine Translation p Lecture 2 Theory and Praxis of

Computer Aided Translation Philipp Koehn 30 April 2015 Philipp Koehn Machine Translation:

Computer Aided Translation Philipp Koehn 15 November 2018 Philipp Koehn Machine Translation:

Easy and Hard Outline Constraint Ranking in OT The Constraint Ranking problem Making fast

Tutorial: TF-Ranking for sparse features Tutorial: TF-Ranking for sparse features This tutorial

Machine Translation: Going Deep Philipp Koehn 4 June 2015 Philipp Koehn Machine Translation:

Machine Translation Philipp Koehn 1 December 2015 Philipp Koehn Artificial Intelligence:

Neural Machine Translation II Refinements Philipp Koehn 17 October 2017 Philipp Koehn Machine

Representing Huge Translation Models Statistical Machine Translation parallel text + alignment

sst rst r srt rs

Revealing Algorithmic Rankers Julia Stoyanovich Gerome Miklau Ellen P. Goodman Drexel

Question-Answering: Evaluation, Systems, Resources Ling573 NLP Systems & Applications

New Jersey Center for Teaching and Learning AP Chemistry Progressive Science Initiative This

Voting in Parallel Universes ILLC Workshop on Collective Decision Making 2015 Stphane Airiau

Week 5 Video 5 Relationship Mining Network Analysis Todays Class Network Analysis Network

Compression: Huffmans Algorithm Greg Plaxton Theory in Programming Practice, Spring 2004

Solving problems by searching Uninformed search algorithms Discussion Class CS 171 Friday,

Human Ranking of Machine Translation Matt Post Johns Hopkins - PowerPoint PPT Presentation

Human Ranking of Machine Translation Matt Post Johns Hopkins University University of Pennsylvania April 9, 2015 Some slides and ideas borrowed from Adam Lopez (Edinburgh) Review In translation, human evaluations are what matter but

Statistical Machine Translation Nadir Durrani 21-November-2014 Machine Translation

Chicken Human 1 Human 2 Rat Chicken Human 1 Human 2 Rat Chicken Human 1 Human 2 Rat

Introd u ction to machine translation MAC H IN E TR AN SL ATION IN P YTH ON Th u shan

Machine Translation Machine Translation February 13, 2008 Andreas Eisele UdS Computerlinguistik

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1

11-731 Machine Translation Speech 2 Speech Translation Speech Translation Three part systems

Machine Translation Philipp Koehn 28 April 2020 Philipp Koehn Artificial Intelligence: Machine

Statistical Machine Translation Statistical Machine Translation p Lecture 2 Theory and Praxis of

Computer Aided Translation Philipp Koehn 30 April 2015 Philipp Koehn Machine Translation:

Computer Aided Translation Philipp Koehn 15 November 2018 Philipp Koehn Machine Translation:

Easy and Hard Outline Constraint Ranking in OT The Constraint Ranking problem Making fast

Tutorial: TF-Ranking for sparse features Tutorial: TF-Ranking for sparse features This tutorial

Machine Translation: Going Deep Philipp Koehn 4 June 2015 Philipp Koehn Machine Translation:

Machine Translation Philipp Koehn 1 December 2015 Philipp Koehn Artificial Intelligence:

Neural Machine Translation II Refinements Philipp Koehn 17 October 2017 Philipp Koehn Machine

Representing Huge Translation Models Statistical Machine Translation parallel text + alignment

sst rst r srt rs

Revealing Algorithmic Rankers Julia Stoyanovich Gerome Miklau Ellen P. Goodman Drexel

Question-Answering: Evaluation, Systems, Resources Ling573 NLP Systems &amp; Applications

New Jersey Center for Teaching and Learning AP Chemistry Progressive Science Initiative This

Voting in Parallel Universes ILLC Workshop on Collective Decision Making 2015 Stphane Airiau

Week 5 Video 5 Relationship Mining Network Analysis Todays Class Network Analysis Network

Compression: Huffmans Algorithm Greg Plaxton Theory in Programming Practice, Spring 2004

Solving problems by searching Uninformed search algorithms Discussion Class CS 171 Friday,

Question-Answering: Evaluation, Systems, Resources Ling573 NLP Systems & Applications