Five Shades of Noise: Analyzing Machine Translation Errors in - PowerPoint PPT Presentation

Five Shades of Noise: Analyzing Machine Translation Errors in User-Generated Text Marlies van der Wees, Arianna Bisazza, Christof Monz

Statistical Machine Translation News sentence: 印度⾦釒融中⼼忄孟买亦受到波及。 (mumbai, india's financial center, was also affected.) 😁 SMT india's financial center mumbai also affected. Five Shades of Noise: Analyzing Machine 2 Translation Errors in User-Generated Text

Statistical Machine Translation SMS sentence: 你路上慢点 (be careful on your way / take your time) 😪 SMT you are on the road to slow points Five Shades of Noise: Analyzing Machine 3 Translation Errors in User-Generated Text

SMT for user-generated text is often bad ✤ Reference ✤ SMT output and if i go out, i will and if i went. ✦ ✦ stop by your place i could not bring it to into its enemies. ✦ ✦ you i've never seen a pig i am seen pig there. ✦ ✦ there you're too delighted to anytime you ✦ ✦ be homesick Five Shades of Noise: Analyzing Machine 4 Translation Errors in User-Generated Text

Towards improving SMT quality for UG ✤ To target specific error types, we need to know why mistakes are made: in UG versus formal text ✦ contrast UG with newswire • in different types of UG ✦ five shades of noise: weblogs, comments, • speech (CTS), SMS, and chat messages in different language pairs ✦ Arabic-English & Chinese-English • Five Shades of Noise: Analyzing Machine 5 Translation Errors in User-Generated Text

Analyzing SMT errors in UG text ✤ What translation choices were made by the SMT system? SMT ✤ What translation choices could have been made by the SMT system? ✤ Why did the SMT system make the ✤ Why did the SMT system make the choices that it made? choices that it made? Five Shades of Noise: Analyzing Machine 6 Translation Errors in User-Generated Text

Word Alignment Driven Evaluation: approach * ✤ For each word alignment link in the test (e.g. 你 — your ) that is translated wrongly, determine: source phrase source phrase source phrase target phrase target phrase target phrase probability probability probability source and target source and target source and target �� on the road 0.4 source on the road on the road 0.4 0.4 source source phrases both in table, phrases both in table, phrases both in table, phrase not in phrase not in phrase not in �� but other translation but other translation but other translation on the way on the way on the way 0.3 0.3 0.3 phrase table: phrase table: phrase table: preferred: preferred: preferred: �� SEEN error SEEN error SEEN error on your way on your way on your way 0.2 0.2 0.2 SCORE error SCORE error SCORE error target phrase target phrase target phrase � � � dot 0.1 dot dot not in phrase table: not in phrase table: not in phrase table: SENSE error SENSE error SENSE error � � � point point point 0.4 * Approach adopted from Irvine et al., Measuring Machine Translation Errors in New Domains , 2013 Five Shades of Noise: Analyzing Machine 7 Translation Errors in User-Generated Text

Word Alignment Driven Evaluation: results Word-level error statistics for Arabic-English benchmarks Word-level error statistics for Arabic-English benchmarks 60 60 Correct Correct Seen 50 50 Sense Score Relative frequency Relative frequency 40 40 30 30 20 20 10 10 0 0 News 1 News 1 News 2 News 2 Weblogs Comments Weblogs Comments CTS CTS Chat Chat SMS SMS News UG Five Shades of Noise: Analyzing Machine 8 Translation Errors in User-Generated Text

Word Alignment Driven Evaluation: findings ✤ SMT errors for UG text differ from SMT errors for news ✦ many SEEN and SENSE errors for UG • between different types of UG ✦ SMS and chat messages are most affected • between different language pairs ✦ differences in Chinese-English are more • subtle than in Arabic-English Five Shades of Noise: Analyzing Machine 9 Translation Errors in User-Generated Text

Analyzing SMT errors in UG: what we learned ✤ Common errors in UG are due to: misspellings or Arabic dialectal forms ✦ formal lexical choices ✦ idioms translated word by word ✦ dropped pronouns in Chinese ✦ ✤ UG suffers from low model coverage generate new translation candidates ✦ normalize existing translation candidates ✦ Five Shades of Noise: Analyzing Machine 10 Translation Errors in User-Generated Text

More Error Analysis? ✤ Visit the poster for: Five Shades of Noise: Analyzing Machine Translation Errors in User-Generated Text Marlies van der Wees Arianna Bisazza Christof Monz Informatics Institute, University of Amsterdam Model coverage analysis ✦ Motivation Five Shades of Noise Statistical machine translation (SMT) of user-generated (UG) text Two language pairs Five UG sets Two news sets input SMS message: output translation: Arabic-English & weblogs, comments, different sources, SMT �� Chinese-English speech, SMS, chat to contrast with UG you are on the road Arabic-English versus (= be careful on your way / take your time) to slow points Lower translation quality for UG than for news ✦ Understanding SMT errors in UG text why does SMT make the errors that it makes on UG? SMT low model coverage? poor scoring of translation options? Chinese-English results what errors are observed for various types of UG? Quantitative Analysis: SMT Model Coverage Approach for each phrase pair in the test set Qualitative Examples (e.g. �� / take your time), determine: ✦ source phrase covered in the SMT models target phrase covered in the SMT models phrase pair covered in the SMT models all computed for various phrase lengths ✤ Read the paper for: Findings coverage of source phrases and phrase pairs is lower for UG than for news coverage of target phrases is more balanced among test sets coverage dramatically decreases for longer phrases SMS and chat suffer most from low coverage Phrase-length analysis ✦ Qualitative Analysis: Word Alignment Driven Evaluation * so the kids do not feel upset Ref: i 'm online . take your time Ref: — Correct — SEEN error: unknown source Detailed explanation and — SENSE error: 上网了 , 你路上慢点 Input: Input: qAlt E$An AlEyAl mtzEl$ unknown target ✦ — SCORE error: suboptimal Output: on the internet , and you are on the road to slow points Output: said because of the sons scoring missing pronoun idiom translated in small chunks lexical choices that are too formal out-of-vocabulary (OOV) not inferred by SMT system losing its meaning as a phrase not reflecting colloquial language due to dialect or misspellings discussions * Irvine et al., Measuring Machine Translation Errors in New Domains , 2013 Conclusions SMT errors for UG text differ promising solutions include UG text This research was funded in part from SMT errors for news improving scoring for news by the Netherlands Organization �� for Scientific Research (NWO) SMT under project number 639.022.213 between different types of UG increasing phrase pair coverage for UG between different language pairs increasing source phrase coverage for SMS & chat ACL 2015 Workshop on Noisy User-generated Text (WNUT), Beijing, China m.e.vanderwees@uva.nl Five Shades of Noise: Analyzing Machine 11 Translation Errors in User-Generated Text

Five Shades of Noise: Analyzing Machine Translation Errors in - PowerPoint PPT Presentation

Five Shades of Noise: Analyzing Machine Translation Errors in User-Generated Text Marlies van der Wees, Arianna Bisazza, Christof Monz Statistical Machine Translation News sentence: (mumbai,

Statistical Machine Translation Nadir Durrani 21-November-2014 Machine Translation

ELO TRANSLATION PROJECT SARAH **** SOME VOCAB Errors Logic Errors Runtime Errors

Machine Translation 2 Wikipedia Machine translation, often referred to by the acronym MT, is a

Machine Translation (M2M) Machine Translation (M2M) SNMP MIB to CIM MOF SNMP MIB to CIM MOF

Statistical Machine Translation The Main Idea Treat translation as a noisy channel problem:

Natural Language Processing Machine Translation Dan Klein UC Berkeley 1 Machine Translation 2

Introd u ction to machine translation MAC H IN E TR AN SL ATION IN P YTH ON Th u shan

Natural Language Processing Machine Translation Machine Translation Dan Klein UC Berkeley

Use of the Machine Translation Module within Dj Vu X2 Quick Guidance Introduction Machine

Machine Translation Machine Translation February 13, 2008 Andreas Eisele UdS Computerlinguistik

Machine Translation Machine Translation Berlin Chen 2003 References: 1. Natural Language

Machine Translation 12: (Non-neural) Statistical Machine Translation Rico Sennrich University of

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1

Semi-supervised Learning for Neural Machine Translation Yong Cheng joint work with Wei Xu,

History & Evaluation CMSC 470 Marine Carpuat T odays topics Machine Translation

Learning Non-Isomorphic Tree Mappings for Machine Translation Syntax-Based Machine Translation

Introduction to Machine Translation CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides &

CRF Word Alignment & Noisy Channel Translation Machine Translation Lecture 6 Instructor:

Machine Translation CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT marine@cs.umd.edu T oday:

What can Statistical Machine Translation teach Neural Machine Translation about Structured

Social Translation: How Massive Online Collaboration Could Take Machine Translation to the Next

Statistical Machine Translation George Foster George Foster Statistical Machine Translation A

Global Translation Services Website translation using post-edited machine translation and

Machine Translation: Going Deep Philipp Koehn 4 June 2015 Philipp Koehn Machine Translation:

Five Shades of Noise: Analyzing Machine Translation Errors in - PowerPoint PPT Presentation

Five Shades of Noise: Analyzing Machine Translation Errors in User-Generated Text Marlies van der Wees, Arianna Bisazza, Christof Monz Statistical Machine Translation News sentence: (mumbai,

Statistical Machine Translation Nadir Durrani 21-November-2014 Machine Translation

ELO TRANSLATION PROJECT SARAH **** SOME VOCAB Errors Logic Errors Runtime Errors

Machine Translation 2 Wikipedia Machine translation, often referred to by the acronym MT, is a

Machine Translation (M2M) Machine Translation (M2M) SNMP MIB to CIM MOF SNMP MIB to CIM MOF

Statistical Machine Translation The Main Idea Treat translation as a noisy channel problem:

Natural Language Processing Machine Translation Dan Klein UC Berkeley 1 Machine Translation 2

Introd u ction to machine translation MAC H IN E TR AN SL ATION IN P YTH ON Th u shan

Natural Language Processing Machine Translation Machine Translation Dan Klein UC Berkeley

Use of the Machine Translation Module within Dj Vu X2 Quick Guidance Introduction Machine

Machine Translation Machine Translation February 13, 2008 Andreas Eisele UdS Computerlinguistik

Machine Translation Machine Translation Berlin Chen 2003 References: 1. Natural Language

Machine Translation 12: (Non-neural) Statistical Machine Translation Rico Sennrich University of

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1

Semi-supervised Learning for Neural Machine Translation Yong Cheng joint work with Wei Xu,

History &amp; Evaluation CMSC 470 Marine Carpuat T odays topics Machine Translation

Learning Non-Isomorphic Tree Mappings for Machine Translation Syntax-Based Machine Translation

Introduction to Machine Translation CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides &amp;

CRF Word Alignment &amp; Noisy Channel Translation Machine Translation Lecture 6 Instructor:

Machine Translation CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT marine@cs.umd.edu T oday:

What can Statistical Machine Translation teach Neural Machine Translation about Structured

Social Translation: How Massive Online Collaboration Could Take Machine Translation to the Next

Statistical Machine Translation George Foster George Foster Statistical Machine Translation A

Global Translation Services Website translation using post-edited machine translation and

Machine Translation: Going Deep Philipp Koehn 4 June 2015 Philipp Koehn Machine Translation:

History & Evaluation CMSC 470 Marine Carpuat T odays topics Machine Translation

Introduction to Machine Translation CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides &

CRF Word Alignment & Noisy Channel Translation Machine Translation Lecture 6 Instructor: