The Statistical Machine Translation System of the University of - PDF document

� � � � � The Statistical Machine Translation System of the University of Edinburgh Philipp Koehn pkoehn@inf.ed.ac.uk School of Informatics University of Edinburgh – p.1 The Statistical Machine Translation System of the University of Edinburgh p Outline p Overview: SMT at Edinburgh Baseline System Improvements Evaluation Related Recent Work in SMT – p.2 Philipp Koehn, University of Edinburgh 2

� � � � � � � � � � The Statistical Machine Translation System of the University of Edinburgh p People Working On SMT at Edinburgh p Philipp Koehn (lecturer) Miles Osborne (lecturer) Amittai Axelrod (graduate student) Alexandra Birch Mayne (graduate student) Chris Callison-Burch (graduate student, Linear-B) David Talbot (graduate student) Michael White (researcher) – p.3 Philipp Koehn, University of Edinburgh 3 The Statistical Machine Translation System of the University of Edinburgh p MT Eval 2005 Effort p 3-month effort building on previous work at MIT – improved system performance – introduced other researchers to the system Focus on Arabic-English: – deal with more data – various feature improvements It is never finished... – did not train on new data – some changes not completed on time – p.4 Philipp Koehn, University of Edinburgh 4

� � � � � � The Statistical Machine Translation System of the University of Edinburgh p Outline p Overview: SMT at Edinburgh Baseline System Improvements Evaluation Related Recent Work in SMT – p.5 Philipp Koehn, University of Edinburgh 5 The Statistical Machine Translation System of the University of Edinburgh p Phrase-Based Translation p Morgen fliege ich nach Kanada zur Konferenz Tomorrow I will fly to the conference in Canada Phrase model similar to other groups’ model – word align corpus, using GIZA++ and Och’s refined method – collect phrase pairs consistent with word alignment – log-linear model to combine model components – parameter tuning by minimum error rate training – decoder Pharaoh (http://www.isi.edu/licensed-sw/pharaoh/) – p.6 Philipp Koehn, University of Edinburgh 6

� � � � � � � � � � � � � � � � � The Statistical Machine Translation System of the University of Edinburgh p System Components p reordering model linear reordering cost, max. 4 word movement language model trigram LM trained using SRILM toolkit phrase translation model f e phrase translation model e f word translation model f e word translation model e f word penalty phrase penalty – p.7 Philipp Koehn, University of Edinburgh 7 The Statistical Machine Translation System of the University of Edinburgh p Outline p Overview: SMT at Edinburgh Baseline System Improvements – more training data (+2% BLEU) – bigger language model (+2% BLEU) – minor model improvements (+2% BLEU) Evaluation Related Recent Work in SMT – p.8 Philipp Koehn, University of Edinburgh 8

� � � � � � � The Statistical Machine Translation System of the University of Edinburgh p More Training Data p All of the data (instead of half) – maximum sentence length 40 words – break up corpus in 2-3 parts – run snt2cooc separately, merge – combined GIZA++ run (3-5 days CPU time) Chunking Splitting – p.9 Philipp Koehn, University of Edinburgh 9 The Statistical Machine Translation System of the University of Edinburgh p Chunking p Break up along comma, semicolon, colon, etc. Sentence-align smaller units 63.9 100.3 million words used – p.10 Philipp Koehn, University of Edinburgh 10

� � � ✁ ✂ ✄ ✂ � ☎ � The Statistical Machine Translation System of the University of Edinburgh p Splitting p Break up longer sentences – minimum number of crossed word alignments – cut sentences in the middle third – cut as central as possible 100.3 130.3 million words used – p.11 Philipp Koehn, University of Edinburgh 11 The Statistical Machine Translation System of the University of Edinburgh p Splitting II p a b c d e f g h i j k l m n o 1 2 3 4 5 6 7 8 9 10 11 12 Aligned sentences using lexical t-table with threshold, eliminate multiple aligned words – p.12 Philipp Koehn, University of Edinburgh 12

� � The Statistical Machine Translation System of the University of Edinburgh p Splitting III p a b c d e f g h i j k l m n o 1 2 3 4 5 6 7 8 9 10 11 12 Good and bad (2 crossings) split points – p.13 Philipp Koehn, University of Edinburgh 13 The Statistical Machine Translation System of the University of Edinburgh p Splitting IV p a b c d e f g h i j k l m n o 1 2 3 4 3 crossings 5 2 crossings 6 1 crossing 7 0 crossings 8 9 10 11 12 Quality of split points in the middle third – p.14 Philipp Koehn, University of Edinburgh 14

� � � � The Statistical Machine Translation System of the University of Edinburgh p Splitting V p a b c d e f g h i j k l m n o 1 2 3 4 3 crossings 5 2 crossings 6 1 crossing 7 0 crossings 8 9 10 11 12 Find most central best split point – p.15 Philipp Koehn, University of Edinburgh 15 The Statistical Machine Translation System of the University of Edinburgh p Bigger Language Model p Dealing with memory limitations in training Dealing with memory limitations in decoding Multiple language models – p.16 Philipp Koehn, University of Edinburgh 16

� � � � � The Statistical Machine Translation System of the University of Edinburgh p Memory Limitations in Training p A lot of monolingual English text is available – English half of parallel text: 130 million words – English gigaword corpus: 1.78 billion words – the web: 1 trillion words ? SRILM training keeps all n-grams in memory (2-4 GB limit) Practically limited to: – 800 million words (training + part of Gigaword) – ignored trigram singletons – digits (’0’-’9’) replaced by ’5’ – p.17 Philipp Koehn, University of Edinburgh 17 The Statistical Machine Translation System of the University of Edinburgh p Memory Limitations in Decoding p Pruning possible? – only need to consider words that can be produced – translation model can be cut down to a few (1-2) percent Unigrams Bigrams Trigrams Entire LM (trained on 130m) 291,767 4,991,346 7,881,122 1000 sent. 13,792 2,850,983 6,540,940 1000 sent, top 20 transl. 9,860 2,251,111 5,590,783 10 sent, top 20 transl. 871 127,552 488,694 High overhead in filtering LM – p.18 Philipp Koehn, University of Edinburgh 18

� � � � � � � � � The Statistical Machine Translation System of the University of Edinburgh p Multiple Language Models p Pharaoh allows multiple language models: Large LM – trained on 800 million words (training + part of Gigaword) – ignored trigram singletons – digits (’0’-’9’) replaced by ’5’ Specialized LM – trained on 1.1 million words (news training corpus) – including all singletons – no special treatment of numbers Weights of LM determined by discriminative training – p.19 Philipp Koehn, University of Edinburgh 19 The Statistical Machine Translation System of the University of Edinburgh p Minor Model Improvements p dropping unknown words during decoding delete word feature limited changes to the recapitalizer limited post-editing of the output limited changes to the tokenization of Arabic – p.20 Philipp Koehn, University of Edinburgh 20

� � � � � � The Statistical Machine Translation System of the University of Edinburgh p Outline p Overview: SMT at Edinburgh Baseline System Improvements Evaluation Related Recent Work in SMT – p.21 Philipp Koehn, University of Edinburgh 21 The Statistical Machine Translation System of the University of Edinburgh p Evaluation for Arabic-English p Improvements for Arabic-English: Eval set ’04 system ’05 system Eval 2002 (partial) 34.4% BLEU 40.4% BLEU Eval 2004 34.1% BLEU 34.3% BLEU Eval 2005 35.6% BLEU 40.5% BLEU – p.22 Philipp Koehn, University of Edinburgh 22

� � � � � � � The Statistical Machine Translation System of the University of Edinburgh p Why so Little Improvement on Eval 2004? p Model optimized on first 300 sentences of Eval 2002 very short output (length ratio 0.905) Word penalty feature allows tuning of output length: BLEU 38% 37% tuned 36% best 35% 34% 0.9 1.0 1.1 0.95 1.05 1.05 length ratio output/reference Manual adjustment: 34.3% 37.7% BLEU – p.23 Philipp Koehn, University of Edinburgh 23 The Statistical Machine Translation System of the University of Edinburgh p Evaluation for Chinese-English p Improvements for Chinese-English System changes: – bigger language model (800 million words) – debugged number translator Eval set ’04 system ’05 system Eval 2002 (partial) 26.1% BLEU 27.2% BLEU Eval 2004 27.1% BLEU 28.1% BLEU Eval 2005 24.4% BLEU 25.1% BLEU – p.24 Philipp Koehn, University of Edinburgh 24

The Statistical Machine Translation System of the University of - PDF document

The Statistical Machine Translation System of the University of Edinburgh Philipp Koehn pkoehn@inf.ed.ac.uk School of Informatics University of Edinburgh p.1 The Statistical Machine Translation System of the

Statistical Machine Translation Statistical Machine Translation p Lecture 2 Theory and Praxis of

Statistical Machine Translation George Foster George Foster Statistical Machine Translation A

Statistical Machine Translation Nadir Durrani 21-November-2014 Machine Translation

Representing Huge Translation Models Statistical Machine Translation parallel text + alignment

Statistical Machine Translation Graham Neubig Nara Institute of Science and Technology (NAIST)

Domain Adaptation in Statistical Machine Translation Logic, Language and Computation Bart

11-731 Machine Translation Speech 2 Speech Translation Speech Translation Three part systems

Introd u ction to machine translation MAC H IN E TR AN SL ATION IN P YTH ON Th u shan

Machine Translation Machine Translation February 13, 2008 Andreas Eisele UdS Computerlinguistik

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1

Machine Translation 12: (Non-neural) Statistical Machine Translation Rico Sennrich University of

Machine Translation Philipp Koehn 28 April 2020 Philipp Koehn Artificial Intelligence: Machine

Computer Aided Translation Philipp Koehn 30 April 2015 Philipp Koehn Machine Translation:

Computer Aided Translation Philipp Koehn 15 November 2018 Philipp Koehn Machine Translation:

Statistical Machine Translation What works and what does not Andreas Maletti Universitt

Machine Translation: Going Deep Philipp Koehn 4 June 2015 Philipp Koehn Machine Translation:

MariaDB MySQL Corrected Michael Monty Widenius About Monty Original author of MySQL

Fast gradient descent for drifting least squares regression: Non-asymptotic bounds and application

Research In Context Adriana Kovashka Assistant Professor, Dept of Computer Science University of

SOCIAL MEDIA AUDIT & CONVERSATION ANALYSIS ISABELLE BAUMGARTEN, KATIE OROSS, KAYLIE MCQUILLIN,

What is Cluster Analysis? Dmitriy (Dima) Gorenshteyn Sr. Data Scientist, Memorial Sloan

Title: Historical notes AIMA: Chapter 1 1 Introduction to Artificial Intelligence

Introduction + Information Theory LING 572 January 7, 2020 Shane Steinert-Threlkeld Adapted

CSE 473 Artificial Intelligence (AI) Rajesh Rao (Instructor) Yi-Shu Wei (TA) Hunter Whalen (TA)

The Statistical Machine Translation System of the University of - PDF document

The Statistical Machine Translation System of the University of Edinburgh Philipp Koehn pkoehn@inf.ed.ac.uk School of Informatics University of Edinburgh p.1 The Statistical Machine Translation System of the

Statistical Machine Translation Statistical Machine Translation p Lecture 2 Theory and Praxis of

Statistical Machine Translation George Foster George Foster Statistical Machine Translation A

Statistical Machine Translation Nadir Durrani 21-November-2014 Machine Translation

Representing Huge Translation Models Statistical Machine Translation parallel text + alignment

Statistical Machine Translation Graham Neubig Nara Institute of Science and Technology (NAIST)

Domain Adaptation in Statistical Machine Translation Logic, Language and Computation Bart

11-731 Machine Translation Speech 2 Speech Translation Speech Translation Three part systems

Introd u ction to machine translation MAC H IN E TR AN SL ATION IN P YTH ON Th u shan

Machine Translation Machine Translation February 13, 2008 Andreas Eisele UdS Computerlinguistik

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1

Machine Translation 12: (Non-neural) Statistical Machine Translation Rico Sennrich University of

Machine Translation Philipp Koehn 28 April 2020 Philipp Koehn Artificial Intelligence: Machine

Computer Aided Translation Philipp Koehn 30 April 2015 Philipp Koehn Machine Translation:

Computer Aided Translation Philipp Koehn 15 November 2018 Philipp Koehn Machine Translation:

Statistical Machine Translation What works and what does not Andreas Maletti Universitt

Machine Translation: Going Deep Philipp Koehn 4 June 2015 Philipp Koehn Machine Translation:

MariaDB MySQL Corrected Michael Monty Widenius About Monty Original author of MySQL

Fast gradient descent for drifting least squares regression: Non-asymptotic bounds and application

Research In Context Adriana Kovashka Assistant Professor, Dept of Computer Science University of

SOCIAL MEDIA AUDIT &amp; CONVERSATION ANALYSIS ISABELLE BAUMGARTEN, KATIE OROSS, KAYLIE MCQUILLIN,

What is Cluster Analysis? Dmitriy (Dima) Gorenshteyn Sr. Data Scientist, Memorial Sloan

Title: Historical notes AIMA: Chapter 1 1 Introduction to Artificial Intelligence

Introduction + Information Theory LING 572 January 7, 2020 Shane Steinert-Threlkeld Adapted

CSE 473 Artificial Intelligence (AI) Rajesh Rao (Instructor) Yi-Shu Wei (TA) Hunter Whalen (TA)

SOCIAL MEDIA AUDIT & CONVERSATION ANALYSIS ISABELLE BAUMGARTEN, KATIE OROSS, KAYLIE MCQUILLIN,