Building a Phrase-based SMT System Graham Neubig & Kevin Duh - PowerPoint PPT Presentation

Building a Phrase-Based SMT System Building a Phrase-based SMT System Graham Neubig & Kevin Duh Nara Institute of Science and Technology (NAIST) 5/10/2012 1

Building a Phrase-Based SMT System Phrase-based Statistical Machine Translation (SMT) ● Divide sentence into patterns, reorder, combine Today I will give a lecture on machine translation . Today I will give a lecture on machine translation . 今日は、を行いますの講義機械翻訳。 Today machine translation a lecture on I will give . 今日は、機械翻訳の講義を行います。今日は、機械翻訳の講義を行います。 ● Statistical translation models, reordering models, 2 language models learned from text

Building a Phrase-Based SMT System This Talk 1) What are the steps required to build a phrase-based machine translation translation system? 2) What tools implement these steps in Moses* (an open-source statistical MT system)? 3) What are some research problems related to each of these components? 3 * http://www.statmt.org/moses

Building a Phrase-Based SMT System Steps in Training a Phrase-based SMT System ● Collecting Data ● Tokenization ● Language Modeling ● Alignment ● Phrase Extraction/Scoring ● Reordering Models ● Decoding ● Evaluation ● Tuning

Building a Phrase-Based SMT System Collecting Data

Building a Phrase-Based SMT System Collecting Data ● Sentence parallel data ● Used in: Translation model/Reordering model これはペンです。 This is a pen. 昨日は友達と食べた。 I ate with my friend yesterday. 象は花が長い。 Elephants' trunks are long. ● Monolingual data (in the target language) ● Used in: Language model This is a pen. I ate with my friend yesterday. Elephants' trunks are long.

Building a Phrase-Based SMT System Good Data is ● Big! → Translation Accuracy LM Data Size (Million Words) [Brants 2007] ● Clean ● In the same domain as test data

Building a Phrase-Based SMT System Collecting Data ● For academic workshops, data is prepared for us! Name Type Words TED Lectures 1.76M News Commentary News 2.52M e.g. EuroParl Political 45.7M IWSLT 2011 → UN Political 301M Giga Web 576M ● In real systems ● Data from government organizations, newspapers ● Crawl the web ● Merge several data sources

Building a Phrase-Based SMT System Research ● Finding bilingual pages [Resnik 03] [Image: Mainichi Shimbun]

Building a Phrase-Based SMT System Research ● Finding bilingual pages [Resnik 03] ● Sentence alignment [Moore 02]

Building a Phrase-Based SMT System Research ● Finding bilingual pages [Resnik 03] ● Sentence alignment [Moore 02] ● Crowd-sourcing data creation [Ambati 10] ● Mechanical Turk, duolingo, etc.

Building a Phrase-Based SMT System Tokenization

Building a Phrase-Based SMT System Tokenization ● Example: Divide Japanese into words 太郎が花子を訪問した。太郎が花子を訪問した。 ● Example: Make English lowercase, split punctuation Taro visited Hanako. taro visited hanako .

Building a Phrase-Based SMT System Tools for Tokenization ● Most European languages tokenize.perl en < input.en > output.en tokenize.perl fr < input.fr > output.fr ● Japanese MeCab: mecab -O wakati < input.ja > output.ja KyTea: kytea -notags < input.ja > output.ja JUMAN, etc. ● Chinese Stanford Segmenter, LDC, KyTea, etc...

Building a Phrase-Based SMT System Research ● What is good tokenization for machine translation? ● Accuracy? Consistency? [Chang 08] ● Matching target language words? [Sudoh 11] 太郎が花子を訪問した。 Taro <ARG1> visited <ARG2> Hanako . ● Morphology (Korean, Arabic, Russian) [Niessen 01] 단어란 도대체 무엇일까요 ? 단어 란 도대체 무엇 일 까요 ? ● Unsupervised learning [Chung 09, Neubig 12]

Building a Phrase-Based SMT System Language Modeling

Building a Phrase-Based SMT System Language Modeling ● Assign a probability to each sentence E1: Taro visited Hanako P(E1) E2: the Taro visited the Hanako LM P(E2) E3: Taro visited the bibliography P(E3) ● More fluent sentences get higher probability P(E1) > P(E2) P(E1) > P(E3)

Building a Phrase-Based SMT System n-gram Models ● We want the probability of P(W = “Taro visited Hanako”) ● n-gram model calculates one word at a time ● Condition on n-1 previous words e.g. 2-gram model P(w 1 =“Taro”) * P(w 2 =”visited” | w 1 =“Taro”) * P(w 3 =”Hanako” | w 2 =”visited”) * P(w 4 =”</s>” | w 3 =”Hanako”) NOTE: sentence ending symbol </s> 18

Building a Phrase-Based SMT System Tools ● SRILM Toolkit: Train: ngram-count -order 5 -interpolate -kndiscount -unk -text input.txt -lm lm.arpa Test: ngram -lm lm.arpa -ppl test.txt ● Others: KenLM, RandLM, IRSTLM

Building a Phrase-Based SMT System Research Problems ● Is there anything that can beat n-grams? [Goodman 01] ● Fast to compute ● Easy to integrate into decoding ● Surprisingly strong ● Other methods ● Syntactic LMs [Charniak 03] ● Neural networks [Bengio 06] ● Model M [Chen 09] ● etc...

Building a Phrase-Based SMT System Alignment

Building a Phrase-Based SMT System Alignment ● Find which words correspond to each-other 太郎が花子を訪問した。太郎が花子を訪問した。 taro visited hanako . taro visited hanako . ● Done automatically with probabilistic methods 日本語日本語日本語日本語日本語日本語 P( 花子 |hanako) = 0.99 日本語日本語日本語日本語日本語日本語 P( 太郎 |taro) = 0.97 P(visited| 訪問 ) = 0.46 English English P(visited| した ) = 0.04 English English English English English P( 花子 |taro) = 0.0001 English English English English English 22

Building a Phrase-Based SMT System IBM/HMM Models ● One-to-many alignment model ホテルの受付 the hotel front desk X X the hotel front desk ホテルの受付 ● IBM Model 1: No structure (“bag of words”) ● IBM Models 2-5, HMM: Add more structure 23

Building a Phrase-Based SMT System Combining One-to-Many Alignments ホテルの受付 the hotel front desk X X the hotel front desk ホテルの受付 Combine the hotel front desk ホテルの受付 ● Several different heuristics 24

Building a Phrase-Based SMT System Tools ● mkcls: Find bilingual classes ホテルの受付 35 49 12 the hotel front desk 23 35 12 19 ● GIZA++: Find alignments using IBM models (uses classes from mkcls for smoothing) ホテルの受付 35 49 12 ホテルの受付 + the hotel front desk the hotel front desk 23 35 12 19 ● symal: Combine alignments in both directions ● (Included in train-model.perl of Moses)

Building a Phrase-Based SMT System Research Problems ● Does alignment actually matter? [Aryan 06] ● Supervised alignment models [Fraser 06, Haghighi 09] ● Alignment using syntactic structure [DeNero 07] ● Phrase-based alignment models [Marcu 02, DeNero 08]

Building a Phrase-Based SMT System Phrase Extraction

Building a Phrase-Based SMT System Phrase Extraction ● Use alignments to find phrase pairs ホテ受ホテルの → hotel ルの付ホテルの → the hotel the 受付 → front desk hotel ホテルの受付 → hotel front desk front ホテルの受付 → the hotel front desk desk

Building a Phrase-Based SMT System Tools ● extract: Extract all the phrases ● phrase-extract/score: Score the phrases ● (Included in train-model.perl)

Building a Phrase-Based SMT System Research ● Domain adaptation of translation models [Koehn 07, Matsoukas 09] ● Reducing phrase table size [Johnson 07] ● Generalized phrase extraction (Geppetto toolkit) [Ling 10] ● Phrase sense disambiguation [Carpuat 07]

Building a Phrase-Based SMT System Reordering Models

Building a Phrase-Based SMT System Lexicalized Reordering ● Probability of monotone, swap, discontinuous 細太訪しい男が郎を問た the thin mono disc. man visited Taro swap 細い → the thin 太郎を → Taro high monotone probability high swap probability ● Conditioning on input/output, left/right, or both

Building a Phrase-Based SMT System Tools ● extract: Same as phrase extraction ● lexical-reordering/score: Scores lexical reordering ● (included in train-model.perl)

Building a Phrase-Based SMT System Research ● Still a very open research area (especially en↔ja) ● Change the translation model ● Hierarchical phrase-based [Chiang 07] ● Syntax-based translation [Yamada 01, Galley 06] ● Pre-ordering [Xia 04, Isozaki 10] F 彼はパンを食べた F' 彼は食べたパンを E he ate rice

Building a Phrase-based SMT System Graham Neubig & Kevin Duh - PowerPoint PPT Presentation

Building a Phrase-Based SMT System Building a Phrase-based SMT System Graham Neubig & Kevin Duh Nara Institute of Science and Technology (NAIST) 5/10/2012 1 Building a Phrase-Based SMT System Phrase-based Statistical Machine Translation

Efficient solutions for word reordering in German-English phrase-based SMT Arianna Bisazza &

Dynamically shaping the reordering search space of phrase-based SMT Arianna Bisazza &

Phrase Weights Statistical NLP Spring 2011 Lecture 10: Phrase Alignment Dan Klein UC

SMT WORLDWIDE SMT America, Europe and Asia staff has over 20 years experience in the SMT field

POLYMETALLIC PRODUCER AGM PRESENTATION June 30, 2020 TSX: SMT | NYSE AMERICAN: SMTS | BVL: SMT

SMT Solvers: A Disruptive Technology John Rushby Computer Science Laboratory SRI International

Using SMT solvers for binary analysis and exploitation A primer on SMT, SMT solvers, Z3 & angr

What Is an Expanded Noun Phrase? An expanded noun phrase gives much more detail than a simple

Phrase-Based Models Philipp Koehn 15 September 2020 Philipp Koehn Machine Translation:

Satisfiability Modulo Theories SMT solvers are finding their way in many different application

Introduction to SAT and SMT Solvers Interfacing Yosys and SMT Solvers for BMC and more using

Lexical Syntax for Dependency-based Language Models Statistical Machine Translation Incremental

4CSLL5 Advanced Computational Linguistics Introduction Phrase Based Machine Trans Martin

POLYMETALLIC PRODUCER CORPORATE PRESENTATION July 2020 TSX: SMT | NYSE AMERICAN: SMTS | BVL:

SMT in Asia Content Teknek and the SMT industry The market Why cleaning is needed

POLYMETALLIC PRODUCER CORPORATE PRESENTATION February 2020 TSX: SMT | NYSE AMERICAN: SMTS |

Units of Measurement Commonly used Units of Measurement Length Mass Time Force

Using Semantic Mapping to Manage Heterogeneity in XLIFF Interoperability by Dave Lewis, Rob

On the Scalar Gaussian Interference Channel Chandra Nair, & David Ng The Chinese University

Empirically Estimating Order Constraints for Content Planning in Generation Pablo A. Duboue and

New Trends in General Game Playing Michael Thielscher, Dresden Chess Players The 1 st Chess

Automated Deduction Modulo November 8, 2013 David Delahaye David.Delahaye@cnam.fr Cnam / Inria,

Linear-Time Approximation Algorithms for Unit Disk Graphs Guilherme D. da Fonseca Celina M. H.

OOMMF eXtensible Solver M. J. Donahue and D. G. Porter NIST, Gaithersburg, MD USA OOMMF:

Sambuz

Useful Links

Newsletter

Mail Us

Building a Phrase-based SMT System Graham Neubig & Kevin Duh - PowerPoint PPT Presentation

Building a Phrase-Based SMT System Building a Phrase-based SMT System Graham Neubig & Kevin Duh Nara Institute of Science and Technology (NAIST) 5/10/2012 1 Building a Phrase-Based SMT System Phrase-based Statistical Machine Translation

Efficient solutions for word reordering in German-English phrase-based SMT Arianna Bisazza &amp;

Dynamically shaping the reordering search space of phrase-based SMT Arianna Bisazza &amp;

Phrase Weights Statistical NLP Spring 2011 Lecture 10: Phrase Alignment Dan Klein UC

SMT WORLDWIDE SMT America, Europe and Asia staff has over 20 years experience in the SMT field

POLYMETALLIC PRODUCER AGM PRESENTATION June 30, 2020 TSX: SMT | NYSE AMERICAN: SMTS | BVL: SMT

SMT Solvers: A Disruptive Technology John Rushby Computer Science Laboratory SRI International

Using SMT solvers for binary analysis and exploitation A primer on SMT, SMT solvers, Z3 &amp; angr

What Is an Expanded Noun Phrase? An expanded noun phrase gives much more detail than a simple

Phrase-Based Models Philipp Koehn 15 September 2020 Philipp Koehn Machine Translation:

Satisfiability Modulo Theories SMT solvers are finding their way in many different application

Introduction to SAT and SMT Solvers Interfacing Yosys and SMT Solvers for BMC and more using

Lexical Syntax for Dependency-based Language Models Statistical Machine Translation Incremental

4CSLL5 Advanced Computational Linguistics Introduction Phrase Based Machine Trans Martin

POLYMETALLIC PRODUCER CORPORATE PRESENTATION July 2020 TSX: SMT | NYSE AMERICAN: SMTS | BVL:

SMT in Asia Content Teknek and the SMT industry The market Why cleaning is needed

POLYMETALLIC PRODUCER CORPORATE PRESENTATION February 2020 TSX: SMT | NYSE AMERICAN: SMTS |

Units of Measurement Commonly used Units of Measurement Length Mass Time Force

Using Semantic Mapping to Manage Heterogeneity in XLIFF Interoperability by Dave Lewis, Rob

On the Scalar Gaussian Interference Channel Chandra Nair, &amp; David Ng The Chinese University

Empirically Estimating Order Constraints for Content Planning in Generation Pablo A. Duboue and

New Trends in General Game Playing Michael Thielscher, Dresden Chess Players The 1 st Chess

Automated Deduction Modulo November 8, 2013 David Delahaye David.Delahaye@cnam.fr Cnam / Inria,

Linear-Time Approximation Algorithms for Unit Disk Graphs Guilherme D. da Fonseca Celina M. H.

OOMMF eXtensible Solver M. J. Donahue and D. G. Porter NIST, Gaithersburg, MD USA OOMMF:

Sambuz

Useful Links

Newsletter

Mail Us

Efficient solutions for word reordering in German-English phrase-based SMT Arianna Bisazza &

Dynamically shaping the reordering search space of phrase-based SMT Arianna Bisazza &

Using SMT solvers for binary analysis and exploitation A primer on SMT, SMT solvers, Z3 & angr

On the Scalar Gaussian Interference Channel Chandra Nair, & David Ng The Chinese University