Fast and Adaptive Online Training of Feature-Rich Translation Models - PowerPoint PPT Presentation

Fast and Adaptive Online Training of Feature-Rich Translation Models Spence Green Sida Wang Daniel Cer Christopher D. Manning Stanford University ACL 2013

Feature-Rich Research Industry/Evaluations Liang et al. 2006 Tillmann and Zhang 2006 Arun and Koehn 2007 Ittycheriah and Roukos 2007 Watanabe et al. 2007 n -best/lattice MERT Chiang et al. 2008; Chiang et al. 2009 Haddow et al. 2011 MIRA (ISI) Hopkins and May 2011 Xiang and Ittycheriah 2011 Cherry and Foster 2012 Chiang 2012 Gimpel 2012 Simianer et al. 2012 Watanabe 2012

Feature-rich Shared Task Submissions # Feature-rich 2012 WMT 0 IWSLT 1 2013 WMT 2 ? IWSLT TBD 4

Speculation: Entrenchment Of MERT Feature-rich on small tuning sets? Implementation complexity Open source availability 5

Speculation: Entrenchment Of MERT Feature-rich on small tuning sets? Implementation complexity Open source availability Top-selling phone of 2003 5

Motivation: Why Feature-Rich MT? Make MT more like other machine learning settings Features for specific errors Domain adaptation 6

Motivation: Why Online MT Tuning? Search: decode more often Better solutions See: [ Liang and Klein 2009 ] Computer-aided translation: incremental updating 7

Benefits Of Our Method Fast and scalable Adapts to dense/sparse feature mix Not complicated 8

Online Algorithm Overview Updating with an adaptive learning rate Automatic feature selection via L 1 regularization Loss function: Pairwise ranking 9

Notation t time/update step 10

Notation t time/update step weight vector in R n  t 10

Notation t time/update step weight vector in R n  t η learning rate 10

Notation t time/update step weight vector in R n  t η learning rate ℓ t (  ) loss of t ’th example 10

Notation t time/update step weight vector in R n  t η learning rate ℓ t (  ) loss of t ’th example z t − 1 ∈ ∂ℓ t (  t − 1 ) subgradient set ( subdifferential ) 10

Notation t time/update step weight vector in R n  t η learning rate ℓ t (  ) loss of t ’th example z t − 1 ∈ ∂ℓ t (  t − 1 ) subgradient set ( subdifferential ) z t − 1 = ∇ ℓ t (  t − 1 ) for differentiable loss functions 10

Notation t time/update step weight vector in R n  t η learning rate ℓ t (  ) loss of t ’th example z t − 1 ∈ ∂ℓ t (  t − 1 ) subgradient set ( subdifferential ) z t − 1 = ∇ ℓ t (  t − 1 ) for differentiable loss functions r (  ) regularization function 10

Warm-up: Stochastic Gradient Descent Per-instance update:  t =  t − 1 − ηz t − 1 Issue #1: learning rate schedule η / t ? 11

Warm-up: Stochastic Gradient Descent Per-instance update:  t =  t − 1 − ηz t − 1 Issue #1: learning rate schedule η / t ? � η / t ? 11

Warm-up: Stochastic Gradient Descent Per-instance update:  t =  t − 1 − ηz t − 1 Issue #1: learning rate schedule η / t ? � η / t ? η / ( 1 + γt ) ? Yuck. 11

Warm-up: Stochastic Gradient Descent SGD update:  t =  t − 1 − ηz t − 1 Issue #2: same step size for every coordinate Intuitively, we might want: Frequent feature: small steps e.g. η / t � Rare feature: large steps e.g. η / t 12

SGD: Learning Rate Adaptation SGD update:  t =  t − 1 − ηz t − 1 Scale learning rate with A − 1 ∈ R n × n :  t =  t − 1 − ηA − 1 z t − 1 Choices: A − 1 =  (SGD) 13

SGD: Learning Rate Adaptation SGD update:  t =  t − 1 − ηz t − 1 Scale learning rate with A − 1 ∈ R n × n :  t =  t − 1 − ηA − 1 z t − 1 Choices: A − 1 =  (SGD) A − 1 = H − 1 (Batch: Newton step) 13

AdaGrad Duchi et al. 2011 Update:  t =  t − 1 − ηA − 1 z t − 1 Set A − 1 = G − 1 / 2 : t G t = G t − 1 + z t − 1 · z ⊤ t − 1 14

AdaGrad: Approximations and Intuition For high-dimensional  t , use diagonal G t  t =  t − 1 − ηG − 1 / 2 z t − 1 t Intuition: � 1 / t schedule on constant gradient Small steps for frequent features Big steps for rare features [ Duchi et al. 2011 ] 15

AdaGrad vs. SGD: 2D Illustration 10 SGD 8 AdaGrad 6 4 2 0 −2 −4 −6 −8 −10 −10 −5 0 5 10 16

Feature Selection Traditional approach: frequency cutoffs Unattractive for large tuning sets (e.g. bitext) More principled: L 1 regularization � r (  ) = |   |  17

Feature Selection: FOBOS T wo-step update:  t − 1 2 =  t − 1 − ηz t − 1 (1)     1 � � 2   � �  t = rg min �  −  t − 1 + λ · r (  )   �   2  2 � ��   � �� regularization proximal term (2) [ Duchi and Singer 2009 ] Extension: AdaGrad update in step (1) 18

Feature Selection: FOBOS For L 1 , FOBOS becomes soft thresholding: � � � � � �  t = sign (  t − 1 2 ) �  t − 1 � − λ + 2 Squared- L 2 also has a simple form 19

Feature Selection: Lazy Regularization Lazy updating: only update active coordinates Big speedup in MT setting Easy with FOBOS: t ′ j : last update of dimension j Use λ ( t − t ′ j ) 20

AdaGrad + FOBOS: Full Algorithm 1. Additive update: G t 21

AdaGrad + FOBOS: Full Algorithm 1. Additive update: G t 2. Additive update:  t − 1 2 21

AdaGrad + FOBOS: Full Algorithm 1. Additive update: G t 2. Additive update:  t − 1 2 3. Closed-form regularization:  t 21

AdaGrad + FOBOS: Full Algorithm 1. Additive update: G t 2. Additive update:  t − 1 2 3. Closed-form regularization:  t Not complicated Very fast 21

Recap: Pairwise Ranking For derivation d , feature map ϕ ( d ) , references e 1: k Metric: B ( d, e 1: k ) (e.g. BLEU + 1) Model score: M ( d ) =  · ϕ ( d ) Pairwise consistency: � d + , e 1: k � � d − , e 1: k � M ( d + ) > M ( d − ) ⇐⇒ B > B [ Hopkins and May 2011 ] 22

Loss Function: Pairwise Ranking M ( d + ) > M ( d − ) ⇐⇒  · ( ϕ ( d + ) − ϕ ( d − )) > 0 Loss formulation: Difference vector:  = ϕ ( d + ) − ϕ ( d − ) Find  so that  ·  > 0 Binary classification problem between  and −  Logistic loss: convex, differentiable [ Hopkins and May 2011 ] 23

Parallelization Online algorithms are inherently sequential Out-of-order updating:  7 =  6 − ηz 4  8 =  7 − ηz 6  9 =  8 − ηz 5 24

Parallelization Online algorithms are inherently sequential Out-of-order updating:  7 =  6 − ηz 4  8 =  7 − ηz 6  9 =  8 − ηz 5 � Low-latency regret bound: O ( T ) [ Langford et al. 2009 ] 24

Translation Quality Experiments Arabic-English (Ar–En) and Chinese-English (Zh–En) Newswire and mixed-genre experiments BOLT bitexts: data up to 2012 Bilingual Monolingual Sentences Tokens Tokens Ar–En 6.6M 375M 990M Zh–En 9.3M 538M 25

MT System Phrase-based MT: Phrasal [ Cer et al. 2010 ] Dense baseline: MERT Cer et al. 2008 line search Accumulates n -best lists Random starting points, etc. 26

Feature-Rich Baseline: PRO Pairwise Ranking Optimization (PRO) Batch log loss minimization Phrasal implementation: L-BFGS with L 2 regularization [ Hopkins and May 2011 ] Sanity check: Moses PRO and kb-MIRA (batch) implementations 27

Dense Features 8 Hierarchical lex. reordering 28

Dense Features 8 Hierarchical lex. reordering 5 Moses phrase table features 1 Rule bitext count 1 Unique rule indicator 28

Dense Features 8 Hierarchical lex. reordering 5 Moses phrase table features 1 Rule bitext count 1 Unique rule indicator 1 Word penalty 1 Linear distortion 1 LM 1 Unknown word 19 28

Sparse Feature Templates Discriminative Phrase Table (PT) � �� ⇒ space program �� Rule indicator: ✶ Discriminative Alignments (AL) �� ⇒ Source word deletion: ✶ �� ⇒ space Word alignments: ✶ Discriminative Lex. Reordering (LO) �� swap ( �� ⇒ space ) Phrase orientation: ✶ 29

Evaluation: NIST OpenMT Small tuning set: MT06 “Large” tuning set: MT0568 ( ≈ 4200 segments) BLEU-4 uncased, Four references Paper: mixed genre (bitext) experiments 30

Results: Small Tuning Set (Dense) Ar–En Zh–En Tune Test Avg. Tune Test Avg. MERT 45.08 50.51 33.73 34.49 This paper 43.16 50.11 32.20 35.25 31

Results: Add More Features Ar–En Zh–En Tune Test Avg. Tune Test Avg. MERT—Dense 45.08 50.51 33.73 34.49 This paper + PT 50.61 50.52 34.92 35.12 32

Fast and Adaptive Online Training of Feature-Rich Translation Models - PowerPoint PPT Presentation

Fast and Adaptive Online Training of Feature-Rich Translation Models Spence Green Sida Wang Daniel Cer Christopher D. Manning Stanford University ACL 2013 Feature-Rich Research Industry/Evaluations Liang et al. 2006 Tillmann and Zhang

Fast and Adaptive Online Training of Feature-Rich Translation Models Spence Green Sida Wang

Decision Tree Prof. Seungchul Lee Industrial AI Lab. Feature Test Feature 1 Feature 2 Feature

Neural Nets for Adaptive Filter and Adaptive Neural Nets as Adaptive Filters Pattern Recognition

Adaptive Control Chapter 1: Introduction to Adaptive Control Adaptive Control Landau, Lozano,

Adaptive Control Chapter 11: Direct Adaptive Control 1 Adaptive Control Landau, Lozano,

A Distinctive Feature of A Distinctive Feature of A Distinctive Feature of A Distinctive Feature

Outline Reducing Dimensionality Feature Selection 1 Steven J Zeil Feature Extraction 2

Adaptive Control Chapter 12: Indirect Adaptive Control 1 Adaptive Control Landau, Lozano,

Fast-timing measurements in neutron-rich odd-mass Fast-timing measurements in neutron-rich

Fast Bayesian automatic Fast Bayesian automatic adaptive quadrature adaptive quadrature Gh.

Earth: The Feature Presentation - feature, landscape, topography Earth: The Feature Presentation

Reducing Dimensionality Steven J Zeil Old Dominion Univ. Fall 2010 1 Feature Selection

Being a METS Startup Fast Failure; Fast Reward November 2016 Fast Failure; Fast Reward

Food Handler Training Food Handler Training Food Handler Training Food Handler Training Online

Adaptive Control Chapter 13: Multimodel adaptive control with switching Chapter 13: Multimodel

Adaptive Control Chapter 14: Adaptive regulation Rejection of unknown disturbances 1

Speculative execution in a distributed file system E. B. Nightingale, P. M. Chen, J. Flinn

A Runtime System for Software Lock Elision Amitabha Roy (U. Cambridge) Steven Hand (U.

1 Store Buffer Design Example Memory Dependence Any load instruction receives the memory Store

Program Analysis Jose Lugo-Martinez CSE 240C: Advanced Microarchitecture Prof. Steven Swanson

Take a Walk on the Wild Side(-Channel) Enrico Perla DISCLAIMER This presentation is my own work

Nuclear Industry Perspectives on Waste Confidence Briefing on Waste Confidence Rulemaking March

Statistics 498 Summer 2009 Summer Practicum in Statistics and Financial Risk Professor Peter

What can we learn from law? Raphael Gellert & Niels van Dijk (VUB/LSTS) Brno, 25 November

Sambuz

Useful Links

Newsletter

Mail Us

Fast and Adaptive Online Training of Feature-Rich Translation Models - PowerPoint PPT Presentation

Fast and Adaptive Online Training of Feature-Rich Translation Models Spence Green Sida Wang Daniel Cer Christopher D. Manning Stanford University ACL 2013 Feature-Rich Research Industry/Evaluations Liang et al. 2006 Tillmann and Zhang

Fast and Adaptive Online Training of Feature-Rich Translation Models Spence Green Sida Wang

Decision Tree Prof. Seungchul Lee Industrial AI Lab. Feature Test Feature 1 Feature 2 Feature

Neural Nets for Adaptive Filter and Adaptive Neural Nets as Adaptive Filters Pattern Recognition

Adaptive Control Chapter 1: Introduction to Adaptive Control Adaptive Control Landau, Lozano,

Adaptive Control Chapter 11: Direct Adaptive Control 1 Adaptive Control Landau, Lozano,

A Distinctive Feature of A Distinctive Feature of A Distinctive Feature of A Distinctive Feature

Outline Reducing Dimensionality Feature Selection 1 Steven J Zeil Feature Extraction 2

Adaptive Control Chapter 12: Indirect Adaptive Control 1 Adaptive Control Landau, Lozano,

Fast-timing measurements in neutron-rich odd-mass Fast-timing measurements in neutron-rich

Fast Bayesian automatic Fast Bayesian automatic adaptive quadrature adaptive quadrature Gh.

Earth: The Feature Presentation - feature, landscape, topography Earth: The Feature Presentation

Reducing Dimensionality Steven J Zeil Old Dominion Univ. Fall 2010 1 Feature Selection

Being a METS Startup Fast Failure; Fast Reward November 2016 Fast Failure; Fast Reward

Food Handler Training Food Handler Training Food Handler Training Food Handler Training Online

Adaptive Control Chapter 13: Multimodel adaptive control with switching Chapter 13: Multimodel

Adaptive Control Chapter 14: Adaptive regulation Rejection of unknown disturbances 1

Speculative execution in a distributed file system E. B. Nightingale, P. M. Chen, J. Flinn

A Runtime System for Software Lock Elision Amitabha Roy (U. Cambridge) Steven Hand (U.

1 Store Buffer Design Example Memory Dependence Any load instruction receives the memory Store

Program Analysis Jose Lugo-Martinez CSE 240C: Advanced Microarchitecture Prof. Steven Swanson

Take a Walk on the Wild Side(-Channel) Enrico Perla DISCLAIMER This presentation is my own work

Nuclear Industry Perspectives on Waste Confidence Briefing on Waste Confidence Rulemaking March

Statistics 498 Summer 2009 Summer Practicum in Statistics and Financial Risk Professor Peter

What can we learn from law? Raphael Gellert &amp; Niels van Dijk (VUB/LSTS) Brno, 25 November

Sambuz

Useful Links

Newsletter

Mail Us

What can we learn from law? Raphael Gellert & Niels van Dijk (VUB/LSTS) Brno, 25 November