Fast and Adaptive Online Training of Feature-Rich Translation Models Spence Green Sida Wang Daniel Cer Christopher D. Manning Stanford University ACL 2013
Feature-Rich Research Industry/Evaluations Liang et al. 2006 Tillmann and Zhang 2006 Arun and Koehn 2007 Ittycheriah and Roukos 2007 Watanabe et al. 2007 n -best/lattice MERT Chiang et al. 2008; Chiang et al. 2009 Haddow et al. 2011 MIRA (ISI) Hopkins and May 2011 Xiang and Ittycheriah 2011 Cherry and Foster 2012 Chiang 2012 Gimpel 2012 Simianer et al. 2012 Watanabe 2012
Feature-Rich Research Industry/Evaluations Liang et al. 2006 Tillmann and Zhang 2006 Arun and Koehn 2007 Ittycheriah and Roukos 2007 Watanabe et al. 2007 n -best/lattice MERT Chiang et al. 2008; Chiang et al. 2009 Haddow et al. 2011 MIRA (ISI) Hopkins and May 2011 Xiang and Ittycheriah 2011 Cherry and Foster 2012 Chiang 2012 Gimpel 2012 Simianer et al. 2012 Watanabe 2012
Feature-rich Shared Task Submissions # Feature-rich 2012 WMT 0 IWSLT 1 2013 WMT 2 ? IWSLT TBD 4
Speculation: Entrenchment Of MERT Feature-rich on small tuning sets? Implementation complexity Open source availability 5
Speculation: Entrenchment Of MERT Feature-rich on small tuning sets? Implementation complexity Open source availability Top-selling phone of 2003 5
Motivation: Why Feature-Rich MT? Make MT more like other machine learning settings Features for specific errors Domain adaptation 6
Motivation: Why Online MT Tuning? Search: decode more often Better solutions See: [ Liang and Klein 2009 ] Computer-aided translation: incremental updating 7
Benefits Of Our Method Fast and scalable Adapts to dense/sparse feature mix Not complicated 8
Online Algorithm Overview Updating with an adaptive learning rate Automatic feature selection via L 1 regularization Loss function: Pairwise ranking 9
Notation t time/update step 10
Notation t time/update step weight vector in R n t 10
Notation t time/update step weight vector in R n t η learning rate 10
Notation t time/update step weight vector in R n t η learning rate ℓ t ( ) loss of t ’th example 10
Notation t time/update step weight vector in R n t η learning rate ℓ t ( ) loss of t ’th example z t − 1 ∈ ∂ℓ t ( t − 1 ) subgradient set ( subdifferential ) 10
Notation t time/update step weight vector in R n t η learning rate ℓ t ( ) loss of t ’th example z t − 1 ∈ ∂ℓ t ( t − 1 ) subgradient set ( subdifferential ) z t − 1 = ∇ ℓ t ( t − 1 ) for differentiable loss functions 10
Notation t time/update step weight vector in R n t η learning rate ℓ t ( ) loss of t ’th example z t − 1 ∈ ∂ℓ t ( t − 1 ) subgradient set ( subdifferential ) z t − 1 = ∇ ℓ t ( t − 1 ) for differentiable loss functions r ( ) regularization function 10
Warm-up: Stochastic Gradient Descent Per-instance update: t = t − 1 − ηz t − 1 11
Warm-up: Stochastic Gradient Descent Per-instance update: t = t − 1 − ηz t − 1 Issue #1: learning rate schedule η / t ? 11
Warm-up: Stochastic Gradient Descent Per-instance update: t = t − 1 − ηz t − 1 Issue #1: learning rate schedule η / t ? � η / t ? 11
Warm-up: Stochastic Gradient Descent Per-instance update: t = t − 1 − ηz t − 1 Issue #1: learning rate schedule η / t ? � η / t ? η / ( 1 + γt ) ? Yuck. 11
Warm-up: Stochastic Gradient Descent SGD update: t = t − 1 − ηz t − 1 Issue #2: same step size for every coordinate 12
Warm-up: Stochastic Gradient Descent SGD update: t = t − 1 − ηz t − 1 Issue #2: same step size for every coordinate Intuitively, we might want: Frequent feature: small steps e.g. η / t � Rare feature: large steps e.g. η / t 12
SGD: Learning Rate Adaptation SGD update: t = t − 1 − ηz t − 1 Scale learning rate with A − 1 ∈ R n × n : t = t − 1 − ηA − 1 z t − 1 13
SGD: Learning Rate Adaptation SGD update: t = t − 1 − ηz t − 1 Scale learning rate with A − 1 ∈ R n × n : t = t − 1 − ηA − 1 z t − 1 Choices: A − 1 = (SGD) 13
SGD: Learning Rate Adaptation SGD update: t = t − 1 − ηz t − 1 Scale learning rate with A − 1 ∈ R n × n : t = t − 1 − ηA − 1 z t − 1 Choices: A − 1 = (SGD) A − 1 = H − 1 (Batch: Newton step) 13
AdaGrad Duchi et al. 2011 Update: t = t − 1 − ηA − 1 z t − 1 Set A − 1 = G − 1 / 2 : t G t = G t − 1 + z t − 1 · z ⊤ t − 1 14
AdaGrad: Approximations and Intuition For high-dimensional t , use diagonal G t t = t − 1 − ηG − 1 / 2 z t − 1 t 15
AdaGrad: Approximations and Intuition For high-dimensional t , use diagonal G t t = t − 1 − ηG − 1 / 2 z t − 1 t Intuition: � 1 / t schedule on constant gradient Small steps for frequent features Big steps for rare features [ Duchi et al. 2011 ] 15
AdaGrad vs. SGD: 2D Illustration 10 SGD 8 AdaGrad 6 4 2 0 −2 −4 −6 −8 −10 −10 −5 0 5 10 16
Feature Selection Traditional approach: frequency cutoffs Unattractive for large tuning sets (e.g. bitext) 17
Feature Selection Traditional approach: frequency cutoffs Unattractive for large tuning sets (e.g. bitext) More principled: L 1 regularization � r ( ) = | | 17
Feature Selection: FOBOS T wo-step update: t − 1 2 = t − 1 − ηz t − 1 (1) 1 � � 2 � � t = rg min � − t − 1 + λ · r ( ) � 2 2 � �� � � �� � regularization proximal term (2) [ Duchi and Singer 2009 ] 18
Feature Selection: FOBOS T wo-step update: t − 1 2 = t − 1 − ηz t − 1 (1) 1 � � 2 � � t = rg min � − t − 1 + λ · r ( ) � 2 2 � �� � � �� � regularization proximal term (2) [ Duchi and Singer 2009 ] Extension: AdaGrad update in step (1) 18
Feature Selection: FOBOS For L 1 , FOBOS becomes soft thresholding: � � � � � � t = sign ( t − 1 2 ) � t − 1 � − λ + 2 19
Feature Selection: FOBOS For L 1 , FOBOS becomes soft thresholding: � � � � � � t = sign ( t − 1 2 ) � t − 1 � − λ + 2 Squared- L 2 also has a simple form 19
Feature Selection: Lazy Regularization Lazy updating: only update active coordinates Big speedup in MT setting 20
Feature Selection: Lazy Regularization Lazy updating: only update active coordinates Big speedup in MT setting Easy with FOBOS: t ′ j : last update of dimension j Use λ ( t − t ′ j ) 20
AdaGrad + FOBOS: Full Algorithm 1. Additive update: G t 21
AdaGrad + FOBOS: Full Algorithm 1. Additive update: G t 2. Additive update: t − 1 2 21
AdaGrad + FOBOS: Full Algorithm 1. Additive update: G t 2. Additive update: t − 1 2 3. Closed-form regularization: t 21
AdaGrad + FOBOS: Full Algorithm 1. Additive update: G t 2. Additive update: t − 1 2 3. Closed-form regularization: t 21
AdaGrad + FOBOS: Full Algorithm 1. Additive update: G t 2. Additive update: t − 1 2 3. Closed-form regularization: t Not complicated Very fast 21
Recap: Pairwise Ranking For derivation d , feature map ϕ ( d ) , references e 1: k Metric: B ( d, e 1: k ) (e.g. BLEU + 1) Model score: M ( d ) = · ϕ ( d ) 22
Recap: Pairwise Ranking For derivation d , feature map ϕ ( d ) , references e 1: k Metric: B ( d, e 1: k ) (e.g. BLEU + 1) Model score: M ( d ) = · ϕ ( d ) Pairwise consistency: � d + , e 1: k � � d − , e 1: k � M ( d + ) > M ( d − ) ⇐⇒ B > B [ Hopkins and May 2011 ] 22
Loss Function: Pairwise Ranking M ( d + ) > M ( d − ) ⇐⇒ · ( ϕ ( d + ) − ϕ ( d − )) > 0 23
Loss Function: Pairwise Ranking M ( d + ) > M ( d − ) ⇐⇒ · ( ϕ ( d + ) − ϕ ( d − )) > 0 Loss formulation: Difference vector: = ϕ ( d + ) − ϕ ( d − ) Find so that · > 0 Binary classification problem between and − 23
Loss Function: Pairwise Ranking M ( d + ) > M ( d − ) ⇐⇒ · ( ϕ ( d + ) − ϕ ( d − )) > 0 Loss formulation: Difference vector: = ϕ ( d + ) − ϕ ( d − ) Find so that · > 0 Binary classification problem between and − Logistic loss: convex, differentiable [ Hopkins and May 2011 ] 23
Parallelization Online algorithms are inherently sequential 24
Parallelization Online algorithms are inherently sequential Out-of-order updating: 7 = 6 − ηz 4 8 = 7 − ηz 6 9 = 8 − ηz 5 24
Parallelization Online algorithms are inherently sequential Out-of-order updating: 7 = 6 − ηz 4 8 = 7 − ηz 6 9 = 8 − ηz 5 � Low-latency regret bound: O ( T ) [ Langford et al. 2009 ] 24
Recommend
More recommend