sparse feature learning
play

Sparse Feature Learning Philipp Koehn 1 March 2016 Philipp Koehn - PowerPoint PPT Presentation

Sparse Feature Learning Philipp Koehn 1 March 2016 Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016 Multiple Component Models 1 Translation Model Language Model Reordering Model Philipp Koehn Machine


  1. Sparse Feature Learning Philipp Koehn 1 March 2016 Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016

  2. Multiple Component Models 1 Translation Model Language Model Reordering Model Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016

  3. Component Weights 2 Translation Model .05 .26 Language Model .19 .06 Reordering Model .21 .1 .04 .1 Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016

  4. Even More Numbers Inside 3 Translation Model p(a | to) = 0.18 .05 p(casa | house) = 0.35 Language p(azur | blue) = 0.77 Model p(la | the) = 0.32 .19 .06 Reordering Model .21 .1 .04 .1 Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016

  5. Grand Vision 4 • There are millions of parameters – each phrase translation score – each language model n-gram – etc. • Can we train them all discriminatively? • This implies optimization over the entire training corpus Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016

  6. 5 search space iterative n-best our work n-best rule scores aligned corpus a handful thousands millions Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016

  7. 6 search space MERT iterative [Och&al. 2003] n-best our work n-best rule scores aligned corpus a handful thousands millions Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016

  8. 7 search space MERT iterative [Och&al. 2003] n-best our work n-best rule scores aligned corpus a handful thousands millions Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016

  9. 8 MIRA [Chiang 2007] search SampleRank [Haddow&al. 2011] space MERT PRO iterative [Och&al. 2003] [Hopkins/May 2011] n-best our work n-best rule scores aligned corpus a handful thousands millions Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016

  10. 9 MIRA [Chiang 2007] search SampleRank [Haddow&al. 2011] space MaxViolation [Yu et al., 2013] MERT PRO iterative [Och&al. 2003] [Hopkins/May 2011] n-best Leave One Out [Wuebker et al., 2012] n-best rule scores aligned corpus a handful thousands millions Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016

  11. Strategy and Core Problems 10 • Process each sentence pair in the training corpus • Optimize parameters towards producing the reference translation • Reference translation may not be producible by model – optimize towards most similar translation – or, only process sentence pair partially • Avoid overfitting • Large corpora require efficient learning methods Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016

  12. Sentence Level vs. Corpus Level Error Metric 11 • Optimizing BLEU requires optimizing over the entire training corpus � BLEU ( { e best h j ( e i , f i ) λ i } , { e ref = argmax e i i } ) i j • Life would be easier, if we could sum over sentence level scores � � ( h j ( e i , f i ) λ i ) , e ref BLEU’( argmax e i ) i i j • For instance, BLEU+1 Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016

  13. 12 features Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016

  14. Core Rule Properties 13 • Frequency of phrase (binned) • Length of phrase – number of source words – number of target words – number of source and target words • Unaligned / added (content) words in phrase pair • Reordering within phrase pair Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016

  15. Lexical Translation Features 14 • lex ( e ) fires when an output word e is generated • lex ( f, e ) fires when an output word e is generated aligned to a input word f • lex ( NULL , e ) fires when an output word e is generated unaligned • lex ( f, NULL ) fires when an input word e is dropped • Could also be defined on part of speech tags or word classes Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016

  16. Lexicalized Reordering Features 15 • Replacement of lexicalized reordering model • Features differ by – lexicalized by first or last word of phrase (source or target) – word representation replaced by word class – orientation type Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016

  17. Domain Features 16 • Indicator feature that the rule occurs in one specific domain • Probability that the rule belongs to one specific domain • Domain-specific lexical translation probabilities Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016

  18. Syntax Features 17 • If we have syntactic parse trees, many more features – number of nodes of a particular kind – matching of source and target constituents – reordering within syntactic constituents • Parse trees are a by-product of syntax-based models • More on that in future lectures Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016

  19. Every Number in Model 18 • Phrase pair indicator feature • Target n-gram feature • Phrase pair orientation feature Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016

  20. 19 perceptron algorithm Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016

  21. Optimizing Linear Model 20 • We consider each sentence pair ( e i , f i ) and its alignment a i • To simplify notation, we define derivation d i = ( e i , f i , a i ) • Model score is weighted linear combination of feature values h j and weights λ j � score ( λ, d i ) = λ j h j ( d i ) j • Such models are also known as single layer perceptrons Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016

  22. Reference and Model Best 21 • Besides the reference derivation d ref for sentence pair i and its score i score ( λ, d ref � λ j h j ( d ref i ) = i ) j • We also have the model best translation � d best = argmax d score ( λ i , d i ) = argmax d λ j h j ( d i ) i j • ... and its model score score ( λ, d best � λ j h j ( d best ) = ) i i j • We can view the error in our model as a function of its parameters λ error ( λ, d best , d ref i ) = score ( λ, d best ) − score ( λ, d ref i ) i i Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016

  23. Follow the Direction of Gradient 22 g r a error( λ ) d error( λ ) i e n t gradient = 1 = - 2 λ λ current λ optimal λ optimal λ current λ gradient negative gradient positive ⇒ we need to move right ⇒ we need to move left • Assume that we can compute the gradient d dλ error ( λ ) at any point • If the error curve is convex, gradient points in the direction the optimum Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016

  24. Move Relative to Steepness 23 2 error( λ ) error( λ ) error( λ ) = t n gradient = 1 e i d a r g 2 . 0 = t n e d i a g r λ λ λ optimal λ current λ optimal λ current λ optimal λ current λ gradient high (steep) gradient medium gradient low (flat) ⇒ move a lot ⇒ move some ⇒ move little • If the error curve is convex, size of gradient indicates speed of change • Model update ∆ λ = − d dλ error ( λ ) Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016

  25. Stochastic Gradient Descent 24 • We want to minimize the error error ( λ, d best , d ref i ) = score ( λ, d best ) − score ( λ, d ref i ) i i • In stochastic gradient descent, we follow direction of gradient d d λ error ( λ, d best , d ref i ) i • For each λ j , we compute the gradient pointwise d d error ( λ j , d best , d ref score ( λ, d best ) − score ( λ, d ref i ) = i ) i i d λ j d λ j Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016

  26. Stochastic Gradient Descent 25 • Gradient with respect to λ j d d error ( λ j , d best , d ref � λ j ′ h j ′ ( d best � λ j ′ h j ′ ( d ref i ) = ) − i ) i i d λ j d λ j j ′ j ′ • For λ ′ j � = λ j , the terms λ j ′ h j ′ ( d i ) are constant, so they disappear d d error ( λ j , d best , d ref λ j h j ( d best ) − λ j h j ( d ref i ) = i ) i i d λ j d λ j • The derivative of a linear function is its factor d error ( λ j , d best , d ref i ) = h j ( d best ) − h j ( d ref i ) i i d λ j = λ j − ( h j ( d best ) − h j ( d ref ⇒ Our model update is λ new i )) i j Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016

  27. Intuition 26 • Feature values in model best translation • Feature values in reference translation • Intuition: – promote features whose value is bigger in reference – demote features whose value is bigger in model best Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016

  28. Algorithm 27 Input: set of sentence pairs ( e , f ), set of features Output: set of weights λ for each feature 1: λ i = 0 for all i 2: while not converged do for all foreign sentences f do 3: d best = best derivation according to model 4: d ref = reference derivation 5: if d best � = d ref then 6: for all features h i do 7: λ i += h i ( d ref ) − h i ( d best ) 8: end for 9: end if 10: end for 11: 12: end while Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016

  29. 28 generating the reference Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016

Recommend


More recommend