sparse feature learning
play

Sparse Feature Learning Philipp Koehn 3 March 2015 Philipp Koehn - PowerPoint PPT Presentation

Sparse Feature Learning Philipp Koehn 3 March 2015 Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015 Multiple Component Models 1 Translation Model Language Model Reordering Model Philipp Koehn Machine


  1. Sparse Feature Learning Philipp Koehn 3 March 2015 Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

  2. Multiple Component Models 1 Translation Model Language Model Reordering Model Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

  3. Component Weights 2 Translation Model .05 .26 Language Model .19 .06 Reordering Model .21 .1 .04 .1 Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

  4. Even More Numbers Inside 3 Translation Model p(a | to) = 0.18 .05 p(casa | house) = 0.35 Language p(azur | blue) = 0.77 Model p(la | the) = 0.32 .19 .06 Reordering Model .21 .1 .04 .1 Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

  5. Grand Vision 4 • There are millions of parameters – each phrase translation score – each language model n-gram – etc. • Can we train them all discriminatively? • This implies optimization over the entire training corpus Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

  6. 5 search space iterative n-best our work n-best rule scores aligned corpus a handful thousands millions Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

  7. 6 search space MERT iterative [Och&al. 2003] n-best our work n-best rule scores aligned corpus a handful thousands millions Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

  8. 7 search space MERT iterative [Och&al. 2003] n-best our work n-best rule scores aligned corpus a handful thousands millions Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

  9. 8 MIRA [Chiang 2007] search SampleRank [Haddow&al. 2011] space MERT PRO iterative [Och&al. 2003] [Hopkins/May 2011] n-best our work n-best rule scores aligned corpus a handful thousands millions Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

  10. 9 MIRA [Chiang 2007] search SampleRank [Haddow&al. 2011] space MaxViolation [Yu et al., 2014] MERT PRO iterative [Och&al. 2003] [Hopkins/May 2011] n-best Leave One Out [Wuebker et al., 2012] n-best rule scores aligned corpus a handful thousands millions Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

  11. Strategy and Core Problems 10 • Process each sentence pair in the training corpus • Optimize parameters towards producing the reference translation • Reference translation may not be producible by model – optimize towards most similar translation – or, only process sentence pair partially • Avoid overfitting • Large corpora require efficient learning methods Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

  12. Sentence Level vs. Corpus Level Error Metric 11 • Optimizing BLEU requires optimizing over the entire training corpus � BLEU ( { e best h j ( e i , f i ) λ i } , { e ref = argmax e i i } ) i j • Life would be easier, if we could sum over sentence level scores � � h j ( e i , f i ) λ i , e ref BLEU’( argmax e i ) i i j • For instance, BLEU+1 Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

  13. 12 features Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

  14. Core Rule Properties 13 • Frequency of phrase (binned) • Length of phrase – number of source words – number of target words – number of source and target words • Unaligned / added (content) words in phrase pair • Reordering within phrase pair Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

  15. Lexical Translation Features 14 • lex ( e ) fires when an output word e is generated • lex ( f, e ) fires when an output word e is generated aligned to a input word f • lex ( NULL , e ) fires when an output word e is generated unaligned • lex ( f, NULL ) fires when an input word e is dropped • Could also be defined on part of speech tags or word classes Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

  16. Lexicalized Reordering Features 15 • Replacement of lexicalized reordering model • Features differ by – lexicalized by first or last word of phrase (source or target) – word representation replaced by word class – orientation type Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

  17. Domain Features 16 • Indicator feature that the rule occurs in one specific domain • Probability that the rule belongs to one specific domain • Domain-specific lexical translation probabilities Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

  18. Syntax Features 17 • If we have syntactic parse trees, many more features – number of nodes of a particular kind – matching of source and target constituents – reordering within syntactic constituents • Parse trees are a by-product of syntax-based models • More on that in future lectures Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

  19. Every Number in Model 18 • Phrase pair indicator feature • Target n-gram feature • Phrase pair orientation feature Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

  20. 19 perceptron algorithm Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

  21. Optimizing Linear Model 20 • We consider each sentence pair ( e i , f i ) and its alignment a i • To simplify notation, we define derivation d i = ( e i , f i , a i ) • Model score is weighted linear combination of feature values h j and weights λ j � score ( λ, d i ) = λ j h j ( d i ) j • Such models are also known as single layer perceptrons Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

  22. Reference and Model Best 21 • Besides the reference derivation d ref for sentence pair i and its score i score ( λ, d ref � λ j h j ( d ref i ) = i ) j • We also have the model best translation � d best = argmax d score ( λ i , d i ) = argmax d λ j h j ( d i ) i j • ... and its model score score ( λ, d best � λ j h j ( d best ) = ) i i j • We can view the error in our model as a function of its parameters λ error ( λ, d best , d ref i ) = score ( λ, d best ) − score ( λ, d ref i ) i i Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

  23. Stochastic Gradient Descent 22 error( λ ) t n e i d a r g λ optimal λ current λ • We cannot analytically find the optimum of the curve error ( λ ) • We can compute the gradient d dλ error ( λ ) at any point • We want to follow the gradient towards the optimal λ value Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

  24. Stochastic Gradient Descent 23 • We want to minimize the error error ( λ, d best , d ref i ) = score ( λ, d best ) − score ( λ, d ref i ) i i • In stochastic gradient descent, we follow direction of gradient d d λ error ( λ, d best , d ref i ) i • For each λ j , we compute the gradient pointwise d d error ( λ j , d best , d ref score ( λ, d best ) − score ( λ, d ref i ) = i ) i i d λ j d λ j Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

  25. Stochastic Gradient Descent 24 • Gradient with respect to λ j d d error ( λ j , d best , d ref � λ j ′ h j ′ ( d best � λ j ′ h j ′ ( d ref i ) = ) − i ) i i d λ j d λ j j ′ j ′ • For λ ′ j � = λ j , the terms λ j ′ h j ′ ( d i ) are constant, so they disappear d d error ( λ j , d best , d ref λ j h j ( d best ) − λ j h j ( d ref i ) = i ) i i d λ j d λ j • The derivative of a linear function is its factor d error ( λ j , d best , d ref i ) = h j ( d best ) − h j ( d ref i ) i i d λ j = λ j − ( h j ( d best ) − h j ( d ref ⇒ Our model update is λ new i )) i j Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

  26. Intuition 25 • Feature values in model best translation • Feature values in reference translation • Intuition: – promote features whose value is bigger in reference – demote features whose value is bigger in model best Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

  27. Algorithm 26 Input: set of sentence pairs ( e , f ), set of features Output: set of weights λ for each feature 1: λ i = 0 for all i 2: while not converged do for all foreign sentences f do 3: d best = best derivation according to model 4: d ref = reference derivation 5: if d best � = d ref then 6: for all features h i do 7: λ i += h i ( d ref ) − h i ( d best ) 8: end for 9: end if 10: end for 11: 12: end while Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

  28. 27 generating the reference Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

  29. Failure to Generate Reference 28 • Reference translation may be anywhere in this box all English sentences produceable by model covered by search • If produceable by model → we can compute feature scores • If not → we can not Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

Recommend


More recommend