Sparse Feature Learning Philipp Koehn 1 March 2016 Philipp Koehn - PowerPoint PPT Presentation

Sparse Feature Learning Philipp Koehn 1 March 2016 Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016

Multiple Component Models 1 Translation Model Language Model Reordering Model Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016

Component Weights 2 Translation Model .05 .26 Language Model .19 .06 Reordering Model .21 .1 .04 .1 Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016

Even More Numbers Inside 3 Translation Model p(a | to) = 0.18 .05 p(casa | house) = 0.35 Language p(azur | blue) = 0.77 Model p(la | the) = 0.32 .19 .06 Reordering Model .21 .1 .04 .1 Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016

Grand Vision 4 • There are millions of parameters – each phrase translation score – each language model n-gram – etc. • Can we train them all discriminatively? • This implies optimization over the entire training corpus Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016

5 search space iterative n-best our work n-best rule scores aligned corpus a handful thousands millions Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016

6 search space MERT iterative [Och&al. 2003] n-best our work n-best rule scores aligned corpus a handful thousands millions Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016

7 search space MERT iterative [Och&al. 2003] n-best our work n-best rule scores aligned corpus a handful thousands millions Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016

8 MIRA [Chiang 2007] search SampleRank [Haddow&al. 2011] space MERT PRO iterative [Och&al. 2003] [Hopkins/May 2011] n-best our work n-best rule scores aligned corpus a handful thousands millions Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016

9 MIRA [Chiang 2007] search SampleRank [Haddow&al. 2011] space MaxViolation [Yu et al., 2013] MERT PRO iterative [Och&al. 2003] [Hopkins/May 2011] n-best Leave One Out [Wuebker et al., 2012] n-best rule scores aligned corpus a handful thousands millions Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016

Strategy and Core Problems 10 • Process each sentence pair in the training corpus • Optimize parameters towards producing the reference translation • Reference translation may not be producible by model – optimize towards most similar translation – or, only process sentence pair partially • Avoid overfitting • Large corpora require efficient learning methods Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016

Sentence Level vs. Corpus Level Error Metric 11 • Optimizing BLEU requires optimizing over the entire training corpus � BLEU ( { e best h j ( e i , f i ) λ i } , { e ref = argmax e i i } ) i j • Life would be easier, if we could sum over sentence level scores � � ( h j ( e i , f i ) λ i ) , e ref BLEU’( argmax e i ) i i j • For instance, BLEU+1 Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016

12 features Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016

Core Rule Properties 13 • Frequency of phrase (binned) • Length of phrase – number of source words – number of target words – number of source and target words • Unaligned / added (content) words in phrase pair • Reordering within phrase pair Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016

Lexical Translation Features 14 • lex ( e ) fires when an output word e is generated • lex ( f, e ) fires when an output word e is generated aligned to a input word f • lex ( NULL , e ) fires when an output word e is generated unaligned • lex ( f, NULL ) fires when an input word e is dropped • Could also be defined on part of speech tags or word classes Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016

Lexicalized Reordering Features 15 • Replacement of lexicalized reordering model • Features differ by – lexicalized by first or last word of phrase (source or target) – word representation replaced by word class – orientation type Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016

Domain Features 16 • Indicator feature that the rule occurs in one specific domain • Probability that the rule belongs to one specific domain • Domain-specific lexical translation probabilities Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016

Syntax Features 17 • If we have syntactic parse trees, many more features – number of nodes of a particular kind – matching of source and target constituents – reordering within syntactic constituents • Parse trees are a by-product of syntax-based models • More on that in future lectures Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016

Every Number in Model 18 • Phrase pair indicator feature • Target n-gram feature • Phrase pair orientation feature Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016

19 perceptron algorithm Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016

Optimizing Linear Model 20 • We consider each sentence pair ( e i , f i ) and its alignment a i • To simplify notation, we define derivation d i = ( e i , f i , a i ) • Model score is weighted linear combination of feature values h j and weights λ j � score ( λ, d i ) = λ j h j ( d i ) j • Such models are also known as single layer perceptrons Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016

Reference and Model Best 21 • Besides the reference derivation d ref for sentence pair i and its score i score ( λ, d ref � λ j h j ( d ref i ) = i ) j • We also have the model best translation � d best = argmax d score ( λ i , d i ) = argmax d λ j h j ( d i ) i j • ... and its model score score ( λ, d best � λ j h j ( d best ) = ) i i j • We can view the error in our model as a function of its parameters λ error ( λ, d best , d ref i ) = score ( λ, d best ) − score ( λ, d ref i ) i i Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016

Follow the Direction of Gradient 22 g r a error( λ ) d error( λ ) i e n t gradient = 1 = - 2 λ λ current λ optimal λ optimal λ current λ gradient negative gradient positive ⇒ we need to move right ⇒ we need to move left • Assume that we can compute the gradient d dλ error ( λ ) at any point • If the error curve is convex, gradient points in the direction the optimum Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016

Move Relative to Steepness 23 2 error( λ ) error( λ ) error( λ ) = t n gradient = 1 e i d a r g 2 . 0 = t n e d i a g r λ λ λ optimal λ current λ optimal λ current λ optimal λ current λ gradient high (steep) gradient medium gradient low (flat) ⇒ move a lot ⇒ move some ⇒ move little • If the error curve is convex, size of gradient indicates speed of change • Model update ∆ λ = − d dλ error ( λ ) Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016

Stochastic Gradient Descent 24 • We want to minimize the error error ( λ, d best , d ref i ) = score ( λ, d best ) − score ( λ, d ref i ) i i • In stochastic gradient descent, we follow direction of gradient d d λ error ( λ, d best , d ref i ) i • For each λ j , we compute the gradient pointwise d d error ( λ j , d best , d ref score ( λ, d best ) − score ( λ, d ref i ) = i ) i i d λ j d λ j Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016

Stochastic Gradient Descent 25 • Gradient with respect to λ j d d error ( λ j , d best , d ref � λ j ′ h j ′ ( d best � λ j ′ h j ′ ( d ref i ) = ) − i ) i i d λ j d λ j j ′ j ′ • For λ ′ j � = λ j , the terms λ j ′ h j ′ ( d i ) are constant, so they disappear d d error ( λ j , d best , d ref λ j h j ( d best ) − λ j h j ( d ref i ) = i ) i i d λ j d λ j • The derivative of a linear function is its factor d error ( λ j , d best , d ref i ) = h j ( d best ) − h j ( d ref i ) i i d λ j = λ j − ( h j ( d best ) − h j ( d ref ⇒ Our model update is λ new i )) i j Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016

Intuition 26 • Feature values in model best translation • Feature values in reference translation • Intuition: – promote features whose value is bigger in reference – demote features whose value is bigger in model best Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016

Algorithm 27 Input: set of sentence pairs ( e , f ), set of features Output: set of weights λ for each feature 1: λ i = 0 for all i 2: while not converged do for all foreign sentences f do 3: d best = best derivation according to model 4: d ref = reference derivation 5: if d best � = d ref then 6: for all features h i do 7: λ i += h i ( d ref ) − h i ( d best ) 8: end for 9: end if 10: end for 11: 12: end while Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016

28 generating the reference Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016

Sparse Feature Learning Philipp Koehn 1 March 2016 Philipp Koehn - PowerPoint PPT Presentation

Sparse Feature Learning Philipp Koehn 1 March 2016 Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016 Multiple Component Models 1 Translation Model Language Model Reordering Model Philipp Koehn Machine

Decision Tree Prof. Seungchul Lee Industrial AI Lab. Feature Test Feature 1 Feature 2 Feature

Sparse Feature Learning Philipp Koehn 3 March 2015 Philipp Koehn Machine Translation: Sparse

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

A Distinctive Feature of A Distinctive Feature of A Distinctive Feature of A Distinctive Feature

Outline Reducing Dimensionality Feature Selection 1 Steven J Zeil Feature Extraction 2

Sparse Matrices sparse many elements are zero dense few elements are zero Example Of

Earth: The Feature Presentation - feature, landscape, topography Earth: The Feature Presentation

Reducing Dimensionality Steven J Zeil Old Dominion Univ. Fall 2010 1 Feature Selection

Understanding Sparse JL for Feature Hashing Meena Jagadeesan Harvard University (Class of 2020)

Sparse tensors are a natural way of representing real-world data 1 Sparse tensors are a natural

MLSS 06 - Canberra Elements Hierarchical Basis Sparse Grids Sparse Grids Combination

CNBC Matlab Mini-Course Sparse Matrices Sparse matrices provide an efficient means to store

Tutorial: TF-Ranking for sparse features Tutorial: TF-Ranking for sparse features This tutorial

Parallel Numerical Algorithms Chapter 4 Sparse Linear Systems Section 4.1 Direct Methods

Extremal results for sparse pseudorandom graphs Yufei Zhao Massachusetts Institute of Technology

Feature Extraction 7-1 Ronald Peikert SciVis 2007 - Feature Extraction What are features?

Patience James 5: 7-11 Patience - Definition patience \ p -sh n(t)s\ noun 1: the

Perfect sampling using dynamic programming Sminaire quipe CALIN Christelle Rovetta quipe

Component models for embedded systems: from UML to Autosar Franois Terrier with contributions

Assessing the Quality of Automatically Built Network Representations Lionel Eyraud-Dubois

Outline Return-oriented programming (ROP) Control-flow integrity (CFI) CSci 5271 More modern

OLDBURY EXTENDIBILITY ARRANGEMENTS IAN HARDMAN EMERGENCY PLANNING MANAGER MONMOUTHSHIRE COUNTY

Causes of Terrorism Class Three - POLI 142L Review Definition of Terrorism Goals of

Year-End Report Presented to the School Board June 11, 2019 Telling Our Story Who Benefits