optimal beam search for machine translation
play

Optimal Beam Search for Machine Translation Alexander M. Rush - PDF document

Optimal Beam Search for Machine Translation Alexander M. Rush Yin-Wen Chang Michael Collins MIT CSAIL, Department of Computer Science, Cambridge, MA 02139, USA Columbia University, { srush, yinwen } @csail.mit.edu New York, NY 10027, USA


  1. Optimal Beam Search for Machine Translation Alexander M. Rush Yin-Wen Chang Michael Collins MIT CSAIL, Department of Computer Science, Cambridge, MA 02139, USA Columbia University, { srush, yinwen } @csail.mit.edu New York, NY 10027, USA mcollins@cs.columbia.edu • In theory, it can provide a certificate of optimal- Abstract ity; in practice, we show that it produces opti- Beam search is a fast and empirically effective mal hypotheses, with certificates of optimality, method for translation decoding, but it lacks on the vast majority of examples. formal guarantees about search error. We de- velop a new decoding algorithm that combines • It utilizes well-studied algorithms and extends the speed of beam search with the optimal cer- off-the-shelf beam search decoders. tificate property of Lagrangian relaxation, and apply it to phrase- and syntax-based transla- • Empirically it is very fast, results show that it is tion decoding. The new method is efficient, 3.5 times faster than an optimized incremental utilizes standard MT algorithms, and returns constraint-based solver. an exact solution on the majority of transla- tion examples in our test data. The algorithm While our focus is on fast decoding for machine is 3.5 times faster than an optimized incremen- translation, the algorithm we present can be applied tal constraint-based decoder for phrase-based translation and 4 times faster for syntax-based to a variety of dynamic programming-based decod- translation. ing problems. The method only relies on having a constrained beam search algorithm and a fast uncon- strained search algorithm. Similar algorithms exist 1 Introduction for many NLP tasks. Beam search (Koehn et al., 2003) and cube prun- We begin in Section 2 by describing constrained ing (Chiang, 2007) have become the de facto decod- hypergraph search and showing how it generalizes ing algorithms for phrase- and syntax-based trans- translation decoding. Section 3 introduces a variant lation. The algorithms are central to large-scale of beam search that is, in theory, able to produce machine translation systems due to their efficiency a certificate of optimality. Section 4 shows how to and tendency to produce high-quality translations improve the effectiveness of beam search by using (Koehn, 2004; Koehn et al., 2007; Dyer et al., 2010). weights derived from Lagrangian relaxation. Sec- However despite practical effectiveness, neither al- tion 5 puts everything together to derive a fast beam search algorithm that is often optimal in practice. gorithm provides any bound on possible decoding error. Experiments compare the new algorithm with several variants of beam search, cube pruning, A ∗ In this work we present a variant of beam search decoding for phrase- and syntax-based translation. search, and relaxation-based decoders on two trans- The motivation is to exploit the effectiveness and ef- lation tasks. The optimal beam search algorithm is ficiency of beam search, but still maintain formal able to find exact solutions with certificates of opti- guarantees. The algorithm has the following bene- mality on 99% of translation examples, significantly fits: more than other baselines. Additionally the optimal 210 Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing , pages 210–221, Seattle, Washington, USA, 18-21 October 2013. c � 2013 Association for Computational Linguistics

  2. procedure B EST P ATH S CORE ( θ, τ ) beam search algorithm is much faster than other ex- π [ v ] ← 0 for all v ∈ T act methods. for e ∈ E in topological order do �� v 2 , . . . , v | v | � , v 1 � ← e 2 Background | v | � s ← θ ( e ) + π [ v i ] The focus of this work is decoding for statistical ma- i =2 chine translation. Given a source sentence, the goal if s > π [ v 1 ] then π [ v 1 ] ← s is to find the target sentence that maximizes a com- return π [1] + τ bination of translation model and language model scores. In order to analyze this decoding problem, Figure 1: Dynamic programming algorithm for uncon- we first abstract away from the specifics of transla- strained hypergraph search. Note that this version only returns the highest score: max x ∈X θ ⊤ x + τ . The optimal tion into a general form, known as a hypergraph. In hyperpath can be found by including back-pointers. this section, we describe the hypergraph formalism and its relation to machine translation. x ( e ) = 0 otherwise. The set of valid hyperpaths is 2.1 Notation defined as Throughout the paper, scalars and vectors are writ-   � x : x ( e ) = 1 , ten in lowercase, matrices are written in uppercase,           and sets are written in script-case, e.g. X . All vec- e ∈E : h ( e )=1 X = tors are assumed to be column vectors. The function � �   x ( e ) = x ( e ) ∀ v ∈ N \ { 1 }       δ ( j ) yields an indicator vector with δ ( j ) j = 1 and   e : h ( e )= v e : v ∈ t ( e ) δ ( j ) i = 0 for all i � = j . The first problem we consider is unconstrained hy- pergraph search. Let θ ∈ R |E| be the weight vector 2.2 Hypergraphs and Search for the hypergraph and let τ ∈ R be a weight offset. 1 A directed hypergraph is a pair ( V , E ) where V = The unconstrained search problem is to find { 1 . . . |V|} is a set of vertices, and E is a set of di- rected hyperedges. Each hyperedge e ∈ E is a tuple � x ∈X θ ⊤ x + τ max θ ( e ) x ( e ) + τ = max � � � v 2 , . . . , v | v | � , v 1 where v i ∈ V for i ∈ { 1 . . . | v |} . x ∈X e ∈E The head of the hyperedge is h ( e ) = v 1 . The tail of the hyperedge is the ordered sequence t ( e ) = This maximization can be computed for any � v 2 , . . . , v | v | � . The size of the tail | t ( e ) | may vary weights and directed acyclic hypergraph in time O ( |E| ) using dynamic programming. across different hyperedges, but | t ( e ) | ≥ 1 for all Figure 1 edges and is bounded by a constant. A directed shows this algorithm which is simply a version of graph is a directed hypergraph with | t ( e ) | = 1 for the CKY algorithm. all edges e ∈ E . Next consider a variant of this problem: con- Each vertex v ∈ V is either a non-terminal or a strained hypergraph search. Constraints will be nec- essary for both phrase- and syntax-based decoding. terminal in the hypergraph. The set of non-terminals is N = { v ∈ V : h ( e ) = v for some e ∈ E} . Con- In phrase-based models, the constraints will ensure versely, the set of terminals is defined as T = V \N . that each source word is translated exactly once. In syntax-based models, the constraints will be used to All directed hypergraphs used in this work are intersect a translation forest with a language model. acyclic: informally this implies that no hyperpath (as In the constrained hypergraph problem, hyper- defined below) contains the same vertex more than paths must fulfill additional linear hyperedge con- once (see Martin et al. (1990) for a full definition). straints. Define the set of constrained hyperpaths as Acyclicity implies a partial topological ordering of the vertices. We also assume there is a distinguished X ′ = { x ∈ X : Ax = b } root vertex 1 where for all e ∈ E , 1 �∈ t ( e ) . Next we define a hyperpath as x ∈ { 0 , 1 } |E| where 1 The purpose of the offset will be clear in later sections. For x ( e ) = 1 if hyperedge e is used in the hyperpath, this section, the value of τ can be taken as 0 . 211

Recommend


More recommend