Constrained Recombination in an Example-based Machine Translation System Monica Gavrila University of Hamburg, Faculty of Mathematics, Informatics and Natural Sciences Vogt-Koelln Str. 30, 22527, Hamburg, Germany gavrila@informatik.uni-hamburg.de Abstract vious language combinations. On the other side, exactly for these languages, human translators are Constraints in natural language process- few or missing, so MT-systems are highly re- ing play an important role. In this pa- quired. per we show which impact (word-order) Based mainly on the existence of a parallel cor- constraints have on the translation re- pus, which does not necessary have to include a sults, when they are applied in the re- large number of examples 1 , example-based ma- combination step of a linear EBMT sys- chine translation ( EBMT ) seems to be a solution tem. Both the baseline EBMT system for under-resourced languages. This MT approach, and the constrained one are implemented which has its start in Nagao’s work (Nagao, 1984), during this research. In the experiments is essentially translation by analogy. The basic we use two language-pairs (Romanian- premise is that, if a previously translated sentence English and Romanian-German), in both occurs again, the same translation is likely to be directions of translations. In these lan- correct again. guage constellations, Romanian, an in- Constraints in natural language processing play flected language with Latin root, is consid- an important role, such as in constraint-based ered under-resourced. This aspect makes grammars. Constraints usually restrict the possible the process of translation even more chal- values that a variable (or a feature) may take with lenging. respect to certain rules. In MT, they have been used for example in the SMT approach: (Canisius and 1 Introduction van den Bosch, 2009), (Cao and Sumita, 2010). Machine translation ( MT ), one of the most chal- In this paper we explore how (word-order) con- lenging domains in Natural Language Processing straints can be used in a linear EBMT system. As ( NLP ), plays an important role in ensuring global we employ an under-resourced language (i.e. Ro- communication. Documents in various domains manian), we keep the systems as resource-free as need to be translated in a large combination of possible. The algorithms are mainly based on sur- language-pairs. As quite often it is hard to find face forms and corpus statistics. That is why our the right human translators, with the right domain- EBMT systems borrow ideas only from the linear and language-knowledge, MT can be considered, and template-based EBMT approaches. at least for these cases, a solution. We investigate two language pairs: Romanian- Less spoken languages have to overcome a ma- English and Romanian-German, in both directions jor gap in language resources and tools, which of translation. The under-resourced language we all ensure the development of a good MT-system. consider in this work is Romanian, as when start- Even more, some of these under-resourced lan- ing this work not sufficient linguistic resources guages are highly inflected, with a more compli- were publicly available, or, when available, com- cated grammar and often having linguistic phe- paring with the other two languages, they were nomena which have been not encountered in pre- 1 In contrast to statistical MT ( SMT ). � 2011 European Association for Machine Translation. c ▼✐❦❡❧ ▲✳ ❋♦r❝❛❞❛✱ ❍❡✐❞✐ ❉❡♣r❛❡t❡r❡✱ ❱✐♥❝❡♥t ❱❛♥❞❡❣❤✐♥st❡ ✭❡❞s✳✮ Pr♦❝❡❡❞✐♥❣s ♦❢ t❤❡ ✶✺t❤ ❈♦♥❢❡r❡♥❝❡ ♦❢ t❤❡ ❊✉r♦♣❡❛♥ ❆ss♦❝✐❛t✐♦♥ ❢♦r ▼❛❝❤✐♥❡ ❚r❛♥s❧❛t✐♦♥ ✱ ♣✳ ✶✾✸✕✷✵✵ ▲❡✉✈❡♥✱ ❇❡❧❣✐✉♠✱ ▼❛② ✷✵✶✶
under-developed or not sufficiently tested. Further- The matching procedure is a linguistic light ap- more, there was also no real possibility of choos- proach, focusing in finding common substrings. ing among several resources, as, when available, As, the longer the common subsequence, the lower only one resource was at hand. The use of the Ger- the probability of boundary friction problems, the man - Romanian language-pair raises interesting longest common subsequence ( LCS ) is consid- questions, as most of the example-based transla- ered. The procedure is based on the Longest Com- tion systems consider English as a source/target mon Subsequence Similarity ( LCSS ) measure we language ( SL / TL ), which has a simpler syntax implemented using a dynamic programming algo- and morphology. Romanian and German, both in- rithm similar to the one found in (Bergroth et al., flected languages, present language specific char- 2000). The initial LCS character-based algorithm is transformed into a token 3 -based one. A penalty acteristics (morphological and syntactical), that P = 0 . 01 is introduced for each token-gap be- makes the process of translation even more chal- lenging. tween the input and the matched sentence. This way it is chosen the sentence that covers the in- After this short introduction, the following two sections will present both EBMT systems put with less token-gaps. Therefore, the chance to have a minimum number of sequences that should we implemented: the baseline EBMT system Lin − EBMT , and the constrained system Lin − be recombined to form the output increases. This EBMT REC + . In section 4 we will describe the approach could decrease the appearance of bound- ary friction and word-order problems. data used and the translation results. The results obtained by Lin − EBMT are compared with the The matching score is calculated as follows: given the input I and a sentence S from the ex- ones provided by different constraint settings ap- plied in Lin − EBMT REC + . In all the experi- ample database, LCSS is calculated as: ments the same training and test data are used. The LCSS ( I, S ) = LCSS T ( I, S ) − P ∗ noTG, (1) paper will end with conclusions and further work. where Lin − EBMT , the Baseline EBMT 2 LCSS T ( I, S ) = Length ( LCS ( I, S )) , (2) System Length ( I ) LCS ( I, S ) I S , is the LCS between and In this section we describe Lin − EBMT , the Length ( x ) is the number of tokens of a string baseline EBMT system implemented during the re- x and noTG is the number of token-gaps of the search. Lin − EBMT is a linear EBMT system, LCSS ( I, S ) when compared with I . in the sense of system classification found in (Mc- For example, considering the sentences Tait, 2001). It is based on surface-forms and uses Input s1 = “ Saving names and phone numbers ( no linguistic resources, with the exception of the Add name ) ” parallel corpus. It contains all the three steps of an Sentence in the corpus s2 = “ Erasing names and EBMT system 2 : matching, alignment and recom- numbers ” bination. The longest common subsequence LCS ( s 1 , s 2) is Before starting the translation, training and test “ names and numbers ” data are tokenized and lowercased. In order to Given the input and the example database, the reduce the search space in the matching process, matching procedure gives as output the sentences we use a word index. This approach has already that best cover the input. The algorithm tries to been encountered in the literature, for example in match the input with an entry in the corpus, and (Sumita and Iida, 1991). The matching procedure in case this is not possible, to match parts of the is run only after the search space size is decreased. input with (parts of) the sentences in the corpus. If during the matching procedure the test sentence The matching algorithm is recursive and follows is found in the corpus, its translation represents the the steps enumerated below: output. Otherwise, the translation steps described in the following subsections are performed. 1. Find the sentence in the corpus that best Matching the Input match the input, by using the similarity mea- sure previously described. Keep it as part of 2 The steps of an EBMT system – matching, alignment and 3 A token can be a word-form, a number, a punctuation sign, recombination – are firstly described in (Nagao, 1984) and specifically presented under these names in (Somers, 1999). etc. ✶✾✹
Recommend
More recommend