Machine Translation Overview Marcello Federico FBK-irst Trento, Italy 2011 M. Federico, FBK-irst SMT - Part 1 2011 Outline 1 • Introduction • Approaches • Brief history • Evaluation • State-of-the-art • Examples References: • Philipp Koehn, Statistical Machine Translation, Cambridge University Press, 2009. • Daniel Jurafsky and James H. Martin, Speech and Language Processing, Second Edition, Prentice Hall, 2009. • Chris Manning and Hinrich Sch¨ utze, Foundations of Statistical Natural Language Processing, MIT Press, 1999. M. Federico, FBK-irst SMT - Part 1 2011
Machine Translation 2 Wikipedia Machine translation, often referred to by the acronym MT, is a sub-field of computational linguistics that investigates the use of computer software to translate text or speech from one natural language to another. Preferred Definition MT investigates the translation of ”standard” language that can be systematically observed in ordinary communication – e.g. conversations, news, speeches, business letters, user manuals, etc. –. MT as a discipline is not interested in the translation of literature genres that express creative and sophisticated use of language. For several reasons, such kind of language is simply out of the scope of MT. 1 For a very interesting introduction to issues related to the translation of literature work see Umberto Eco, ”Experiences in Translation”, U. Toronto Press, 2001. M. Federico, FBK-irst SMT - Part 1 2011 Introduction to MT 3 Why is Machine Translation so Di ffi cult? High quality human translation implies: • deep and rich understanding of source language and text • sophisticated and creative command of target language Nowadays, feasible goals for machine translation are only tasks: • for which a rough translation is adequate (gist translation) • where a human post-editor can improve MT output (CAT) • focusing on small linguistic domains (translators on PDAs) In general, di ffi culty of translating depends on how similar the target and source languages are in their vocabulary, grammar, and conceptual structure. M. Federico, FBK-irst SMT - Part 1 2011
Di ff erences and Similarities of Languages 4 • Universal communicative role of language – names for people, words for talking about women, men, children – every language seems to have nouns and verbs • Di ff erences/similarities across large classes of languages : – Morphological: one vs. many morphemes per words, agglutination vs. fusion – Syntactical: Subj-Verb-Obj structure (E) vs. SOV (J) vs. VSO (Irish) – Semantical: direction/manner of motion indicated by verb/satellites the bottle floated out (E) → la botella sali´ o flotando (S) • Lexical divergences between languages: – Semantical: there is no corresponding word with the same meaning wall (E) → Wand / Mauer (G, inside/outside) – Syntactical: a word is better translated into another part-of-speech she likes to sing (E,v) → sie singt gerne (D,adv) • Cultural Di ff erences : philosophical argument=is translation possible at all? M. Federico, FBK-irst SMT - Part 1 2011 Lexical Divergences 5 English Japanese otooto (younger) brother aniisan (older) English Japanese isu (subj animate) is aru (subj not animate) English know French conna^ ıtre (be acquainted with) savoir (know a proposition) English French ils (masculine) they elles (feminine) German English Berg hill mountain M. Federico, FBK-irst SMT - Part 1 2011
Di ffi cult Translations 6 • There is no way to translate w/o doubt the French word “bois” albero arbre Baum tree legno Holz timber bosco bois wood foresta forˆ et Wald forest • Translate ”And God called the light Day” with popular MT engines 1 Babelfish: Y dios llamado el d´ ıa ligero Google: Y llam´ o Dios a la luz D´ ıa Reverso: Y Dios llam´ o el D´ ıa ligero (de luz) None got the right sense, but Reverso got a right one! 1 Tried on 2nd March 2011 M. Federico, FBK-irst SMT - Part 1 2011 Di ffi cult Translations 7 Source: John visita ogni giorno sua sorella Ann per vedere suo nipote Sam Problems : English Italian Moreover, in English, the possessive adjective nephew agrees with the gender of the owner, while in niece nipote Italian with the gender of the owned object. grandchild Hence, legal English translations are: English 1: John visits every day his sister Ann to see his nephew Sam English 2: John visits every day his sister Ann to see her nephew Sam English 3: John visits every day his sister Ann to see her grandchild Sam English 4: John visits every day his sister Ann to see his grandchild Sam M. Federico, FBK-irst SMT - Part 1 2011
Approaches to MT 8 Rough classification according to employed linguistic representations : • Direct model : translate and re-order single words or n-grams – basically, no linguistic representation is used • Transfer model : use explicit knowledge about language di ff erences – analyze lexical and syntactic structure of source sentence – transfer structures from source to target language – generate corresponding sentence in the target language • Interlingua model : extract the meaning and express it in the target language – analyze lexical, syntactical and semantical structure of source sentence – interpret the meaning into a canonical interlingua – generate the target sentence from the interlingua Notice: required knowledge for the interlingua approach grows linearly with number of languages, rather than to the square. M. Federico, FBK-irst SMT - Part 1 2011 Vauquois’s Triangle 9 Interlingua Semantics Semantics G s e i s n y e l r a a n t A i Transfer o n Syntax Syntax Source Target String String Direct M. Federico, FBK-irst SMT - Part 1 2011
Approaches to MT 10 How is knowledge and linguistic information acquired by the system? • Hand-crafted : knowledge for analysis, transfer, generation, meaning representation, or direct translation is manually developed – most of commercial MT systems fall in this category – requires lots of human labor and expertise – includes: rule-based MT • Machine-learned : representations are implemented by mathematical models learnable from data, e.g. parallel corpora of human translations – much less human e ff ort is needed – requires huge amounts of data, the more, the better! – includes: statistical MT and example-based MT M. Federico, FBK-irst SMT - Part 1 2011 Transfer-Based MT 11 context-free grammar Synchronous context-free grammar / NP DT NPB NP DT 1 NPB 2 DT 1 NPB 2 → → / NPB JJ NN NPB JJ 1 NN 2 NN 2 JJ 1 → → NPB NN NPB NN / NN → → · · · · · · / DT the DT the il → → / JJ north JJ north settentrionale → → / NN wind NN wind vento → → · · · · · · NP NP settentrionale DT NPB DT NPB JJ NN NN JJ the north wind il vento settentrionale M. Federico, FBK-irst SMT - Part 1 2011
Transfer-Based MT 12 context-free grammar synchronous context-free grammar / NP DT NPB NP DT 1 NPB 2 DT 1 NPB 2 → → / NPB JJ NN NPB JJ 1 NN 2 NN 2 JJ 1 → → NPB NN NPB NN / NN → → · · · · · · / DT the DT the il → → / JJ north JJ north settentrionale → → / NN wind NN wind vento → → · · · · · · NP NP settentrionale DT NPB DT NPB JJ NN NN JJ the north wind il vento settentrionale 1 The shown example is clearly a simplification. Working approaches use a very large number of probabilistic and lexicalized rules. M. Federico, FBK-irst SMT - Part 1 2011 Interlingua-Based MT 13 • Applied to linguistic domains with a limited number of relations and concepts – tourist information, hotel booking, flight reservation, ... • Semantics of a sentence can be expressed with predicate argument structure – I need a twin bed room reservation for tomorrow – book-room(date=tomorrow,type=single) • Interlingua language has to be designed carefully (by hand) – for some application formalism similar to SQL language • Processing steps in IBMT: – extract content from source sentence – map content into SQL like IL format - generate translation from IL format M. Federico, FBK-irst SMT - Part 1 2011
Interlingua-Based MT 14 • S 2 : I’m arriving on june sixth • I: give-information+temporal+arrival (who=I, time=(june, md6)) • T: my arrival time is sixth of june • S: no that’s not necessary • I: negate • T: no • S: and i was wondering what you have in the way of rooms available during that time • I: request-information+availability+room (room-type=question) • T: what kind of rooms are available? 2 S: speech (English), I: Interlingua, T: translation (English) M. Federico, FBK-irst SMT - Part 1 2011 Example-Based MT 15 • Assumption: people translate by analogy – Decompose a sentence into phrases – Translate phrases by analogy to previous translations – Properly compose translation fragments into one long sentence • Given a parallel corpus of translation examples Italian German sono possibili deboli nevicate leichte Schneef¨ alle sind m¨ oglich sono possibili alcuni rovesci ein paar Regenschauer sind m¨ oglich le deboli precipitazioni cesseranno die leichte Niederschl¨ age klingen ab si verificheranno deboli precipitazioni leichte Niederschl¨ age werden einsetzen. • Learn Translation patterns sono possibili X X sind m¨ oglich deboli precipitazioni leichte Niederschl¨ age M. Federico, FBK-irst SMT - Part 1 2011
Recommend
More recommend