machine translation overview
play

Machine Translation Overview Marcello Federico FBK-irst Trento, - PDF document

Machine Translation Overview Marcello Federico FBK-irst Trento, Italy 2013 M. Federico, FBK-irst SMT - Part 1 2013 Outline 1 Introduction Motivation Approaches Brief history Evaluation State-of-the-art Examples


  1. Machine Translation Overview Marcello Federico FBK-irst Trento, Italy 2013 M. Federico, FBK-irst SMT - Part 1 2013 Outline 1 • Introduction • Motivation • Approaches • Brief history • Evaluation • State-of-the-art • Examples References: • P. Koehn, Statistical Machine Translation, Cambridge University Press, 2009. • A. Lopez, Statistical Machine Translation, ACM Computing Surveys, vol. 40, number 3, 2008. • D. Jurafsky and J. H. Martin, Speech and Language Processing, Prentice Hall, 2009. • C. Manning and H. Sch¨ utze, Foundations of Statistical Natural Language Processing, MIT Press, 199 9. M. Federico, FBK-irst SMT - Part 1 2013

  2. Machine Translation 2 Wikipedia Machine translation, often referred to by the acronym MT, is a sub-field of computational linguistics that investigates the use of computer software to translate text or speech from one natural language to another. Personal Definition MT generally investigates the automatic translation of ”standard” language that can be systematically observed in ordinary communication – e.g. conversations, news, speeches, business letters, user manuals, etc. –. MT is generally not concerned with literature genres, nor creative and sophisticated use of language. For several reasons, such kind of language is simply out of the scope of MT. 1 For a very interesting introduction to issues related to the translation of literature work see Umberto Eco, ”Experiences in Translation”, U. Toronto Press, 2001. M. Federico, FBK-irst SMT - Part 1 2013 Introduction to MT 3 Why is Machine Translation so Important? 1 • Information society and production of multilingual content 7 billion people - 6,000 languages - 250,000 translators • Globalization and demand for translation services: 1,000 global companies operating in ≥ 160 countries • Size of worldwide translation market: 12.5 billion $ per year ≈ 34 million $ per day • Size of translation industry: 3,150 translation companies (3.1 billion $) 200,000 freelance translators (9.4 billion $) • MT can improve productivity of human translators: integration of MT with human translation (post-editing) • MT can supply cheap gist translation competitive quality-cost-speed trade-o ff 1 Source: Common Sense Advisory, 2010 M. Federico, FBK-irst SMT - Part 1 2013

  3. Introduction to MT 4 Do we need more research in MT? Chinglish examples, some of which resulting from MT errors. M. Federico, FBK-irst SMT - Part 1 2013 Introduction to MT 5 Do we need more research in MT? M. Federico, FBK-irst SMT - Part 1 2013

  4. Introduction to MT 6 Why is Machine Translation so Di ffi cult? High quality human translation implies: • deep and rich understanding of source language and text • sophisticated and creative command of target language Nowadays, feasible goals for machine translation are tasks: • an approximate translation is still useful (gist translation) • human translators can post-edit MT (computer assisted translation) • linguistic domain is very focused and limited (smarthphone apps) In general, di ffi culty of translating depends on how similar the target and source languages are in their vocabulary, grammar, and conceptual structure. M. Federico, FBK-irst SMT - Part 1 2013 Applications of MT 7 Gist translation for social media. M. Federico, FBK-irst SMT - Part 1 2013

  5. Applications of MT 8 Carrier 12:00 PM 12:00 PM Carrier Speech translation app. M. Federico, FBK-irst SMT - Part 1 2013 Applications of MT 9 Integration of MT into computer assisted translation. M. Federico, FBK-irst SMT - Part 1 2013

  6. Di ff erences and Similarities of Languages 10 • Universal communicative role of language – names for people, words for talking about women, men, children – every language seems to have nouns and verbs • Di ff erences/similarities across large classes of languages : – Morphology: one vs. many morphemes per words, agglutination vs. fusion – Syntax: Subj-Verb-Obj structure (E) vs. SOV (J) vs. VSO (Irish) – Semantics: mapping of semantic roles and meaning of words e.g. direction/manner of motion indicated by verb/satellite in the bottle floated out (E) → la botella sali´ o flotando (S) • Lexical divergence between languages: – Semantical: there is no corresponding word with the same meaning wall (E) → Wand / Mauer (G, inside/outside) – Syntactical: a word is better translated into another part-of-speech she likes to sing (E,v) → sie singt gerne (D,adv) • Cultural Di ff erences : philosophical argument=is translation possible at all? M. Federico, FBK-irst SMT - Part 1 2013 Lexical Divergence 11 English Japanese otooto (younger) brother oniisan (older) English Japanese isu (subj animate) is aru (subj not animate) English know French conna^ ıtre (be acquainted with) savoir (know a proposition) English French ils (masculine) they elles (feminine) German English Berg hill mountain • some languages make distinctions that other languages don’t • di ffi culty to translate from less specific into more specific information • language di ff erences enforce di ff erent conceptual structures • debate: do people who speak di ff erent languages think di ff erently? 2 2 Watch talk by Lera Boroditsky (U. Stanford), ”How Language Shapes Thought”, fora.tv. M. Federico, FBK-irst SMT - Part 1 2013

  7. Approaches to MT 12 Rough classification according to employed linguistic representations : • Direct model : translate and re-order single words or n-grams – basically, no linguistic representation is used • Transfer model : use explicit knowledge about language di ff erences – analyze lexical and syntactic structure of source sentence – transfer structures from source to target language – generate corresponding sentence in the target language • Interlingua model : extract the meaning and express it in the target language – analyze lexical, syntactical and semantical structure of source sentence – interpret the meaning into a canonical interlingua – generate the target sentence from the interlingua Notice: required knowledge for the interlingua approach grows linearly with number of languages, rather than to the square. M. Federico, FBK-irst SMT - Part 1 2013 Vauquois’s Triangle 13 Interlingua Semantics Semantics G s e i s n y e l r a a n t A i Transfer o n Syntax Syntax Source Target String String Direct M. Federico, FBK-irst SMT - Part 1 2013

  8. Approaches to MT 14 How is knowledge and linguistic information acquired by the system? • Hand-crafted : knowledge for analysis, transfer, generation, meaning representation, or direct translation is manually developed – most of commercial MT systems fall into this category – requires lots of human labor and expertise – includes: rule-based MT • Machine-learned : representations are implemented by mathematical models learnable from data, e.g. parallel corpora of human translations – much less human e ff ort is needed – requires huge amounts of data, the more, the better! – includes: statistical MT and example-based MT M. Federico, FBK-irst SMT - Part 1 2013 Transfer-Based MT 15 context-free grammar Synchronous context-free grammar / NP DT NPB NP DT 1 NPB 2 DT 1 NPB 2 → → / NPB JJ NN NPB JJ 1 NN 2 NN 2 JJ 1 → → NPB NN NPB NN / NN → → · · · · · · / DT the DT the il → → / JJ north JJ north settentrionale → → / NN wind NN wind vento → → · · · · · · NP NP settentrionale DT NPB DT NPB JJ NN NN JJ the north wind il vento settentrionale M. Federico, FBK-irst SMT - Part 1 2013

  9. Transfer-Based MT 16 context-free grammar synchronous context-free grammar / NP DT NPB NP DT 1 NPB 2 DT 1 NPB 2 → → / NPB JJ NN NPB JJ 1 NN 2 NN 2 JJ 1 → → NPB NN NPB NN / NN → → · · · · · · / DT the DT the il → → / JJ north JJ north settentrionale → → / NN wind NN wind vento → → · · · · · · NP NP settentrionale DT NPB DT NPB JJ NN NN JJ the north wind il vento settentrionale 1 This is a toy example. Working approaches use a very large set of probabilistic and lexicalized rules. M. Federico, FBK-irst SMT - Part 1 2013 Interlingua-Based MT 17 • Applied to linguistic domains with a limited number of relations and concepts – tourist information, hotel booking, flight reservation, ... • Semantics of a sentence can be expressed with predicate argument structure – I need a twin bed room reservation for tomorrow – book-room(date=tomorrow,type=single) • Interlingua language has to be designed carefully (by hand) – for some application formalism similar to SQL language • Processing steps in IBMT: – extract content from source sentence – map content into SQL like IL format - generate translation from IL format M. Federico, FBK-irst SMT - Part 1 2013

Recommend


More recommend