who we are introduction to statistical machine
play

Who we are Introduction to Statistical Machine Chris = PhD student - PDF document

Who we are Introduction to Statistical Machine Chris = PhD student at University of Translation Edinburgh, co-founder of Linear B Ltd, a startup company that builds SMT systems ESSLLI 2005 Philipp = Lecturer at U of Edinburgh,


  1. Who we are Introduction to Statistical Machine • Chris = PhD student at University of Translation Edinburgh, co-founder of Linear B Ltd, a startup company that builds SMT systems ESSLLI 2005 • Philipp = Lecturer at U of Edinburgh, recently finished his PhD at University of Chris Callison-Burch Southern California / ISI, did postdoc at MIT Philipp Koehn Course Overview Course Overview • Day 1: - Different approaches to MT • Day 4: - Overview of statistical MT - Evaluation of translation quality - Useful resources - Using parallel corpora for other tasks • Day 2: • Day 5: - Decoding and search - Syntax-based approaches to SMT • Day 3: - Aligning words and phrases A long history • Machine translation was one of the first applications envisioned for computers • Warren Weaver (1949) Overview of MT “I have a text in front of me which is written in Russian but I am going to pretend that it is really written in English and that it has been coded in some strange symbols. All I need to do is strip off the code in order to retrieve the information contained in the text.” • First demonstrated by IBM in 1954 with a basic word-for-word translation system.

  2. Commercially Academically Interesting Interesting • U.S. has invested in MT for intelligence • Machine translation requires many other purposes NLP technologies • MT is popular on the web -- it is the most • Potentially: parsing, generation, word sense used of Google's special features disambiguation, named entity recognition, • EU spends more than � 1,000,000,000 on transliteration, pronoun resolution, natural language understanding, and real-world translation costs each year. (Semi-) knowledge automating that could lead to huge savings What makes MT hard? Differing word orders • Word order • English word order is subject - verb - object Japanese order is subject - object - verb • Word sense • English: IBM bought Lotus • Pronouns Japanese: IBM Lotus bought • Tense • English: Reporters said IBM bought Lotus • Idioms Japanese: Reporters IBM Lotus bought said Word sense ambiguity Problem of pronouns • Some languages like Spanish can drop • `Bank' as in river subject pronouns • In Spanish the verbal inflection often `Bank' as in financial institution • `Plant' as in a tree indicates which pronoun should be restored -o = I `Plant' as in a factory -as = you • Different word senses will likely translate -a = he / she / it -amos = we into different words in another language -an = they • When should we use `she' or `he' or `it'?

  3. Different tenses Idioms • Spanish has two versions of the past tense: • "to kick the bucket'' means "to die'' one for a definite time in the past, and one for an unknown time in the past • "a bone of contention" does not have • When translating from English to Spanish anything to do with skeletons • "a lame duck", "tongue in cheek", "to cave in" we need to choose which version of the past tense to use Various approaches • Word-for-word translation Various Approaches to • Syntactic transfer • Interlingual approaches Machine Translation • Controlled language • Example-based translation • Statistical translation Word-for-word Syntactic transfer translation S S • Use a machine-readable bilingual dictionary NP (SUBJ) VP NP (SUBJ) VP to translate each work in a text Reporters S (OBJ) V Reporters V S (OBJ) • Advantages: Easy to implement, results give a NP (SUBJ) VP said said NP (SUBJ) VP rough idea about what the text is about IBM NP (OBJ) V IBM V NP (OBJ) • Disadvantages: Problems with word order Lotus bought bought Lotus • Parse the sentence means that this results in low-quality translation • Rearrange constituents • Then translate the words

  4. Syntactic transfer Syntactic transfer VB VB PRP VB1 VB2 PRP VB2 VB1 He TO VB ga adores He adores VB TO ha desu kare daisuki listening MN TO no listening TO MN kiku to music music to ongaku wo Syntactic transfer Interlingua • Advantages: Deals with the word-order problem • Assign a logical form to sentences • Disadvantages: • John must not go = - Must construct transfer rules for each OBLIGATORY(NOT(GO(JOHN))) language pair that you deal with John may not go = - Sometimes there is syntactic mis-match NOT(PERMITTED(GO(JOHN))) • Example: • Use logical form to generate a sentence in English: The bottle floated into the cave another language Spanish: La botella entro a la cuerva flotando = The bottle entered the cave floating Interlingua Controlled language • Advantages: • Define a subset of a language which can be Single logical form means that we can used to compose text to be translated translate between all languages and only • Issued editorial guidelines that limit each write a parser/generator for each language once word to only one word sense, and which • Disadvantages: forbid certain difficult constructions • Apply syntactic transfer or interlingual Difficult to define a single logical form. English words in all capital letter probably approaches won't cut it.

  5. Controlled language Example-based MT • Fundamental idea: • Advantages: Results in more reliable, higher - People do not translate by doing deep quality translation for subset of language linguistics analysis of a sentence. that it deals with - They translate by decomposing sentence into fragments, translating each of those, • Disadvantages: Does not cover all language and then composing those properly. use, so can only be applied in limited settings • Principle of analogy in translation Example of Challenges Example-Based MT • Translate: • Locating similar sentences He buys a book on international politics. • Aligning sub-sentential fragments • With these examples: • Combining multiple fragments of example (He buys) a notebook. translations into a single sentence (Kare ha) nouto (wo kau). • Determining when it is appropriate to I read (a book on international politics). substitute one fragment for another Watashi ha (kokusaiseiji nitsuite kakareta hon) wo yomu • Selecting the best translation out of many • (Kare ha) (kokusaiseiji nitsuite kakareta hon) candidates (wo kau). Statistical machine Example-based MT translation • Advantages: Uses fragments of human • Find most probable English sentence given a translations which can result in higher foreign language sentence quality • Automatically align words and phrases • Disadvantages: May have limited coverage within sentence pairs in a parallel corpus depending on the size of the example • Probabilities are determined automatically database, and flexibility of matching by training a statistical model using the heuristics parallel corpus

  6. Statistical machine Parallel corpus translation what is more , the relevant cost im übrigen ist die diesbezügliche dynamic is completely under control. kostenentwicklung völlig unter kontrolle . • Advantages: sooner or later we will have to be früher oder später müssen wir die sufficiently progressive in terms of own notwendige progressivität der eigenmittel als - Has a way of dealing with lexical ambiguity resources as a basis for this fair tax grundlage dieses gerechten steuersystems system . zur sprache bringen . - Can deal with idioms that occur in the we plan to submit the first accession wir planen , die erste beitrittspartnerschaft training data partnership in the autumn of this year . im herbst dieses jahres vorzulegen . - Requires minimal human effort it is a question of equality and solidarity hier geht es um gleichberechtigung und - Can be created for any language pair that . solidarität . has enough training data the recommendation for the year 1999 die empfehlung für das jahr 1999 wurde vor has been formulated at a time of dem hintergrund günstiger entwicklungen • Disadvantages: favourable developments and optimistic und einer für den kurs der europäischen prospects for the european economy . wirtschaft positiven perspektive abgegeben . Does not explicitly deal with syntax that does not , however , detract from im übrigen tut das unserer hohen the deep appreciation which we have for wertschätzung für den vorliegenden bericht this report . keinen abbruch . Choosing an Approach Some Criteria • Do we want to design a system for a single • Many challenges in MT, many different ways language or for many languages? of approaching the task • Can we assume a constrained vocabulary or • What approach you prefer will depend on do we need to deal with any text? your background (i.e. logicians tend towards • What resources already exist for the interlingua, linguists towards syntactic transfer) languages that we're dealing with? • Objectively choosing how to approach the • How long will it take us to develop the task is tricky resources, and how large a staff will we need? Advantages of SMT Choosing SMT • Economic reasons: - Low cost; Rapid prototyping • Data driven • Practical reasons: • Language independent - Many language pairs don't have NLP • No need for staff of linguists of language resources, but do have parallel corpora • Quality reasons: experts • Can prototype a new system quickly and at - Uses chunks of human translated as its building blocks a very low cost - When very large data sets are available produces state of the art results

Recommend


More recommend