lecture 21
play

Lecture 21: Google Translate translate.google.com Machine - PowerPoint PPT Presentation

CS447: Natural Language Processing Machine Translation in 2018 http://courses.engr.illinois.edu/cs447 Lecture 21: Google Translate translate.google.com Machine Translation Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center 2


  1. CS447: Natural Language Processing Machine Translation in 2018 http://courses.engr.illinois.edu/cs447 Lecture 21: Google Translate translate.google.com Machine Translation Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center � 2 CS447 Natural Language Processing Machine Translation in 2012 Google Translate translate.google.com Why is MT difficult? � 3 CS447 Natural Language Processing CS447 Natural Language Processing � 4

  2. 
 
 
 
 Some examples Correspondences John loves Mary. 
 John loves Mary. 
 Jean aime Marie. 
 Jean aime Marie. 
 John told Mary a story. 
 Jean a raconté une histoire à Marie. 
 John told Mary a story. 
 Jean [ a raconté ] une histoire [ à Marie ] . 
 John is a computer scientist. 
 Jean est informaticien. 
 John is a [computer scientist]. 
 John swam across the lake. 
 Jean est informaticien. 
 Jean a traversé le lac à la nage . John [swam across] the lake. 
 Jean [ a traversé ] le lac [ à la nage ] . � 5 � 6 CS447 Natural Language Processing CS447 Natural Language Processing Correspondences Lexical divergences One-to-one: The different senses of homonymous words 
 John = Jean , aime = loves , Mary= Marie 
 generally have different translations: 
 One-to-many/many-to-one: English-German: (river) bank - Ufer 
 Mary = [ à Marie ] (financial) bank - Bank 
 [a computer scientist] = informaticien 
 The different senses of polysemous words 
 Many-to-many: may also have different translations: 
 [ swam across ] = [ a traversé à la nage ] 
 I know that he bought the book: Je sais qu ’il a acheté le livre. Reordering required: I know Peter: Je connais Peter. 
 told Mary 1 [a story] 2 = a raconté [ une histoire ] 2 [ à Marie ] 1 I know math: Je m’y connais en maths . � 7 � 8 CS447 Natural Language Processing CS447 Natural Language Processing

  3. Lexical divergences Syntactic divergences Word order: fixed or free? Lexical specificity If fixed, which one? [SVO (Sbj-Verb-Obj), SOV, VSO,… ] 
 German Kürbis = English pumpkin or (winter) squash English brother = Chinese gege (older) or didi (younger) 
 Head-marking vs. dependent-marking Dependent-marking (English) the man’ s house 
 Morphological divergences Head-marking (Hungarian) the man house- his 
 English: new book(s), new story/stories 
 French: un nouveau livre (sg.m), une nouvelle histoire (sg.f), 
 Pro-drop languages can omit pronouns: des nouveaux livres (pl.m), des nouvelles histoires (pl.f) Italian (with inflection): I eat = mangi o ; he eats = mangi a 
 - How much inflection does a language have? 
 Chinese (without inflection): I/he eat: ch ī fàn (cf. Chinese vs.Finnish) - How many morphemes does each word have? - How easily can the morphemes be separated ? � 9 � 10 CS447 Natural Language Processing CS447 Natural Language Processing Syntactic divergences: negation Semantic differences Aspect: Normal Negated - English has a progressive aspect : 
 ‘Peter swims’ vs. ‘Peter is swimming’ do -support, English I drank coffee. I didn’t drink (any) coffee. - German can only express this with an adverb : any ‘Peter schwimmt’ vs. ‘Peter schwimmt gerade’ (‘swims currently’) 
 ne..pas Motion events have two properties: French J’ai bu du café Je n’ ai pas bu de café. du → de - manner of motion ( swimming ) - direction of motion ( across the lake) keinen Kaffee German Ich habe Kaffee Ich habe keinen Kaffee Languages express either the manner with a verb 
 = getrunken getrunken ‘no coffee’ and the direction with a ‘satellite’ or vice versa (L. Talmy): English (satellite-framed): He [ swam ] MANNER [ across ] DIR the lake French (verb-framed): Il a [ traversé ] DIR le lac [ à la nage ] MANNER � 11 � 12 CS447 Natural Language Processing CS447 Natural Language Processing

  4. 
 
 
 
 Knight’s Centauri and Arctuan 1a. ok-voon ororok sprok. 7a. lalok farok ororok lalok sprok izok 1b. at-voon bichat dat. enemok. 7b. wat jjat bichat wat dat vat eneat. 
 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. An exercise 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 6a. lalok sprok izok jok stok. 11b. wat nnat arrat mat zanzanat. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. � 14 CS447 Natural Language Processing � 13 CS447 Natural Language Processing 1a. Garcia and associates . 
 1a. ok- voon ororok sprok . 
 The original corpus 1b. Garcia y asociados . 1b. at- voon bichat dat . 2a. Carlos Garcia has three associates . 
 2a. ok-drubel ok- voon anok plok sprok . 
 2b. Carlos Garcia tiene tres asociados . 2b. at-drubel at- voon pippat rrat dat . 1a. Garcia and associates. 
 8a. the company has three groups. 3a. his associates are not strong. 
 3a. erok sprok izok hihok ghirok. 
 1b. Garcia y asociados. 8b. la empresa tiene tres grupos. 3b. sus asociados no son fuertes. 3b. totat dat arrat vat hilat. 4a. Garcia has a company also . 
 4a. ok- voon anok drok brok jok . 
 2a. Carlos Garcia has three associates. 9a. its groups are in Europe. 4b. Garcia tambien tiene una empresa. 4b. at- voon krat pippat sat lat. 2b. Carlos Garcia tiene tres asociados. 
 9b. sus grupos están en Europa. 5a. its clients are angry. 
 5a. wiwok farok izok stok. 
 5b. sus clientes están enfadados. 5b. totat jjat quat cat. 3a. his associates are not strong. 10a. the modern groups sell strong 6a. the associates are also angry. 
 6a. lalok sprok izok jok stok. 
 3b. sus asociados no son fuertes. 
 pharmaceuticals. 6b. los asociados tambien están enfadados. 6b. wat dat krat quat cat. 10b. los grupos modernos venden medicinas 7a. the clients and the associates are enemies. 
 7a. lalok farok ororok lalok sprok izok enemok 
 4a. Garcia has a company also. fuertes. 7b. los clientes y los asociados son enemigos. 7b. wat jjat bichat wat dat vat eneat. 4b. Garcia tambien tiene una empresa. 
 8a. the company has three groups . 
 8a. lalok brok anok plok nok . 
 11a. the groups do not sell zanzanine. 8b. la empresa tiene tres grupos . 8b. iat lat pippat rrat nnat . 5a. its clients are angry. 11b. los grupos no venden zanzanina. 9a. its groups are in Europe. 
 9a. wiwok nok izok kantok ok-yurp. 
 5b. sus clientes están enfadados. 
 9b. sus grupos están en Europa. 9b. totat nnat quat oloat at-yurp. 12a. the small groups are not modern. 10a. the modern groups sell strong pharmaceuticals 
 10a. lalok mok nok yorok ghirok clok. 
 10b. los grupos modernos venden medicinas fuertes 10b. wat nnat gat mat bat hilat. 6a. the associates are also angry. 12b. los grupos pequeños no son modernos. 11a. the groups do not sell zanzanine. 
 11a. lalok nok crrrok hihok yorok zanzanok. 
 6b. los asociados tambien están enfadados. 
 11b. los grupos no venden zanzanina. 11b. wat nnat arrat mat zanzanat. 12a. the small groups are not modern. 
 12a. lalok rarok nok izok hihok mok. 
 7a. the clients and the associates are enemies. 12b. los grupos pequeños no son modernos. 12b. wat nnat forat arrat vat gat. 7b. los clientes y los asociados son enemigos. � 15 � 16 CS447 Natural Language Processing CS447 Natural Language Processing

  5. Machine translation approaches � 18 CS447 Natural Language Processing � 17 CS447 Natural Language Processing The Rosetta Stone MT History Three different translations of the same text: WW II: Code-breaking efforts at Bletchley Park, England (Alan Turing) - Hieroglyphic Egyptian (used by priests) 1948: Shannon/Weaver: Information theory - Demotic Egyptian (used for daily purposes) 1949: Weaver’s memorandum defines the task - Classical Greek (used by the administration) 1954: IBM/Georgetown demo: 60 sentences Russian-English Instrumental in our understanding of ancient Egyptian 
 1960: Bar-Hillel: MT to difficult This is an instance of parallel text: 1966: ALPAC report: human translation is far cheaper and better: 
 kills MT for a long time The Greek inscription allowed scholars 
 to decipher the hieroglyphs 1980s/90s: Transfer and interlingua-based approaches 1990: IBM’s CANDIDE system (first modern statistical MT system) 2000s: Huge interest and progress in wide-coverage statistical MT: 
 phrase-based MT, syntax-based MT, open-source tools Now: Neural machine translation � 19 � 20 CS447 Natural Language Processing CS447 Natural Language Processing

Recommend


More recommend