Deep Linguistic Information in Hybrid Machine Translation ��������� Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics Czech Republic
Outline: From Data To an MT System “DeepBank:” The Prague Czech-English Dependency Treebank (2.0) – Texts, annotation style(s), alignment, tools The platform: Treex TectoMT: hybrid MT English �������� – The (old) idea – Overall design – Core modules (A Speculation on) The Future Dec. 8, 2012 Hybrid MT Workshop - Coling 2012 2
The Prague Czech-English surface Dependency Treebank (PCEDT) 2.0 syntax Parallel treebank eebank ban nk nk Dependency style (“Prague”) ncy sty (“Prague” y y style (“Pragu ”) y ”) – (surface) syntax ) y ) syntax ta ax – syntax & semantics (“tectogrammatics”) sema semantics (“tectogrammatics” se mantics (“tec gramma matics ”) ”) syntax & semantics (and more) = “tectogrammatics” Dec. 8, 2012 Hybrid MT Workshop - Coling 2012 Work h p - Workshop C Co oling 2012 2 4 4
The Prague Czech-English Dependency Treebank (PCEDT) 2.0 Pa Parallel treebank Parallel treeban lel treeban nk nk nk Dependen Dependency style (“Prague”) De p dency style (“Prague” y style (“Pra y y Pragu ”) ”) – (surface) synta – (surface) syntax ( ( ( (sur rface) synta ) ) y ax ax – syntax & semantics (“tectogrammatics”) – s syntax & y tax & s x & semanti & semantics (“tec (“tectogramma ( ( ogramma g ma Penn T Penn Treebank translation into C Penn Treebank translation into Czech n Treebank tran k trans translat lation into ����������������������������������������������� Dec. 8, 2012 Hybrid MT Workshop - Coling 2012 5
The Prague Czech-English Dependency Treebank (PCEDT) 2.0 Parallel treebank Dependency style (“Prague”) – (surface) syntax ( ( ( ) y ) ) y – syntax & semantics (“tectogrammatics”) – s syntax & y tax & s x & semanti & semantics (“ (“ ( ectogra te te ec gramma g matic matic cs cs ”) ”) Penn Treeb Penn Treebank tran Penn Treebank translation into Czech k trans k translation into Czec anslat lation into Czec ch ch ch 1 million word 1 million word 1 million words ds Published at LDC, June 2012 (LDC2012T08) blished at DC LDC June 2012 (LDC2012T0 , J June 2012 (LDC (LDC2012 ( L C2012T 012T0 2T0 – Also available through LINDAT-Clarin and META- lso available through lso availa lso available through vail LI INDA AT- larin and MET C and MET ET SHARE HAR HARE Dec. 8, 2012 8 2 8 2 Hybrid MT Workshop - Coling 2012 6
PCEDT 2.0 The Alignment(s) Czec Czech-English alignments Cz ch- ng i h alignment E nglish alignm nt g g g ts ts – – Sentence-level (manual, natural due to translation) S Sentenc tenc ce- evel ( le e vel (manua ( nual, natural ual, n ural At both syntactic levels At both syntactic level At both syntac y vel ls ls – Word (node) level – Word (node) le Word ( W Word ( o d (node ( de) leve ) evel el automatic, test section manually corrected (in part) a tomatic, tes automatic, tes c tes st sectio st se st section manually cor st tion manually cor rrected (in par rrected ed (in par rt) rt) Dec. 8, 2012 Hybrid MT Workshop - Coling 2012 Co C oling 2 201 01 012 01 01 01 01 01 1 1 2 2 2 7 7
tectogrammatics PCEDT 2.0 The Alignment(s) Czech-English alignments – Sentence-level (manual, natural due to translation) At both syntactic levels 1 � 1 – Word (node) level automatic, test section manually corrected (in part), m � n Between annotation levels PTB syntax – Tectogrammatics to surface syntax m ���������������� – Surface syntax to word level (1 ���� surface syntax Dec. 8, 2012 Hybrid MT Workshop - Coling 2012 8
Surface syntax annotation English – Dependency (head rules + additions, manual corrections) – Function label (PDT-style) at all nodes (from PTB + rules) – Lemmatization + „pure“ POS tags from PTB – Automatic (from PTB) + a few manual corrections Czech – PDT style, no change – Syntax: automatic (MST); 2000 sent. fully manual for testing – Lemmatization and tagging: auto 99%/96%, Spoustová et al. EACL 2009 (COMPOST tagger) http://ufal.mff.cuni.cz/compost (Czech, English & other) – No p-level (of course � ) Dec. 8, 2012 Hybrid MT Workshop - Coling 2012 10
Tectogrammatical annotation Manual (both languages) Major features – Nodes with „autosemantic“ words only (no function words) Ellipsis „restored“ (new node for verbal arguments) – (Semantic) function (dependent � head relation) Verb arguments + ca 50 functions for other relations – Valency lexicons attached (Eng: links to PropBank) – “Formemes”: prep+case style label (useful in MT and search) – Co-reference integrated (Eng: BBN + more), Czech: manually Alignment – To surface syntax & between Czech and English This temblor-prone city dispatched inspectors, firefighters and other earthquake-trained personnel *-1 to aid San Francisco. Dec. 8, 2012 Hybrid MT Workshop - Coling 2012 11
Accompanying Tools TrEd (http://ufal.mff.cuni.cz/tred) – Annotation, View/Browse and Search environment – Open source, perl – Search and visualization: Simple data browser (http://ufal.mff.cuni.cz/pcedt2.0) PML-TQ: Powerful query language for complex tree-based annotation Treex (http://ufal.mff.cuni.cz/treex) – Modular NLP processing environment – Easy handling of complex NLP-annotated data – Modules exists for Czech, English data processing incl. 3 rd -party tools integrated into Treex – CPAN-distributed Dec. 8, 2012 Hybrid MT Workshop - Coling 2012 12
PCEDT and Tectogrammatics in (hybrid) MT The famous, (almost) “Vauquois” triangle: ANALYSIS TRANSFER SYNTHESIS t-layer deep syntax & semantics: tectogrammatical layer a-layer shallow syntax: analytical layer m-layer POS & lemmatization: morphological layer w-layer source language (English) target language (Czech) Dec. 8, 2012 Hybrid MT Workshop - Coling 2012 13
Analysis-Transfer-Synthesis Hybrid System Over 90 steps: both rule-based and statistical ANALYSIS TRANSFER SYNTHESIS Grammatemes, formemes t-layer Structural Convert to t-tree Basic morph. categories transfer Analytical dep. function Agreement a-layer Lexical Parsing (MST) transfer Add function words (dictionary) Tagging (Compost) Generate forms m-layer & lexical choice Lemmatization Concatenate Tokenization w-layer source language (English) target language (Czech) Dec. 8, 2012 Hybrid MT Workshop - Coling 2012 14
Example Translation should Pred translation . be a-layer Sb AuxK Obj (parse) + easy machine functions Pnom Atr machine translation should be easy . Lemmatized NN NN MD VB JJ . & POS tagged Machine translation should be easy . Tokenized Dec. 8, 2012 Hybrid MT Workshop - Coling 2012 15
Example Translation should Pred Mark translation . be function Sb AuxK Obj nodes & edges to easy machine “collapse” Pnom Atr Dec. 8, 2012 Hybrid MT Workshop - Coling 2012 16
Example Translation be v:fin T-tree translation backbone easy n:subj adj:compl + formemes machine n:attr Dec. 8, 2012 Hybrid MT Workshop - Coling 2012 17
Example Translation Modality=hort be Conditional=1 v:fin Tense=PresSim T-tree backbone translation easy n:subj DoC=Positive Num=sg + adj:compl formemes + machine grammatemes n:attr Dec. 8, 2012 Hybrid MT Workshop - Coling 2012 18
Example Translation mít Modality=hort být Conditional=1 v:fin Tense=PresSim v:inf ������ ������� posun DoC=Positive Num=sg n:1 snadný jednoduchý Transfer p ������ adj:compl starts: n:1 strojový Clone t-tree adv: stroj n:2 adj:attr n:attr * Dictionary translation: MaxEnt classifier, ~10 6 features Dec. 8, 2012 Hybrid MT Workshop - Coling 2012 19
Example Translation mít Modality=hort být Conditional=1 v:fin Tense=PresSim v:inf ������ ������� posun DoC=Positive Num=sg n:1 Select snadný best jednoduchý combination p ������ adj:compl n:1 of lemmas & strojový adv: stroj Formemes n:2 (HMTM) adj:attr n:attr Dec. 8, 2012 Hybrid MT Workshop - Coling 2012 20
Example Translation mít Gen=MInanim Clone C=PastP to a-tree, Num=sg ������� add core Num=sg morphological . Case=1 . & POS tags by být snadný + Deg=pos C=inf agreement Case=1 strojový Gen=MInanim Deg=pos + Case=1 function words Gen=MInanim Dec. 8, 2012 Hybrid MT Workshop - Coling 2012 21
Example Translation mít Gen=MInanim C=PastP Num=sg ������� Num=sg . Case=1 . by být snadný Deg=pos C=inf Case=1 strojový Gen=MInanim Deg=pos Case=1 Rearrange Gen=MInanim clitics Dec. 8, 2012 Hybrid MT Workshop - Coling 2012 22
Example Translation m �� ������� . Synthesize by být snadný word forms strojový ... and flatten the tree: Strojový p � eklad by m � l být snadný. (capitalize, space) Dec. 8, 2012 Hybrid MT Workshop - Coling 2012 23
Recommend
More recommend