Motivation Segmentation of input Rules in Apertium are applied in a greedy, left-to-right, longest match manner: a word is never processed by more than one rule Inference algorithm should ensure that sequences of words that processed together by a single rule are not processed by different rules Example: cars The white house and the red → La casa blanca y el rojo coches N DT ADJ N CC DT ADJ cars The white house and the red → La casa blanca y los coches rojos N DT ADJ N CC DT ADJ V´ ıctor M. S´ anchez-Cartagena 11/51
Motivation New rule inference approach solves these issues thanks to: A rule formalism with more generalisation power: Generalised Alignment Templates (GATs) Extension of the formalism defined by S´ anchez-Mart´ ınez & Forcada (2009) A more powerful inference algorithm Prevents overgeneralisation Solves conflicts between rules at a global level Ensures correct segmentation of input V´ ıctor M. S´ anchez-Cartagena 12/51
Generalised alignment templates A GAT is made of: SL word classes and restrictions define SL lexical forms matched TL word classes define output 1 PN 2 POS 3 N-gen: ǫ .num:* 1 el DT-gen: $ 3 2 N-gen: $ 3 3 de PR 4 PN t .num: $ 3 t .num: $ 3 s s r 1 = {} , r 2 = {} , r 3 = {} V´ ıctor M. S´ anchez-Cartagena 13/51
Generalised alignment templates New values introduced in word classes to apply the same GAT to lexical forms with different values of morphological inflection attributes Wildcard values ( ∗ ) SL references ( $ j s ) and TL references ( $ j t ) 1 PN 2 POS 3 N-gen: ǫ .num:* 1 el DT-gen: $ 3 3 de PR 2 N-gen: $ 3 4 PN t .num: $ 3 t .num: $ 3 s s r 1 = {} , r 2 = {} , r 3 = {} V´ ıctor M. S´ anchez-Cartagena 13/51
Generalised alignment templates Example of translation with a GAT (English → Spanish) 1 PN 2 POS 3 N-gen: ǫ .num:* 1 el DT-gen: $ 3 3 de PR 2 N-gen: $ 3 4 PN t .num: $ 3 t .num: $ 3 s s r 1 = {} , r 2 = {} , r 3 = {} Victor’s plants V´ ıctor M. S´ anchez-Cartagena 14/51
Generalised alignment templates Example of translation with a GAT (English → Spanish) 1 PN 2 POS 3 N-gen: ǫ .num:* 1 el DT-gen: $ 3 2 N-gen: $ 3 3 de PR 4 PN t .num: $ 3 t .num: $ 3 s s r 1 = {} , r 2 = {} , r 3 = {} Victor PN ’s POS plant N-gen: ǫ .num:pl V´ ıctor M. S´ anchez-Cartagena 14/51
Generalised alignment templates Example of translation with a GAT (English → Spanish) 1 PN 2 POS 3 N-gen: ǫ .num:* 1 el DT-gen: $ 3 2 N-gen: $ 3 3 de PR 4 PN t .num: $ 3 t .num: $ 3 s s r 1 = {} , r 2 = {} , r 3 = {} 1 Victor PN → Victor PN ’s POS ıctor PN V´ 3 plant N-gen: ǫ .num:pl → plant N-gen: ǫ .num:pl planta N-gen:f.num:pl V´ ıctor M. S´ anchez-Cartagena 14/51
Generalised alignment templates Example of translation with a GAT (English → Spanish) 1 PN 2 POS 3 N-gen: ǫ .num:* 1 el DT-gen: $ 3 2 N-gen: $ 3 3 de PR 4 PN t .num: $ 3 t .num: $ 3 s s r 1 = {} , r 2 = {} , r 3 = {} el DT-gen: $ 3 t .num: $ 3 1 Victor PN → s N-gen: $ 3 t .num: $ 3 ıctor PN V´ s 3 plant N-gen: ǫ .num:pl → de PR PN planta N-gen:f.num:pl V´ ıctor M. S´ anchez-Cartagena 14/51
Generalised alignment templates Example of translation with a GAT (English → Spanish) 1 PN 2 POS 3 N-gen: ǫ .num:* 1 el DT-gen: $ 3 2 N-gen: $ 3 3 de PR 4 PN t .num: $ 3 t .num: $ 3 s s r 1 = {} , r 2 = {} , r 3 = {} el DT-gen: $ 3 t .num: $ 3 1 Victor PN → s planta N-gen: $ 3 t .num: $ 3 ıctor PN V´ s 3 plant N-gen: ǫ .num:pl → de PR V´ ıctor PN planta N-gen:f.num:pl V´ ıctor M. S´ anchez-Cartagena 14/51
Generalised alignment templates Example of translation with a GAT (English → Spanish) 1 PN 2 POS 3 N-gen: ǫ .num:* 1 el DT-gen: $ 3 2 N-gen: $ 3 3 de PR 4 PN t .num: $ 3 t .num: $ 3 s s r 1 = {} , r 2 = {} , r 3 = {} el DT-gen: $ 3 t .num: $ 3 1 Victor PN → s planta N-gen: $ 3 t .num: $ 3 ıctor PN V´ s 3 plant N-gen: ǫ .num:pl → de PR V´ ıctor PN planta N-gen:f.num:pl V´ ıctor M. S´ anchez-Cartagena 14/51
Generalised alignment templates Example of translation with a GAT (English → Spanish) 1 PN 2 POS 3 N-gen: ǫ .num:* 1 el DT-gen: $ 3 2 N-gen: $ 3 3 de PR 4 PN t .num: $ 3 t .num: $ 3 s s r 1 = {} , r 2 = {} , r 3 = {} el DT-gen: $ 3 1 Victor PN → t .num:pl planta N-gen: $ 3 t .num:pl ıctor PN V´ 3 plant N-gen: ǫ .num:pl → de PR V´ ıctor PN planta N-gen:f.num:pl V´ ıctor M. S´ anchez-Cartagena 14/51
Generalised alignment templates Example of translation with a GAT (English → Spanish) 1 PN 2 POS 3 N-gen: ǫ .num:* 1 el DT-gen: $ 3 2 N-gen: $ 3 3 de PR 4 PN t .num: $ 3 t .num: $ 3 s s r 1 = {} , r 2 = {} , r 3 = {} el DT-gen: $ 3 1 Victor PN → t .num:pl planta N-gen: $ 3 t .num:pl ıctor PN V´ 3 plant N-gen: ǫ .num:pl → de PR V´ ıctor PN planta N-gen:f.num:pl V´ ıctor M. S´ anchez-Cartagena 14/51
Generalised alignment templates Example of translation with a GAT (English → Spanish) 1 PN 2 POS 3 N-gen: ǫ .num:* 1 el DT-gen: $ 3 2 N-gen: $ 3 3 de PR 4 PN t .num: $ 3 t .num: $ 3 s s r 1 = {} , r 2 = {} , r 3 = {} 1 Victor PN → el DT-gen:f.num:pl planta N-gen:f.num:pl ıctor PN V´ 3 plant N-gen: ǫ .num:pl → de PR V´ ıctor PN planta N-gen:f.num:pl V´ ıctor M. S´ anchez-Cartagena 14/51
Generalised alignment templates Example of translation with a GAT (English → Spanish) 1 PN 2 POS 3 N-gen: ǫ .num:* 1 el DT-gen: $ 3 2 N-gen: $ 3 3 de PR 4 PN t .num: $ 3 t .num: $ 3 s s r 1 = {} , r 2 = {} , r 3 = {} las plantas de V´ ıctor V´ ıctor M. S´ anchez-Cartagena 14/51
Inference of shallow-transfer rules Method overview V´ ıctor M. S´ anchez-Cartagena 15/51
Inference of shallow-transfer rules 1. Bilingual phrase extraction V´ ıctor M. S´ anchez-Cartagena 16/51
Inference of shallow-transfer rules 1. Bilingual phrase extraction 1 Analyse SL and TL sides of the parallel corpus 2 Compute statistical word alignments as in SMT 3 Extract bilingual phrases compatible with the alignments (in a similar way to SMT) English : There were white houses Spanish : Hab´ ıa casas blancas V´ ıctor M. S´ anchez-Cartagena 17/51
Inference of shallow-transfer rules 1. Bilingual phrase extraction 1 Analyse SL and TL sides of the parallel corpus 2 Compute statistical word alignments as in SMT 3 Extract bilingual phrases compatible with the alignments (in a similar way to SMT) house N-gen: ǫ white ADJ-gen: ǫ there ADV be VERB-t:past num:pl num: ǫ haber VERB-t:past casa N-gen:f blanco ADJ-gen:f p:3.num:sg num:pl num:pl V´ ıctor M. S´ anchez-Cartagena 17/51
Inference of shallow-transfer rules 1. Bilingual phrase extraction 1 Analyse SL and TL sides of the parallel corpus 2 Compute statistical word alignments as in SMT 3 Extract bilingual phrases compatible with the alignments (in a similar way to SMT) house N-gen: ǫ white ADJ-gen: ǫ there ADV be VERB-t:past num:pl num: ǫ haber VERB-t:past casa N-gen:f blanco ADJ-gen:f p:3.num:sg num:pl num:pl V´ ıctor M. S´ anchez-Cartagena 17/51
Inference of shallow-transfer rules 1. Bilingual phrase extraction 1 Analyse SL and TL sides of the parallel corpus 2 Compute statistical word alignments as in SMT 3 Extract bilingual phrases compatible with the alignments (in a similar way to SMT) house N-gen: ǫ white ADJ-gen: ǫ there ADV be VERB-t:past num:pl num: ǫ haber VERB-t:past casa N-gen:f blanco ADJ-gen:f p:3.num:sg num:pl num:pl V´ ıctor M. S´ anchez-Cartagena 17/51
Inference of shallow-transfer rules 1. Bilingual phrase extraction 1 Analyse SL and TL sides of the parallel corpus 2 Compute statistical word alignments as in SMT 3 Extract bilingual phrases compatible with the alignments (in a similar way to SMT) house N-gen: ǫ white ADJ-gen: ǫ there ADV be VERB-t:past num:pl num: ǫ haber VERB-t:past casa N-gen:f blanco ADJ-gen:f p:3.num:sg num:pl num:pl V´ ıctor M. S´ anchez-Cartagena 17/51
Inference of shallow-transfer rules 1. Bilingual phrase extraction 1 Analyse SL and TL sides of the parallel corpus 2 Compute statistical word alignments as in SMT 3 Extract bilingual phrases compatible with the alignments (in a similar way to SMT) house N-gen: ǫ white ADJ-gen: ǫ there ADV be VERB-t:past num:pl num: ǫ haber VERB-t:past casa N-gen:f blanco ADJ-gen:f p:3.num:sg num:pl num:pl V´ ıctor M. S´ anchez-Cartagena 17/51
Inference of shallow-transfer rules 1. Bilingual phrase extraction 1 Analyse SL and TL sides of the parallel corpus 2 Compute statistical word alignments as in SMT 3 Extract bilingual phrases compatible with the alignments (in a similar way to SMT) house N-gen: ǫ white ADJ-gen: ǫ there ADV be VERB-t:past num:pl num: ǫ haber VERB-t:past casa N-gen:f blanco ADJ-gen:f p:3.num:sg num:pl num:pl V´ ıctor M. S´ anchez-Cartagena 17/51
Inference of shallow-transfer rules 2. Generation of GATs V´ ıctor M. S´ anchez-Cartagena 18/51
Inference of shallow-transfer rules 2. Generation of GATs Strategy From each bilingual phrase, generate as many GATs as possible as long as they correctly reproduce the original bilingual phrase Transformations applied: 1 Generate a very specific GAT (only matches the initial bilingual phrase) 2 Lexical generalisation 3 Morphological generalisation V´ ıctor M. S´ anchez-Cartagena 19/51
Inference of shallow-transfer rules 2. Generation of GATs 1 Generate a very specific GAT Bilingual phrase (English–Spanish): white ADJ-gen: ǫ .num: ǫ house N-gen: ǫ .num:pl casa N-gen:f.num:pl blanco ADJ-gen:f.num:pl V´ ıctor M. S´ anchez-Cartagena 20/51
Inference of shallow-transfer rules 2. Generation of GATs 1 Generate a very specific GAT GAT: 1 white ADJ-gen: ǫ .num: ǫ 2 house N-gen: ǫ .num:pl 1 casa N-gen:f.num:pl 2 blanco ADJ-gen:f.num:pl r 1 = { gen : ǫ, num : ǫ } , r 2 = { gen : f , num : pl } V´ ıctor M. S´ anchez-Cartagena 20/51
Inference of shallow-transfer rules 2. Generation of GATs 2 Lexical generalisation Remove lemmas from word classes if they can be obtained from the bilingual dictionary 1 white ADJ-gen: ǫ .num: ǫ 2 house N-gen: ǫ .num:pl 1 casa N-gen:f.num:pl 2 blanco ADJ-gen:f.num:pl r 1 = { gen : ǫ, num : ǫ } , r 2 = { gen : f , num : pl } V´ ıctor M. S´ anchez-Cartagena 21/51
Inference of shallow-transfer rules 2. Generation of GATs 2 Lexical generalisation Remove lemmas from word classes if they can be obtained from the bilingual dictionary 2 house N-gen: ǫ .num:pl 1 ADJ-gen: ǫ .num: ǫ 1 casa N-gen:f.num:pl 2 ADJ-gen:f.num:pl r 1 = { gen : ǫ, num : ǫ } , r 2 = { gen : f , num : pl } V´ ıctor M. S´ anchez-Cartagena 21/51
Inference of shallow-transfer rules 2. Generation of GATs 2 Lexical generalisation Remove lemmas from word classes if they can be obtained from the bilingual dictionary 1 white ADJ-gen: ǫ .num: ǫ 2 N-gen: ǫ .num:pl 1 N-gen:f.num:pl 2 blanco ADJ-gen:f.num:pl r 1 = { gen : ǫ, num : ǫ } , r 2 = { gen : f , num : pl } V´ ıctor M. S´ anchez-Cartagena 21/51
Inference of shallow-transfer rules 2. Generation of GATs 2 Lexical generalisation Remove lemmas from word classes if they can be obtained from the bilingual dictionary 1 ADJ-gen: ǫ .num: ǫ 2 N-gen: ǫ .num:pl 1 N-gen:f.num:pl 2 ADJ-gen:f.num:pl r 1 = { gen : ǫ, num : ǫ } , r 2 = { gen : f , num : pl } V´ ıctor M. S´ anchez-Cartagena 21/51
Inference of shallow-transfer rules 2. Generation of GATs 3 Morphological generalisation Detect morphological inflection attributes whose value can be s or $ j obtained with references ( $ j t ) in the TL attributes Add wildcards ( ∗ ) in the SL attributes Remove restrictions 1 ADJ-gen: ǫ .num: ǫ 2 N-gen: ǫ .num:pl 1 N-gen:f.num:pl 2 ADJ-gen:f.num:pl r 1 = { gen : ǫ, num : ǫ } , r 2 = { gen : f , num : pl } V´ ıctor M. S´ anchez-Cartagena 22/51
Inference of shallow-transfer rules 2. Generation of GATs 3 Morphological generalisation Detect morphological inflection attributes whose value can be s or $ j obtained with references ( $ j t ) in the TL attributes Add wildcards ( ∗ ) in the SL attributes Remove restrictions 1 ADJ-gen:*.num: ǫ 2 N-gen:*.num:pl 1 N-gen: $ 2 2 ADJ-gen: $ 2 t .num:pl t .num:pl r 1 = { num : ǫ } , r 2 = { num : pl } V´ ıctor M. S´ anchez-Cartagena 22/51
Inference of shallow-transfer rules 2. Generation of GATs 3 Morphological generalisation Detect morphological inflection attributes whose value can be s or $ j obtained with references ( $ j t ) in the TL attributes Add wildcards ( ∗ ) in the SL attributes Remove restrictions 1 ADJ-gen: ǫ .num:* 2 N-gen: ǫ .num:* 1 N-gen:f.num: $ 2 2 ADJ-gen:f.num: $ 2 s s r 1 = { gen : ǫ } , r 2 = { gen : f } V´ ıctor M. S´ anchez-Cartagena 22/51
Inference of shallow-transfer rules 2. Generation of GATs 3 Morphological generalisation Detect morphological inflection attributes whose value can be s or $ j obtained with references ( $ j t ) in the TL attributes Add wildcards ( ∗ ) in the SL attributes Remove restrictions 1 ADJ-gen:*.num:* 2 N-gen:*.num:* 1 N-gen: $ 2 2 ADJ-gen: $ 2 t .num: $ 2 t .num: $ 2 s s r 1 = {} , r 2 = {} V´ ıctor M. S´ anchez-Cartagena 22/51
Inference of shallow-transfer rules 3. Choosing the most appropriate GATs V´ ıctor M. S´ anchez-Cartagena 23/51
Inference of shallow-transfer rules 3. Choosing the most appropriate GATs Strategy Select the minimum set of GATs needed to reproduce all the bilingual phrases extracted from the parallel corpus Appropriate level of generalisation is found The more general the GATs → the fewer GATs are needed to reproduce the bilingual phrases In case of conflicts, more specific GATs are used First approach to solve conflicts solved at a global level Result : hierarchy in which specific GATs correct the errors of more general ones V´ ıctor M. S´ anchez-Cartagena 24/51
Inference of shallow-transfer rules Implementation : NP-Hard problem similar to already studied set cover problem (Garey and Johnson, 1979) Split in independent subproblems for each sequence of SL lexical categories Rewrite as an integer linear programming problem: optimisation of a function subject to restrictions encoded as set of linear inequations Solve with branch and cut (Xu et al., 2009) V´ ıctor M. S´ anchez-Cartagena 25/51
Inference of shallow-transfer rules 4. Optimising rules for segmentation V´ ıctor M. S´ anchez-Cartagena 26/51
Inference of shallow-transfer rules 4. Optimising rules for segmentation Generate GATs only for selected sequences of SL lexical categories so as to ensure correct segmentation V´ ıctor M. S´ anchez-Cartagena 27/51
Inference of shallow-transfer rules 4. Optimising rules for segmentation Generate GATs only for selected sequences of SL lexical categories so as to ensure correct segmentation 1 Identify key text segments: text segments that maximise BLEU when sentences are translated with the most specific GAT applied to them cars The white house and the red N DT ADJ N CC DT ADJ V´ ıctor M. S´ anchez-Cartagena 27/51
Inference of shallow-transfer rules 4. Optimising rules for segmentation Generate GATs only for selected sequences of SL lexical categories so as to ensure correct segmentation 1 Identify key text segments: text segments that maximise BLEU when sentences are translated with the most specific GAT applied to them 2 Discard sequences of lexical categories that usually prevent key text segments from being translated with the same rule cars The white house and the red N DT ADJ N CC DT ADJ V´ ıctor M. S´ anchez-Cartagena 27/51
Inference of shallow-transfer rules 4. Optimising rules for segmentation Generate GATs only for selected sequences of SL lexical categories so as to ensure correct segmentation 1 Identify key text segments: text segments that maximise BLEU when sentences are translated with the most specific GAT applied to them 2 Discard sequences of lexical categories that usually prevent key text segments from being translated with the same rule cars The white house and the red N DT ADJ N CC DT ADJ V´ ıctor M. S´ anchez-Cartagena 27/51
Inference of shallow-transfer rules 4. Optimising rules for segmentation Generate GATs only for selected sequences of SL lexical categories so as to ensure correct segmentation 1 Identify key text segments: text segments that maximise BLEU when sentences are translated with the most specific GAT applied to them 2 Discard sequences of lexical categories that usually prevent key text segments from being translated with the same rule cars The white house and the red N DT ADJ N CC DT ADJ V´ ıctor M. S´ anchez-Cartagena 27/51
Inference of shallow-transfer rules 4. Optimising rules for segmentation Generate GATs only for selected sequences of SL lexical categories so as to ensure correct segmentation 1 Identify key text segments: text segments that maximise BLEU when sentences are translated with the most specific GAT applied to them 2 Discard sequences of lexical categories that usually prevent key text segments from being translated with the same rule Remove redundant GATs: long GATs that produce the same translation as the combination of shorter ones V´ ıctor M. S´ anchez-Cartagena 27/51
Evaluation Evaluation goal : Compare translation quality achieved by inferred rules with other approaches: Word-for-word translation S´ anchez-Mart´ ınez & Forcada (2009) Apertium handcrafted rules Procedure : 1 Infer rules from small corpora fragments for different language pairs 2 Translate out-of-domain texts and compute MT evaluation metrics (BLEU, TER, METEOR) train test Spanish ↔ Catalan El Peri´ odico Consumer Eroski English ↔ Spanish Europarl Newstest 2013 Breton → French Ofis Publik Ofis Publik * V´ ıctor M. S´ anchez-Cartagena 28/51
Some results New algorithm systematically outperforms S´ anchez-Mart´ ınez & Forcada (2009) by a statistically significant margin Number of GATs inferred is at least 1 order of magnitude smaller → easier revision and maintenance of rules Example: Spanish → English 0.21 Sanchez-Cartagena et al. Sanchez-Martinez and Forcada 0.2 Apertium handcrafted Word-for-word 0.19 BLEU score 0.18 0.17 0.16 0.15 0.14 100 250 500 1000 2500 5000 Size of the training corpus (in sentences); log. scale V´ ıctor M. S´ anchez-Cartagena 29/51
Some results Morphological generalisation involves vast computational cost and it is only useful for very small corpora Disabling it permits using more training corpus and reaching translation quality of handcrafted rules Example: Spanish → English 0.21 Sanchez-Cartagena et al. - no wildcard Sanchez-Martinez and Forcada 0.2 Apertium handcrafted Word-for-word 0.19 BLEU score 0.18 0.17 0.16 0.15 0.14 100 250 500 1000 2500 5000 10000 25000 Size of the training corpus (in sentences); log. scale V´ ıctor M. S´ anchez-Cartagena 30/51
Outline 1 Introduction 2 Inferring shallow-transfer rules from small parallel corpora 3 Integrating shallow-transfer data into statistical machine translation 4 Assisting non-expert users in extending morphological dictionaries 5 Concluding remarks V´ ıctor M. S´ anchez-Cartagena 31/51
Motivation Goal : New method for integrating shallow-transfer RBMT linguistic resources into the phrase-based SMT architecture Why? Tackle data sparseness problem in SMT Existing dictionaries + rule inference → generalisation of knowledge from parallel corpus to unseen sequences of words Both shallow-transfer RBMT and phrase-based SMT split the text in flat sequences No strategy can be found in the literature addressed to integration of shallow-transfer RBMT resources into SMT architecture Existing black-box approach (Eisele et al., 2008) presents strong limitations V´ ıctor M. S´ anchez-Cartagena 32/51
Motivation Limitations of a black-box strategy: V´ ıctor M. S´ anchez-Cartagena 33/51
Motivation Limitations of a black-box strategy: Incorrect/missing phrase pairs extracted A fierce lion eats a lot Un le´ on feroz come mucho Phrases extracted: fierce – un le´ on, lot – mucho, ... V´ ıctor M. S´ anchez-Cartagena 33/51
Motivation Limitations of a black-box strategy: Incorrect/missing phrase pairs extracted Inadequate balance between two types of phrase pairs V´ ıctor M. S´ anchez-Cartagena 33/51
Integrating RBMT data into SMT Method overview Use inner workings of RBMT translation process to generate error-free bilingual phrases Join corpus-extracted + synthetic phrase pairs, do phrase scoring and add binary feature function V´ ıctor M. S´ anchez-Cartagena 34/51
Integrating RBMT data into SMT Generation of synthetic phrase pairs Strategy Generate phrase pairs for all the bilingual dictionary entries Segment the SL text to be translated with shallow-transfer rules All the linguistic information is extracted from the RBMT system without loss Example: SL text : The white house and the red cars the DT white ADJ house N-num:sg and CC the DT red ADJ car N-num:pl V´ ıctor M. S´ anchez-Cartagena 35/51
Integrating RBMT data into SMT Generation of synthetic phrase pairs Strategy Generate phrase pairs for all the bilingual dictionary entries Segment the SL text to be translated with shallow-transfer rules All the linguistic information is extracted from the RBMT system without loss Example: SL text : The white house and the red cars the DT white ADJ house N-num:sg and CC the DT red ADJ car N-num:pl Generated bilingual phrases : the white house – la casa blanca V´ ıctor M. S´ anchez-Cartagena 35/51
Integrating RBMT data into SMT Generation of synthetic phrase pairs Strategy Generate phrase pairs for all the bilingual dictionary entries Segment the SL text to be translated with shallow-transfer rules All the linguistic information is extracted from the RBMT system without loss Example: SL text : The white house and the red cars the DT white ADJ house N-num:sg and CC the DT red ADJ car N-num:pl Generated bilingual phrases : the red cars – los coches rojos V´ ıctor M. S´ anchez-Cartagena 35/51
Evaluation Evaluation goals : Compare translation quality achieved by hybrid system with other approaches: Pure RBMT and phrase-based SMT systems Black-box hybrid approach by Eisele et al. (2008) Measure impact of: Size of parallel and monolingual corpora Rules: automatically inferred or handcrafted Domain of test corpus: same or different from training Procedure : Build hybrid systems using Apertium data and compute MT evaluation metrics (BLEU, TER, METEOR) train out-of-domain test TL model English ↔ Spanish Europarl Newstest 2010 Europarl (+ newscrawl ) Breton → French Ofis Publik − Ofis Publik + Europarl V´ ıctor M. S´ anchez-Cartagena 36/51
Some results Systematically outperforms black-box approach Outperforms pure systems when parallel corpus is very small or out-of-domain texts are translated Increasing the size of the language model reduces impact Example: Spanish → English out-of-domain, handcrafted rules TL model: Europarl TL model: Europarl + newscrawl (4x bigger) 0.26 0.3 0.28 0.24 0.26 0.22 0.24 BLEU score BLEU score 0.2 0.22 0.2 0.18 0.18 0.16 SMT SMT 0.16 hybrid hybrid Apertium Apertium 0.14 0.14 10000 40000 160000 600000 1272260 10000 40000 160000 600000 1272260 Size of the training corpus (in sentences) Size of the training corpus (in sentences) V´ ıctor M. S´ anchez-Cartagena 37/51
Some results Hybrid systems with inferred rules can outperform those with only dictionaries without using any additional resource Example: English → Spanish out-of-domain. TL model: Europarl 0.26 0.24 0.22 BLEU score 0.2 0.18 0.16 SMT hybrid-auto. rules hybrid-hand. rules Apertium hybrid-only dict. 0.14 10000 40000 160000 600000 1272260 Size of the training corpus (in sentences) V´ ıctor M. S´ anchez-Cartagena 38/51
Some results Outperform factored translation models (Koehn and Hoang, 2007) for small parallel corpora Example: English → Spanish out-of-domain TL model: Europarl (factored system uses surface forms + morph. information) 0.26 0.25 0.24 0.23 BLEU score 0.22 0.21 0.2 0.19 hybrid-auto. rules factored 0.18 10000 40000 160000 600000 1272260 Size of the training corpus (in sentences) V´ ıctor M. S´ anchez-Cartagena 39/51
Outline 1 Introduction 2 Inferring shallow-transfer rules from small parallel corpora 3 Integrating shallow-transfer data into statistical machine translation 4 Assisting non-expert users in extending morphological dictionaries 5 Concluding remarks V´ ıctor M. S´ anchez-Cartagena 40/51
Motivation Goal : Allow non-expert users to insert entries in RBMT morphological dictionaries Why? Creation of morphological dictionaries from scratch consumes a great portion of development time of an RBMT system Dictionaries are less complex than transfer rules Non-expert users can enlarge them. No need for people with: Linguistic background Knowledge of RBMT system Result: reduce RBMT development costs V´ ıctor M. S´ anchez-Cartagena 41/51
Overview V´ ıctor M. S´ anchez-Cartagena 42/51
Overview System asks iteratively: Is this word a valid form of the word to be inserted? V´ ıctor M. S´ anchez-Cartagena 42/51
Overview System asks iteratively: Is this word a valid form of the word to be inserted? V´ ıctor M. S´ anchez-Cartagena 42/51
Overview System asks iteratively: Is this word a valid form of the word to be inserted? V´ ıctor M. S´ anchez-Cartagena 42/51
Overview System asks iteratively: Is this word a valid form of the word to be inserted? V´ ıctor M. S´ anchez-Cartagena 42/51
Overview System asks iteratively: Is this word a valid form of the word to be inserted? V´ ıctor M. S´ anchez-Cartagena 42/51
Overview System asks iteratively: Is this word a valid form of the word to be inserted? V´ ıctor M. S´ anchez-Cartagena 42/51
Overview System asks iteratively: Is this word a valid form of the word to be inserted? V´ ıctor M. S´ anchez-Cartagena 42/51
Overview System asks iteratively: Is this word a valid form of the word to be inserted? V´ ıctor M. S´ anchez-Cartagena 42/51
Overview System asks iteratively: Is this word a valid form of the word to be inserted? V´ ıctor M. S´ anchez-Cartagena 42/51
Overview System asks iteratively: Is this word a valid form of the word to be inserted? V´ ıctor M. S´ anchez-Cartagena 42/51
Overview System asks iteratively: Is this word a valid form of the word to be inserted? V´ ıctor M. S´ anchez-Cartagena 42/51
Overview System asks iteratively: Is this word a valid form of the word to be inserted? V´ ıctor M. S´ anchez-Cartagena 42/51
Recommend
More recommend