Introduction Methodology Evaluation Conclusions An Italian to Catalan RBMT system reusing data from existing language pairs Antonio Toral , Mireia Ginest´ ı-Rosell, Francis Tyers 2 nd International Workshop on Free/Open-Source Rule-Based Machine Translation 2011/01/21 Antonio Toral , Mireia Ginest´ ı-Rosell, Francis Tyers it → ca RBMT reusing data from existing language pairs
Introduction Methodology Evaluation Conclusions Contents Introduction 1 Methodology 2 Crossdics Inconsistencies Coverage Evaluation 3 Setting Results Conclusions 4 Antonio Toral , Mireia Ginest´ ı-Rosell, Francis Tyers it → ca RBMT reusing data from existing language pairs
Introduction Methodology Evaluation Conclusions Two main approaches in Machine Translation: Rule-Based and Statistical Antonio Toral , Mireia Ginest´ ı-Rosell, Francis Tyers it → ca RBMT reusing data from existing language pairs
Introduction Methodology Evaluation Conclusions Two main approaches in Machine Translation: Rule-Based and Statistical What is needed to build a system for a new language pair? Antonio Toral , Mireia Ginest´ ı-Rosell, Francis Tyers it → ca RBMT reusing data from existing language pairs
Introduction Methodology Evaluation Conclusions Two main approaches in Machine Translation: Rule-Based and Statistical What is needed to build a system for a new language pair? RBMT: dictionaries and rules Antonio Toral , Mireia Ginest´ ı-Rosell, Francis Tyers it → ca RBMT reusing data from existing language pairs
Introduction Methodology Evaluation Conclusions Two main approaches in Machine Translation: Rule-Based and Statistical What is needed to build a system for a new language pair? RBMT: dictionaries and rules SMT: parallel corpus Antonio Toral , Mireia Ginest´ ı-Rosell, Francis Tyers it → ca RBMT reusing data from existing language pairs
Introduction Methodology Evaluation Conclusions Two main approaches in Machine Translation: Rule-Based and Statistical What is needed to build a system for a new language pair? RBMT: dictionaries and rules SMT: parallel corpus Drawbacks: Antonio Toral , Mireia Ginest´ ı-Rosell, Francis Tyers it → ca RBMT reusing data from existing language pairs
Introduction Methodology Evaluation Conclusions Two main approaches in Machine Translation: Rule-Based and Statistical What is needed to build a system for a new language pair? RBMT: dictionaries and rules SMT: parallel corpus Drawbacks: RBMT: linguistic expertise on both languages, manual construction Antonio Toral , Mireia Ginest´ ı-Rosell, Francis Tyers it → ca RBMT reusing data from existing language pairs
Introduction Methodology Evaluation Conclusions Two main approaches in Machine Translation: Rule-Based and Statistical What is needed to build a system for a new language pair? RBMT: dictionaries and rules SMT: parallel corpus Drawbacks: RBMT: linguistic expertise on both languages, manual construction SMT: only applicable to language pairs with big amounts of parallel data Antonio Toral , Mireia Ginest´ ı-Rosell, Francis Tyers it → ca RBMT reusing data from existing language pairs
Introduction Methodology Evaluation Conclusions This paper: build RBMT system by exploiting data from existing pairs Antonio Toral , Mireia Ginest´ ı-Rosell, Francis Tyers it → ca RBMT reusing data from existing language pairs
Introduction Methodology Evaluation Conclusions This paper: build RBMT system by exploiting data from existing pairs We build an MT system for pair a – b given existing systems for pairs a – c and b – c Antonio Toral , Mireia Ginest´ ı-Rosell, Francis Tyers it → ca RBMT reusing data from existing language pairs
Introduction Methodology Evaluation Conclusions This paper: build RBMT system by exploiting data from existing pairs We build an MT system for pair a – b given existing systems for pairs a – c and b – c Italian → Catalan from Apertium’s Italian–Spanish and Catalan–Spanish Antonio Toral , Mireia Ginest´ ı-Rosell, Francis Tyers it → ca RBMT reusing data from existing language pairs
Introduction Methodology Evaluation Conclusions This paper: build RBMT system by exploiting data from existing pairs We build an MT system for pair a – b given existing systems for pairs a – c and b – c Italian → Catalan from Apertium’s Italian–Spanish and Catalan–Spanish Motivation: Antonio Toral , Mireia Ginest´ ı-Rosell, Francis Tyers it → ca RBMT reusing data from existing language pairs
Introduction Methodology Evaluation Conclusions This paper: build RBMT system by exploiting data from existing pairs We build an MT system for pair a – b given existing systems for pairs a – c and b – c Italian → Catalan from Apertium’s Italian–Spanish and Catalan–Spanish Motivation: RBMT competitive and useful for languages without parallel corpora Antonio Toral , Mireia Ginest´ ı-Rosell, Francis Tyers it → ca RBMT reusing data from existing language pairs
Introduction Methodology Evaluation Conclusions This paper: build RBMT system by exploiting data from existing pairs We build an MT system for pair a – b given existing systems for pairs a – c and b – c Italian → Catalan from Apertium’s Italian–Spanish and Catalan–Spanish Motivation: RBMT competitive and useful for languages without parallel corpora Reusing data from similar pairs significantly reduces the amount of work Antonio Toral , Mireia Ginest´ ı-Rosell, Francis Tyers it → ca RBMT reusing data from existing language pairs
Introduction Crossdics Methodology Inconsistencies Evaluation Coverage Conclusions Antonio Toral , Mireia Ginest´ ı-Rosell, Francis Tyers it → ca RBMT reusing data from existing language pairs
Introduction Crossdics Methodology Inconsistencies Evaluation Coverage Conclusions Input dictionaries: Antonio Toral , Mireia Ginest´ ı-Rosell, Francis Tyers it → ca RBMT reusing data from existing language pairs
Introduction Crossdics Methodology Inconsistencies Evaluation Coverage Conclusions Input dictionaries: es–it: mono es 11k, mono it 10k, bi 12k Antonio Toral , Mireia Ginest´ ı-Rosell, Francis Tyers it → ca RBMT reusing data from existing language pairs
Introduction Crossdics Methodology Inconsistencies Evaluation Coverage Conclusions Input dictionaries: es–it: mono es 11k, mono it 10k, bi 12k es–ca: mono es 44k, mono ca 40k, bi 51k Antonio Toral , Mireia Ginest´ ı-Rosell, Francis Tyers it → ca RBMT reusing data from existing language pairs
Introduction Crossdics Methodology Inconsistencies Evaluation Coverage Conclusions Input dictionaries: es–it: mono es 11k, mono it 10k, bi 12k es–ca: mono es 44k, mono ca 40k, bi 51k Output dictionaries: Antonio Toral , Mireia Ginest´ ı-Rosell, Francis Tyers it → ca RBMT reusing data from existing language pairs
Introduction Crossdics Methodology Inconsistencies Evaluation Coverage Conclusions Input dictionaries: es–it: mono es 11k, mono it 10k, bi 12k es–ca: mono es 44k, mono ca 40k, bi 51k Output dictionaries: it–ca: mono it 7k, mono ca 8k, bi 9k Antonio Toral , Mireia Ginest´ ı-Rosell, Francis Tyers it → ca RBMT reusing data from existing language pairs
Introduction Crossdics Methodology Inconsistencies Evaluation Coverage Conclusions Input dictionaries: es–it: mono es 11k, mono it 10k, bi 12k es–ca: mono es 44k, mono ca 40k, bi 51k Output dictionaries: it–ca: mono it 7k, mono ca 8k, bi 9k Other linguistic data: Antonio Toral , Mireia Ginest´ ı-Rosell, Francis Tyers it → ca RBMT reusing data from existing language pairs
Introduction Crossdics Methodology Inconsistencies Evaluation Coverage Conclusions Input dictionaries: es–it: mono es 11k, mono it 10k, bi 12k es–ca: mono es 44k, mono ca 40k, bi 51k Output dictionaries: it–ca: mono it 7k, mono ca 8k, bi 9k Other linguistic data: it tagger and disambiguation probabilities taken from it–es Antonio Toral , Mireia Ginest´ ı-Rosell, Francis Tyers it → ca RBMT reusing data from existing language pairs
Introduction Crossdics Methodology Inconsistencies Evaluation Coverage Conclusions Input dictionaries: es–it: mono es 11k, mono it 10k, bi 12k es–ca: mono es 44k, mono ca 40k, bi 51k Output dictionaries: it–ca: mono it 7k, mono ca 8k, bi 9k Other linguistic data: it tagger and disambiguation probabilities taken from it–es transfer rules: 35 taken from oc–ca (mainly noun phrases) + 9 manually created (verbs and clitic pronouns) Antonio Toral , Mireia Ginest´ ı-Rosell, Francis Tyers it → ca RBMT reusing data from existing language pairs
Introduction Crossdics Methodology Inconsistencies Evaluation Coverage Conclusions Reasons Antonio Toral , Mireia Ginest´ ı-Rosell, Francis Tyers it → ca RBMT reusing data from existing language pairs
Introduction Crossdics Methodology Inconsistencies Evaluation Coverage Conclusions Reasons Differences of gender and number (it–ca) Antonio Toral , Mireia Ginest´ ı-Rosell, Francis Tyers it → ca RBMT reusing data from existing language pairs
Introduction Crossdics Methodology Inconsistencies Evaluation Coverage Conclusions Reasons Differences of gender and number (it–ca) Different ways of categorising lemmas and morphological features (es–it and es–ca) Antonio Toral , Mireia Ginest´ ı-Rosell, Francis Tyers it → ca RBMT reusing data from existing language pairs
Recommend
More recommend