Evaluating Data Sources in a Large Czech-English Corpus CzEng 0.9 - PowerPoint PPT Presentation

Evaluating Data Sources in a Large Czech-English Corpus CzEng 0.9 ek ˇ Ondˇ rej Bojar, Adam Liˇ ska, Zdenˇ Zabokrtsk´ y { bojar,zabokrtsky } @ufal.mff.cuni.cz adam.liska@gmail.com Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics Charles University, Prague May 19, 2010 Utility of Data Sources

Outline • CzEng 0.9 overview • Our contribution: – Evaluating CzEng 0.9 filters. – Implementing and evaluating new filters. • Utility of data sources. May 19, 2010 Utility of Data Sources 1

CzEng 0.9 • large parallel Czech-English corpus • various sources to include as much material as possible 34 % eu (JRC-Acquis) 4 % other 6 % paraweb subtitles 28 % 10 % techdoc 18 % fiction (books) Number of tokens • 8 million parallel sentences 93 million English tokens, 82 million Czech tokens May 19, 2010 Utility of Data Sources 2

Common Processing Pipeline All documents go through the same processing pipeline: • conversion to UTF-8 encoded plain text • segmentation • sentence alignment using Hunalign • only 1-1 aligned sentences are kept (82%) • heuristic filters filter out mis-aligned/malformed pairs • automatic analyses at the morphological, analytical (surface syntactic) and tectogrammatical (deep syntactic) layers TectoMT platform, following Functional Generative Description and the Prague Dependency Treebank (PDT, Hajiˇ c et al. (2006)) May 19, 2010 Utility of Data Sources 3

Filters Used in CzEng 0.9 • the Czech and English sentences identical • the lengths of the sentences are too different • no Czech word on the Czech side or English word on the English side • suspicious character • clearly suspicious segmentation or tokenization • outstanding HTML entities or tags • relicts of metainformation The filters were not empirically evaluated! May 19, 2010 Utility of Data Sources 4

New filters • applied on segments included in CzEng 0.9 • non-ASCII characters on the English side that are not present in the Czech sentence • use of numbers in the Czech and English sentences are different • word-aligment score of each sentence pair is below a given threshold May 19, 2010 Utility of Data Sources 5

New Filter: Non-ASCII characters • Typical problem: “English” Koupˇ e zboˇ z´ ı za ´ uˇ celem jeho dalˇ s´ ıho prodeje a prodej . (The purchase of goods for the purposes of re-selling and selling.) Czech Specialista na osobn´ ı a n´ akladn´ ı vozidla . (The specialist for cars and lorries.) • Causes: incorrect document/sentence alignment, non-parallel input • English segments with non-ASCII characters that are not present in the Czech segment are filtered out May 19, 2010 Utility of Data Sources 6

New Filter: Use of Numbers • Filter looks for numerical and written equivalents of the numbers found in the English segment • Filters out a wide range of mistakes: English Hours must be reported in . 25 increments . Czech Hodiny je nutn´ e zadat v intervalech po 0 (Hours have to be entered in increments of 0) May 19, 2010 Utility of Data Sources 7

New Filter: Word-alignment Score • Filter considers alignment probabilities in both directions • GIZA++: Hidden Markov Model, IBM Model 1, IBM Model 3 and IBM Model 4 trained on lemmas = 1 J log ( p ( e , a | f )) + 1 e J 1 , f I � � I log ( p ( f , a | e )) (1) Score 1 May 19, 2010 Utility of Data Sources 8

Overall Evaluation • Evaluated on two sets of 1000 sentence pairs: – CzEng filters: sent. pairs selected from aligned plaintext files – new filters: first 1000 segments from CzEng (randomized at the level of short sequences of sentences) • overall precision: any filter fires ⇒ was it indeed a bad segment? � � � segments marked by both human � � | segments marked by at least one filter | (2) � � and at least one filter � � • overall recall: how many bad segments are found? � � � segments marked by both human � � | segments marked by human | (3) � � and at least one filter � � May 19, 2010 Utility of Data Sources 9

Evaluation of the Filters • Extended sets of sentence pairs: – CzEng filters: 200 segments where the filter fired – new filters: 500 segments where the filter fired • filter precision: the filter fires ⇒ was it indeed a bad segment? � � � � � segments marked by both human segments marked by the filter, � � � � (4) � � � � and the filter i.e. 200 or 500 � � � � • filter recall: how many bad segments are found? � � � segments marked by both human � � | segments marked by human | (5) � � and the filter � � May 19, 2010 Utility of Data Sources 10

Evaluation of Czeng Filters Selected CzEng Filters Precision Recall Not enough letters 94% 7% Mismatching lengths 91% 11% Repeated character 88% 2% No English word 80% 11% Suspicious char. 75% 1% Identical 72% 26% No Czech word 67% 2% Too long sentence 12% 0% Extra header 2% 0% Overall (all filters) 57% 42% Overall (evaluated filters only) 57% 41% • Surprisingly low precision of many filters. • Large margin for recall improvement. May 19, 2010 Utility of Data Sources 11

Evaluation of New Filters Filter Precision Recall Non-ASCII characters in English 100% 4% Number 88% 6% Word-alignment scores 77% 33% Overall 79% 40% • Applied on top of original CzEng 0.9 filtering. • Word-alignment can be tuned for precision/recall. May 19, 2010 Utility of Data Sources 12

Prec/Rec for Alignment Filters 100 Word-alignment score: 80 100k lower is better 60 Recall 40 20 0 0 20 40 60 80 100 Precision May 19, 2010 Utility of Data Sources 13

Prec/Rec for Hunalign Scores 100 Hunalign quality: higher 80 lower is better 60 Recall 40 20 0 0 20 40 60 80 100 Precision ⇒ Hunalign scores not suitable for filtering. May 19, 2010 Utility of Data Sources 14

Utility of Data Sources 1 Bad 1-1 Segments [%] Most Frequent Error subtitles 4.6 Mismatching lengths (42.0%), eu 33.3 Identical (39.9%), techdoc 10.2 Identical (37.9%), paraweb 59.5 Identical (61.7%), fiction 3.1 Mismatching lengths (54.9%), news 3.8 Identical (54.1%), navajo 11.9 Identical (40.9%), • Large share of Parallel Web and EU texts filtered out • Fiction, news and subtitles show high utility May 19, 2010 Utility of Data Sources 15

Utility of Data Sources 2 - CzEng Bad 1-1 Segments [%] Most Frequent Error subtitles 6.8 Alignment score (94.5%), eu 3.3 Alignment score (68.7%), techdoc 3.4 Alignment score (93.7%), paraweb 17.6 ASCII (51.2%), fiction 7.4 Alignment score (86.0%), news 2.2 Alignment score (55.3%), navajo 1.9 Alignment score (57.1%), • Cleanest source: news • Original filtering still insufficient for Parallel Web segments May 19, 2010 Utility of Data Sources 16

Conclusion • Original CzEng 0.9 filters insufficient. – Overall recall ∼ 40%, precision 57% only. • New filters on top of CzEng 0.9 ones: – Overall recall ∼ 40%, precision 79%. • Most reliable sources of data: fiction, news and subtitles. Future: • Merge sets of filters. • Ensemble of many high-precision filters to achieve high recall. Download: http://ufal.mff.cuni.cz/czeng May 19, 2010 Utility of Data Sources 17

Jan Hajiˇ c, Eva Hajiˇ cov´ a, Jarmila Panevov´ a, Petr Sgall, Petr Pajas, Jan ˇ Stˇ ep´ anek, Jiˇ r´ ı Havelka, and Marie Mikulov´ a. 2006. Prague Dependency Treebank 2.0. LDC, Philadelphia. May 19, 2010 Utility of Data Sources 18

Evaluating Data Sources in a Large Czech-English Corpus CzEng 0.9 - PowerPoint PPT Presentation

Evaluating Data Sources in a Large Czech-English Corpus CzEng 0.9 ek Ond rej Bojar, Adam Li ska, Zden Zabokrtsk y { bojar,zabokrtsky } @ufal.mff.cuni.cz adam.liska@gmail.com Institute of Formal and Applied Linguistics Faculty

4 English I CP or Honors Credits English II CP or Honors of English III CP or

Corpus Stylistics: Speech, Writing and Thought Presentation in a Corpus of English Writing

5: The Corpus of Old English The Dictionary of Old English Corpus 3060 texts A Poetry

The Corpus of Old English P . S. Langeslag The Dictionary of Old English Corpus 3060 Texts

The need for Corpus Statistics: Corpus analysis and the identification of linguistically relevant

NORDIC chamber of commerce in the czech republic czech economy facts in brief 2015 Czech economy

CzeSL an error tagged corpus of Czech as a second language Barbora tindlov 1 Svatava

CORPUS STYLISTICS: SPEECH, WRITING AND THOUGHT PRESENTATION IN A CORPUS OF ENGLISH WRITING

ENGLISH CHOICES AT WHEATLEY AN INTRODUCTION FOR NINTH GRADERS AND THEIR PARENTS ENGLISH

REPORT ON THE CZECH CADASTRE 2005-2006 Jiri Rydval, Libor Tomandl Czech Office for Surveying,

Delimiting Adverbial Meanings A corpus-based comparative study on Czech spatial prepositions and

TrustedOut Corpus Intelligence Corpus Intelligence Makes Intelligence Trustworthy. Florent Solt,

MACAQ : A Multi Annotated Corpus to study how we adapt Answers to various Questions Anne

Free English and Czech telephone speech corpus shared under the CC-BY-SA 3.0 license Matj

Czech Republic your business partner 4 February 2013 Country Focus Briefing Czech Republic

A SHORT NOTE ABOUT THE CZECH LANGUAGE HOSTED BY A short note about the Czech language Czech

Overview Focus Projection Focus Projection Focus to Accent Focus to Accent Restricted View of

Probabilistic Models of Human Sentence Experiment 1: Entropy and Sentence Length 2 Processing

Lecture 01 The Security Mindset Stephen Checkoway CS 343 Fall 2020 Adapted from Michael

Wentworth Institute of Technology College of Engineering and Technology COMP201 Computer

Logic as a Tool Chapter 3: Understanding First-order Logic 3.4 Truth, validity, logical

Semantic Structural Evaluation for Text Simplification Elior Sulem, Omri Abend and Ari Rappoport

TOP TEN OBSTACLES FOR DISTRIBUTED LEDGERS SARAH MEIKLEJOHN (UCL) DATA DATA CONSUMERS

for legal protection insurance ? l Philippe ROCHE (France) Spain is an illustration of the