what data is needed why
play

What Data Is Needed? Why? Dr. Khalid Choukri (Evaluations and - PowerPoint PPT Presentation

What Data Is Needed? Why? Dr. Khalid Choukri (Evaluations and Language Resource Association) ELRC Workshop in Deutschland, 29.09.2015 1 What types of data? Translation From previous session, we have seen the predominant


  1. “What Data Is Needed? Why?” Dr. Khalid Choukri (Evaluations and Language Resource Association) ELRC Workshop in Deutschland, 29.09.2015 1

  2. What types of data? Translation • From previous session, we have seen the predominant approach of data-driven paradigm  We “learn” from existing data • How are Language Resources produced:  from documents and data to valuable Language Resources for MT  Why it is important that you help us with the data you have / you know about • The focus is on data in all languages (EU/CEF). ELRC Workshop in Deutschland, 29.09.2015 2

  3. What Data ELRC Workshop in Deutschland, 29.09.2015 4

  4. Translations & Automated Translations ELRC Workshop in Deutschland, 29.09.2015 5

  5. Translations ELRC Workshop in Deutschland, 29.09.2015 6

  6. What types of data? Translation ELRC Workshop in Deutschland, 29.09.2015 7

  7. What types of data? “Aligned” Translation English French ELRC Workshop in Deutschland, 29.09.2015 8

  8. What types of data? “Aligned” Translation ELRC Workshop in Deutschland, 29.09.2015 9

  9. Comparable Collections English Greek Spanish Una telecomunicación es toda Με τον γενικό όρο Τ elecommunication occurs transmisión y recepción de τηλεπικοινωνίες, when the exchange of señales de cualquier naturaleza, (telecommunications), information between two or típicamente electromagnéticas, χαρακτηρίζεται η κάθε μορφής more entities ενσύρματη ή ασύρματη, que contengan signos, sonidos, (communication) includes the imágenes o, en definitiva, ηλεκτρομαγνητική, ηλεκτρική, use of technology. κ.λπ., ακουστική και οπτική cualquier tipo de información επικοινωνία που que se desee comunicar a cierta πραγματοποιείται ανεξαρτήτως distancia. Communication technology απόστασης. uses channels to transmit Por metonimia, también se information (as electrical denomina telecomunicación (o Στους σύγχρονους καιρούς, signals), either over a αυτή η διαδικασία σχεδόν πάντα telecomunicaciones, physical medium (such as περιλαμβάνει την αποστολή indistintamente) a la disciplina signal cables), or in the form que estudia, diseña, desarrolla y ηλεκτρομαγνητικών κυμάτων ή of electromagnetic waves. ηλεκτρικών σημάτων από explota aquellos sistemas que κατάλληλες ηλεκτρονικές permiten dichas συσκευές, όπως το τηλέφωνο ή comunicaciones; de forma The word is often used in its ο ασύρματος, αλλά παλαιότερα análoga, la ingeniería de plural form, περιελάμβανε τη χρήση telecomunicaciones resuelve los telecommunications, because ακουστικών σημάτων, όπως problemas técnicos asociados a it involves many different τυμπάνων, ή οπτικών, όπως ο esta disciplina. technologies. σηματοφόρος καπνός ή η λάμψη της φωτιάς. Source : First sentences of articles for Telecommunications in the English, Greek and Spanish Wikipedias German page is slightly different but these are (never) translations of one source!! ELRC Workshop in Deutschland, 29.09.2015 10

  10. Dictionaries / Terminologies /Ontologies ID FR ES EL 6905 abandon scolaire abandono escolar διακοπή της σχολικής φοίτησης 920 abats despojo παραπροϊόντα σφαγίων 1857 abattage d'animaux sacrificio de animales σφαγή ζώων 6621 abrogation derogación κατάργηση 5075 Abruzzes Abruzos Αβρουζία συστηματική απουσία από την 5339 absentéisme absentismo εργασία 5984 abstentionnisme abstencionismo αποχή 2 abus de confiance abuso de confianza απιστία 96 abus de droit abuso de derecho κατάχρηση δικαιώματος 186 abus de pouvoir abuso de poder κατάχρηση εξουσίας 280 accès à l'éducation acceso a la educación πρόσβαση στην εκπαίδευση 372 accès à l'emploi acceso al empleo πρόσβαση στην αγορά εργασίας ELRC Workshop in Deutschland, 29.09.2015 12

  11. Where can we find such data? Digital World • Archives • Internet ELRC Workshop in Deutschland, 29.09.2015 13

  12. Digital word … Internet ELRC Workshop in Deutschland, 29.09.2015 14

  13. Internet Era & Digital Data ELRC Workshop in Deutschland, 29.09.2015 15

  14. Of course need for digital textual data !! ELRC Workshop in Deutschland, 29.09.2015 16

  15. Various Formats ELRC Workshop in Deutschland, 29.09.2015 17

  16. Documented Data (Meta-data) Dublin Core Metadata Element Set 1. Title 2. Creator 3. Subject 4. Description 5. Publisher 6. Contributor 7. Date 8. Type 9. Format 10.Identifier 11.Source 12.Language 13.Relation 14.Coverage 15.Rights ELRC Workshop in Deutschland, 29.09.2015 18

  17. How LRs are produced • Let us see some examples of raw data (html with tables, pictures, etc.) and how they become LRs – Discover & identify sources – Clear IPR and Get the data (Download, Harvest, Crawl, …) – Clean the data (e.g. detect and remove the “boilerplate”, “templates”, pictures, html tags, etc., convert format) – Example of tools (Boilerpipe) – Document the data – Align the translations when identified and break into “sentences” – Compute some alignment confidence – Share ELRC Workshop in Deutschland, 29.09.2015 19

  18. A Language Resource Factory • How can this process be turned into a factory of LR production (Automation of the Procedure) • Some simple illustrations • We rather start from the Digital word – OCR may be considered for the less-resourced languages ELRC Workshop in Deutschland, 29.09.2015 25

  19. Many web sites… ELRC Workshop in Deutschland, 29.09.2015

  20. … are rich in multilingual content ELRC Workshop in Deutschland, 29.09.2015

  21. How can we obtain this content… ELRC Workshop in Deutschland, 29.09.2015

  22. ELRC Workshop in Deutschland, 29.09.2015 31

  23. … and convert it to valuable Language Resources for Machine Translation? ELRC Workshop in Deutschland, 29.09.2015

  24. From a web page to the Factory • How does this process scale up:  Identify a “useful” source (good candidate for multilingual data)  Review and visit all the links (the URLs referenced in each page)  “Click on each link” and move forward • Get each page and its “potentially” associated one in the other language • Identify the “ domains”, “genre”, etc. if possible • Get rid of the “noise” (ads, format, boilerplate, etc.) • Align (documents/files, chapters, paragraphs, sentences,) • Check accuracy of alignment • Use …. And share ELRC Workshop in Deutschland, 29.09.2015 33

  25. A Journey in the meandering lines of Internet ELRC Workshop in Deutschland, 29.09.2015

  26. (automatically) Follow all referenced links ELRC Workshop in Deutschland, 29.09.2015 35

  27. referenced links … Automated process • http://portal.elda.org/ http://portal.elda.org/en/ • http://portal.elda.org/news/rss/ • http://portal.elda.org/login/ • http://portal.elda.org/en/login/ • http://portal.elda.org/reset/ • http://portal.elda.org/about/elra/contact/ • http://portal.elda.org/en/about/elra/contact/ • http://portal.elda.org/tag/85/ • http://portal.elda.org/en/tag/85/ • http://portal.elda.org/tag/86/ • http://portal.elda.org/en/tag/86/ ELRC Workshop in Deutschland, 29.09.2015 37

  28. ILSP Focused Crawler • Research prototype for acquiring general or domain-specific, monolingual and bilingual corpora • Input: • Domain definitions (lists of terms) • Seed URLs • Modules (open source libraries/toolkits): – Page Fetching/Text Extraction – Normalization and Metadata Extraction – Boilerplate Detection (Boilerpipe) – Language Detection (covering > 50 langs ) – Text Classification – Exact and near de-duplication – Detection of pairs of parallel documents – Sentence alignment (Hunalign and others) • Generates lists of – document pairs and – segment pairs in TMX files ELRC Workshop in Deutschland, 29.09.2015 39

  29. … it integrates technologies to crawl (part of a /multiple pages) website… ELRC Workshop in Deutschland, 29.09.2015

  30. … identify the language of each crawled page … ELRC Workshop in Deutschland, 29.09.2015

  31. … identify the language of each crawled page ELRC Workshop in Deutschland, 29.09.2015

  32. … extract several types of data descriptors (metadata) ELRC Workshop in Deutschland, 29.09.2015

  33. … and optionally classify each page as relevant or not to a user-defined domain ELRC Workshop in Deutschland, 29.09.2015

  34. It can detect boilerplate text … ELRC Workshop in Deutschland, 29.09.2015

  35. … HTML structure and/or URL similarity to detect document pairs ELRC Workshop in Deutschland, 29.09.2015

  36. … HTML structure and/or URL similarity to detect document pairs ELRC Workshop in Deutschland, 29.09.2015

  37. Sometimes URLs are not enough for finding document pairs… ELRC Workshop in Deutschland, 29.09.2015

Recommend


More recommend