workshop on statistical machine translation for curious
play

Workshop on statistical machine translation for curious translators - PowerPoint PPT Presentation

Workshop on statistical machine translation for curious translators Vctor M. Snchez-Cartagena Prompsit Language Engineering, S.L. Outline 1) Introduction to machine translation 2) The Abu-MaTran project 3)Acquisition of parallel data from


  1. Workshop on statistical machine translation for curious translators Víctor M. Sánchez-Cartagena Prompsit Language Engineering, S.L.

  2. Outline 1) Introduction to machine translation 2) The Abu-MaTran project 3)Acquisition of parallel data from the web – How a web crawler works – Hands-on session: Bicrawler 4) Statistical machine translation (SMT) – Introduction to SMT – Hands-on session: MTradumàtica The Abu-MaT ran project 2

  3. Introduction to machine translation

  4. Machine translation ● Translation, by means of a computing system (computer+software) of texts in digital form from one natural language (source language; SL) to another (target language; TL) ● No human intervention whatsoever The Abu-MaT ran project 4

  5. Applications of machine translation ● Machine translation and professional translation, even if closely related in purpose, are not interchangeable products (Sager,1994) ● A machine translation, is really a translation? – It cannot be used as a professional product would – This does not mean machine translation is useless! The Abu-MaT ran project 5

  6. Applications of machine translation ● Gisting (assimilation) : ephemeral translation, ideally instantaneous, used to get a rough idea of a text when you do not speak the language or you speak it badly – Internet surfing, informal communication, etc. The Abu-MaT ran project 6

  7. Applications of machine translation ● Post-editing (dissemination) : permanent translation, ideally with few errors, for its publication after correction – Production of drafts for post-editing The Abu-MaT ran project 7

  8. Applications of machine translation The Abu-MaT ran project 8

  9. Applications of machine translation ● Gisting: – English (MT): *Match very difficult but fans unconditional support players very motivated – English (Cor.): MatchThe game was very difficult but fans the unconditional support of fans made the players to be very motivated – Spanish (SL): El partido ha sido muy difícil pero el apoyo incondicional de la afición hizo que los jugadores estuvieran muy motivados The Abu-MaT ran project 9

  10. Applications of machine translation ● Post-editing (dissemination): – English (MT): *I eat you were not coming we left – English (Cor.): I eatAs you were not coming we left – Spanish (SL): Como no venías, nos fuimos The Abu-MaT ran project 10

  11. Rule-based machine translation ● Uses explicit representations of linguistic information: dictionaries, rules, etc. The Abu-MaT ran project 11

  12. Corpus-based machine translation ● Learns to translate from large amounts of existing translations (bitexts = parallel corpora) ● Statistical machine translation (SMT) is corpus- based The Abu-MaT ran project 12

  13. Approaches to machine translation ● Corpus-based MT works best when . . . – You have a big bitext of pre-translated and aligned sentences – The languages involved are not morphologically complex – The texts to be translated are in the same domain as those used to learn ● Rule-based MT works best when . . . – You do not have bitexts, or they are of low quality – The languages involved are typologically similar (e.g. es–ca, es–pt, es–fr) – You are translating formal language The Abu-MaT ran project 13

  14. The Abu-MaTran project

  15. Abu-MaTran in a nutshell ● Marie Curie IAPP (Industry-Academia Partnerships and Pathways) – core activity: transfer of knowledge – by means of secondments: put in contact academic and industrial partners ● Duration: 48 months (from January 2013): it is about to end The Abu-MaT ran project 15

  16. Partners ● Dublin City University (Ireland) ● Prompsit Language Engineering (Spain) ● University of Alicante (Spain) ● University of Zagreb (Croatia) ● Institute for Language and Speech Processing (Greece) The Abu-MaT ran project 16

  17. Abu-MaTran in a nutshell ● Enhance industry-academia cooperation to tackle multilinguality ● Increase low industrial adoption of machine translation ● Transfer back to academia the know-how of industry to make research products more robust ● Resources produced to be released as free/open- source software ● Focus on Croatian: language of new EU member state ● Emphasis on dissemination The Abu-MaT ran project 17

  18. Some results (I) ● Multiple open-source tools released: – Web crawlers, rule inference toolkits for rule-based machine translation, etc. ● Corpora released: – General-domain monolingual corpora for Croatian, Serbian, Bosnian, Catalan and Finnish – General-domain parallel corpora for English-to Croatian, Serbian, Bosnian and Finnish – Tourism domain parallel corpora for English-Croatian – … ● Machine translation systems created: – Rule-based: Serbian-Croatian – Statistical:English-Croatian (general domain and tourism domain), English-Greek (tourism domain) The Abu-MaT ran project 18

  19. Some results (II) ● Organization of Spanish Linguistics Olympiad 2014-2015-2016 ● Workshop organization: – 2014, Dublin: Software management for researchers – 2014-2015, Zagreb: data creation for Croatian RBMT – 2014, Reykjavik: free/open-source RBMT linguistic resources – 2016, Dublin: Hybrid machine translation – 2016, Dublin: Tools for linguists – 2016, UA: Statistical machine translation The Abu-MaT ran project 19

  20. Acquisition of parallel data from the web 1)Web crawling 2)Hands-on session: Bicrawler

  21. Web crawling ● We can find many multilingual websites on the Internet ● Parallel corpora are essential to build SMT systems ● We can automatically obtain a parallel corpus from a multilingual website with a web crawler The Abu-MaT ran project 21

  22. How a web crawler works ● How can we turn a multilingual website ... ● … into a parallel corpus ready for SMT? Our University Campus is regarded as La Universidad puede presumir de Study with us ¿Vienes? one the best in Europe tener uno de los mejores campus Our campus is regarded as… La Univer europeos Study with us ¿Vienes? The Abu-MaT ran project 22

  23. How a web crawler works 1)Download web pages (documents) 2)Extract text and remove HTML tags 3)Detect language of documents 4)Identify documents that are mutual translation (most difficult part) 5)Extract parallel sentences from each document pair The Abu-MaT ran project 23

  24. How a web crawler works 1)Download web pages (documents) ● The most time-consuming part: downloading a big website can take days and even weeks ! ● From the main page (e.g. www.ua.es), hyperlinks are followed in order to get new documents ● From new documents, hyperlinks are followed in order to get more documents, and so on… The Abu-MaT ran project 24

  25. How a web crawler works 2)Extract text and remove HTML tags ● HTML tags need to be stored: they are needed in subsequent steps ● Text is split into paragraphs <div class="row"> Study with us <div class="col-md-12"> <h2 class="subSeccionIcono" id="vienes"><img The University of Alicante gives you a src="https://web.ua.es/secciones- warm welcome and offers its services for ua/images/acceso/estudia/vida- accommodation and transport. Find out universitaria/icono1.jpg" /> Study with more here. us </h2> <h3 class="subtituloIcono">The University of Alicante gives you a warm welcome and offers its services for accommodation and transport. Find out more here. </h3> The Abu-MaT ran project 25

  26. How a web crawler works 3)Detect language of documents Study with us The University of Alicante gives you a English warm welcome and offers its services for accommodation and transport. Find out more here. ¿Vienes? Spanish La Universidad de Alicante te acoge con toda clase de facilidades para el alojamiento o el transporte. Conócelas aquí. The Abu-MaT ran project 26

  27. How a web crawler works 4) Identify documents that are mutual translation ● The most difficult part ● Clues that help us to identify pairs of documents: – URL: e.g. https://web.ua.es/en/university-life.html and https://web.ua.es/es/university-life.html – Images – Numbers – Named entities – HTML structure/layout – Links – Similarity after being translated with some bilingual resource: finding parallel resources is difficult for some language pairs! The Abu-MaT ran project 27

  28. How a web crawler works 5)Extract parallel sentences from each document pair ● Split sentences from each paragraph Study with us ¿Vienes? The University of Alicante gives you a La Universidad de Alicante te acoge con warm welcome and offers its services for toda clase de facilidades para el accommodation and transport. Find out alojamiento o el transporte. Conócelas more here. aquí. Study with us ¿Vienes? The University of Alicante La Universidad de Alicante gives you a warm welcome te acoge con toda clase de and offers its services for facilidades para el accommodation and alojamiento o el transporte. transport. Find out more here. Conócelas aquí. The Abu-MaT ran project 28

Recommend


More recommend