czechizator echiz tor
play

Czechizator echiztor Charles University Faculty of Mathematics and - PowerPoint PPT Presentation

Rudolf Rosa rosa@ufal.mff.cuni.cz Czechizator echiztor Charles University Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics SloNLP, Tatransk Matliare, 18 September 2016 Czechizator lexicon-less


  1. Rudolf Rosa rosa@ufal.mff.cuni.cz Czechizator – Čechizátor Charles University Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics SloNLP, Tatranské Matliare, 18 September 2016

  2. Czechizator  lexicon-less “translation” from English to Czech Rudolf Rosa: Czechizator - Čechizátor 2/32

  3. Czechizator  lexicon-less “translation” from English to Czech  usual approach: use a bilingual lexicon presentation input statistical Czech- translation training translation English system model prezentace texts output Rudolf Rosa: Czechizator - Čechizátor 3/32

  4. Czechizator  lexicon-less “translation” from English to Czech  usual approach: use a bilingual lexicon presentation input statistical Czech- translation training translation English system model prezentace texts output  Czechizator approach: use a set of rules instead rules: presentation input Czech- -ise → -iza translation English -tion → -ce system presentace texts ... output Rudolf Rosa: Czechizator - Čechizátor 4/32

  5. Example: Czechizating ITAT titles  Statistical modelling in climate science Rudolf Rosa: Czechizator - Čechizátor 5/32

  6. Example: Czechizating ITAT titles  Statistical modelling in climate science Statistické modelování v klimat scienci Rudolf Rosa: Czechizator - Čechizátor 6/32

  7. Example: Czechizating ITAT titles  Statistical modelling in climate science Statistické modelování v klimat scienci  12 years of Unsupervised Dependency Parsing Rudolf Rosa: Czechizator - Čechizátor 7/32

  8. Example: Czechizating ITAT titles  Statistical modelling in climate science Statistické modelování v klimat scienci  12 years of Unsupervised Dependency Parsing 12 jírů nesupervizované parsování dependence Rudolf Rosa: Czechizator - Čechizátor 8/32

  9. Example: Czechizating ITAT titles  Statistical modelling in climate science Statistické modelování v klimat scienci  12 years of Unsupervised Dependency Parsing 12 jírů nesupervizované parsování dependence  Multivariable Approximation by Convolutional Kernel Networks Rudolf Rosa: Czechizator - Čechizátor 9/32

  10. Example: Czechizating ITAT titles  Statistical modelling in climate science Statistické modelování v klimat scienci  12 years of Unsupervised Dependency Parsing 12 jírů nesupervizované parsování dependence  Multivariable Approximation by Convolutional Kernel Networks Multivariabilní aproximace Konvolucional Kernel netvorksu Rudolf Rosa: Czechizator - Čechizátor 10/32

  11. Implementation  lexical translation: a set of Czechization rules  43 ending-based transformation rules (see later)  33 transliteration rules: th → t, ti → ci, ck → k, ph → f, sh → š, igh → aj, dg → dž, w → v, c → k…  36 hard-coded translations of semi-auxiliaries: be, have, do, and, or, all, this, many, only, main…  grammar and function words: TectoMT  English-Czech machine translation system  Czechizator implemented as a TectoMT lexical translation model Rudolf Rosa: Czechizator - Čechizátor 11/32

  12. Implementation I preferred the presentation of David. Rudolf Rosa: Czechizator - Čechizátor 12/32

  13. Implementation I preferred TectoMT the presentation analysis of David. Rudolf Rosa: Czechizator - Čechizátor 13/32

  14. Implementation prefer verb, 1 st person, past I preferred TectoMT presentation the presentation analysis noun, definite, object of David. David noun+of, named ent. Rudolf Rosa: Czechizator - Čechizátor 14/32

  15. Implementation prefer verb, 1 st person, past I preferred TectoMT presentation the presentation analysis noun, definite, object of David. David noun+of, named ent. transfer Rudolf Rosa: Czechizator - Čechizátor 15/32

  16. Implementation prefer verb, 1 st person, past I preferred TectoMT presentation the presentation analysis noun, definite, object of David. David noun+of, named ent. Czechization of lemmas Rudolf Rosa: Czechizator - Čechizátor 16/32

  17. Implementation prefer verb, 1 st person, past I preferred TectoMT presentation the presentation analysis noun, definite, object of David. David noun+of, named ent. Czechization of lemmas preferovat prezentace David Rudolf Rosa: Czechizator - Čechizátor 17/32

  18. Implementation prefer verb, 1 st person, past I preferred TectoMT presentation the presentation analysis noun, definite, object of David. David noun+of, named ent. Czechization of lemmas TectoMT transfer of attributes preferovat prezentace David Rudolf Rosa: Czechizator - Čechizátor 18/32

  19. Implementation prefer verb, 1 st person, past I preferred TectoMT presentation the presentation analysis noun, definite, object of David. David noun+of, named ent. Czechization of lemmas TectoMT transfer of attributes preferovat verb, 1 st person, past prezentace noun, accusative David noun, genitive, n.e. Rudolf Rosa: Czechizator - Čechizátor 19/32

  20. Implementation prefer verb, 1 st person, past I preferred TectoMT presentation the presentation analysis noun, definite, object of David. David noun+of, named ent. Czechization of lemmas TectoMT transfer of attributes preferovat verb, 1 st person, past TectoMT prezentace synthesis noun, accusative David noun, genitive, n.e. Rudolf Rosa: Czechizator - Čechizátor 20/32

  21. Implementation prefer verb, 1 st person, past I preferred TectoMT presentation the presentation analysis noun, definite, object of David. David noun+of, named ent. Czechization of lemmas TectoMT transfer of attributes preferovat verb, 1 st person, past Preferoval jsem TectoMT prezentace prezentaci synthesis noun, accusative Davida. David noun, genitive, n.e. Rudolf Rosa: Czechizator - Čechizátor 21/32

  22. Transformation rules for adjectives  partial  native → parciální → nativní  stable  regular → stabilní → regulární  tolerant  fatal → tolerantní → fatální  tolerated  nervous → tolerovaný → nervózní  turkic  parsed → turkický → parsovaný  practical  parsing → praktický → parsující  park → parkový Rudolf Rosa: Czechizator - Čechizátor 22/32

  23. What is it good for?  translations sometimes “reasonable”  scientific titles and abstracts, marketing texts Rudolf Rosa: Czechizator - Čechizátor 23/32

  24. What is it good for?  translations sometimes “reasonable”  scientific titles and abstracts, marketing texts:  Accenture Operations combines technology that digitizes and automates business processes, unlocks actionable insights, and delivers everything- as-a-service with our team's deep industry, functional and technical expertise.  Operacions acenturu kombinuje technologii, která digitizuje a automuje procesy businosti, unlokuje akcionabilní insajty a deliveruje everyting-as-a- servicová s funkcionální a technickou expertizou dípové industrie našeho tímu. Rudolf Rosa: Czechizator - Čechizátor 24/32

  25. What is it good for?  translations sometimes “reasonable”  scientific titles and abstracts, marketing texts  still, only a proof of concept & a fun application  not really useful as a standalone tool  maybe as a starting point for later post-editing Rudolf Rosa: Czechizator - Čechizátor 25/32

  26. What is it good for?  translations sometimes “reasonable”  scientific titles and abstracts, marketing texts  still, only a proof of concept & a fun application  not really useful as a standalone tool  maybe as a starting point for later post-editing  potential: combine with TectoMT lexical models  frequent words: translation model trained from data  infrequent words: insufficient training data, Czechize! Rudolf Rosa: Czechizator - Čechizátor 26/32

  27. Complementing TectoMT  rare/unseen words not well handled by TectoMT  unreliable translation for rare words, none for unseen  e.g. scientific terms  large number and growing, rare in data  often rather regular translations → can be Czechized  anaphora → anafora hypotactical → hypotaktický circumfixal → cirkumfixální Rudolf Rosa: Czechizator - Čechizátor 27/32

  28. Complementing TectoMT  rare/unseen words not well handled by TectoMT  unreliable translation for rare words, none for unseen  e.g. scientific terms  large number and growing, rare in data  often rather regular translations → can be Czechized  anaphora → anafora hypotactical → hypotaktický circumfixal → cirkumfixální  current issues: named entities get Czechized  usually should be avoided, but detection insufficient Rudolf Rosa: Czechizator - Čechizátor 28/32

  29. Conclusion  lexicon-less lexical “translation” module  transformation (endings) and transliteration rules  grammar and aux words handled by TectoMT  Czechization of lemmas on t-layer  Czechization of scientific titles sometimes “good”  but still not really useful  work in progress: integrate into TectoMT  complement existing lexical models  Czechize rare and unseen words, e.g. science terms Rudolf Rosa: Czechizator - Čechizátor 29/32

Recommend


More recommend