learning morphology from the corpus
play

Learning Morphology from the Corpus Ondej Duek Institute of Formal - PowerPoint PPT Presentation

Motivation Generation Analysis Learning Morphology from the Corpus Ondej Duek Institute of Formal and Applied Linguistics Charles University in Prague November 11, 2013 . . . . . . 1/ 22 Ondej Duek Learning Morphology from


  1. Motivation Generation Analysis Learning Morphology from the Corpus Ondřej Dušek Institute of Formal and Applied Linguistics Charles University in Prague November 11, 2013 . . . . . . 1/ 22 Ondřej Dušek Learning Morphology from the Corpus

  2. Motivation Generation Analysis Motivation (general) Morphology needed in most NLP tasks • Parsing • Structural MT • Factored phrase-based MT • Corpora • User interfaces • Dialogue systems Morphology module influences overall quality of the systems . . . . . . 2/ 22 Ondřej Dušek Learning Morphology from the Corpus

  3. KHRESMOI – translation of medical text: terms ALEX dialogue system – public transport: stop names Up to 5% of words are not recognized in special domains There's no guesser in Treex (that I know of) Dolnokrčská X@------------- artroplastika X@------------- “Inflect anything” Translate and create unseen phrases Speak freely in dialogue systems Motivation Generation Analysis Motivation (personal) “Avoid the X@ tag in Czech as much as possible” • Words unknown to the Czech dictionary are relatively common in some applications . . . . . . 3/ 22 Ondřej Dušek Learning Morphology from the Corpus

  4. There's no guesser in Treex (that I know of) “Inflect anything” Translate and create unseen phrases Speak freely in dialogue systems Motivation Generation Analysis Motivation (personal) “Avoid the X@ tag in Czech as much as possible” • Words unknown to the Czech dictionary are relatively common in some applications • KHRESMOI – translation of medical text: terms • ALEX dialogue system – public transport: stop names • Up to 5% of words are not recognized in special domains Dolnokrčská X@------------- artroplastika X@------------- . . . . . . 3/ 22 Ondřej Dušek Learning Morphology from the Corpus

  5. “Inflect anything” Translate and create unseen phrases Speak freely in dialogue systems Motivation Generation Analysis Motivation (personal) “Avoid the X@ tag in Czech as much as possible” • Words unknown to the Czech dictionary are relatively common in some applications • KHRESMOI – translation of medical text: terms • ALEX dialogue system – public transport: stop names • Up to 5% of words are not recognized in special domains • There's no guesser in Treex (that I know of) Dolnokrčská X@------------- artroplastika X@------------- . . . . . . 3/ 22 Ondřej Dušek Learning Morphology from the Corpus

  6. Motivation Generation Analysis Motivation (personal) “Avoid the X@ tag in Czech as much as possible” • Words unknown to the Czech dictionary are relatively common in some applications • KHRESMOI – translation of medical text: terms • ALEX dialogue system – public transport: stop names • Up to 5% of words are not recognized in special domains • There's no guesser in Treex (that I know of) Dolnokrčská X@------------- artroplastika X@------------- “Inflect anything” • Translate and create unseen phrases • Speak freely in dialogue systems . . . . . . 3/ 22 Ondřej Dušek Learning Morphology from the Corpus

  7. Motivation Generation Analysis Exploiting the regularities in morphology • Morphology of many languages is mostly regular, but for a certain number of exceptions • Size, number, and shape of inflection patterns differ . . . . . . 4/ 22 Ondřej Dušek Learning Morphology from the Corpus

  8. Motivation Generation Analysis Exploiting the regularities in morphology • Morphology of many languages is mostly regular, but for a certain number of exceptions • Size, number, and shape of inflection patterns differ . . . . . . 4/ 22 Ondřej Dušek Learning Morphology from the Corpus

  9. Hand-written rules? rule Hard to maintain with complex morphology y x B C Learning from the data! Obtaining the rules automatically Plenty of corpora of sufficient size available Motivation Generation Analysis Possible approaches to morphology Dictionaries? • Work well, reliable • Limited coverage and/or availability . . . . . . 5/ 22 Ondřej Dušek Learning Morphology from the Corpus

  10. Learning from the data! Obtaining the rules automatically Plenty of corpora of sufficient size available Motivation Generation Analysis Possible approaches to morphology Dictionaries? • Work well, reliable • Limited coverage and/or availability Hand-written rules? rule • Hard to maintain with complex morphology y x B C . . . . . . 5/ 22 Ondřej Dušek Learning Morphology from the Corpus

  11. Motivation Generation Analysis Possible approaches to morphology Dictionaries? • Work well, reliable • Limited coverage and/or availability Hand-written rules? rule • Hard to maintain with complex morphology y x B C Learning from the data! • Obtaining the rules automatically • Plenty of corpora of sufficient size available . . . . . . 5/ 22 Ondřej Dušek Learning Morphology from the Corpus

  12. 1. Generation with Filip Jurčíček (see also: our paper at ACL-SRW 2013) Flect : statistical morphology generator 2. Analysis recent, only partially finished experiments on Czech a simple morphology module to go with the Featurama tagger, comparison with others 3. Discussion Motivation Generation Analysis My experiments with morphology • in chronological (less logical) order . . . . . . 6/ 22 Ondřej Dušek Learning Morphology from the Corpus

  13. 2. Analysis recent, only partially finished experiments on Czech a simple morphology module to go with the Featurama tagger, comparison with others 3. Discussion Motivation Generation Analysis My experiments with morphology • in chronological (less logical) order 1. Generation • with Filip Jurčíček (see also: our paper at ACL-SRW 2013) • Flect : statistical morphology generator . . . . . . 6/ 22 Ondřej Dušek Learning Morphology from the Corpus

  14. 3. Discussion Motivation Generation Analysis My experiments with morphology • in chronological (less logical) order 1. Generation • with Filip Jurčíček (see also: our paper at ACL-SRW 2013) • Flect : statistical morphology generator 2. Analysis • recent, only partially finished experiments on Czech • a simple morphology module to go with the Featurama tagger, comparison with others . . . . . . 6/ 22 Ondřej Dušek Learning Morphology from the Corpus

  15. Motivation Generation Analysis My experiments with morphology • in chronological (less logical) order 1. Generation • with Filip Jurčíček (see also: our paper at ACL-SRW 2013) • Flect : statistical morphology generator 2. Analysis • recent, only partially finished experiments on Czech • a simple morphology module to go with the Featurama tagger, comparison with others 3. Discussion . . . . . . 6/ 22 Ondřej Dušek Learning Morphology from the Corpus

  16. Only previous statistical morphology module known to us: Bohnet et al. (2010) Flect tested on 6 languages from the CoNLL 2009 data set with a varying degree of morphological richness Semantics EN DE ES CA JA CS Syntax for these languages Natural Language Generation Morphology Text Motivation Introduction Generation The system Analysis Results Flect : Morphology generator • Using machine learning to predict inflection . . . . . . 7/ 22 Ondřej Dušek Learning Morphology from the Corpus

  17. Flect tested on 6 languages from the CoNLL 2009 data set with a varying degree of morphological richness Semantics EN DE ES CA JA CS Syntax for these languages Natural Language Generation Morphology Text Motivation Introduction Generation The system Analysis Results Flect : Morphology generator • Using machine learning to predict inflection • Only previous statistical morphology module known to us: Bohnet et al. (2010) . . . . . . 7/ 22 Ondřej Dušek Learning Morphology from the Corpus

  18. Motivation Introduction Generation The system Analysis Results Flect : Morphology generator • Using machine learning to predict inflection • Only previous statistical morphology module known to us: Bohnet et al. (2010) • Flect tested on 6 languages from the CoNLL 2009 data set with a varying degree of morphological richness Semantics EN DE ES CA JA CS Syntax for these languages Natural Language Generation Morphology Text . . . . . . 7/ 22 Ondřej Dušek Learning Morphology from the Corpus

  19. Languages with more inflection (e.g. Czech): even the simplest applications have trouble with morphology é ě Toto se líbí uživateli Jana Nováková. --------- - - [masc] [fem] This is liked by user (name) [dat] [nom] e u Děkujeme, Jan Novák , vaše hlasování bylo vytvořeno. Thank you, (name) [nom] your poll has been created Motivation Introduction Generation The system Analysis Results The need to generate morphology • English – not so much: hard-coded solutions often work well enough . . . . . . 8/ 22 Ondřej Dušek Learning Morphology from the Corpus

Recommend


More recommend