Motivation Generation Analysis Learning Morphology from the Corpus Ondřej Dušek Institute of Formal and Applied Linguistics Charles University in Prague November 11, 2013 . . . . . . 1/ 22 Ondřej Dušek Learning Morphology from the Corpus
Motivation Generation Analysis Motivation (general) Morphology needed in most NLP tasks • Parsing • Structural MT • Factored phrase-based MT • Corpora • User interfaces • Dialogue systems Morphology module influences overall quality of the systems . . . . . . 2/ 22 Ondřej Dušek Learning Morphology from the Corpus
KHRESMOI – translation of medical text: terms ALEX dialogue system – public transport: stop names Up to 5% of words are not recognized in special domains There's no guesser in Treex (that I know of) Dolnokrčská X@------------- artroplastika X@------------- “Inflect anything” Translate and create unseen phrases Speak freely in dialogue systems Motivation Generation Analysis Motivation (personal) “Avoid the X@ tag in Czech as much as possible” • Words unknown to the Czech dictionary are relatively common in some applications . . . . . . 3/ 22 Ondřej Dušek Learning Morphology from the Corpus
There's no guesser in Treex (that I know of) “Inflect anything” Translate and create unseen phrases Speak freely in dialogue systems Motivation Generation Analysis Motivation (personal) “Avoid the X@ tag in Czech as much as possible” • Words unknown to the Czech dictionary are relatively common in some applications • KHRESMOI – translation of medical text: terms • ALEX dialogue system – public transport: stop names • Up to 5% of words are not recognized in special domains Dolnokrčská X@------------- artroplastika X@------------- . . . . . . 3/ 22 Ondřej Dušek Learning Morphology from the Corpus
“Inflect anything” Translate and create unseen phrases Speak freely in dialogue systems Motivation Generation Analysis Motivation (personal) “Avoid the X@ tag in Czech as much as possible” • Words unknown to the Czech dictionary are relatively common in some applications • KHRESMOI – translation of medical text: terms • ALEX dialogue system – public transport: stop names • Up to 5% of words are not recognized in special domains • There's no guesser in Treex (that I know of) Dolnokrčská X@------------- artroplastika X@------------- . . . . . . 3/ 22 Ondřej Dušek Learning Morphology from the Corpus
Motivation Generation Analysis Motivation (personal) “Avoid the X@ tag in Czech as much as possible” • Words unknown to the Czech dictionary are relatively common in some applications • KHRESMOI – translation of medical text: terms • ALEX dialogue system – public transport: stop names • Up to 5% of words are not recognized in special domains • There's no guesser in Treex (that I know of) Dolnokrčská X@------------- artroplastika X@------------- “Inflect anything” • Translate and create unseen phrases • Speak freely in dialogue systems . . . . . . 3/ 22 Ondřej Dušek Learning Morphology from the Corpus
Motivation Generation Analysis Exploiting the regularities in morphology • Morphology of many languages is mostly regular, but for a certain number of exceptions • Size, number, and shape of inflection patterns differ . . . . . . 4/ 22 Ondřej Dušek Learning Morphology from the Corpus
Motivation Generation Analysis Exploiting the regularities in morphology • Morphology of many languages is mostly regular, but for a certain number of exceptions • Size, number, and shape of inflection patterns differ . . . . . . 4/ 22 Ondřej Dušek Learning Morphology from the Corpus
Hand-written rules? rule Hard to maintain with complex morphology y x B C Learning from the data! Obtaining the rules automatically Plenty of corpora of sufficient size available Motivation Generation Analysis Possible approaches to morphology Dictionaries? • Work well, reliable • Limited coverage and/or availability . . . . . . 5/ 22 Ondřej Dušek Learning Morphology from the Corpus
Learning from the data! Obtaining the rules automatically Plenty of corpora of sufficient size available Motivation Generation Analysis Possible approaches to morphology Dictionaries? • Work well, reliable • Limited coverage and/or availability Hand-written rules? rule • Hard to maintain with complex morphology y x B C . . . . . . 5/ 22 Ondřej Dušek Learning Morphology from the Corpus
Motivation Generation Analysis Possible approaches to morphology Dictionaries? • Work well, reliable • Limited coverage and/or availability Hand-written rules? rule • Hard to maintain with complex morphology y x B C Learning from the data! • Obtaining the rules automatically • Plenty of corpora of sufficient size available . . . . . . 5/ 22 Ondřej Dušek Learning Morphology from the Corpus
1. Generation with Filip Jurčíček (see also: our paper at ACL-SRW 2013) Flect : statistical morphology generator 2. Analysis recent, only partially finished experiments on Czech a simple morphology module to go with the Featurama tagger, comparison with others 3. Discussion Motivation Generation Analysis My experiments with morphology • in chronological (less logical) order . . . . . . 6/ 22 Ondřej Dušek Learning Morphology from the Corpus
2. Analysis recent, only partially finished experiments on Czech a simple morphology module to go with the Featurama tagger, comparison with others 3. Discussion Motivation Generation Analysis My experiments with morphology • in chronological (less logical) order 1. Generation • with Filip Jurčíček (see also: our paper at ACL-SRW 2013) • Flect : statistical morphology generator . . . . . . 6/ 22 Ondřej Dušek Learning Morphology from the Corpus
3. Discussion Motivation Generation Analysis My experiments with morphology • in chronological (less logical) order 1. Generation • with Filip Jurčíček (see also: our paper at ACL-SRW 2013) • Flect : statistical morphology generator 2. Analysis • recent, only partially finished experiments on Czech • a simple morphology module to go with the Featurama tagger, comparison with others . . . . . . 6/ 22 Ondřej Dušek Learning Morphology from the Corpus
Motivation Generation Analysis My experiments with morphology • in chronological (less logical) order 1. Generation • with Filip Jurčíček (see also: our paper at ACL-SRW 2013) • Flect : statistical morphology generator 2. Analysis • recent, only partially finished experiments on Czech • a simple morphology module to go with the Featurama tagger, comparison with others 3. Discussion . . . . . . 6/ 22 Ondřej Dušek Learning Morphology from the Corpus
Only previous statistical morphology module known to us: Bohnet et al. (2010) Flect tested on 6 languages from the CoNLL 2009 data set with a varying degree of morphological richness Semantics EN DE ES CA JA CS Syntax for these languages Natural Language Generation Morphology Text Motivation Introduction Generation The system Analysis Results Flect : Morphology generator • Using machine learning to predict inflection . . . . . . 7/ 22 Ondřej Dušek Learning Morphology from the Corpus
Flect tested on 6 languages from the CoNLL 2009 data set with a varying degree of morphological richness Semantics EN DE ES CA JA CS Syntax for these languages Natural Language Generation Morphology Text Motivation Introduction Generation The system Analysis Results Flect : Morphology generator • Using machine learning to predict inflection • Only previous statistical morphology module known to us: Bohnet et al. (2010) . . . . . . 7/ 22 Ondřej Dušek Learning Morphology from the Corpus
Motivation Introduction Generation The system Analysis Results Flect : Morphology generator • Using machine learning to predict inflection • Only previous statistical morphology module known to us: Bohnet et al. (2010) • Flect tested on 6 languages from the CoNLL 2009 data set with a varying degree of morphological richness Semantics EN DE ES CA JA CS Syntax for these languages Natural Language Generation Morphology Text . . . . . . 7/ 22 Ondřej Dušek Learning Morphology from the Corpus
Languages with more inflection (e.g. Czech): even the simplest applications have trouble with morphology é ě Toto se líbí uživateli Jana Nováková. --------- - - [masc] [fem] This is liked by user (name) [dat] [nom] e u Děkujeme, Jan Novák , vaše hlasování bylo vytvořeno. Thank you, (name) [nom] your poll has been created Motivation Introduction Generation The system Analysis Results The need to generate morphology • English – not so much: hard-coded solutions often work well enough . . . . . . 8/ 22 Ondřej Dušek Learning Morphology from the Corpus
Recommend
More recommend