basic language resources chris cieri mike maxwell
play

Basic Language Resources Chris Cieri Mike Maxwell Stephanie - PowerPoint PPT Presentation

Basic Language Resources Chris Cieri Mike Maxwell Stephanie Strassel COCOSDA/ICWLR Joint Meeting, LREC 2004, Lisbon, May 2004 1 Low Density Languages Project 100k words monolingual text 100k words bilingual text 100k words text


  1. Basic Language Resources Chris Cieri Mike Maxwell Stephanie Strassel � COCOSDA/ICWLR Joint Meeting, LREC 2004, Lisbon, May 2004 1

  2. Low Density Languages Project – 100k words monolingual text – 100k words bilingual text – 100k words text annotated for named entities – 10k word bilingual lexicon – Morphological parser/ stemmer – Encoding converters – Languages: Bengali, Panjabi, Tamil, Tigrinya, Uzbek, Tagalog � COCOSDA/ICWLR Joint Meeting, LREC 2004, Lisbon, May 2004 2

  3. REFLEX Project • Research on English and Foreign Language EXploitation – Proposal stage only! – Seven languages per year – 250k monolingual text – 250k bilingual text (75k English � target language) – Encoding converters – Sentence segmenter – Word segmenter (where required) – 10k Bilingual Lexicon – POS tagset and tagger (and for some languages, 5k word annotated text) – Morphological analyzer (and for some languages, 5k word annotated text) – Named entity tagger – 100k text annotated for named entities � COCOSDA/ICWLR Joint Meeting, LREC 2004, Lisbon, May 2004 3

  4. Language Survey • Languages with > 1M speakers • Sociolinguistic status – Written status – News media • Basic linguistic typology • Electronic resources – Web sites – Lexicons – Other tools � COCOSDA/ICWLR Joint Meeting, LREC 2004, Lisbon, May 2004 4

Recommend


More recommend