ljubljana slovenia
play

Ljubljana, Slovenia MultilingualWeb Workshop, Madrid, Oct 26 th 2010 - PowerPoint PPT Presentation

Marko Grobelnik (marko.grobelnik@ijs.si) Jozef Stefan Institute (http://www.ijs.si/) Ljubljana, Slovenia MultilingualWeb Workshop, Madrid, Oct 26 th 2010 Imagine a user understanding several languages e.g. English, German, Italian,


  1. Marko Grobelnik (marko.grobelnik@ijs.si) Jozef Stefan Institute (http://www.ijs.si/) Ljubljana, Slovenia MultilingualWeb Workshop, Madrid, Oct 26 th 2010

  2.  Imagine a user understanding several languages ◦ e.g. English, German, Italian, Croatian, Serbian, Slovenian (…not so uncommon in Slovenia)  Such a user would want to browse and search documents in all the known languages  …but of course, a query can be provided only in one language  We need a s search ch engine ne, which ch given en a q query y in one language uage, returns urns docum cumen ents ts in select cted ed languages uages ◦ …this is called “ cross ss- lingual information retrieval”

  3.  …there are many research fields working with textual data solving different problems: ◦ Computational Linguistics, Machine Translation, Information Retrieval, Text Mining, Semantic Web, …  Each of the research fields “represents” text in a slightly different way

  4.  Character (character n-grams and sequences)  Words (stop-words, stemming, lemmatization)  Phrases (word n-grams, proximity features)  Part-of-speech tags  Taxonomies / thesauri  Vector-space model  Correlated V.S.M.  Language models  Full-parsing  Collaborative tagging / Web2.0  Templates / Frames  Ontologies / First order theories

  5.  Character (character n-grams and sequences)  Words (stop-words, stemming, lemmatization) Search (Inf. Retrieval),  Phrases (word n-grams, proximity features) Categorization, Clustering,  Part-of-speech tags Summarization, …  Taxonomies / thesauri  Vector-space model  Correlated V.S.M.  Language models  Full-parsing  Collaborative tagging / Web2.0  Templates / Frames  Ontologies / First order theories

  6.  Character (character n-grams and sequences)  Words (stop-words, stemming, lemmatization) Search (Inf. Retrieval),  Phrases (word n-grams, proximity features) Categorization, Clustering,  Part-of-speech tags Summarization, …  Taxonomies / thesauri  Vector-space model Cross-lingual Inf. Retrieval,  Correlated V.S.M. Connecting Text + Images,  Language models  Full-parsing  Collaborative tagging / Web2.0  Templates / Frames  Ontologies / First order theories

  7.  Character (character n-grams and sequences)  Words (stop-words, stemming, lemmatization) Search (Inf. Retrieval),  Phrases (word n-grams, proximity features) Categorization, Clustering,  Part-of-speech tags Summarization, …  Taxonomies / thesauri  Vector-space model Cross-lingual Inf. Retrieval,  Correlated V.S.M. Connecting Text + Images,  Language models  Full-parsing Machine translation Spam filtering, …  Collaborative tagging / Web2.0  Templates / Frames  Ontologies / First order theories

  8.  Character (character n-grams and sequences)  Words (stop-words, stemming, lemmatization) Search (Inf. Retrieval),  Phrases (word n-grams, proximity features) Categorization, Clustering,  Part-of-speech tags Summarization, …  Taxonomies / thesauri  Vector-space model Cross-lingual Inf. Retrieval,  Correlated V.S.M. Connecting Text + Images,  Language models  Full-parsing Machine translation Spam filtering, …  Collaborative tagging / Web2.0  Templates / Frames  Ontologies / First order theories

  9.  Ideally, we would want represent the text in a language neutral form ◦ …so that a document content would be comparable regardless on the language  Having this, we can solve many problems still unaddressed on the market…  Nowadays, we can solve this on a large scale… ◦ …because of availability of large amounts of “comparable corpora” like Wikipedia

  10. Slovenian Slovak German English Czech French Hungarian Spanish Greek Italian Language Neutral Danish Document Representation (trained with machine learning) Finnish Lithuanian Swedish Dutch New document New document represented as text in represented in any of the above languages Language Neutral way …enables cross - lingual retrieval, categorization, clustering, …

  11. Wikipedia Wi edia La Lang ngua uage ges № Language Arti ticl cles 1 English 3,451,276 2 German 1,139,687  With machine learning techniques 3 French 1,022,762 we can learn “language neutral 4 Polish 740,342 5 Italian 739,961 document representation”… 6 Japanese 711,765 7 Spanish 663,201  …planned for ~200 Wikipedia … … 32 Bulgarian 107,739 languages having over 1000 articles 33 Persian 107,564 34 Slovenian 101,731  The goal is to have an updated 35 Waray-Waray 100,454 200x200 matrix of languages for … … 92 Walloon 11,791 comparing document content 93 Irish 11,623 ◦ …trained statistical models + software 94 Chuvash 11,620 will be open to use 95 Armenian 11,197 ◦ …part of FP7 MetaNet Network of 96 Yoruba 10,167 Excellence … … 192 Picard 1,092 193 Aymara 1,088 194 Wolof 1,082 195 Tumbuka 1,061

  12.  Cross-lingual Information Retrieval is a technique for comparing documents written in different languages ◦ …still largely unsolved for comparing large number of languages  We are introducing “ languag age e neutra ral l document ent represent entati ation on ” ◦ …based on statistical representation  Using Wikipedia we are building 200x200 matrix of languages within ◦ …solution will be open source

Recommend


More recommend