Marko Grobelnik (marko.grobelnik@ijs.si) Jozef Stefan Institute (http://www.ijs.si/) Ljubljana, Slovenia MultilingualWeb Workshop, Madrid, Oct 26 th 2010
Imagine a user understanding several languages ◦ e.g. English, German, Italian, Croatian, Serbian, Slovenian (…not so uncommon in Slovenia) Such a user would want to browse and search documents in all the known languages …but of course, a query can be provided only in one language We need a s search ch engine ne, which ch given en a q query y in one language uage, returns urns docum cumen ents ts in select cted ed languages uages ◦ …this is called “ cross ss- lingual information retrieval”
…there are many research fields working with textual data solving different problems: ◦ Computational Linguistics, Machine Translation, Information Retrieval, Text Mining, Semantic Web, … Each of the research fields “represents” text in a slightly different way
Character (character n-grams and sequences) Words (stop-words, stemming, lemmatization) Phrases (word n-grams, proximity features) Part-of-speech tags Taxonomies / thesauri Vector-space model Correlated V.S.M. Language models Full-parsing Collaborative tagging / Web2.0 Templates / Frames Ontologies / First order theories
Character (character n-grams and sequences) Words (stop-words, stemming, lemmatization) Search (Inf. Retrieval), Phrases (word n-grams, proximity features) Categorization, Clustering, Part-of-speech tags Summarization, … Taxonomies / thesauri Vector-space model Correlated V.S.M. Language models Full-parsing Collaborative tagging / Web2.0 Templates / Frames Ontologies / First order theories
Character (character n-grams and sequences) Words (stop-words, stemming, lemmatization) Search (Inf. Retrieval), Phrases (word n-grams, proximity features) Categorization, Clustering, Part-of-speech tags Summarization, … Taxonomies / thesauri Vector-space model Cross-lingual Inf. Retrieval, Correlated V.S.M. Connecting Text + Images, Language models Full-parsing Collaborative tagging / Web2.0 Templates / Frames Ontologies / First order theories
Character (character n-grams and sequences) Words (stop-words, stemming, lemmatization) Search (Inf. Retrieval), Phrases (word n-grams, proximity features) Categorization, Clustering, Part-of-speech tags Summarization, … Taxonomies / thesauri Vector-space model Cross-lingual Inf. Retrieval, Correlated V.S.M. Connecting Text + Images, Language models Full-parsing Machine translation Spam filtering, … Collaborative tagging / Web2.0 Templates / Frames Ontologies / First order theories
Character (character n-grams and sequences) Words (stop-words, stemming, lemmatization) Search (Inf. Retrieval), Phrases (word n-grams, proximity features) Categorization, Clustering, Part-of-speech tags Summarization, … Taxonomies / thesauri Vector-space model Cross-lingual Inf. Retrieval, Correlated V.S.M. Connecting Text + Images, Language models Full-parsing Machine translation Spam filtering, … Collaborative tagging / Web2.0 Templates / Frames Ontologies / First order theories
Ideally, we would want represent the text in a language neutral form ◦ …so that a document content would be comparable regardless on the language Having this, we can solve many problems still unaddressed on the market… Nowadays, we can solve this on a large scale… ◦ …because of availability of large amounts of “comparable corpora” like Wikipedia
Slovenian Slovak German English Czech French Hungarian Spanish Greek Italian Language Neutral Danish Document Representation (trained with machine learning) Finnish Lithuanian Swedish Dutch New document New document represented as text in represented in any of the above languages Language Neutral way …enables cross - lingual retrieval, categorization, clustering, …
Wikipedia Wi edia La Lang ngua uage ges № Language Arti ticl cles 1 English 3,451,276 2 German 1,139,687 With machine learning techniques 3 French 1,022,762 we can learn “language neutral 4 Polish 740,342 5 Italian 739,961 document representation”… 6 Japanese 711,765 7 Spanish 663,201 …planned for ~200 Wikipedia … … 32 Bulgarian 107,739 languages having over 1000 articles 33 Persian 107,564 34 Slovenian 101,731 The goal is to have an updated 35 Waray-Waray 100,454 200x200 matrix of languages for … … 92 Walloon 11,791 comparing document content 93 Irish 11,623 ◦ …trained statistical models + software 94 Chuvash 11,620 will be open to use 95 Armenian 11,197 ◦ …part of FP7 MetaNet Network of 96 Yoruba 10,167 Excellence … … 192 Picard 1,092 193 Aymara 1,088 194 Wolof 1,082 195 Tumbuka 1,061
Cross-lingual Information Retrieval is a technique for comparing documents written in different languages ◦ …still largely unsolved for comparing large number of languages We are introducing “ languag age e neutra ral l document ent represent entati ation on ” ◦ …based on statistical representation Using Wikipedia we are building 200x200 matrix of languages within ◦ …solution will be open source
Recommend
More recommend