Ljubljana, Slovenia MultilingualWeb Workshop, Madrid, Oct 26 th 2010 - PowerPoint PPT Presentation

Marko Grobelnik (marko.grobelnik@ijs.si) Jozef Stefan Institute (http://www.ijs.si/) Ljubljana, Slovenia MultilingualWeb Workshop, Madrid, Oct 26 th 2010

 Imagine a user understanding several languages ◦ e.g. English, German, Italian, Croatian, Serbian, Slovenian (…not so uncommon in Slovenia)  Such a user would want to browse and search documents in all the known languages  …but of course, a query can be provided only in one language  We need a s search ch engine ne, which ch given en a q query y in one language uage, returns urns docum cumen ents ts in select cted ed languages uages ◦ …this is called “ cross ss- lingual information retrieval”

 …there are many research fields working with textual data solving different problems: ◦ Computational Linguistics, Machine Translation, Information Retrieval, Text Mining, Semantic Web, …  Each of the research fields “represents” text in a slightly different way

 Character (character n-grams and sequences)  Words (stop-words, stemming, lemmatization)  Phrases (word n-grams, proximity features)  Part-of-speech tags  Taxonomies / thesauri  Vector-space model  Correlated V.S.M.  Language models  Full-parsing  Collaborative tagging / Web2.0  Templates / Frames  Ontologies / First order theories

 Character (character n-grams and sequences)  Words (stop-words, stemming, lemmatization) Search (Inf. Retrieval),  Phrases (word n-grams, proximity features) Categorization, Clustering,  Part-of-speech tags Summarization, …  Taxonomies / thesauri  Vector-space model  Correlated V.S.M.  Language models  Full-parsing  Collaborative tagging / Web2.0  Templates / Frames  Ontologies / First order theories

 Character (character n-grams and sequences)  Words (stop-words, stemming, lemmatization) Search (Inf. Retrieval),  Phrases (word n-grams, proximity features) Categorization, Clustering,  Part-of-speech tags Summarization, …  Taxonomies / thesauri  Vector-space model Cross-lingual Inf. Retrieval,  Correlated V.S.M. Connecting Text + Images,  Language models  Full-parsing  Collaborative tagging / Web2.0  Templates / Frames  Ontologies / First order theories

 Character (character n-grams and sequences)  Words (stop-words, stemming, lemmatization) Search (Inf. Retrieval),  Phrases (word n-grams, proximity features) Categorization, Clustering,  Part-of-speech tags Summarization, …  Taxonomies / thesauri  Vector-space model Cross-lingual Inf. Retrieval,  Correlated V.S.M. Connecting Text + Images,  Language models  Full-parsing Machine translation Spam filtering, …  Collaborative tagging / Web2.0  Templates / Frames  Ontologies / First order theories

 Ideally, we would want represent the text in a language neutral form ◦ …so that a document content would be comparable regardless on the language  Having this, we can solve many problems still unaddressed on the market…  Nowadays, we can solve this on a large scale… ◦ …because of availability of large amounts of “comparable corpora” like Wikipedia

Slovenian Slovak German English Czech French Hungarian Spanish Greek Italian Language Neutral Danish Document Representation (trained with machine learning) Finnish Lithuanian Swedish Dutch New document New document represented as text in represented in any of the above languages Language Neutral way …enables cross - lingual retrieval, categorization, clustering, …

Wikipedia Wi edia La Lang ngua uage ges № Language Arti ticl cles 1 English 3,451,276 2 German 1,139,687  With machine learning techniques 3 French 1,022,762 we can learn “language neutral 4 Polish 740,342 5 Italian 739,961 document representation”… 6 Japanese 711,765 7 Spanish 663,201  …planned for ~200 Wikipedia … … 32 Bulgarian 107,739 languages having over 1000 articles 33 Persian 107,564 34 Slovenian 101,731  The goal is to have an updated 35 Waray-Waray 100,454 200x200 matrix of languages for … … 92 Walloon 11,791 comparing document content 93 Irish 11,623 ◦ …trained statistical models + software 94 Chuvash 11,620 will be open to use 95 Armenian 11,197 ◦ …part of FP7 MetaNet Network of 96 Yoruba 10,167 Excellence … … 192 Picard 1,092 193 Aymara 1,088 194 Wolof 1,082 195 Tumbuka 1,061

 Cross-lingual Information Retrieval is a technique for comparing documents written in different languages ◦ …still largely unsolved for comparing large number of languages  We are introducing “ languag age e neutra ral l document ent represent entati ation on ” ◦ …based on statistical representation  Using Wikipedia we are building 200x200 matrix of languages within ◦ …solution will be open source

Ljubljana, Slovenia MultilingualWeb Workshop, Madrid, Oct 26 th 2010 - PowerPoint PPT Presentation

Marko Grobelnik (marko.grobelnik@ijs.si) Jozef Stefan Institute (http://www.ijs.si/) Ljubljana, Slovenia MultilingualWeb Workshop, Madrid, Oct 26 th 2010 Imagine a user understanding several languages e.g. English, German, Italian,

Ljubljana, Slovenia,

ALGO 2012 Ljubljana, Slovenia August and/or September 2012 Andrej (Andy) Borut Robi Brodnik

National and University Library, Slovenia Alenka auperl University of Ljubljana, Faculty of

Chamber of Commerce and Industry of Slovenia iga Lampe Project office Slovenia 2 mio.

of tastes 30. 1. - 2. 2. 2019 Ljubljana, Slovenia, Gospodarsko razstavie Gastronomy,

Case Studies - Case Studies - Eduroam in Slovenia Eduroam in Slovenia Rok Pape ARNES -

13th International Conference on Parallel Problem Solving from Nature. Ljubljana, Slovenia.

Andromeda 2.0 Anja Petkovi 1 1 University of Ljubljana, Slovenia Logic seminar, Stockholm,

Practical education in Slovenia dr. Nejc Zakrajek 1 Slovenia (population 2 millions) 2 1

Introduction to 3D Scientific Visualization Leon Kos, University of Ljubljana, Slovenia

LOCAL SEARCH BASED OPTIMIZATION OF A SPATIAL LIGHT DISTRIBUTION MODEL David Kaljun , Janez

Uncovering latent jet substructure Barry M . Dillon Jozef Stefan Institute , Ljubljana , Slovenia

Propagation of tropical heating perturbations to the midlatitudes and the role of orography

Choosing best shortcuts for a path Martin Pe car Jozef Stefan Institute, Ljubljana, Slovenia

EUROGEO Ljubljana, Slovenia, August 29, 2019 Henri Ankon (Netherlands, 1950)

1 HIGHLIGHTS OF THE PARENTS TOUR Enjoy the charming Capital, Ljubljana with its many cafes,

Symmetry properties of generalized graph truncations Primo parl University of Ljubljana and

Womens handball club Olimpija Ljubljana, Slovenia VISION OF RK OLIMPIJA Vision of the RK

The Foundations of Borut Robi Computability Theory University of Ljubljana Slovenia, 2015 1

Vector like matter and grand unification Borut Bajc J. Stefan Institute, Ljubljana, Slovenia

PROBLEMS AND SOLUTIONS 3 rd and 4 th October 2017 the National Gallery, Ljubljana The two-day

WMO Regional Instrument Center CALIBRATION LABORATORY SERVICE M.Sc. Drago Groselj Head of

L OGO TO SVG Vladimir Batagelj Department of mathematics, FMF, University of Ljubljana Jadranska

Ljubljana, November 2018 HOUSING FUND OF THE REPUBLIC OF SLOVENIA - SHORT PRESENTATION OF