The Importance of Evaluation for Multilingual Information Retrieval Carol Peters ISTI-CNR, Pisa, Italy FIRE 2011 IIT Bombay, 2-4 December 2011
From FIRE 2008 to FIRE 2010 FIRE 2008 CLEF: Objectives and First Results FIRE 2010 10 Years of CLEF: An Assessment What we’ve done What we’ve learned What the next steps should be FIRE 2011 Exploiting the Results for MLIR System Building FIRE 2011 IIT Bombay, 2-4 December 2011
MLIR/CLIR System Evaluation In IR the role of an evaluation campaign is to: Identify priority areas for research: evaluation permits hypotheses to be validated and progress assessed Support system development and testing evaluation saves developers time and money 1997 – First MLIR/CLIR system evaluation campaigns in US and Japan: TREC and NTCIR 2000 – MLIR/CLIR evaluation in Europe: CLEF (extension of CLIR track at TREC) 2008 – FIRE: MLIR/CLIR evaluation for Indian languages FIRE 2011 IIT Bombay, 2-4 December 2011
Results These evaluation initiatives: Promote research Encourage creation of multi-disciplinary communities Produce vast amounts of valuable scientific data Favour understanding of issues involved in successful system development FIRE 2011 IIT Bombay, 2-4 December 2011
Outline The Need for MLIR/CLIR? What are the Challenges? What is the Contribution of Evaluation? The Example of CLEF FIRE 2011 IIT Bombay, 2-4 December 2011
MLIR in the Information Society Web is an important platform for knowledge dissemination and acquisition User information needs are increasingly varied From primarily academic use to widespread commercial, leisure, educational, entertainment etc. uses Content is available in many languages and non-English content is growing rapidly Information providers and seekers should have equal opportunities Preservation of national languages FIRE 2011 IIT Bombay, 2-4 December 2011
The Need for Multilingual Search http://www.internetworldstats.com/stats.htm FIRE 2011 IIT Bombay, 2-4 December 2011
Countries with most Internet Users Country Population Internet Internet Penetration Users 2000 Users 2011 % of Pop. China 1,336,718,015 22,5000.000 485,000,000 36.3% United States 313,232,044 95,354,000 245,000,000 78.2% India 1,189,172,906 5,000,000 100,000,000 8.4% Japan 126,475,664 47,080,000 99,182,000 78.4% Brazil 203,429,773 5,000,000 75,982,000 37.4% Germany 81,471,834 24,000,000 65,125,000 79.9% Russia 138,739,892 3,100,000 59,700,000 43.0% UK 62,698,362 15,400,000 51,442,100 82.0% France 65,102,719 8,500,000 45,262,000 69.5% Nigeria 155,215,573 20,000 43,982,200 28.3% http://www.internetworldstats.com/top20.htm FIRE 2011 IIT Bombay, 2-4 December 2011
Countries with most Internet Users Country Population Internet Internet Penetration Users 2000 Users 2011 % of Pop. China 1,336,718,015 22,5000.000 485,000,000 36.3% United States 313,232,044 95,354,000 245,000,000 78.2% India 1,189,172,906 5,000,000 100,000,000 8.4% Japan 126,475,664 47,080,000 99,182,000 78.4% Brazil 203,429,773 5,000,000 75,982,000 37.4% Germany 81,471,834 24,000,000 65,125,000 79.9% Russia 138,739,892 3,100,000 59,700,000 43.0% UK 62,698,362 15,400,000 51,442,100 82.0% France 65,102,719 8,500,000 45,262,000 69.5% Nigeria 155,215,573 20,000 43,982,200 28.3% http://www.internetworldstats.com/top20.htm FIRE 2011 IIT Bombay, 2-4 December 2011
MLIR related research Concerns the storage, access, retrieval and presentation of digital information in any of the world's languages. Main areas of interest: enabling technology (character encoding, scripts, internationalisation, localisation) multiple language access, browsing, retrieval, display Crossing the language boundary (filtering, merging, ranking, selecting, presenting results) FIRE 2011 IIT Bombay, 2-4 December 2011
The Terminology Multilingual Information Access (MLIA) Accessing, querying and retrieving information from collections in any language (covering basic enabling techniques and including MLIR and CLIR) Multilingual Information Retrieval (MLIR) Information retrieval in multiple languages (includes CLIR) Cross-Language Information Retrieval (CLIR) Querying multilingual collections in one language in order to retrieve documents in other languages FIRE 2011 IIT Bombay, 2-4 December 2011
The Grand Challenge Fully multilingual, multimodal IR systems capable of processing a query in any medium and any language finding relevant information from a multilingual multimedia collection containing documents in any language and form and presenting it in style most likely to be useful to the user Oard & Hull , AAAI Spring Symposium, Stanford 19 97 FIRE 2011 IIT Bombay, 2-4 December 2011
MLIR/CLIR System Development is Complex There are 6,800 known languages spoken in 200 countries ca 2,250 have writing systems (the others are only spoken) Just 300 have some kind of language processing tools MLIR/CLIR System development involves integrating IR techniques with Language Processing tools and Language Transfer mechanisms
MLIR/CLIR System Development is Complex Multilingual Portals (Localization) How many languages / how many levels should be multilingual / how to handle updates /linguistic and cultural dependent issues Monolingual Search for Multiple Languages encoding and representation issues / language identification / indexing issues (stop words, stemmers, morphological analysers, named entity recognition, ..) Cross-Language Search translation resources (lexicons, corpora, MT systems) Presentation of Results in form interpretable and exploitable by user FIRE 2011 IIT Bombay, 2-4 December 2011
Main Challenges Understanding Search in the Multilingual Context (language & culture) Globalisation (internationalisation & localisation) MLIR/CLIR System Development Language processing tools Best retrieval mechanisms (indexing, matching, merging) Best translation resources From text to multimodal retrieval Providing effective user support Going from Research to Practice FIRE 2011 IIT Bombay, 2-4 December 2011
Main Challenges Understanding Search in the Multilingual Context (language & culture) Globalisation (internationalisation & localisation) MLIR/CLIR System Development Language processing tools Best retrieval mechanisms (indexing, matching, merging) Best translation resources From text to multimodal retrieval Providing effective user support Going from Research to Practice FIRE 2011 IIT Bombay, 2-4 December 2011
Main Challenges Understanding Search in the Multilingual Context (language & culture) Globalisation (internationalisation & localisation) MLIR/CLIR System Development Language processing tools Best retrieval mechanisms (indexing, matching, merging) Best translation resources From text to multimodal retrieval Providing effective user support Going from Research to Practice FIRE 2011 IIT Bombay, 2-4 December 2011
Building a CLIR System Pre-process & index both documents and queries – generally using language dependent techniques (tokenisation, stopwords, stemming, morphological analysis, decompounding, etc.) Translate: queries or documents (or both) Translation resources • Machine Translation (MT) • Parallel/comparable corpora • Bilingual Dictionaries • Multilingual Thesauri • Conceptual Interlingua Find relevant documents in target collection(s) & present results FIRE 2011 IIT Bombay, 2-4 December 2011
Main CLIR Difficulties (I) Language identification Morphology: inflection, derivation, compounding, … OOV terms, e.g. proper names, terminology Multi-word concepts, e.g. phrases and idioms Ambiguity, e.g. polysemy Handling many languages: L1 -> Ln Merging results from different sources / media Presenting the results in useful fashion FIRE 2011 IIT Bombay, 2-4 December 2011
Main CLIR Difficulties (II) CLIR system need clever pre-processing of target collections (e.g. semantic analysis, classification, information extraction) CLIR systems need intelligent post-processing of results: merging/ summarization / translation CLIR systems need well-developed resources Language Processing Tools Language Resources Resources are expensive to acquire, maintain, update FIRE 2011 IIT Bombay, 2-4 December 2011
CLIR for Multimedia Retrieval from a mixed media collection is non- trivial problem Different media processed in different ways and suffer from different kinds of indexing errors: spoken documents indexed using speech recognition handwritten documents indexed using OCR images indexed using significant features Need for complex integration of multiple technologies Need for merging of results from different sources FIRE 2011 IIT Bombay, 2-4 December 2011
Supporting the User FIRE 2011 IIT Bombay, 2-4 December 2011 Clough October 2011
MLIR/CLIR System Evaluation is Complex Need to evaluate single components Need to evaluate overall system performance Need to distinguish CL aspects from IR issues FIRE 2011 IIT Bombay, 2-4 December 2011
Recommend
More recommend