the importance of evaluation for multilingual information
play

The Importance of Evaluation for Multilingual Information Retrieval - PowerPoint PPT Presentation

The Importance of Evaluation for Multilingual Information Retrieval Carol Peters ISTI-CNR, Pisa, Italy FIRE 2011 IIT Bombay, 2-4 December 2011 From FIRE 2008 to FIRE 2010 FIRE 2008 CLEF: Objectives and First Results FIRE 2010 10


  1. The Importance of Evaluation for Multilingual Information Retrieval Carol Peters ISTI-CNR, Pisa, Italy FIRE 2011 IIT Bombay, 2-4 December 2011

  2. From FIRE 2008 to FIRE 2010 FIRE 2008  CLEF: Objectives and First Results FIRE 2010  10 Years of CLEF: An Assessment  What we’ve done  What we’ve learned  What the next steps should be FIRE 2011  Exploiting the Results for MLIR System Building FIRE 2011 IIT Bombay, 2-4 December 2011

  3. MLIR/CLIR System Evaluation In IR the role of an evaluation campaign is to:  Identify priority areas for research:  evaluation permits hypotheses to be validated and progress assessed  Support system development and testing  evaluation saves developers time and money  1997 – First MLIR/CLIR system evaluation campaigns in US and Japan: TREC and NTCIR  2000 – MLIR/CLIR evaluation in Europe: CLEF (extension of CLIR track at TREC)  2008 – FIRE: MLIR/CLIR evaluation for Indian languages FIRE 2011 IIT Bombay, 2-4 December 2011

  4. Results These evaluation initiatives:  Promote research  Encourage creation of multi-disciplinary communities  Produce vast amounts of valuable scientific data  Favour understanding of issues involved in successful system development FIRE 2011 IIT Bombay, 2-4 December 2011

  5. Outline  The Need for MLIR/CLIR?  What are the Challenges?  What is the Contribution of Evaluation?  The Example of CLEF FIRE 2011 IIT Bombay, 2-4 December 2011

  6. MLIR in the Information Society  Web is an important platform for knowledge dissemination and acquisition  User information needs are increasingly varied  From primarily academic use to widespread commercial, leisure, educational, entertainment etc. uses  Content is available in many languages and non-English content is growing rapidly  Information providers and seekers should have equal opportunities  Preservation of national languages FIRE 2011 IIT Bombay, 2-4 December 2011

  7. The Need for Multilingual Search http://www.internetworldstats.com/stats.htm FIRE 2011 IIT Bombay, 2-4 December 2011

  8. Countries with most Internet Users Country Population Internet Internet Penetration Users 2000 Users 2011 % of Pop. China 1,336,718,015 22,5000.000 485,000,000 36.3% United States 313,232,044 95,354,000 245,000,000 78.2% India 1,189,172,906 5,000,000 100,000,000 8.4% Japan 126,475,664 47,080,000 99,182,000 78.4% Brazil 203,429,773 5,000,000 75,982,000 37.4% Germany 81,471,834 24,000,000 65,125,000 79.9% Russia 138,739,892 3,100,000 59,700,000 43.0% UK 62,698,362 15,400,000 51,442,100 82.0% France 65,102,719 8,500,000 45,262,000 69.5% Nigeria 155,215,573 20,000 43,982,200 28.3% http://www.internetworldstats.com/top20.htm FIRE 2011 IIT Bombay, 2-4 December 2011

  9. Countries with most Internet Users Country Population Internet Internet Penetration Users 2000 Users 2011 % of Pop. China 1,336,718,015 22,5000.000 485,000,000 36.3% United States 313,232,044 95,354,000 245,000,000 78.2% India 1,189,172,906 5,000,000 100,000,000 8.4% Japan 126,475,664 47,080,000 99,182,000 78.4% Brazil 203,429,773 5,000,000 75,982,000 37.4% Germany 81,471,834 24,000,000 65,125,000 79.9% Russia 138,739,892 3,100,000 59,700,000 43.0% UK 62,698,362 15,400,000 51,442,100 82.0% France 65,102,719 8,500,000 45,262,000 69.5% Nigeria 155,215,573 20,000 43,982,200 28.3% http://www.internetworldstats.com/top20.htm FIRE 2011 IIT Bombay, 2-4 December 2011

  10. MLIR related research  Concerns the storage, access, retrieval and presentation of digital information in any of the world's languages.  Main areas of interest:  enabling technology (character encoding, scripts, internationalisation, localisation)  multiple language access, browsing, retrieval, display  Crossing the language boundary (filtering, merging, ranking, selecting, presenting results) FIRE 2011 IIT Bombay, 2-4 December 2011

  11. The Terminology  Multilingual Information Access (MLIA)  Accessing, querying and retrieving information from collections in any language (covering basic enabling techniques and including MLIR and CLIR)  Multilingual Information Retrieval (MLIR)  Information retrieval in multiple languages (includes CLIR)  Cross-Language Information Retrieval (CLIR)  Querying multilingual collections in one language in order to retrieve documents in other languages FIRE 2011 IIT Bombay, 2-4 December 2011

  12. The Grand Challenge Fully multilingual, multimodal IR systems  capable of processing a query in any medium and any language  finding relevant information from a multilingual multimedia collection containing documents in any language and form  and presenting it in style most likely to be useful to the user Oard & Hull , AAAI Spring Symposium, Stanford 19 97 FIRE 2011 IIT Bombay, 2-4 December 2011

  13. MLIR/CLIR System Development is Complex  There are 6,800 known languages spoken in 200 countries  ca 2,250 have writing systems (the others are only spoken)  Just 300 have some kind of language processing tools MLIR/CLIR System development involves integrating IR techniques with Language Processing tools and Language Transfer mechanisms

  14. MLIR/CLIR System Development is Complex  Multilingual Portals (Localization)  How many languages / how many levels should be multilingual / how to handle updates /linguistic and cultural dependent issues  Monolingual Search for Multiple Languages  encoding and representation issues / language identification / indexing issues (stop words, stemmers, morphological analysers, named entity recognition, ..)  Cross-Language Search  translation resources (lexicons, corpora, MT systems)  Presentation of Results  in form interpretable and exploitable by user FIRE 2011 IIT Bombay, 2-4 December 2011

  15. Main Challenges  Understanding Search in the Multilingual Context (language & culture)  Globalisation (internationalisation & localisation)  MLIR/CLIR System Development  Language processing tools  Best retrieval mechanisms (indexing, matching, merging)  Best translation resources  From text to multimodal retrieval  Providing effective user support  Going from Research to Practice FIRE 2011 IIT Bombay, 2-4 December 2011

  16. Main Challenges  Understanding Search in the Multilingual Context (language & culture)  Globalisation (internationalisation & localisation)  MLIR/CLIR System Development  Language processing tools  Best retrieval mechanisms (indexing, matching, merging)  Best translation resources  From text to multimodal retrieval  Providing effective user support  Going from Research to Practice FIRE 2011 IIT Bombay, 2-4 December 2011

  17. Main Challenges  Understanding Search in the Multilingual Context (language & culture)  Globalisation (internationalisation & localisation)  MLIR/CLIR System Development  Language processing tools  Best retrieval mechanisms (indexing, matching, merging)  Best translation resources  From text to multimodal retrieval  Providing effective user support  Going from Research to Practice FIRE 2011 IIT Bombay, 2-4 December 2011

  18. Building a CLIR System  Pre-process & index both documents and queries – generally using language dependent techniques (tokenisation, stopwords, stemming, morphological analysis, decompounding, etc.)  Translate: queries or documents (or both)  Translation resources • Machine Translation (MT) • Parallel/comparable corpora • Bilingual Dictionaries • Multilingual Thesauri • Conceptual Interlingua  Find relevant documents in target collection(s) & present results FIRE 2011 IIT Bombay, 2-4 December 2011

  19. Main CLIR Difficulties (I)  Language identification  Morphology: inflection, derivation, compounding, …  OOV terms, e.g. proper names, terminology  Multi-word concepts, e.g. phrases and idioms  Ambiguity, e.g. polysemy  Handling many languages: L1 -> Ln  Merging results from different sources / media  Presenting the results in useful fashion FIRE 2011 IIT Bombay, 2-4 December 2011

  20. Main CLIR Difficulties (II)  CLIR system need clever pre-processing of target collections (e.g. semantic analysis, classification, information extraction)  CLIR systems need intelligent post-processing of results: merging/ summarization / translation  CLIR systems need well-developed resources  Language Processing Tools  Language Resources  Resources are expensive to acquire, maintain, update FIRE 2011 IIT Bombay, 2-4 December 2011

  21. CLIR for Multimedia  Retrieval from a mixed media collection is non- trivial problem  Different media processed in different ways and suffer from different kinds of indexing errors:  spoken documents indexed using speech recognition  handwritten documents indexed using OCR  images indexed using significant features  Need for complex integration of multiple technologies  Need for merging of results from different sources FIRE 2011 IIT Bombay, 2-4 December 2011

  22. Supporting the User FIRE 2011 IIT Bombay, 2-4 December 2011 Clough October 2011

  23. MLIR/CLIR System Evaluation is Complex  Need to evaluate single components  Need to evaluate overall system performance  Need to distinguish CL aspects from IR issues FIRE 2011 IIT Bombay, 2-4 December 2011

Recommend


More recommend