E Evolution of NTCIR: l Infrastructure of Large-Scale Infrastructure of Large Scale Information Access Technologies Evaluation and Testing Evaluation and Testing Noriko Kando Noriko Kando National Institute of Informatics, Japan http://research.nii.ac.jp/ntcir/ From November 2009: http://ntcir.nii.ac.jp/ F N b 2009 h // i ii j / kando (at) nii. ac. jp With th nks f With thanks for Tetsuya Sakai for the slides T t S k i f th slid s NTCIR@CLEF 2009-10-01 Noriko Kando 1
NTCIR: NTCIR: NII Test Collection for Information Retrieval NII Test Collection for Information Retrieval Research Infrastructure for Evaluating IA Research Infrastructure for Evaluating IA A series of evaluation workshops designed to enhance research in information-access technologies by h h i i f ti t h l i b providing an infrastructure for large-scale evaluations. ■ Data sets, evaluation methodologies, and forum ■ Data sets, evaluation methodologies, and forum Project started in late 1997 7th Once every 18 months 6th 5th Data sets (Test collections or TCs) 4th 3rd Scientific, news, patents , and web 2st Chin s Chinese, Korean, Japanese, and English K r n J p n s nd En lish 1st st Tasks 0 20 40 60 80 100 # of groups # of countries IR: Cross-lingual tasks, patents, web, Geo QA : Monolingual tasks, cross-lingual tasks Summarization, trend info., patent maps Opinion analysis, text mining C Community-based Research Activities it b d R h A ti iti NTCIR-7 participants NTCIR@CLEF 2009-10-01 2 Noriko Kando 82 groups from 15 countries
Tasks at past NTCIRs Tasks (Research Areas) of NTCIR Workshops ( ) p ’99 ’01 ‘02 ‘04 ‘ 05 ‘07 1st 2nd 3rd 4th 5th 6th Japanese IR news sci sci T Cross-lingual IR Cross-lingual IR T a Patent Retrieval s map/classif k k Web Retrieval s Navigational Geo Result Classification Term Extraction QuestionAnswering Info Access Dialog Summ metrics S t i Cross-Lingual Text Summarization Trend Information Opinion Analysis Noriko Kando NTCIR@CLEF 2009-10-01 3
NTCIR-7 Clusters (2007 09—2008 12) NTCIR-7 Clusters (2007.09—2008.12) Inf The Cluster 1. Advanced CLIA Mu format e uST; V - Complex CLQA ( Chinese, Japanese, English) 2 - IR for QA (Chinese, Japanese, English)z nd tion Ac d Int’l Visuali Cluster 2. User-Generated : - Multilingual Opinion Analysis Multilingual Opinion Analysis ccess ( ization WS o Cluster 3. Focused Domain : Patent - Patent Translation ; English -> Japanese, P t t T sl ti ; E li h J n Chall (EVIA n Eval - Patent Mining paper -> IPC A ) luating lenge Cluster 4. MuST : - Multi-modal Summarization of Trends g Noriko Kando NTCIR@CLEF 2009-10-01 4
NTCIR-8 Clusters (2008.07—2009.06) I Inform The T Advanced CLIA: - Complex CLQA ( Chinese, Japanese) - IR for QA (Chinese JapanesePar IR for QA (Chinese, JapanesePar 3 3 mation nd In GeoTime Retrieval : (English, Japanese) New nt’l WS User-Generated : Multilingual Opinion Analysis (news) Acces [Pilot] Community QA (Using Yahoo! Answer Japan) New [Pil [Pilot ? ] Multilingual Opinion Analysis (Blog) ? ] M l ili l O i i A l i (Bl ) S on E ss (EV New ? ? Focused Domain Cluster (Patent) ( ) VIA ) Evaluat - Patent Translation ; English -> Japanese, -Patent Mining paper -> IPC g p p ting -Evaluation New Registration is still Open ! You are Registration is still Open ! You are Very much Welcome to join us! Noriko Kando NTCIR@CLEF 2009-10-01 5
NTCIR-7: Advanced CLIA Teruko Mitamura (CMU) Eric Nyberg (CMU) Eric Nyberg (CMU) Ruihua Chen (MSRA) Fred Gey (UCB), Donghong Ji (Wuhan Univ) Donghong Ji (Wuhan Univ) Noriko Kando (NII) Chin-Yew Lin (MSRA) Chuan-Jie Lin (Nat Taiwan Ocean Univ) Tsuneaki Kato (Tokyo Univ) Tatsunori Mori (Yokohama N Univ) Tatsunori Mori (Yokohama N Univ) Tetsuya Sakai (NewsWatch) Ad i Advisor: K.L.Kwok (Queen College) K L K k (Q C ll ) Noriko Kando NTCIR@CLEF 2009-10-01 6
Complex Cross-lingual Question Answering (CCLQA) Task (CCLQA) Task Different teams can exchange Small teams that and create a and create a do not possess d t “dream-team” an entire QA QA system QA system system system can contribute IR IR and QA communities can collaborat d QA i i ll b Noriko Kando NTCIR@CLEF 2009-10-01 7
CCLQA= Complex CLQA CCLQA= Complex CLQA • Moving towards Advanced Complex Questions from g p Q Factoid Questions (NTCIR-5, NTCIR-6) • 4 questions types (events biographies definitions and relationships) (events, biographies, definitions, and relationships) • Examples of Complex Questions – Definition questions : What is the Human Genome – Definition questions : What is the Human Genome Project? – Relationship questions : What is the relationship p q p between Saddam Hussein and Jacques Chirac? – Event questions : List major events in formation of E European Union. U i – Biography questions : Who is Kim Jong-Il? Noriko Kando NTCIR@CLEF 2009-10-01 8
ACLIA: Evaluation EPAN tool ACLIA: Evaluation EPAN tool Noriko Kando NTCIR@CLEF 2009-10-01 9
ACLIA: Evaluation EPAN tool ACLIA: Evaluation EPAN tool CCLQA: Nugget Pyramid Nugget Pyramid Automatic Evaluation Evaluation IR4QA: MAP MS nDCG Q-Measure Q Measure (preference- based ) Noriko Kando NTCIR@CLEF 2009-10-01 10
Traditional “ad hoc” IR vs IR4QA Q • Ad hoc IR (evaluated using Average Precision etc ) etc.) - Find as many (partially or marginally) relevant documents as possible and put them near the documents as possible and put them near the top of the ranked list • IR4QA (evaluating using WHAT? ) IR4QA (evaluating using… WHAT? ) - Find relevant documents containing different correct answers? correct answers? - Find multiple documents supporting the same correct answer to enhance reliability of that correct answer to enhance reliability of that answer? - Combine partially relevant documents A and B Combine partially relevant documents A and B to deduce a correct answer? Noriko Kando NTCIR@CLEF 2009-10-01 11
Average Precision (AP) Average Precision (AP) Pr cisi n Precision at rank r Number of Number of 1 iff doc at r 1 iff d t relevant is relevant docs • Used widely since the advent of TREC • Mean over topics is referred to as “MAP” • Mean over topics is referred to as MAP • Cannot handle graded relevance (but many IR researchers just love it) (but many IR researchers just love it) Noriko Kando NTCIR@CLEF 2009-10-01 12
Persistence Q measure (Q) Q-measure (Q) Parameter β Parameter β set to 1 • Generalises AP and Blended ratio at rank r (Combines Precision handles graded relevance and normalised • Properties similar to AP Cumulative Gain) p Cumulative Gain) and higher discriminative power p Sakai and Robertson EVIA 08 S k i d R b t EVIA 08 • Not widely-used, but provides a user model has been used for QA Q for AP and Q for AP and Q and INEX as well as IR Noriko Kando NTCIR@CLEF 2009-10-01 13
nDCG (Microsoft version) nDCG (Microsoft version) Sum of discounted gains for a system output f t t t Sum of discounted gains m f g • Fixes a bug of the original • Fixes a bug of the original for an ideal output nDCG • But lacks a parameter that reflects • But lacks a parameter that reflects the user’s persistence • Most popular graded-relevance metric • Most popular graded-relevance metric Noriko Kando NTCIR@CLEF 2009-10-01 14
IR4QA evaluation package p g (Works for ad hoc IR in general) Computes Computes AP, Q, nDCG, RBP, NCU [Sakai and Robertson EVIA 08] and so on http://research.nii.ac.jp/ntcir/tools/ir4qa_eval-en Noriko Kando NTCIR@CLEF 2009-10-01 15
• 12 participants from China/Taiwan USA Japan 12 participants from China/Taiwan, USA, Japan • 40 CS runs (22 CS-CS, 18 EN-CS) • 26 CT runs (19 CT-CT 7 EN-CT) 26 CT runs (19 CT CT, 7 EN CT) • 25 JA runs (14 JA-JA, 11 EN-JA) Monolingual Crosslingual Noriko Kando NTCIR@CLEF 2009-10-01 16
Major Approaches Major Approaches • CMUJAV (CS-CS EN-CS JA-JA EN-JA) CMUJAV (CS CS, EN CS, JA JA, EN JA) - Proposes Pseudo Relevance Feedback using Lexico- Semantic Patterns (LSP-PRF) ( ) • CYUT (EN-CS, EN-CT, EN-JA) - Uses Wikipedia in several ways; post hoc results Uses Wikipedia in several ways; post hoc results • MITEL (EN-CS, CT-CT) - SMT and Baidu used for translation; data fusion SMT and Baidu used for translation; data fusion • RALI (CS-CS, EN-CS, CT-CT, EN-CT) - Uses Wikipedia in several ways; high performance Uses Wikipedia in several ways; high performance after bug fix Noriko Kando NTCIR@CLEF 2009-10-01 17
Combining IR4QA &CCLQA Combining IR4QA &CCLQA IR QA F3 • EN-CS EN CS CMU CMU ATR/NiCT ATR/NiCT 0 2763 0.2763 • CS-CS KECIR Apath 0.2695 • EN-JA CMU Forst 0.2873 (CMU 0.1739) • JA JA • JA-JA BRKLY CMU 0.2611 BRKLY CMU 0 2611 Noriko Kando NTCIR@CLEF 2009-10-01 18
System ranking CS by Q/nDCG vs by Q/nDCG vs that by AP CT By definition, y JA JA nDCG is more forgiving for low-recall runs w un than AP and Q. Noriko Kando NTCIR@CLEF 2009-10-01 19
Recommend
More recommend