overview of the aclia ir4qa information retrieval for
play

Overview of the ACLIA IR4QA (Information Retrieval for IR4QA - PowerPoint PPT Presentation

Overview of the ACLIA IR4QA (Information Retrieval for IR4QA (Information Retrieval for Question Answering) Task Tetsuya Sakai Noriko Kando Chuan-Jie Lin Ch Ji Li Teruko Mitamura T k Mit Donghong Ji Kuang-Hua Chen E i N b Eric


  1. Overview of the ACLIA IR4QA (Information Retrieval for IR4QA (Information Retrieval for Question Answering) Task Tetsuya Sakai Noriko Kando Chuan-Jie Lin Ch Ji Li Teruko Mitamura T k Mit Donghong Ji Kuang-Hua Chen E i N b Eric Nyberg 18 th December 2008 @NTCIR-7, Tokyo

  2. TALK OUTLINE TALK OUTLINE 1. Task Objectives 2 2. Relevance Assessments Relevance Assessments 3. Evaluation Metrics 4. Participating Teams 5 5. Official Results Official Results 6. Lazy Evaluation 7. Unanswered Questions

  3. What are the effective IR techniques for QA?

  4. Traditional “ad hoc” IR vs IR4QA • Ad hoc IR (evaluated using Average Precision etc.) - Find as many (partially or marginally) relevant Find as many (partially or marginally) relevant documents as possible and put them near the top of the ranked list the ranked list • IR4QA (evaluating using… WHAT? ) - Find relevant documents containing different correct Find relevant documents containing different correct answers? - Find multiple documents supporting the same correct Find multiple documents supporting the same correct answer to enhance reliability of that answer? - Combine partially relevant documents A and B to Combine partially relevant documents A and B to deduce a correct answer?

  5. TALK OUTLINE TALK OUTLINE 1. Task Objectives 2 2. Relevance Assessments Relevance Assessments 3. Evaluation Metrics 4. Participating Teams 5 5. Official Results Official Results 6. Lazy Evaluation 7. Unanswered Questions

  6. Pooling for relevance assessments System 1 Pool Topic A depth Run 1 Run >= 30 depth depth Target =1000 Relevance Documents : assessments : L2-relevant CS: Simplified : Pool L1-relevant Chinese Chinese L0 L0 CT: Traditional System N Chinese L2: relevant L2 l t JA: Japanese Pool L1: partially relevant depth Run N Run >= 30 L0: judged L0: judged d depth th nonrelevant =1000

  7. Different pool depths for different topics Mandatory for all topics Assess depth 30 pool Assess depth-30 pool Assess depth-50 pool (minus depth-30 pool) Assess depth 50 pool (minus depth 30 pool) Assess depth 70 pool (minus depth 50 pool) Assess depth-70 pool (minus depth-50 pool) See IR4QA Overview A Assess depth-90 pool (minus depth-70 pool) d th 90 l ( i d th 70 l) Tables 29-31 for details Assess depth-100 pool (minus depth-90 pool) Relevance assessments coordinated independently by Relevance assessments coordinated independently by Donghong Ji (CS), Chuan-Jie Lin (CT) and Noriko Kando (JA)

  8. Sorting the pooled documents for assessors • Traditional approach: Docs sorted by IDs • IR4QA approach: Sort docs in depth-X pool IR4QA approach: Sort docs in depth X pool by: - # #runs containing the doc at or above rank X t i i th d t b k X (primary sort key) - Sum of ranks of the doc within these runs (secondary sort key) (secondary sort key) Present ``popular’’ documents first!

  9. Assumptions behind the sort Assumptions behind the sort 1. Popular docs are more likely to be relevant than p y others. Supported by [Sakai and Kando EVIA 08] Supported by [Sakai and Kando EVIA 08] 2. If relevant docs are concentrated near the top of the list to be assessed this is easier for the the list to be assessed, this is easier for the assessors to judge more efficiently and consistently . At NTCIR 2 th At NTCIR-2, the assessors actually did not like doc lists t ll did t lik d li t sorted by doc IDs (But we need more empirical evidence)

  10. TALK OUTLINE TALK OUTLINE 1. Task Objectives 2 2. Relevance Assessments Relevance Assessments 3. Evaluation Metrics 4. Participating Teams 5 5. Official Results Official Results 6. Lazy Evaluation 7. Unanswered Questions

  11. Average Precision (AP) Average Precision (AP) P Precision i i at rank r Number of Number of 1 iff d 1 iff doc at r t relevant is relevant docs • Used widely since the advent of TREC • Mean over topics is referred to as “MAP” • Mean over topics is referred to as MAP • Cannot handle graded relevance (but many IR researchers just love it) (but many IR researchers just love it)

  12. Persistence Q measure (Q) Q-measure (Q) Parameter β Parameter β set to 1 • Generalises AP and Blended ratio at rank r (Combines Precision handles graded relevance and normalised • Properties similar to AP p Cumulative Gain) Cumulative Gain) and higher discriminative power p S k i Sakai and Robertson EVIA 08 d R b t EVIA 08 • Not widely-used, but provides a user model has been used for QA for AP and Q for AP and Q and INEX as well as IR

  13. nDCG (Microsoft version) nDCG (Microsoft version) Sum of discounted gains f for a system output t t t Sum of discounted gains g • Fixes a bug of the original • Fixes a bug of the original for an ideal output nDCG • But lacks a parameter that reflects • But lacks a parameter that reflects the user’s persistence • Most popular graded relevance metric • Most popular graded-relevance metric

  14. IR4QA evaluation package (Works for ad hoc IR in general) Computes Computes AP, Q, nDCG, RBP, NCU [Sakai and Robertson EVIA 08] and so on http://research.nii.ac.jp/ntcir/tools/ir4qa_eval-en

  15. TALK OUTLINE TALK OUTLINE 1. Task Objectives 2 2. Relevance Assessments Relevance Assessments 3. Evaluation Metrics 4. Participating Teams 5 5. Official Results Official Results 6. Lazy Evaluation 7. Unanswered Questions

  16. • 12 participants from China/Taiwan USA Japan 12 participants from China/Taiwan, USA, Japan • 40 CS runs (22 CS-CS, 18 EN-CS) • 26 CT runs (19 CT-CT 7 EN-CT) 26 CT runs (19 CT CT, 7 EN CT) • 25 JA runs (14 JA-JA, 11 EN-JA) Monolingual Crosslingual

  17. Oral presentations Oral presentations • RALI (CS-CS, EN-CS, CT-CT, EN-CT) RALI (CS CS, EN CS, CT CT, EN CT) - Uses Wikipedia to extracts cue words for BIOGRAPHY; Extracts person names using Wikipedia and Google; Uses Google translation G G • CYUT (EN-CS, EN-CT, EN-JA) - Uses Wikipedia for query expansion and translation; U Wiki di f i d l i Uses Google translation • MITEL (EN CS CT CT) • MITEL (EN-CS, CT-CT) - Uses SMT and Baidu for translation; data fusion • CMUJAV (CS CS EN CS JA JA EN JA) • CMUJAV (CS-CS, EN-CS, JA-JA, EN-JA) - Proposes Pseudo Relevance Feedback using Lexico- Semantic Patterns (LSP-PRF) Semantic Patterns (LSP PRF)

  18. Other interesting approaches Other interesting approaches • BRKLY (JA-JA) A very experienced TREC/NTCIR participant • HIT (EN-CS) PRF most successful • KECIR (CS-CS) Query expansion length optimised for each question type (definition, biography…) (d fi iti bi h ) h ti t • NLPAI (CS-CS) Uses question analyses files from other teams (next slide) teams (next slide) • NTUBROWS (CT-CT) Query term filtering, data fusion • OT (CS-CS CT-CT JA-JA) Data fusion-like PRF OT (CS CS, CT CT, JA JA) Data fusion like PRF • TA (EN-JA) SMT document translation from NTCIR-6 • WHUCC (CS-CS) Document reranking ( ) g Please visit the posters of all 12 IR4QA teams! Please visit the posters of all 12 IR4QA teams!

  19. NLPAI (CS-CS) used question analysis files from other teams. CSWHU CSWHU-CS-CS-0 CS-CS-01-T: <KEYTERMS> <KEYTERM SCORE="1.0"> 宇宙大爆炸 </KEYTERM> Different teams <KEYTERM SCORE="0.3"> 理 论 </KEYTERM> </KEYTERMS> come up with Apath-CS-CS-01-T Apath-CS-CS-01-T: <KEYTERMS> different set of <KEYTERM SCORE="1.0"> 宇宙大爆炸理 论 </KEYTERM> query terms with i h </KEYTERMS> /KEYTERMS CMUJA UJAV-CS -CS-CS CS-01-T -01-T: different weights. <KEYTERMS> <KEYTERM SCORE="1.0"> 宇宙 </KEYTERM> This clearly affects This clearly affects <KEYTERM SCORE="1.0"> 大 </KEYTERM> KEYTERM SCORE 大 /KEYTERM <KEYTERM SCORE="1.0"> 爆炸 </KEYTERM> retrieval <KEYTERM SCORE="1.0"> 理 论 </KEYTERM> p performance. <KEYTERM SCORE="1.0"> 宇宙 大 爆炸 理 论 </KEYTERM> <KEYTERM SCORE="1.0"> 宇宙大爆炸理 论 </KEYTERM> KEYTERM SCORE "1 0" 宇宙大爆炸理 论 /KEYTERM <KEYTERM SCORE="1.0"> 宇宙 大 爆炸 </KEYTERM> <KEYTERM SCORE="1.0"> 宇宙大爆炸 </KEYTERM> </KEYTERMS> Special thanks to Special thanks to Maofu Liu (NLPAI)

  20. TALK OUTLINE TALK OUTLINE 1. Task Objectives 2 2. Relevance Assessments Relevance Assessments 3. Evaluation Metrics 4. Participating Teams 5 5. Official Results Official Results 6. Lazy Evaluation 7. Unanswered Questions

  21. CS T-runs: Top 3 teams CS T runs: Top 3 teams Mean Mean Mean AP Q nDCG OT- O .6337 633 OT- O .6490 6 90 OT- O 8270 * .8270 CS-CS-04-T CS-CS-04-T CS-CS-04-T MITEL MITEL- .5959 5959 MITEL- MITEL .6124 6124 CMUJAV- CMUJAV .7951 7951 EN-CS-03-T EN-CS-03-T CS-CS-02-T CMUJAV- CMUJAV .5930 5930 CMUJAV CMUJAV- .6055 6055 MITEL MITEL- .7949 7949 CS-CS-02-T CS-CS-02-T EN-CS-01-T - MITEL is very good even though it is a crosslingual run MITEL i d th h it i li l - OT significantly outperforms CMUJAV with Mean nDCG (two-sided bootstrap test; α =0 05) (two-sided bootstrap test; α =0.05) - nDCG disagrees with AP and Q

Recommend


More recommend