TREC 2003 Tracks A Tale of Two Evaluat ions Retrieval in a domain - PDF document

TREC 2003 Tracks A Tale of Two Evaluat ions Retrieval in a domain Genome Novelty Answers, not docs Q&A TREC and RI A Web searching Web VLC Video Beyond text Speech OCR X → {X,Y,Z} Beyond just English Chinese Spanish Sponsored by: NI ST, ARDA, DARPA Human-in-the-loop Interactive, HARD Donna Harman Streamed text Filtering Routing Static text Ad Hoc, Robust 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 QA 2003 Main Task Genomics Track • Three quest ion t ypes • New t rack f or 2003 – 413 f actoids : same as passages t ask except • f irst year of a 5-year plan must be exact answer, not document ext ract – 37 lists : assemble set of inst ances where • Mot ivat ion: explore ret rieval in a domain each inst ance is a f act oid quest ion answer • Two t asks – 50 def initions : ret urn t ext st rings t hat t oget her def ine t arget of quest ion • primary: ad hoc t ask of f inding MEDLI NE records t hat f ocus on t he basic biology of 50 specif ic gene • Final score weight ed average of names; GeneRI F dat a used as surrogat e answers component s • Secondary: Ext ract GeneRI F dat a f rom 139 art icles FinalScore = ½ Fact oidScore + ¼ List Score + ¼ Def Score QA Def init ion Component QA Main Task Result s • 50 quest ions asking f or a def init ion of a t erm 0. 6 or biographical dat a f or a person 0. 5 • Who is Vlad t he I mpaler? What is pH in chemist ry? 0. 4 • quest ions drawn f rom same logs as f act oids Def init ion 0. 3 List • assessor creat ed def init ion by searching docs Fact oid 0. 2 • Syst em response is an unordered set of st rings 0. 1 • each st ring represent s dif f erent f acet of def • no limit on lengt h of st rings or number of st rings 0 isi03a LCCmainS03 nusmm103r 2 lexiclone92 BBN2003C MI TCSAI L03a ir st qa2003w I BM2003c Albany03I 2 FDUT12Q A3 • Assessor mat ched his f acet s t o syst em st rings • could be 0, 1, or mult iple mat ches per st ring • F score wit h recall weight ed 5 t imes “precision” Final combined scor es f or best main t ask r un per gr oup f or t op 10 gr oups • “precision” is a f unct ion of lengt h

HARD t rack Robust Ret rieval Track • New t rack in 2003 • Goal: improve ad hoc ret rieval by cust omizing t he search t o t he user using: • Mot ivat ions: • f ocus on poor ly perf orming t opics since average 1) Met adat a f rom t opic st at ement s ef f ect iveness usually masks huge variance 1) t he pur pose of t he search • bring t radit ional ad hoc t ask back t o TREC 2) t he genr e or granular it y of t he desired response • Task 3) t he user’s f amiliar it y wit h t he subj ect mat t er 4) biogr aphical dat a about user (age, sex, et c.) • 100 t opics 2) Clarif ying f orms – 50 old t opics f rom TRECs 6-8 – 50 new t ropics creat ed by 2003 assessors 1) assessor (sur rogat e user) spends at most 3 minut es/ t opic responding t o t opic-specif ic f orm • TREC 6-8 document collect ion: disks 4&5 (no CR) 2) example uses: sense r esolut ion, relevance j udgment s • st andard t rec_eval evaluat ion plus new measures Ret rieval Met hods 2003 Robust Ret rieval Track • CUNY and Wat erloo expanded using t he 1 web (and possibly ot her collect ions) 0.9 • ef f ect ive, even f or poor perf ormers 0.8 0.7 • QE based on t arget collect ion generally Pr ecision 100 Topics 0.6 improved mean scores, but did not help 0.5 Old Topics 0.4 New Topics poor perf ormers 0.3 0.2 • Approaches f or poor perf ormers 0.1 • predict when t o expand 0 • f use result s f rom mult iple runs 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 • reorder t op ranked based on clust ering of Recall ret rieved set The Problem RI A Workshop 1 0.9 • I n t he summer of 2003, NI ST organized 0.8 a 6-week workshop called Reliable 0.7 Average Precision I nf ormat ion Access (RI A) 0.6 • RI A was part of t he Nort heast Regional 0.5 Research Cent er summer workshop series 0.4 sponsored by t he Advanced Research and 0.3 0.2 Development Act ivit y of t he US 0.1 Depart ment of Def ense 0 best ok8alx CL99XT

Part icipant s (28) Workshop Goals Donna Harman and Chris Buckley (coordinat ors) � To learn how t o cust omize I R syst ems f or opt imal perf ormance on any given query Cit y Universit y, London: Andy MacFarlane Clairvoyance: David Evans, David Hull, J esse Mont gomery � I nit ial st rong f ocus on relevance Carnegie Mellon U: J amie Callan, Paul Ogilvie, Yi Zhang, Luo Si, Kevyn f eedback and pseudo-relevance (blind) Collins-Thompson MI TRE: Warren Greif f f eedback NI ST: I an Soborof f and Ellen Voorhees � I f t ime, expand t o ot her t ools U. of Massachuset t s at Amherst : Andres Corrada-Emmanuel U. of New York at Albany: Tomek St rzalkowski, Paul Kant or, Sharon � Apply t he result s t o quest ion answering in Small, Ting Liu, Sean Ryan mult iple ways U. Wat erloo: Charlie Clarke, Gordon Cormack, Tom Lyman, Egidio Terr a Ot her st udent s: Zhenmei Gu, Luo Ming, Robert Warr en,J ef f Terrace Failure analysis Overall approach 1) Chose 44 out of 150 topics that were � Massive f ailure analysis done manually "f ailures" f or a single run by each system a) Mean Average Precision <= average � Statistical analysis using many b) have the most variance across systems 2) Use results f rom 6 systems’ standard runs “ident ical” f eedback runs f rom all 3) 6 people per topic (one per system) spent systems 45- 60 minutes looking at those results � Use the results of the above to group 4) Short 6- person group discussion to come to queries needing similar t reatment consensus about topic 5) I ndividual + overall report (f rom templates). Preliminary conclusions Grouping of queries by f ailure f rom f ailure analysis All syst ems emphasize one aspect ; miss anot her 21 � Systems agreed on causes of f ailure 362 – I dent if y incident s of human smuggling much more than had been expected Need out side expansion of “general” t erm 8 � Systems retrieve dif f erent documents, 438 – What count ries are experiencing an but don’t retrieve dif f erent classes of increase in t ourism? documents � Majority of f ailures could be f ixed Missing dif f icult aspect (semant ics in query) 7 with better f eedback and term weighting 401 – What language and cult ural dif f erence impede t he int egrat ion of f oreign minorit ies and query analysis that gives guidance as in Germany? to the relative importance of the terms General I R t echnical f ailure 8

List of experiment s run (Blind) Relevance Feedback bf _base: base runs f or all syst ems bot h using blind f eedback (bf ) and no f eedback What are new met hods of producing st eel? bf _numdocs: vary # docs used f or bf f rom 0-100 bf _numdocs_relonly: same but only use relevant * FBIS4-53871 title1 …. bf _numt erms: vary # t erms added f rom 0-100 FT923-9006 title2 …. bf _pass_numt erms: same but use passages as * FBIS4-27797 . source inst ead of document s * FT944-1455 . FBIS3-24678 . bf _swap-doc: use document s f rom ot her syst ems FT923-9281 . bf _swap_doc_t erm: expand using docs and t erms * FT923-10837 . FT922-11827 . bf _swap_doc_clust er: use CLARI T clust ers FT941-11316 . bf _swap_doc_f use: use f usion of ot her syst ems . bf _numt erms_passages bf _numdocs, relevant only Preliminary Lessons Learned Addit ional experiment s 1) Failure analysis a) systems tend to f ail f or the same reason • t opic_analysis: producing & comparing groups of b) getting the right concepts in system query t opics using assort ed measures critical • qa_st andard: ef f ect of I R algorit hms on QA 2) Surprises that require more analysis using docs/ passages a) bf _swap_docs: some systems better at • t opic_coverage: HI TI QA experiment using all providing docs syst ems b) some systems more robust during expansion c) bf _num_docs relevant only: some relevant docs are bad f eedback docs d) no topic in which there were “golden” terms in top 1- 4 f eedback terms

I mpact Workshop lessons learned � 1620 f inal runs made on TREC 678 collect ion � Learning t o “cat egorize” quest ions of a � This inf ormat ion will be publicly dist ribut ed t o varied nat ure like TREC t opics is much open t he way f or import ant f urt her analysis wit hin t he I R communit y harder t han anyone expect ed � Analysis wit hin t he workshop shows several � Doing massive and caref ul f ailure analysis promising measures f or predict ing blind across mult iple syst ems is a big win relevance f eedback f ailure � Perf orming parallel experiment s using � Addit ionally much has been learned (and will be mult iple syst ems may be t he only way of published) about t he int eract ion of search learning some general principles engines, t opics and dat a collect ions, leading t o more research in t his crit ical area Fut ur e • TREC will cont inue (t rec.nist .gov) – This year’s t racks likely t o cont inue • QA: request s f or required inf o + ot her inf o – One new t rack • invest igat e ad hoc evaluat ion met hodologies f or t erabyt e scale collect ions • SI GI R 2004 workshop on RI A result s – Many more det ails on what was done – Lot s of t ime f or discussion – Breakout sessions on where t o go next

TREC 2003 Tracks A Tale of Two Evaluat ions Retrieval in a domain - PDF document

TREC 2003 Tracks A Tale of Two Evaluat ions Retrieval in a domain Genome Novelty Answers, not docs Q&A TREC and RI A Web searching Web VLC Video Beyond text Speech OCR X {X,Y,Z} Beyond just English Chinese Spanish

Regional Trec - September 27, 2015 - Cadogan Farms TREC Workshop April 2015 Regional TREC

Overview of TREC 2014 Ellen Voorhees Text REtrieval Conference (TREC) TREC 2014 Track

AutoAdapt @ TREC 2010 Dyaa Albakour October 7, 2010 Dyaa Albakour AutoAdapt @ TREC 2010 The

TREC, TAC, takeoffs, tacks, tasks, and titillations for 2009 Ian Soboroff, NIST

Text REtrieval Conference (TREC) TREC TRACKS Crowdsourcing Personal Blog, Microblog documents

The Foundat ions: Logic and The Foundat ions: Logic and The Foundat ions: Logic and Proof , Set

Search Evaluation at Grooveshark Yoni Teitelbaum 2013-07-02 Traditional Evaluation: TREC Image

Overview of TREC 2013 Ellen Voorhees Text REtrieval Conference (TREC) Back to our roots, writ

NEC Forum Tale of Two Contracts Ir. Ir. PAUL LEE PAUL LEE, , Kai Kai-hung hung Assistant

A Tale of Two Indices: Positive vs. Normative Indexation in the Emerging Markets April 2020 A

City of Forest Park A Tale of Two TIFs A Tale of Two TIFs It was the best of times, it was

A Tale of Two Theories: A Tale of Two Theories: Reconciling Reconciling random matrix theory

Exercise 12: Heavy ions beams Exercise 12: Heavy ions beams Beginners FLUKA Course Exercise

Pha Phase se 4: : Iden Identif tifica icatio tion and n and Eval Evaluat uation ion of

Three Talks: 1. How does the solar wind blow? 2 A Tale of Two Space Plasma Physics 2. A Tale of

Community Power in Ontario The Road Ahead Clean Air Council November 24, 2017 20 year

CALIFORNIA LOOKING FORWARD: LINKING INCLUSION & PROSPERITY IN THE NEW ECONOMY @PERE_USC

An Embedding-Based Approach for Oral Disease Diagnosis Prediction from Electronic Medical

2014 CO CORPO RPORATE RATE PRE RESENTATI NTATION ON Revie viewe wed 4Q1 Q14 6 th 5,5

A Novel Automated Approach for Software Effort Estimation Based on Data Augmentation Liyan Song

Mexico: a leading economy A country with competitive sectors ProMxicos strategies 3

Preparing for the future: Emerging priorities for the New Zealand economy David Skilling

1 Company Overview 1 Industry Condition and Competition 2 Financial Status 3

RO-ICAC'2014 Analytical Chemistry for a Better Life September 17 th -21 st 2014 Trgovi te,

TREC 2003 Tracks A Tale of Two Evaluat ions Retrieval in a domain - PDF document

TREC 2003 Tracks A Tale of Two Evaluat ions Retrieval in a domain Genome Novelty Answers, not docs Q&A TREC and RI A Web searching Web VLC Video Beyond text Speech OCR X {X,Y,Z} Beyond just English Chinese Spanish

Regional Trec - September 27, 2015 - Cadogan Farms TREC Workshop April 2015 Regional TREC

Overview of TREC 2014 Ellen Voorhees Text REtrieval Conference (TREC) TREC 2014 Track

AutoAdapt @ TREC 2010 Dyaa Albakour October 7, 2010 Dyaa Albakour AutoAdapt @ TREC 2010 The

TREC, TAC, takeoffs, tacks, tasks, and titillations for 2009 Ian Soboroff, NIST

Text REtrieval Conference (TREC) TREC TRACKS Crowdsourcing Personal Blog, Microblog documents

The Foundat ions: Logic and The Foundat ions: Logic and The Foundat ions: Logic and Proof , Set

Search Evaluation at Grooveshark Yoni Teitelbaum 2013-07-02 Traditional Evaluation: TREC Image

Overview of TREC 2013 Ellen Voorhees Text REtrieval Conference (TREC) Back to our roots, writ

NEC Forum Tale of Two Contracts Ir. Ir. PAUL LEE PAUL LEE, , Kai Kai-hung hung Assistant

A Tale of Two Indices: Positive vs. Normative Indexation in the Emerging Markets April 2020 A

City of Forest Park A Tale of Two TIFs A Tale of Two TIFs It was the best of times, it was

A Tale of Two Theories: A Tale of Two Theories: Reconciling Reconciling random matrix theory

Exercise 12: Heavy ions beams Exercise 12: Heavy ions beams Beginners FLUKA Course Exercise

Pha Phase se 4: : Iden Identif tifica icatio tion and n and Eval Evaluat uation ion of

Three Talks: 1. How does the solar wind blow? 2 A Tale of Two Space Plasma Physics 2. A Tale of

Community Power in Ontario The Road Ahead Clean Air Council November 24, 2017 20 year

CALIFORNIA LOOKING FORWARD: LINKING INCLUSION &amp; PROSPERITY IN THE NEW ECONOMY @PERE_USC

An Embedding-Based Approach for Oral Disease Diagnosis Prediction from Electronic Medical

2014 CO CORPO RPORATE RATE PRE RESENTATI NTATION ON Revie viewe wed 4Q1 Q14 6 th 5,5

A Novel Automated Approach for Software Effort Estimation Based on Data Augmentation Liyan Song

Mexico: a leading economy A country with competitive sectors ProMxicos strategies 3

Preparing for the future: Emerging priorities for the New Zealand economy David Skilling

1 Company Overview 1 Industry Condition and Competition 2 Financial Status 3

RO-ICAC'2014 Analytical Chemistry for a Better Life September 17 th -21 st 2014 Trgovi te,

CALIFORNIA LOOKING FORWARD: LINKING INCLUSION & PROSPERITY IN THE NEW ECONOMY @PERE_USC