Why Batch and User Evaluations Do Not Give the Same Results A. - PowerPoint PPT Presentation

Why Batch and User Evaluations Do Not Give the Same Results A. Turpin Curtin University of Technology Perth, Australia W. Hersh Oregon Health Sciences University Portland, Oregon Presented at SIGIR2001 New Orleans

TREC ad-hoc 0.4 MAP 0.3 0.2 95 96 97 98 99 00

Experimental method 1. Set baseline system to basic Cosine Vector weights 2. Identify “super” system using batch experiments 3. Run 24 users on the 2 systems with same topics 4. Send results off to NIST 5. Get relevance judgments 6. Analyse user results 7. Check batch results

Example instance recall query Number: 414i Title: Cuba, sugar, imports Description: What countries import Cuban sugar? Instances: In the time alloted, please find as many DIFFERENT countries of the sort described above as you can. Please save at least one document for EACH such DIFFERENT country. If one document discusses several such countries, then you need not save other documents that repeat those, since your goal is to identify as many DIFFERENT countries of the sort described above as possible.

Experiment 1 - Instance Recall Baseline Improved 0.390 0.385 0.330 0.324 0.275 0.213 Pre-batch MAP User IR Post-batch MAP

8 Q&A queries 1) What are the names of three US national parks where one can find redwoods? 2) Identify a site with Roman ruins in present day France 3) Name four films in which Orson Welles appeared 4) Name three countries that imported Cuban sugar during the period of time covered by the document collection

8 Q&A queries 5) Which childrens TV program was on the air longer the original Mickey Mouse Club or the original Howdy Doody Show? 6) Which painting did Edvard Munch complete first Vampire or Puberty? 7) Which was the last dynasty of China Qing or Ming? 8) Is Denmark larger or smaller in population than Norway?

Experiment 2 - Question Answering Baseline 66% Improved 60% 0.354 0.327 0.270 0.228 Pre-batch MAP User QA Post-batch MAP

Results Summary Predicted Actual Instance recall 81% 15% (p = 0.27) Question answering 58% -6% (p = 0.41) Why? 1. Systems no different on topics and collection used 2. There was a difference, but users ignored it

Precision metrics on user queries and collection 47% 57% Baseline Improved 0.60 p=0.02 p=0.03 68% 0.50 33% p=0.001 p=0.14 0.40 100% p=0.001 0.30 40% p=0.02 0.20 0.10 0.00 MAP p@10 p@50 MAP p@10 p@50 Inst. Recall experiment QA experiment

Number of instances on user queries and collection 30% 14.00 p=0.28 105% 12.00 p=0.04 10.00 Baseline 8.00 Improved 6.00 4.00 2.00 0.00 Num. inst. @10 Num. inst. @50

So what happens to the difference? • Users compensate for the lack of relevant docs within time limit • Users ignore high ranked relevant documents – Maybe obscure document titles? – Don’t read the list from the top? • “Extra” relevant docs give no new information

Number of queries per topic 33% 5 16% p=0.04 p=0.16 4 3 2 Baseline Improved 1 0 IR QA

Number of docs retrieved Baseline 35% Improved p=0.01 2% 150 p=0.93 35% 100 p=0.02 0% 50 p=0.97 0 IR QA IR QA Relevant Irrelevant

Number top 10 relevant docs ignored 87% 24% 60% p=0.002 p=0.22 40% 20% Baseline Improved 0% IR QA

Conclusion • In these two tasks there is no use providing users with a good weighting scheme because – They will ignore high ranking relevant docs – They will happily issue a few extra queries • They find answers just as well with old technology • User interface effects? • Task effect?

TF ( t , d ) IDF ( t ) � � Basic cosine 2 TF ( t , d ) � t T t T � � q , d d 2 IDF ( t ) f � d , t � Okapi f W + t T � d , t d q , d f N f � � � d , t t f ln � � � � � Pivoted Okapi � � q , t ' f f W + t T � � � t d , t d q , d

Why Batch and User Evaluations Do Not Give the Same Results A. - PowerPoint PPT Presentation

Why Batch and User Evaluations Do Not Give the Same Results A. Turpin Curtin University of Technology Perth, Australia W. Hersh Oregon Health Sciences University Portland, Oregon Presented at SIGIR2001 New Orleans TREC ad-hoc 0.4 MAP 0.3

HEALTH #UNselfie Maryland GIVE the gift of GIVE LEADERSHIP #UNselfie Maryland GIVE the gift

Batch Systems Running calculations on HPC resources Outline What is a batch system? How

Staff Evaluations Good for Everyone! Why have Evaluations? Why Conduct Staff Evaluations? In

Why Im NOT Why Im NOT Why Im NOT Why Im NOT a Hindu Why Im NOT a Hindu

Batch Systems Running your jobs on an HPC machine Outline What are batch systems? Why are

Batch Metadata Editing in DSpace 1.6+ Maureen P. Walsh, The Ohio State University Libraries

? Same time Same time Same place Same time Same place 2 different painters Our story really

HEBT Magnet Vacuum Chambers for Batch 2 and Batch 3 PSP Code 2.3.7.1.2.3.2 Lukas Urban /

Evalutation of E- -newspaper newspaper Evalutation of E prototypes prototypes E- -newspaper

Learning about the process and organism: Batch Sef Heijnen, Department of Biotechnology, Faculty

RUN groupadd -r user && useradd -r -g user user USER user $ docker run --read-only debian

DUTY TO GIVE REASONS Duty to give reasons Key principle A decision-maker must always give

Automating batch fecundity measurements Automating batch fecundity measurements using digital

Batch Modeling and Process Monitoring Geir Rune Flten Agenda CAMO Batch analysis

ECE 417 Fall 2018 Lecture 19: Mini-Batch Training and Data Augmentation Mark Hasegawa-Johnson

A new batch system, dCache and nfs A. Pickford Background Nikhef Local Batch System (Stoomboot)

Taking Part-Time Programmers Seriously Jesse A. Tov Elizabeth Tov Northeastern University

Approximate Program Synthesis James Bornholt Emina Torlak Luis Ceze Dan Grossman University of

COMP80122 Slides and Presentations with special thanks to Sebastian Brandt , to whom I owe most

LARGE CACHE DESIGN Mahdi Nazm Bojnordi Assistant Professor School of Computing University of

Distributed goodness-of-fit: when you can't talk much and have little in common Jayadev Acharya

Accelerator Driven Subcritical Reactors Introducing GEM*STAR A Particularly Advantageous

MILITARY MUNITIONS SUPPORT SERVICES 237 217 200 80 252 WEBINAR MAKING DECISIONS 237 217

Cyber@UC Meeting 68 Advanced Persistent Threats If Youre New! Join our Slack:

Sambuz

Useful Links

Newsletter

Mail Us

Why Batch and User Evaluations Do Not Give the Same Results A. - PowerPoint PPT Presentation

Why Batch and User Evaluations Do Not Give the Same Results A. Turpin Curtin University of Technology Perth, Australia W. Hersh Oregon Health Sciences University Portland, Oregon Presented at SIGIR2001 New Orleans TREC ad-hoc 0.4 MAP 0.3

HEALTH #UNselfie Maryland GIVE the gift of GIVE LEADERSHIP #UNselfie Maryland GIVE the gift

Batch Systems Running calculations on HPC resources Outline What is a batch system? How

Staff Evaluations Good for Everyone! Why have Evaluations? Why Conduct Staff Evaluations? In

Why Im NOT Why Im NOT Why Im NOT Why Im NOT a Hindu Why Im NOT a Hindu

Batch Systems Running your jobs on an HPC machine Outline What are batch systems? Why are

Batch Metadata Editing in DSpace 1.6+ Maureen P. Walsh, The Ohio State University Libraries

? Same time Same time Same place Same time Same place 2 different painters Our story really

HEBT Magnet Vacuum Chambers for Batch 2 and Batch 3 PSP Code 2.3.7.1.2.3.2 Lukas Urban /

Evalutation of E- -newspaper newspaper Evalutation of E prototypes prototypes E- -newspaper

Learning about the process and organism: Batch Sef Heijnen, Department of Biotechnology, Faculty

RUN groupadd -r user &amp;&amp; useradd -r -g user user USER user $ docker run --read-only debian

DUTY TO GIVE REASONS Duty to give reasons Key principle A decision-maker must always give

Automating batch fecundity measurements Automating batch fecundity measurements using digital

Batch Modeling and Process Monitoring Geir Rune Flten Agenda CAMO Batch analysis

ECE 417 Fall 2018 Lecture 19: Mini-Batch Training and Data Augmentation Mark Hasegawa-Johnson

A new batch system, dCache and nfs A. Pickford Background Nikhef Local Batch System (Stoomboot)

Taking Part-Time Programmers Seriously Jesse A. Tov Elizabeth Tov Northeastern University

Approximate Program Synthesis James Bornholt Emina Torlak Luis Ceze Dan Grossman University of

COMP80122 Slides and Presentations with special thanks to Sebastian Brandt , to whom I owe most

LARGE CACHE DESIGN Mahdi Nazm Bojnordi Assistant Professor School of Computing University of

Distributed goodness-of-fit: when you can't talk much and have little in common Jayadev Acharya

Accelerator Driven Subcritical Reactors Introducing GEM*STAR A Particularly Advantageous

MILITARY MUNITIONS SUPPORT SERVICES 237 217 200 80 252 WEBINAR MAKING DECISIONS 237 217

Cyber@UC Meeting 68 Advanced Persistent Threats If Youre New! Join our Slack:

Sambuz

Useful Links

Newsletter

Mail Us

RUN groupadd -r user && useradd -r -g user user USER user $ docker run --read-only debian