Crowdsourcing for Information Retrieval Experimentation and - PowerPoint PPT Presentation

Crowdsourcing for Information Retrieval Experimentation and Evaluation Omar Alonso Microsoft 20 September 2011 CLEF 2011

Disclaimer The views, opinions, positions, or strategies expressed in this talk are mine and do not necessarily reflect the official policy or position of Microsoft. CLEF 2011

Introduction • Crowdsourcing is hot • Lots of interest in the research community – Articles showing good results – Workshops and tutorials (ECIR’10, SIGIR’10, NACL’10, WSDM’11, WWW’11, SIGIR’11, etc.) – HCOMP – CrowdConf 2011 • Large companies leveraging crowdsourcing • Start-ups • Venture capital investment CLEF 2011

Crowdsourcing • Crowdsourcing is the act of taking a job traditionally performed by a designated agent (usually an employee) and outsourcing it to an undefined, generally large group of people in the form of an open call. • The application of Open Source principles to fields outside of software. • Most successful story: Wikipedia CLEF 2011

Personal thoughts … CLEF 2011

HUMAN COMPUTATION CLEF 2011

Human computation • Not a new idea • Computers before computers • You are a human computer CLEF 2011

Some definitions • Human computation is a computation that is performed by a human • Human computation system is a system that organizes human efforts to carry out computation • Crowdsourcing is a tool that a human computation system can use to distribute tasks. CLEF 2011

Examples • ESP game • Captcha: 200M every day • ReCaptcha: 750M to date CLEF 2011

Crowdsourcing today • Outsource micro-tasks • Power law • Attention • Incentives • Diversity CLEF 2011

MTurk • Amazon Mechanical Turk (AMT, MTurk, www.mturk.com) • Crowdsourcing platform • On-demand workforce • “Artificial artificial intelligence”: get humans to do hard part • Named after faux automaton of 18th C. CLEF 2011

MTurk – How it works • Requesters create “Human Intelligence Tasks” (HITs) via web services API or dashboard. • Workers (sometimes called “ Turkers ”) log in, choose HITs, perform them. • Requesters assess results, pay per HIT satisfactorily completed. • Currently >200,000 workers from 100 countries; millions of HITs completed CLEF 2011

Why is this interesting? • Easy to prototype and test new experiments • Cheap and fast • No need to setup infrastructure • Introduce experimentation early in the cycle • In the context of IR, implement and experiment as you go • For new ideas, this is very helpful CLEF 2011

Caveats and clarifications • Trust and reliability • Wisdom of the crowd re-visit • Adjust expectations • Crowdsourcing is another data point for your analysis • Complementary to other experiments CLEF 2011

Why now? • The Web • Use humans as processors in a distributed system • Address problems that computers aren’t good • Scale • Reach CLEF 2011

INFORMATION RETRIEVAL AND CROWDSOURCING CLEF 2011

Evaluation • Relevance is hard to evaluate – Highly subjective – Expensive to measure • Click-through data • Professional editorial work • Verticals CLEF 2011

You have a new idea • Novel IR technique • Don’t have access to click data • Can’t hire editors • How to test new ideas? CLEF 2011

Crowdsourcing and relevance evaluation • Subject pool access: no need to come into the lab • Diversity • Low cost • Agile CLEF 2011

Examples • NLP • Machine Translation • Relevance assessment and evaluation • Spelling correction • NER • Image tagging CLEF 2011

Pedal to the metal • You read the papers • You tell your boss (or advisor) that crowdsourcing is the way to go • You now need to produce hundreds of thousands of labels per month • Easy, right? CLEF 2011

Ask the right questions • Instructions are key • Workers are not IR experts so don’t assume the same understanding in terms of terminology • Show examples • Hire a technical writer • Prepare to iterate CLEF 2011

UX design • Time to apply all those usability concepts • Need to grab attention • Generic tips – Experiment should be self-contained. – Keep it short and simple. – Be very clear with the task. – Engage with the worker. Avoid boring stuff. – Always ask for feedback (open-ended question) in an input box. • Localization CLEF 2011

TREC assessment example • Form with a close question (binary relevance) and open-ended question (user feedback) • Clear title, useful keywords • Workers need to find your task CLEF 2011

Payments • How much is a HIT? • Delicate balance – Too little, no interest – Too much, attract spammers • Heuristics – Start with something and wait to see if there is interest or feedback (“I’ll do this for X amount”) – Payment based on user effort. Example: $0.04 (2 cents to answer a yes/no question, 2 cents if you provide feedback that is not mandatory) • Bonus CLEF 2011

Managing crowds CLEF 2011

Quality control • Extremely important part of the experiment • Approach it as “overall” quality – not just for workers • Bi-directional channel – You may think the worker is doing a bad job. – The same worker may think you are a lousy requester. • Test with a gold standard CLEF 2011

Quality control - II • Approval rate • Qualification test – Problems: slows down the experiment, difficult to “test” relevance – Solution: create questions on topics so user gets familiar before starting the assessment • Still not a guarantee of good outcome • Interject gold answers in the experiment • Identify workers that always disagree with the majority CLEF 2011

Methods for measuring agreement • Inter-agreement level – Agreement between judges – Agreement between judges and the gold set • Some statistics – Cohen’s kappa (2 raters) – Fleiss’ kappa (any number of raters) – Krippendorff’s alpha • Gray areas – 2 workers say “relevant” and 3 say “not relevant” – 2-tier system CLEF 2011

Time to re- visit things … • Crowdsourcing offers flexibility to design and experiment • Need to be creative • Test different things • Let’s dissect items that look trivial CLEF 2011

The standard template • Assuming a lab setting – Show a document – Question: “Is this document relevant to the query”? • Can we do better? • GWAP • Barry & Schamber – Depth/scope/specifity – Accuracy/validity – Clarity – Recency CLEF 2011

Content quality • People like to work on things that they like • TREC ad-hoc vs. INEX – TREC experiments took twice to complete – INEX (Wikipedia), TREC (LA Times, FBIS) • Topics – INEX: Olympic games, movies, salad recipes, etc. – TREC: cosmic events, Schengen agreement, etc. • Content and judgments according to modern times – Airport security docs are pre 9/11 – Antarctic exploration (global warming ) • Document length • Randomize content • Avoid worker fatigue CLEF 2011

Scales and labels • Binary • Ternary • Likert – Strongly disagree, disagree, neither agree nor disagree, agree, strongly agree • DCG paper – Irrelevant, marginally, fairly, highly • Other examples – Perfect, excellent, good, fair, bad – Highly relevant, relevant, related, not relevant – 0..10 (0 == irrelevant, 10 == relevant) – Not at all, to some extent, very much so, don’t know (David Brent) • Usability factors – Provide clear, concise labels that use plain language – Terminology has to be familiar to assessors CLEF 2011

The human side • As a worker – I hate when instructions are not clear – I’m not a spammer – I just don’t get what you want – Boring task – A good pay is ideal but not the only condition for engagement • As a requester – Attrition rate – Balancing act: a task that would produce the right results and is appealing to workers – I want your honest answer for the task – I want qualified workers and I want the system to do some of that for me • Managing crowds and tasks is a daily activity and more difficult than managing computers CLEF 2011

Difficulty of the task • Some topics may be more difficult • Ask workers • TREC example CLEF 2011

Relevance justification • Why settle for a label? • Let workers justify answers • INEX: 22% of assignments with comments • TREC: 10% of assignments with comments • Must be optional CLEF 2011

Development & testing CLEF 2011

Development framework • Incremental approach • Measure, evaluate, and adjust as you go • Suitable for repeatable tasks CLEF 2011

Experiment in production • Ad-hoc experimentation vs. ongoing metrics • Lots of tasks on the system at any moment • Need to grab attention • Importance of experiment metadata • Scalability – Scale on data first then on workers – Size of batch – Cost of a deletion • When to schedule – Split a large task into batches and have 1 single batch in the system – Always review feedback from batch n before uploading n +1 CLEF 2011

Advanced applications • Training sets for machine learning • Active learning • Adaptive quality control • Automatic generation of black/white lists CLEF 2011

Crowdsourcing for Information Retrieval Experimentation and - PowerPoint PPT Presentation

Crowdsourcing for Information Retrieval Experimentation and Evaluation Omar Alonso Microsoft 20 September 2011 CLEF 2011 Disclaimer The views, opinions, positions, or strategies expressed in this talk are mine and do not necessarily reflect

Information Retrieval Introducing Information Retrieval and Web Search Information Retrieval

Text REtrieval Conference (TREC) TREC TRACKS Crowdsourcing Personal Blog, Microblog documents

1 What is multimedia information retrieval? 1.1 Information retrieval 1.2 Multimedia 1.3

CS54701: Information Retrieval CS-54701 Information Retrieval Luo Si Department of Computer

IMAGE RETRIEVAL IN DIGITAL LIBRARIES A LARGE SCALE MULTICOLLECTION EXPERIMENTATION OF MACHINE

Information Retrieval Introducing Information Retrieval and Web Search

Information Retrieval CS-7961: Topics in Information retrieval (IR) is finding material (usually

Information Retrieval CS276: Information Retrieval and Web Search

Information Retrieval CS276: Information Retrieval and Web Search

Introduction Information Retrieval Indian Statistical Institute Information Retrieval (ISI)

Probabilistic Information Retrieval CE-324: Modern Information Retrieval Sharif University of

Web Information Retrieval Lecture 9 Information Retrieval in the Web Search use (iProspect

Information Retrieval CS276: Information Retrieval and Web Search Pandu

Information Retrieval CS276 Information Retrieval and Web Search Christopher

Information Retrieval CS276: Information Retrieval and Web Search Pandu

Information Retrieval Modeling Russian Summer School in Information Retrieval Djoerd Hiemstra

Web Information Retrieval Lecture 8 Evaluation in information retrieval Recap of the last

Information Retrieval CS276: Information Retrieval and Web Search Text Classification 1 Chris

Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar

CS54701: Information Retrieval CS-54701 Information Retrieval Course Review Luo Si Department

Information Needs IR, session 2 CS6200: Information Retrieval Slides by: Jesse Anderton

Multimedia Information Retrieval 1 What is multimedia information retrieval? 2 Basic Multimedia

Introduction to Information Retrieval http://informationretrieval.org IIR 1: Boolean Retrieval

Overview of the ACLIA IR4QA (Information Retrieval for IR4QA (Information Retrieval for Question