Crowdsourcing for Information Retrieval Experimentation and Evaluation Omar Alonso Microsoft 20 September 2011 CLEF 2011
Disclaimer The views, opinions, positions, or strategies expressed in this talk are mine and do not necessarily reflect the official policy or position of Microsoft. CLEF 2011
Introduction • Crowdsourcing is hot • Lots of interest in the research community – Articles showing good results – Workshops and tutorials (ECIR’10, SIGIR’10, NACL’10, WSDM’11, WWW’11, SIGIR’11, etc.) – HCOMP – CrowdConf 2011 • Large companies leveraging crowdsourcing • Start-ups • Venture capital investment CLEF 2011
Crowdsourcing • Crowdsourcing is the act of taking a job traditionally performed by a designated agent (usually an employee) and outsourcing it to an undefined, generally large group of people in the form of an open call. • The application of Open Source principles to fields outside of software. • Most successful story: Wikipedia CLEF 2011
Personal thoughts … CLEF 2011
HUMAN COMPUTATION CLEF 2011
Human computation • Not a new idea • Computers before computers • You are a human computer CLEF 2011
Some definitions • Human computation is a computation that is performed by a human • Human computation system is a system that organizes human efforts to carry out computation • Crowdsourcing is a tool that a human computation system can use to distribute tasks. CLEF 2011
Examples • ESP game • Captcha: 200M every day • ReCaptcha: 750M to date CLEF 2011
Crowdsourcing today • Outsource micro-tasks • Power law • Attention • Incentives • Diversity CLEF 2011
MTurk • Amazon Mechanical Turk (AMT, MTurk, www.mturk.com) • Crowdsourcing platform • On-demand workforce • “Artificial artificial intelligence”: get humans to do hard part • Named after faux automaton of 18th C. CLEF 2011
MTurk – How it works • Requesters create “Human Intelligence Tasks” (HITs) via web services API or dashboard. • Workers (sometimes called “ Turkers ”) log in, choose HITs, perform them. • Requesters assess results, pay per HIT satisfactorily completed. • Currently >200,000 workers from 100 countries; millions of HITs completed CLEF 2011
Why is this interesting? • Easy to prototype and test new experiments • Cheap and fast • No need to setup infrastructure • Introduce experimentation early in the cycle • In the context of IR, implement and experiment as you go • For new ideas, this is very helpful CLEF 2011
Caveats and clarifications • Trust and reliability • Wisdom of the crowd re-visit • Adjust expectations • Crowdsourcing is another data point for your analysis • Complementary to other experiments CLEF 2011
Why now? • The Web • Use humans as processors in a distributed system • Address problems that computers aren’t good • Scale • Reach CLEF 2011
INFORMATION RETRIEVAL AND CROWDSOURCING CLEF 2011
Evaluation • Relevance is hard to evaluate – Highly subjective – Expensive to measure • Click-through data • Professional editorial work • Verticals CLEF 2011
You have a new idea • Novel IR technique • Don’t have access to click data • Can’t hire editors • How to test new ideas? CLEF 2011
Crowdsourcing and relevance evaluation • Subject pool access: no need to come into the lab • Diversity • Low cost • Agile CLEF 2011
Examples • NLP • Machine Translation • Relevance assessment and evaluation • Spelling correction • NER • Image tagging CLEF 2011
Pedal to the metal • You read the papers • You tell your boss (or advisor) that crowdsourcing is the way to go • You now need to produce hundreds of thousands of labels per month • Easy, right? CLEF 2011
Ask the right questions • Instructions are key • Workers are not IR experts so don’t assume the same understanding in terms of terminology • Show examples • Hire a technical writer • Prepare to iterate CLEF 2011
UX design • Time to apply all those usability concepts • Need to grab attention • Generic tips – Experiment should be self-contained. – Keep it short and simple. – Be very clear with the task. – Engage with the worker. Avoid boring stuff. – Always ask for feedback (open-ended question) in an input box. • Localization CLEF 2011
TREC assessment example • Form with a close question (binary relevance) and open-ended question (user feedback) • Clear title, useful keywords • Workers need to find your task CLEF 2011
Payments • How much is a HIT? • Delicate balance – Too little, no interest – Too much, attract spammers • Heuristics – Start with something and wait to see if there is interest or feedback (“I’ll do this for X amount”) – Payment based on user effort. Example: $0.04 (2 cents to answer a yes/no question, 2 cents if you provide feedback that is not mandatory) • Bonus CLEF 2011
Managing crowds CLEF 2011
Quality control • Extremely important part of the experiment • Approach it as “overall” quality – not just for workers • Bi-directional channel – You may think the worker is doing a bad job. – The same worker may think you are a lousy requester. • Test with a gold standard CLEF 2011
Quality control - II • Approval rate • Qualification test – Problems: slows down the experiment, difficult to “test” relevance – Solution: create questions on topics so user gets familiar before starting the assessment • Still not a guarantee of good outcome • Interject gold answers in the experiment • Identify workers that always disagree with the majority CLEF 2011
Methods for measuring agreement • Inter-agreement level – Agreement between judges – Agreement between judges and the gold set • Some statistics – Cohen’s kappa (2 raters) – Fleiss’ kappa (any number of raters) – Krippendorff’s alpha • Gray areas – 2 workers say “relevant” and 3 say “not relevant” – 2-tier system CLEF 2011
Time to re- visit things … • Crowdsourcing offers flexibility to design and experiment • Need to be creative • Test different things • Let’s dissect items that look trivial CLEF 2011
The standard template • Assuming a lab setting – Show a document – Question: “Is this document relevant to the query”? • Can we do better? • GWAP • Barry & Schamber – Depth/scope/specifity – Accuracy/validity – Clarity – Recency CLEF 2011
Content quality • People like to work on things that they like • TREC ad-hoc vs. INEX – TREC experiments took twice to complete – INEX (Wikipedia), TREC (LA Times, FBIS) • Topics – INEX: Olympic games, movies, salad recipes, etc. – TREC: cosmic events, Schengen agreement, etc. • Content and judgments according to modern times – Airport security docs are pre 9/11 – Antarctic exploration (global warming ) • Document length • Randomize content • Avoid worker fatigue CLEF 2011
Scales and labels • Binary • Ternary • Likert – Strongly disagree, disagree, neither agree nor disagree, agree, strongly agree • DCG paper – Irrelevant, marginally, fairly, highly • Other examples – Perfect, excellent, good, fair, bad – Highly relevant, relevant, related, not relevant – 0..10 (0 == irrelevant, 10 == relevant) – Not at all, to some extent, very much so, don’t know (David Brent) • Usability factors – Provide clear, concise labels that use plain language – Terminology has to be familiar to assessors CLEF 2011
The human side • As a worker – I hate when instructions are not clear – I’m not a spammer – I just don’t get what you want – Boring task – A good pay is ideal but not the only condition for engagement • As a requester – Attrition rate – Balancing act: a task that would produce the right results and is appealing to workers – I want your honest answer for the task – I want qualified workers and I want the system to do some of that for me • Managing crowds and tasks is a daily activity and more difficult than managing computers CLEF 2011
Difficulty of the task • Some topics may be more difficult • Ask workers • TREC example CLEF 2011
Relevance justification • Why settle for a label? • Let workers justify answers • INEX: 22% of assignments with comments • TREC: 10% of assignments with comments • Must be optional CLEF 2011
Development & testing CLEF 2011
Development framework • Incremental approach • Measure, evaluate, and adjust as you go • Suitable for repeatable tasks CLEF 2011
Experiment in production • Ad-hoc experimentation vs. ongoing metrics • Lots of tasks on the system at any moment • Need to grab attention • Importance of experiment metadata • Scalability – Scale on data first then on workers – Size of batch – Cost of a deletion • When to schedule – Split a large task into batches and have 1 single batch in the system – Always review feedback from batch n before uploading n +1 CLEF 2011
Advanced applications • Training sets for machine learning • Active learning • Adaptive quality control • Automatic generation of black/white lists CLEF 2011
Recommend
More recommend