crowdsourcing for information retrieval experimentation
play

Crowdsourcing for Information Retrieval Experimentation and - PowerPoint PPT Presentation

Crowdsourcing for Information Retrieval Experimentation and Evaluation Omar Alonso Microsoft 20 September 2011 CLEF 2011 Disclaimer The views, opinions, positions, or strategies expressed in this talk are mine and do not necessarily reflect


  1. Crowdsourcing for Information Retrieval Experimentation and Evaluation Omar Alonso Microsoft 20 September 2011 CLEF 2011

  2. Disclaimer The views, opinions, positions, or strategies expressed in this talk are mine and do not necessarily reflect the official policy or position of Microsoft. CLEF 2011

  3. Introduction • Crowdsourcing is hot • Lots of interest in the research community – Articles showing good results – Workshops and tutorials (ECIR’10, SIGIR’10, NACL’10, WSDM’11, WWW’11, SIGIR’11, etc.) – HCOMP – CrowdConf 2011 • Large companies leveraging crowdsourcing • Start-ups • Venture capital investment CLEF 2011

  4. Crowdsourcing • Crowdsourcing is the act of taking a job traditionally performed by a designated agent (usually an employee) and outsourcing it to an undefined, generally large group of people in the form of an open call. • The application of Open Source principles to fields outside of software. • Most successful story: Wikipedia CLEF 2011

  5. Personal thoughts … CLEF 2011

  6. HUMAN COMPUTATION CLEF 2011

  7. Human computation • Not a new idea • Computers before computers • You are a human computer CLEF 2011

  8. Some definitions • Human computation is a computation that is performed by a human • Human computation system is a system that organizes human efforts to carry out computation • Crowdsourcing is a tool that a human computation system can use to distribute tasks. CLEF 2011

  9. Examples • ESP game • Captcha: 200M every day • ReCaptcha: 750M to date CLEF 2011

  10. Crowdsourcing today • Outsource micro-tasks • Power law • Attention • Incentives • Diversity CLEF 2011

  11. MTurk • Amazon Mechanical Turk (AMT, MTurk, www.mturk.com) • Crowdsourcing platform • On-demand workforce • “Artificial artificial intelligence”: get humans to do hard part • Named after faux automaton of 18th C. CLEF 2011

  12. MTurk – How it works • Requesters create “Human Intelligence Tasks” (HITs) via web services API or dashboard. • Workers (sometimes called “ Turkers ”) log in, choose HITs, perform them. • Requesters assess results, pay per HIT satisfactorily completed. • Currently >200,000 workers from 100 countries; millions of HITs completed CLEF 2011

  13. Why is this interesting? • Easy to prototype and test new experiments • Cheap and fast • No need to setup infrastructure • Introduce experimentation early in the cycle • In the context of IR, implement and experiment as you go • For new ideas, this is very helpful CLEF 2011

  14. Caveats and clarifications • Trust and reliability • Wisdom of the crowd re-visit • Adjust expectations • Crowdsourcing is another data point for your analysis • Complementary to other experiments CLEF 2011

  15. Why now? • The Web • Use humans as processors in a distributed system • Address problems that computers aren’t good • Scale • Reach CLEF 2011

  16. INFORMATION RETRIEVAL AND CROWDSOURCING CLEF 2011

  17. Evaluation • Relevance is hard to evaluate – Highly subjective – Expensive to measure • Click-through data • Professional editorial work • Verticals CLEF 2011

  18. You have a new idea • Novel IR technique • Don’t have access to click data • Can’t hire editors • How to test new ideas? CLEF 2011

  19. Crowdsourcing and relevance evaluation • Subject pool access: no need to come into the lab • Diversity • Low cost • Agile CLEF 2011

  20. Examples • NLP • Machine Translation • Relevance assessment and evaluation • Spelling correction • NER • Image tagging CLEF 2011

  21. Pedal to the metal • You read the papers • You tell your boss (or advisor) that crowdsourcing is the way to go • You now need to produce hundreds of thousands of labels per month • Easy, right? CLEF 2011

  22. Ask the right questions • Instructions are key • Workers are not IR experts so don’t assume the same understanding in terms of terminology • Show examples • Hire a technical writer • Prepare to iterate CLEF 2011

  23. UX design • Time to apply all those usability concepts • Need to grab attention • Generic tips – Experiment should be self-contained. – Keep it short and simple. – Be very clear with the task. – Engage with the worker. Avoid boring stuff. – Always ask for feedback (open-ended question) in an input box. • Localization CLEF 2011

  24. TREC assessment example • Form with a close question (binary relevance) and open-ended question (user feedback) • Clear title, useful keywords • Workers need to find your task CLEF 2011

  25. Payments • How much is a HIT? • Delicate balance – Too little, no interest – Too much, attract spammers • Heuristics – Start with something and wait to see if there is interest or feedback (“I’ll do this for X amount”) – Payment based on user effort. Example: $0.04 (2 cents to answer a yes/no question, 2 cents if you provide feedback that is not mandatory) • Bonus CLEF 2011

  26. Managing crowds CLEF 2011

  27. Quality control • Extremely important part of the experiment • Approach it as “overall” quality – not just for workers • Bi-directional channel – You may think the worker is doing a bad job. – The same worker may think you are a lousy requester. • Test with a gold standard CLEF 2011

  28. Quality control - II • Approval rate • Qualification test – Problems: slows down the experiment, difficult to “test” relevance – Solution: create questions on topics so user gets familiar before starting the assessment • Still not a guarantee of good outcome • Interject gold answers in the experiment • Identify workers that always disagree with the majority CLEF 2011

  29. Methods for measuring agreement • Inter-agreement level – Agreement between judges – Agreement between judges and the gold set • Some statistics – Cohen’s kappa (2 raters) – Fleiss’ kappa (any number of raters) – Krippendorff’s alpha • Gray areas – 2 workers say “relevant” and 3 say “not relevant” – 2-tier system CLEF 2011

  30. Time to re- visit things … • Crowdsourcing offers flexibility to design and experiment • Need to be creative • Test different things • Let’s dissect items that look trivial CLEF 2011

  31. The standard template • Assuming a lab setting – Show a document – Question: “Is this document relevant to the query”? • Can we do better? • GWAP • Barry & Schamber – Depth/scope/specifity – Accuracy/validity – Clarity – Recency CLEF 2011

  32. Content quality • People like to work on things that they like • TREC ad-hoc vs. INEX – TREC experiments took twice to complete – INEX (Wikipedia), TREC (LA Times, FBIS) • Topics – INEX: Olympic games, movies, salad recipes, etc. – TREC: cosmic events, Schengen agreement, etc. • Content and judgments according to modern times – Airport security docs are pre 9/11 – Antarctic exploration (global warming ) • Document length • Randomize content • Avoid worker fatigue CLEF 2011

  33. Scales and labels • Binary • Ternary • Likert – Strongly disagree, disagree, neither agree nor disagree, agree, strongly agree • DCG paper – Irrelevant, marginally, fairly, highly • Other examples – Perfect, excellent, good, fair, bad – Highly relevant, relevant, related, not relevant – 0..10 (0 == irrelevant, 10 == relevant) – Not at all, to some extent, very much so, don’t know (David Brent) • Usability factors – Provide clear, concise labels that use plain language – Terminology has to be familiar to assessors CLEF 2011

  34. The human side • As a worker – I hate when instructions are not clear – I’m not a spammer – I just don’t get what you want – Boring task – A good pay is ideal but not the only condition for engagement • As a requester – Attrition rate – Balancing act: a task that would produce the right results and is appealing to workers – I want your honest answer for the task – I want qualified workers and I want the system to do some of that for me • Managing crowds and tasks is a daily activity and more difficult than managing computers CLEF 2011

  35. Difficulty of the task • Some topics may be more difficult • Ask workers • TREC example CLEF 2011

  36. Relevance justification • Why settle for a label? • Let workers justify answers • INEX: 22% of assignments with comments • TREC: 10% of assignments with comments • Must be optional CLEF 2011

  37. Development & testing CLEF 2011

  38. Development framework • Incremental approach • Measure, evaluate, and adjust as you go • Suitable for repeatable tasks CLEF 2011

  39. Experiment in production • Ad-hoc experimentation vs. ongoing metrics • Lots of tasks on the system at any moment • Need to grab attention • Importance of experiment metadata • Scalability – Scale on data first then on workers – Size of batch – Cost of a deletion • When to schedule – Split a large task into batches and have 1 single batch in the system – Always review feedback from batch n before uploading n +1 CLEF 2011

  40. Advanced applications • Training sets for machine learning • Active learning • Adaptive quality control • Automatic generation of black/white lists CLEF 2011

Recommend


More recommend