HUMAN-POWERED DATA MANAGEMENT ! ! Aditya Parameswaran ! ! with H. Garcia-Molina, ! J. Widom, A. Polyzotis, M. Teh ! ! 1 !
Why should we (DM/DB folks) care? ! Reason 1: Most data is unstructured ! Unstructured Data ! Automated processing: not yet solved ! ! images, videos, text ! ! Incorporate xyzabc ! Structured Data ! 2 !
Why should we (DM/DB folks) care? ! Reason 2: S/ware companies use crowds at scale ! We undertook a survey of industry crowdsourcing users ! !! ! ! ! ! ! ! ! ! ! ! ! ! use crowds! ! Often 10s+ of Millions of $ / yr. / company ! (on crowds + supervisors) ! Plenty of startups too! ! 3 !
Why should we (DM/DB folks) care? ! ! ! Reason 3: Marketplaces are growing rapidly ! 20+ marketplaces ! ! Big companies have internal ones ! Crowdsourcing Marketplaces ! Size of these marketplaces have doubled in 2011 – 2013 !
Why should we (DM/DB folks) care? ! Reason 1: Most data is unstructured ! Reason 2: Software companies use crowds at scale ! Reason 3: Marketplaces are growing rapidly ! 5 !
What is Human-Powered Data Management? ! Data Processing ! Data Processing Algorithms ! Systems ! where humans act as “data processors” ! e.g., compare, label, extract ! Learning Machine Learning ! accuracies ! Interfaces ! HCI ! Patterns ! Economics ! Incentives ! 6 !
Efficient Data Processing Algorithms & Systems ! Filter [SIGMOD12, VLDB14] ! Max [SIGMOD12] ! Data Processing ! Clean [KDD12, TKDD13] ! Categorize [VLDB11] ! Algorithms ! Search [ICDE14] ! ! Debugging [NIPS12] ! Data Processing Deco [CIKM12, VLDB12, TR12, SIGMOD Record 12] ! DataSift [HCOMP13, SIGMOD14] HQuery [CIDR11] ! Systems ! Auxiliary Plugins: Confidence [KDD13, TR14] ! Eviction [TR12] ! Quality, Pricing ! Pricing [VLDB15] ! ! Quality [HCOMP14] ! i.stanford.edu/~adityagp/scoop.html ! 7 !
Data Proc. Sys.: Crowd-Powered Search ! Can your search engine handle this? ! buildings in the vicinity of xxx ! type of cable that connects to ! apartments in a good school district near Urbana, with a bus stop near by ! 8 !
DataSift: Crowd-Powered Search ! ! • No Non-t n-text xtual ual cont ntent nt: ! ! ! “ cables that plug into <img>” ! ! ! “funny pictures of cats with hats with captions” ! ! • Ti Time-c -consum nsuming ng: “find noise canceling headphones where the battery lasts 13 hrs” ! ! ! “apartments in a nice area around urbana” ! ! 9 !
10 !
Building DataSift: Challenges ! Ask for text reformulations for query ! Gather ! Check if item satisfies query ! Filter ! ! Gather ! Retrieve ! Filter ! ! ! Gather ! Retrieve ! Filter ! Retrieve ! Filter ! ! ! ! • How many any re reformul ulat ations ns sho shoul uld we gat athe her? r? ! ! • How many any items s sho shoul uld we re retrieve at at eac ach h st step? p? ! • How do we fi filter r items? s? How many any pe peopl ple do we ask ask? ? ! • How do we opt ptimize the he workfl flow? ! • How do we guar uarant antee corre rectne ness? ss? ! 11 ! !
Fundamental Tradeoffs ! How long can I wait? ! Latency ! What is my desired quality? ! Quality ! Cost ! How much am I willing to spend? ! 12 !
DataSift Summary ! Sample applications: ! education, social media, commerce, journalism, … ! Latency ! Gather ! Retrieve ! Filter ! Quality ! Cost ! [SIGMOD14] DataSift: A Crowd-Powered Search Toolkit (demo) ! [HCOMP13] An expressive and accurate crowd powered search ! 13 !
Filtering: The Simplest Version ! Is this image a cat? ! Boolean ! Dataset set of of It Items ems ! Filtered Filt ered Dataset set ! Predicate ! Y ! Y ! N ! Does X satisfy predicate? ! Latency ! For now, all humans have same error rates ! Quality ! Cost ! 14 !
Our Visualization of Strategies ! continue ! No No ! decide PASS ! 5 ! decide FAIL ! 4 ! Markov 3 ! Decision ! 2 ! Process ! 1 ! 1 ! 2 ! 3 ! 4 ! 5 ! Yes es ! 15 !
Strategy Examples ! continue ! No ! No ! No No decide PASS ! 5 ! 5 ! decide FAIL ! 4 ! 4 ! 3 ! 3 ! 2 ! 2 ! 1 ! 1 ! 1 ! 1 ! 2 ! 2 ! 3 ! 3 ! 4 ! 4 ! 5 ! 5 ! Yes Yes es ! es ! 16 !
Simplest Version ! ! Given: ! — Human error probability (FP/FN) ! Via sampling, ! — Pr [Yes | 0]; Pr [No | 1] ! prior history, or ! — A-priori probability ! gold standard ! — Pr [0]; Pr[1] ! ! Find st stra rateg egy with minimum expected cost (# of questions) ! m ! — Expected error < t (say, 5%) ! x+y=m ! — Cost per item < m (say, 20 questions ) ! m ! 17 !
Evaluating Strategies ! continue ! decide PASS ! No No ! decide FAIL ! 5 ! ! ost = (x+y) Pr [reach(x,y)] Cost ∑ 4 ! ! Error ror = Pr [reach � 1] + ∑ ! Pr [reach � 0] ! ∑ 3 ! y ! 2 ! Pr. [reach (4, 2)] = ! 1 ! Pr. [reach (4, 1) & get a No]+ Pr. [reach (3, 2) & get a Yes] ! 1 ! 2 ! 3 ! 4 ! 5 ! Yes es ! x ! 18 !
Naïve Approach ! For each grid point ! No ! No ! Assign , or ! 5 ! ! 4 ! For all strategies: ! • Evaluate cost & error ! 3 ! Return the best ! 2 ! 1 ! O(3 (3 g ), ), g = O(m (m 2 ) ! 1 ! 2 ! 3 ! 4 ! 5 ! Yes es ! If m= If m= 5, 5, g = 21 21 ! ! 19 !
Comparison ! Computing Money ! Strategy ! Naïve ! $$ !! Not feasible ! deterministic ! Our best Exponential; $$$ ! deterministic ! feasible ! 20 !
Probabilistic Strategy Example ! No No ! continue ! 6 ! decide PASS ! decide FAIL ! 5 ! 4 ! 3 ! 2 ! 1 ! (0.2, 0.8, 0) ! 1 ! 2 ! 3 ! 4 ! 5 ! Yes es ! 21 !
Comparison ! Computing Money ! Strategy ! Naïve ! Exponential; ! $$ !! deterministic ! not feasible ! Our best Exponential; $$$ ! deterministic ! feasible ! The best Polynomial(m) ! $ ! probabilistic ! THE BEST THE BEST ! 22 !
Finding the Optimal Strategy ! Simple: Use Linear Programming " ! • variables: “probabilistic decision per grid point” ! ! • constraints: ! • probability conservation ! • boundary conditions ! ! [SIGMOD12] Crowdscreen: Algorithms for filtering data with humans ! 23 !
Generalizations ! • Multiple answers (ratings, categories) ! • Multiple independent filters ! Doable ! • Difficulty ! • Different penalty functions ! • Latency ! • Different worker abilities ! Hard! ! • Different worker probes ! • A-priori scores ! ! 24 !
Generalization: Worker Abilities ! It Item em 1 ! It Item em 2 ! Item It em 3 ! Actual ! 0 ! 1 ! 0 ! W 1 ! 0 ! 1 ! 0 ! W 2 ! 1 ! 1 ! 1 ! W 3 ! 1 ! 0 ! 1 ! (W 1 Yes, W 1 No, …, W n Yes, W n No) ! O(m 2n ) points ! n � 1000 ! Explosion of state! ! 25 !
A Different Representation ! Pr Pr [1| 1|An Ans] s] ! 1 ! No No ! 3 ! 0.8 ! 2 ! 0.6 ! 1 ! 0.4 ! 0.2 ! 1 ! 2 ! 3 ! Yes es ! 1 ! 2 ! 3 ! Cost ost ! 26 !
Worker Abilities: Sufficiency ! Pr Pr [1| 1|An Ans] s] ! 1 ! 0.8 ! ( W 1 Yes, W 1 No, ! 0.6 ! W 2 Yes, W 2 No, ! 0.4 ! …, ! W n Yes, W n No) ! 0.2 ! 1 ! 2 ! 3 ! 4 ! 5 ! Cost ost ! Recording Pr[1|Ans] is sufficient: ! Strategy ! Optimal ! 27 !
MOOCs: Application of Filtering ! � ! Peer Evaluation ! � Crowdsourcing ! Required ! A+ ! A ! B- ! B+ ! Generalization of boolean filtering to scoring [1-5] ! 28 !
Experiments on MOOCs ! Stanford HCI Course ! 1000 x 5 x 5 Parts = 25000 Parts ! Graded by random peers with known error rates ! To study: how much we can reduce error for fixed cost ! ! 29 !
Summary : ! For same cost, reduction in error ! (distance from correct grade) of: ! • 50% over median ! • 30% over MLE ! • 10-20% over same accuracy ! [VLDB14] Optimal Crowd-Powered Rating and Filtering Algorithms ! 30 !
Efficient Data Processing Algorithms & Systems ! Filt Filter er [SIG IGMOD12, 12, VLDB14] 14] ! Max [SIGMOD12] ! Data Processing ! Clean [KDD12, TKDD13] ! Categorize [VLDB11] ! Algorithms ! Search [ICDE14] ! ! Debugging [NIPS12] ! Data Processing Deco [CIKM12, VLDB12, TR12, SIGMOD Record 12] ! DataSif ift [HCOMP13, 13, SIG IGMOD14] 14] HQuery [CIDR11] ! Systems ! Auxiliary Plugins: Confidence [KDD13, TR14] ! Eviction [TR12] ! Quality, Pricing ! Pricing [VLDB15] ! ! Quality [HCOMP14] ! Latency ! i.stanford.edu/~adityagp/scoop.html ! Quality ! Cost ! 31 !
32 !
VISUAL DATA MANAGEMENT with SeeDB ! Aditya Parameswaran ! ! with: ! Hector Garcia Molina, Sam Madden, ! Alkis Polyzotis, Manasi Vartak ! ! 33 !
Simplifying Data Analytics ! ! Up to a million additional analysts will be needed to address data analytics needs in 2018 in the US alone. ! ! ! ! ! --- McKinsey Big Data Report, 2013 ! How w do o we e ma make e it it ea easier sier for or novice ice data analyst ysts s to o get et in insig sights s from rom data? ! 34 !
Data Analytics Workflow ! “Production by State” ! 50 ! 40 ! 30 ! 25 ! 20 ! 15 ! 10 ! 10 ! All Products ! MA ! CA ! IL ! NY ! “Staplers” ! Query ! Views ! “Sales by Year” ! 4.5 ! 4 ! 3.5 ! 3 ! 2.5 ! 2 ! 1.5 ! Labor oriou ious s and Tiresome! iresome! ! Can we e automa omate e this? is? ! “Production by Year” ! ! 4.5 ! 4 ! Simila imilar r issu issues es wit with ! 3.5 ! 3 ! 2.5 ! Tab ableau, au, Sho ShowMe, Pro rofi filer, Spo Spotfi fire re ! 2 ! 1.5 ! 35 !
Recommend
More recommend