Using CrowdSourcing for Data Analytics Hector Garcia-Molina (work - PDF document

Using CrowdSourcing for Data Analytics Hector Garcia-Molina (work with Steven Whang, Peter Lofgren, Aditya Parameswaran and others) Stanford University 1 • Big Data Analytics • CrowdSourcing 1

CrowdSourcing 3 Real World Examples Categorizing Images Categorizing Images Data Gathering Data Gathering S S earch R earch R elevance elevance Image Matching Image Matching Translation Translation 4 2

Many Crowdsourcing Marketplaces! Many Research Projects! 6 3

Example tasks: • get missing data • verify results data • analyze data analytics humans results 7 Example tasks: • get missing data • verify results data • analyze data analytics humans Key Point: results • use humans judiciously 8 4

Today will illustrate with • Entity Resolution • (may cover another topic briefly) 9 Traditional Entity Resolution analysis cleansing what matches what?? ... System n System 1 10 5

Why is ER Challenging? • Huge data sets • No unique identifiers • Missing data • Lots of uncertainty • Many ways to skin the cat 11 Simple ER Example 12 6

Simple ER Example b sim=0.9 a d sim=0.8 c 13 Simple ER Example b sim=0.9 a d sim=0.8 c 14 7

ER: Exact vs Approximate ER resolved cameras cameras resolved ER CDs CDs products resolved ER books books ... ... 15 Simple ER Algorithm • Compute pairwise similarities • Apply threshold • Perform transitive closure 16 8

Simple ER Algorithm • Compute pairwise similarities • Apply threshold • Perform transitive closure 0.7 0.45 0.63 0.5 0.9 0.9 0.95 0.63 0.5 0.4 0.7 0.87 0.9 0.6 17 Simple ER Algorithm • Compute pairwise similarities • Apply threshold • Perform transitive closure 0.7 0.9 0.9 0.95 0.7 0.87 0.9 threshold = 0.7 18 9

Simple ER Algorithm • Compute pairwise similarities • Apply threshold • Perform transitive closure 0.7 0.9 0.9 0.95 0.7 0.87 0.9 19 Crowd ER 20 10

Same as this? 21 Crowd ER • First Cut: For every pair of records, ask workers if they match (i.e., get similarity) 22 11

Crowd ER • First Cut: For every pair of records, ask workers if they match (i.e., get similarity) • Too expensive! 0.7 0.45 0.63 0.5 0.9 0.9 0.95 0.63 0.5 0.4 0.7 0.87 0.9 0.6 23 Crowd ER • Second Cut: Compute similarities; workers verify "critical" pairs 0.7 0.45 0.63 0.5 0.9 0.9 0.95 0.5 0.63 0.4 0.7 critical?? 0.87 0.9 0.6 24 12

Crowd ER • Second Cut: Compute similarities; workers verify "critical" pairs 0.7 0.45 0.63 0.5 0.9 0.9 0.95 0.5 0.63 0.4 0.7 critical?? 0.87 0.9 0.6 25 Crowd ER • Second Cut: Compute similarities; workers verify "critical" pairs 0.7 0.45 0.63 0.5 0.9 0.9 0.95 0.5 0.63 0.4 0.7 critical?? 0.87 0.9 0.6 26 13

records new evidence pairwise generate crowd analysis questions global analysis Key Point: • use humans judiciously clusters 27 Key Issue: Semantics of Crowd Answer 28 14

Key Issue: Semantics of Crowd Answer D C E B A ? 29 Also issue: Similarities as Probabilities sim(a,b) → prob(a,b) 30 15

Strategy a 0.2 0.9 b c 0.5 current state ER result use any given ER algorithm 31 Strategy a 0.2 0.9 Q(a,b) b c 0.5 current consider ALL possible questions state Q(b,c) (three in this example) Q(a,c) 32 16

Strategy new state ER result Y a 0.2 0.9 Q(a,b) new state ER result N b c consider 0.5 possible outcomes new state ER result current Y state Q(b,c) N new state ER result Y new state ER result Q(a,c) N new state ER result 33 Strategy a a a 0.2 0.9 0.2 0.9 b c b c b c 1.0 0.5 new state ER result current Y state Q(b,c) example 34 17

Strategy new state ER result score? Y a 0.2 0.9 Q(a,b) new state ER result score? N b c 0.5 score? new state ER result current Y state Q(b,c) N new state ER result score? Y new state ER result score? Q(a,c) N new state ER result score? 35 Two Remaining Issues • How do we score an ER result? ER result gold standard F score • Efficiency? 36 18

Gold Standard? a a 0.2 0.2 0.9 1.0 b c b c 0.5 0.6 sim to prob 37 possible worlds Gold Standard? a 0.12 b c a a 0.2 0.2 a 0.9 1.0 0.48 b c b c 0.5 0.6 b c a sim to prob 0.08 b c a 0.32 b c 38 19

possible worlds possible clustering Gold Standard? a (via ER algorithm) 0.12 b c a a 0.68 0.2 0.2 a 0.9 1.0 a 0.48 b c b c 0.5 0.6 b c b c a sim to prob 0.08 0.32 b c a a 0.32 b c b c 39 Strategy score new state ER result vs GS? Y a 0.2 0.9 Q(a,b) score new state ER result N b c vs GS? 0.5 score new state ER result current Y vs GS? state Q(b,c) score N new state ER result vs GS? score Y new state ER result vs GS? Q(a,c) N score new state ER result vs GS? 40 20

Evaluating Efficiently • See: Steven E. Whang, Peter Lofgren, and H. Garcia-Molina. Question Selection for Crowd Entity Resolution. To appear in Proc. 39th Int'l Conf. on Very Large Data Bases (PVLDB), Trento, Italy, 2013. 41 Sample Result 42 21

Summary Example tasks: • get missing data data • verify results • analyze data analytics humans Key Point: results • use humans judiciously 43 Now for something completely different! analytics DBMS big data 44 22

Now for something completely different! analytics DBMS big humans data 45 DeCo: Declarative CrowdSourcing what is best price for End user Nikon DS LR cameras? DBMS humans data 46 23

DeCo: Declarative CrowdSourcing what is best price for End user Nikon DS LR cameras? DBMS humans data model type brand D7100 DSLR Nikon 7D DSLR Canon P5000 comp Nikon • • • • • • • • • 47 DeCo: Declarative CrowdSourcing what is best price for End Nikon DS LR cameras? user DBMS humans data model type brand what is best price for Crowd D7100 DSLR Nikon Nikon D7100 camera? 7D DSLR Canon P5000 comp Nikon • • • • • • • • • 48 24

Example with a bit more detail: restaurant rating cuisine Chez Panisse 4.9 French User view Chez Panisse 4.9 California Bytes 3.8 California • • • • • • • • • Example with a bit more detail: restaurant rating cuisine Chez Panisse 4.9 French User view Chez Panisse 4.9 California Bytes 3.8 California • • • • • • • • • ⋈ o restaurant restaurant rating restaurant cuisine Chez Panisse Chez Panisse 4.8 Chez Panisse French Bytes Chez Panisse 5.0 Chez Panisse California • • • Chez Panisse 4.9 Bytes California Anchor Bytes 3.6 Bytes California Bytes 4.0 • • • • • • • • • • • • • • • • • • Dependent Dependent 50 25

Example with a bit more detail: restaurant rating cuisine Chez Panisse 4.9 French User view Chez Panisse 4.9 California Bytes 3.8 California • • • • • • • • • ⋈ o restaurant restaurant rating restaurant cuisine Chez Panisse Chez Panisse 4.8 Chez Panisse French Bytes Chez Panisse 5.0 Chez Panisse California • • • Chez Panisse 4.9 Bytes California Anchor Bytes 3.6 Bytes California fetch rule Bytes 4.0 Chez Panisse • • • • • • Bytes • • • • • • • • • • • • fetch rule Dependent Dependent fetch rule 51 Example with a bit more detail: restaurant rating cuisine Chez Panisse 4.9 French User view Chez Panisse 4.9 California Bytes 3.8 California • • • • • • • • • ⋈ o restaurant restaurant rating restaurant cuisine Chez Panisse Chez Panisse 4.8 Chez Panisse French Bytes Chez Panisse 5.0 Chez Panisse California • • • Chez Panisse 4.9 Bytes California Anchor Bytes 3.6 Bytes California fetch rule Bytes 4.0 Chez Panisse • • • • • • Bytes • • • • • • • • • • • • French fetch rule Dependent Dependent fetch rule fetch rule 52 26

Example with a bit more detail: restaurant rating cuisine Chez Panisse 4.9 French User view Chez Panisse 4.9 California Bytes 3.8 California • • • • • • • • • ⋈ resolution o resolution rule rule restaurant restaurant rating restaurant cuisine Chez Panisse Chez Panisse 4.8 Chez Panisse French Bytes Chez Panisse 5.0 Chez Panisse California • • • Chez Panisse 4.9 Bytes California Anchor Bytes 3.6 Bytes California Bytes 4.0 Chez Panisse • • • • • • Bytes • • • • • • • • • • • • Dependent Dependent 53 Example with a bit more detail: restaurant rating cuisine 1. Fetch Chez Panisse 4.9 French User 2. Resolve view Chez Panisse 4.9 California 3. Join Bytes 3.8 California • • • • • • • • • ⋈ o restaurant restaurant rating restaurant cuisine Chez Panisse Chez Panisse 4.8 Chez Panisse French Bytes Chez Panisse 5.0 Chez Panisse California • • • Chez Panisse 4.9 Bytes California Anchor Bytes 3.6 Bytes California Bytes 4.0 • • • • • • • • • • • • • • • • • • Dependent Dependent 54 27

Using CrowdSourcing for Data Analytics Hector Garcia-Molina (work - PDF document

Using CrowdSourcing for Data Analytics Hector Garcia-Molina (work with Steven Whang, Peter Lofgren, Aditya Parameswaran and others) Stanford University 1 Big Data Analytics CrowdSourcing 1 CrowdSourcing 3 Real World Examples

Analytics and Data Summit 2020 Analytics and Data Summit 2020 Analytics and Data Summit 2020

A/B Testing Crowdsourcing and Human Computation Instructor: Chris Callison-Burch Website:

Crowdsourcing and Human Computer Interaction Design Crowdsourcing and Human Computation

How Crowdsourcing Enabled Computer Vision Crowdsourcing and Human Computation Instructor: Chris

Rise of Crowdsourcing Crowdsourcing = Harvesting societys wisdom, skill, creativity, and scale

Crowdsourcing and HCI 2: Privacy and Latency Crowdsourcing and Human Computation Instructor:

Crowdsourcing Cytogenetic Biodosimetry Dose Estimation Crowdsourcing Cytogenetic Biodosimetry Dose

Crowdsourcing of Weather Data on Mobile App and Deep Learning Lior Perez 99th AMS annual

Speech Transcrip-on with Crowdsourcing Crowdsourcing and Human Computa2on Instructor: Chris

Compliance Crowdsourcing: Managing customer audits at scale Craig Erickson, CISSP, CISA Data

Undergraduate Business Analytics Minor Spreadsheet Analytics BANA-2081 Business Analytics

Crowdsourcing and Human Computation Instructor: Chris Callison-Burch Website:

A Micro Crowdsourcing Architecture to Localize A Micro Crowdsourcing Architecture to Localize Web

crowdsourcing workflow control Nate Tucker and Perry Green barriers to effective crowdsourcing

Incentives in Crowdsourcing: A Game-theoretic Approach ARPITA GHOSH Cornell University NIPS

Crowdsourcing Projects December 11, 2014 Presented by: Crowdsourcing Consortium for Libraries

Research Data Management at the University of Alberta Geoff Harder, Associate University

Department meeting September 19-20 Agenda 19th of September 11.30-13.00 Session 1:

NWU Teaching and Learning Strategy 2015 2020 NWU Student Leadership Training 8 September 2017

Before and Beyond Embedding Skenderija, Sasha; Stehl k, Martin; Houdek, Tom a s 2017

ONTARIO GOVERNMENT USE OF BIG DATA ANALYTICS David Goodis Assistant Commissioner, Ontario IPC

Big Data & The Road to Safety the problem Kenya ranks top 10 countries for death rate

Use of Big Data in Environmental Evaluation World Bank 19th Meeting of the DAC Network on

BITCOIN AND CRYPTO UPDATE University of Adelaide, MBA Alumni Webinar June 2020 RYAN KRIS. 1

Sambuz

Useful Links

Newsletter

Mail Us

Using CrowdSourcing for Data Analytics Hector Garcia-Molina (work - PDF document

Using CrowdSourcing for Data Analytics Hector Garcia-Molina (work with Steven Whang, Peter Lofgren, Aditya Parameswaran and others) Stanford University 1 Big Data Analytics CrowdSourcing 1 CrowdSourcing 3 Real World Examples

Analytics and Data Summit 2020 Analytics and Data Summit 2020 Analytics and Data Summit 2020

A/B Testing Crowdsourcing and Human Computation Instructor: Chris Callison-Burch Website:

Crowdsourcing and Human Computer Interaction Design Crowdsourcing and Human Computation

How Crowdsourcing Enabled Computer Vision Crowdsourcing and Human Computation Instructor: Chris

Rise of Crowdsourcing Crowdsourcing = Harvesting societys wisdom, skill, creativity, and scale

Crowdsourcing and HCI 2: Privacy and Latency Crowdsourcing and Human Computation Instructor:

Crowdsourcing Cytogenetic Biodosimetry Dose Estimation Crowdsourcing Cytogenetic Biodosimetry Dose

Crowdsourcing of Weather Data on Mobile App and Deep Learning Lior Perez 99th AMS annual

Speech Transcrip-on with Crowdsourcing Crowdsourcing and Human Computa2on Instructor: Chris

Compliance Crowdsourcing: Managing customer audits at scale Craig Erickson, CISSP, CISA Data

Undergraduate Business Analytics Minor Spreadsheet Analytics BANA-2081 Business Analytics

Crowdsourcing and Human Computation Instructor: Chris Callison-Burch Website:

A Micro Crowdsourcing Architecture to Localize A Micro Crowdsourcing Architecture to Localize Web

crowdsourcing workflow control Nate Tucker and Perry Green barriers to effective crowdsourcing

Incentives in Crowdsourcing: A Game-theoretic Approach ARPITA GHOSH Cornell University NIPS

Crowdsourcing Projects December 11, 2014 Presented by: Crowdsourcing Consortium for Libraries

Research Data Management at the University of Alberta Geoff Harder, Associate University

Department meeting September 19-20 Agenda 19th of September 11.30-13.00 Session 1:

NWU Teaching and Learning Strategy 2015 2020 NWU Student Leadership Training 8 September 2017

Before and Beyond Embedding Skenderija, Sasha; Stehl k, Martin; Houdek, Tom a s 2017

ONTARIO GOVERNMENT USE OF BIG DATA ANALYTICS David Goodis Assistant Commissioner, Ontario IPC

Big Data &amp; The Road to Safety the problem Kenya ranks top 10 countries for death rate

Use of Big Data in Environmental Evaluation World Bank 19th Meeting of the DAC Network on

BITCOIN AND CRYPTO UPDATE University of Adelaide, MBA Alumni Webinar June 2020 RYAN KRIS. 1

Sambuz

Useful Links

Newsletter

Mail Us

Big Data & The Road to Safety the problem Kenya ranks top 10 countries for death rate