Using CrowdSourcing for Data Analytics Hector Garcia-Molina (work with Steven Whang, Peter Lofgren, Aditya Parameswaran and others) Stanford University 1 • Big Data Analytics • CrowdSourcing 1
CrowdSourcing 3 Real World Examples Categorizing Images Categorizing Images Data Gathering Data Gathering S S earch R earch R elevance elevance Image Matching Image Matching Translation Translation 4 2
Many Crowdsourcing Marketplaces! Many Research Projects! 6 3
Example tasks: • get missing data • verify results data • analyze data analytics humans results 7 Example tasks: • get missing data • verify results data • analyze data analytics humans Key Point: results • use humans judiciously 8 4
Today will illustrate with • Entity Resolution • (may cover another topic briefly) 9 Traditional Entity Resolution analysis cleansing what matches what?? ... System n System 1 10 5
Why is ER Challenging? • Huge data sets • No unique identifiers • Missing data • Lots of uncertainty • Many ways to skin the cat 11 Simple ER Example 12 6
Simple ER Example b sim=0.9 a d sim=0.8 c 13 Simple ER Example b sim=0.9 a d sim=0.8 c 14 7
ER: Exact vs Approximate ER resolved cameras cameras resolved ER CDs CDs products resolved ER books books ... ... 15 Simple ER Algorithm • Compute pairwise similarities • Apply threshold • Perform transitive closure 16 8
Simple ER Algorithm • Compute pairwise similarities • Apply threshold • Perform transitive closure 0.7 0.45 0.63 0.5 0.9 0.9 0.95 0.63 0.5 0.4 0.7 0.87 0.9 0.6 17 Simple ER Algorithm • Compute pairwise similarities • Apply threshold • Perform transitive closure 0.7 0.9 0.9 0.95 0.7 0.87 0.9 threshold = 0.7 18 9
Simple ER Algorithm • Compute pairwise similarities • Apply threshold • Perform transitive closure 0.7 0.9 0.9 0.95 0.7 0.87 0.9 19 Crowd ER 20 10
Same as this? 21 Crowd ER • First Cut: For every pair of records, ask workers if they match (i.e., get similarity) 22 11
Crowd ER • First Cut: For every pair of records, ask workers if they match (i.e., get similarity) • Too expensive! 0.7 0.45 0.63 0.5 0.9 0.9 0.95 0.63 0.5 0.4 0.7 0.87 0.9 0.6 23 Crowd ER • Second Cut: Compute similarities; workers verify "critical" pairs 0.7 0.45 0.63 0.5 0.9 0.9 0.95 0.5 0.63 0.4 0.7 critical?? 0.87 0.9 0.6 24 12
Crowd ER • Second Cut: Compute similarities; workers verify "critical" pairs 0.7 0.45 0.63 0.5 0.9 0.9 0.95 0.5 0.63 0.4 0.7 critical?? 0.87 0.9 0.6 25 Crowd ER • Second Cut: Compute similarities; workers verify "critical" pairs 0.7 0.45 0.63 0.5 0.9 0.9 0.95 0.5 0.63 0.4 0.7 critical?? 0.87 0.9 0.6 26 13
records new evidence pairwise generate crowd analysis questions global analysis Key Point: • use humans judiciously clusters 27 Key Issue: Semantics of Crowd Answer 28 14
Key Issue: Semantics of Crowd Answer D C E B A ? 29 Also issue: Similarities as Probabilities sim(a,b) → prob(a,b) 30 15
Strategy a 0.2 0.9 b c 0.5 current state ER result use any given ER algorithm 31 Strategy a 0.2 0.9 Q(a,b) b c 0.5 current consider ALL possible questions state Q(b,c) (three in this example) Q(a,c) 32 16
Strategy new state ER result Y a 0.2 0.9 Q(a,b) new state ER result N b c consider 0.5 possible outcomes new state ER result current Y state Q(b,c) N new state ER result Y new state ER result Q(a,c) N new state ER result 33 Strategy a a a 0.2 0.9 0.2 0.9 b c b c b c 1.0 0.5 new state ER result current Y state Q(b,c) example 34 17
Strategy new state ER result score? Y a 0.2 0.9 Q(a,b) new state ER result score? N b c 0.5 score? new state ER result current Y state Q(b,c) N new state ER result score? Y new state ER result score? Q(a,c) N new state ER result score? 35 Two Remaining Issues • How do we score an ER result? ER result gold standard F score • Efficiency? 36 18
Gold Standard? a a 0.2 0.2 0.9 1.0 b c b c 0.5 0.6 sim to prob 37 possible worlds Gold Standard? a 0.12 b c a a 0.2 0.2 a 0.9 1.0 0.48 b c b c 0.5 0.6 b c a sim to prob 0.08 b c a 0.32 b c 38 19
possible worlds possible clustering Gold Standard? a (via ER algorithm) 0.12 b c a a 0.68 0.2 0.2 a 0.9 1.0 a 0.48 b c b c 0.5 0.6 b c b c a sim to prob 0.08 0.32 b c a a 0.32 b c b c 39 Strategy score new state ER result vs GS? Y a 0.2 0.9 Q(a,b) score new state ER result N b c vs GS? 0.5 score new state ER result current Y vs GS? state Q(b,c) score N new state ER result vs GS? score Y new state ER result vs GS? Q(a,c) N score new state ER result vs GS? 40 20
Evaluating Efficiently • See: Steven E. Whang, Peter Lofgren, and H. Garcia-Molina. Question Selection for Crowd Entity Resolution. To appear in Proc. 39th Int'l Conf. on Very Large Data Bases (PVLDB), Trento, Italy, 2013. 41 Sample Result 42 21
Summary Example tasks: • get missing data data • verify results • analyze data analytics humans Key Point: results • use humans judiciously 43 Now for something completely different! analytics DBMS big data 44 22
Now for something completely different! analytics DBMS big humans data 45 DeCo: Declarative CrowdSourcing what is best price for End user Nikon DS LR cameras? DBMS humans data 46 23
DeCo: Declarative CrowdSourcing what is best price for End user Nikon DS LR cameras? DBMS humans data model type brand D7100 DSLR Nikon 7D DSLR Canon P5000 comp Nikon • • • • • • • • • 47 DeCo: Declarative CrowdSourcing what is best price for End Nikon DS LR cameras? user DBMS humans data model type brand what is best price for Crowd D7100 DSLR Nikon Nikon D7100 camera? 7D DSLR Canon P5000 comp Nikon • • • • • • • • • 48 24
Example with a bit more detail: restaurant rating cuisine Chez Panisse 4.9 French User view Chez Panisse 4.9 California Bytes 3.8 California • • • • • • • • • Example with a bit more detail: restaurant rating cuisine Chez Panisse 4.9 French User view Chez Panisse 4.9 California Bytes 3.8 California • • • • • • • • • ⋈ o restaurant restaurant rating restaurant cuisine Chez Panisse Chez Panisse 4.8 Chez Panisse French Bytes Chez Panisse 5.0 Chez Panisse California • • • Chez Panisse 4.9 Bytes California Anchor Bytes 3.6 Bytes California Bytes 4.0 • • • • • • • • • • • • • • • • • • Dependent Dependent 50 25
Example with a bit more detail: restaurant rating cuisine Chez Panisse 4.9 French User view Chez Panisse 4.9 California Bytes 3.8 California • • • • • • • • • ⋈ o restaurant restaurant rating restaurant cuisine Chez Panisse Chez Panisse 4.8 Chez Panisse French Bytes Chez Panisse 5.0 Chez Panisse California • • • Chez Panisse 4.9 Bytes California Anchor Bytes 3.6 Bytes California fetch rule Bytes 4.0 Chez Panisse • • • • • • Bytes • • • • • • • • • • • • fetch rule Dependent Dependent fetch rule 51 Example with a bit more detail: restaurant rating cuisine Chez Panisse 4.9 French User view Chez Panisse 4.9 California Bytes 3.8 California • • • • • • • • • ⋈ o restaurant restaurant rating restaurant cuisine Chez Panisse Chez Panisse 4.8 Chez Panisse French Bytes Chez Panisse 5.0 Chez Panisse California • • • Chez Panisse 4.9 Bytes California Anchor Bytes 3.6 Bytes California fetch rule Bytes 4.0 Chez Panisse • • • • • • Bytes • • • • • • • • • • • • French fetch rule Dependent Dependent fetch rule fetch rule 52 26
Example with a bit more detail: restaurant rating cuisine Chez Panisse 4.9 French User view Chez Panisse 4.9 California Bytes 3.8 California • • • • • • • • • ⋈ resolution o resolution rule rule restaurant restaurant rating restaurant cuisine Chez Panisse Chez Panisse 4.8 Chez Panisse French Bytes Chez Panisse 5.0 Chez Panisse California • • • Chez Panisse 4.9 Bytes California Anchor Bytes 3.6 Bytes California Bytes 4.0 Chez Panisse • • • • • • Bytes • • • • • • • • • • • • Dependent Dependent 53 Example with a bit more detail: restaurant rating cuisine 1. Fetch Chez Panisse 4.9 French User 2. Resolve view Chez Panisse 4.9 California 3. Join Bytes 3.8 California • • • • • • • • • ⋈ o restaurant restaurant rating restaurant cuisine Chez Panisse Chez Panisse 4.8 Chez Panisse French Bytes Chez Panisse 5.0 Chez Panisse California • • • Chez Panisse 4.9 Bytes California Anchor Bytes 3.6 Bytes California Bytes 4.0 • • • • • • • • • • • • • • • • • • Dependent Dependent 54 27
Recommend
More recommend