Cost-efficient Data Acquisition on Online Data Marketplaces for Correlation Analysis VLDB’19 Yanying Li 1 Haipei Sun 1 Boxiang Dong 2 Hui (Wendy) Wang 1 1 Stevens Institute of Technology Hoboken, NJ 2 Montclair State University Montclair, NJ August 28, 2019
Data Marketplace The rising demand for valuable online datasets has led to the emergence of data marketplace. Data seller Specify data views for sale and their prices. Data shopper Decide which views to purchase. 2 / 33
Data Acquisition We consider data shopper’s need as correlation analysis. Age Zipcode Population State Disease # of cases [35, 40] 10003 7,000 Zipcode State MA Flu 300 [20, 25] 01002 3,500 07003 NJ correct NJ Flu 400 [55, 60] 07003 1,200 07304 NJ correct Florida Lyme disease 130 [35, 40] 07003 5,800 10001 NY correct California Lyme disease 40 10001 NJ wrong [35, 40] 07304 2,000 NJ Lyme disease 200 (a) D S : Source instance owned by D 1 : Zipcode table D 2 : Data and statistics of diseases by state data shopper Adam (FD: Zipcode → State) Age Address Insurance Disease [35, 40] 10 North St. UnitedHealthCare Flu [20, 25] 5 Main St. MedLife HIV [35, 40] 25 South St. UnitedHealthCare Flu D 3 : Insurance & disease data instance (b) Relevant instances on data marketplace Need: Find correlation between age groups and diseases in New Jersey 3 / 33
Data Acquisition • Requirement 1: Meaningful join D S ⋊ ⋉ D 3 is meaningless, as it associates the aggregation data with individual records. Age Zipcode Population Address Insurance Disease [35, 40] 10003 7,000 10 North St. UnitedHealthCare Flu [35, 40] 10003 7,000 25 South St. UnitedHealthCare Flu [20, 25] 01002 3,500 5 Main St. MedLife HIV [35, 40] 07003 5,800 10 North St. UnitedHealthCare Flu [35, 40] 07003 5,800 10 North St. UnitedHealthCare Flu [35, 40] 07304 2,000 25 South St. UnitedHealthCare Flu [35, 40] 07304 2,000 25 South St. UnitedHealthCare Flu D S ⋊ ⋉ D 3 4 / 33
Data Acquisition • Requirement 1: Meaningful join • Requirement 2: High data quality We consider data inconsistency as the main quality issue. Zipcode State 07003 NJ correct FD: Zipcode → State 07304 NJ correct 10001 NY correct 10001 NJ wrong 5 / 33
Data Acquisition • Requirement 1: Meaningful join • Requirement 2: High data quality • Requirement 3: Budget constraint The data shopper has a purchase budget. The price of the purchased datasets must be within the budget. 6 / 33
Our Contributions We design a middleware service named DANCE , a Data Acquisition framework on oNline data market for CorrElation analysis that • provides cost-efficient data acquisition service; • enables budget-conscious search of the high-quality data; • maximizes the correlation of the desired attributes. 7 / 33
Outline 1 Introduction 2 Related Work 3 Preliminaries 4 DANCE • Offline Phase • Online Phase 5 Experiments 6 Conclusion 8 / 33
Related Work Data Market • Query-based pricing model [KUB + 15] • History-aware pricing model [U + 16] • Arbitrage-free pricing model [KUB + 12, LK14, DK17] Data Exploration via Join • Summary graph [YPS11] • Reverse engineering [ZEPS13] Do not consider data quality and budget. 9 / 33
Preliminaries - Data Pricing • In this paper, we mainly focus on query-based pricing functions [KUB + 15]. Input Explicit prices for a few views Output The derived price for any view • DANCE is compatible with any pricing model. 10 / 33
Preliminaries - Data Quality We define data quality as the fraction of tuples that are correct with regard to all the functional dependencies. TID A B C D E t 1 a 1 b 2 c 1 d 1 e 1 t 2 a 1 b 2 c 1 d 1 e 1 t 3 a 1 b 2 c 2 d 1 e 1 t 4 a 1 b 2 c 3 d 1 e 2 t 5 a 1 b 3 c 3 d 2 e 2 FD: A → B , D → E 11 / 33
Preliminaries - Data Quality We define data quality as the fraction of tuples that are correct with regard to all the functional dependencies. TID A B C D E t 1 a 1 b 2 c 1 d 1 e 1 t 2 a 1 b 2 c 1 d 1 e 1 t 3 a 1 b 2 c 2 d 1 e 1 t 4 a 1 b 2 c 3 d 1 e 2 t 5 a 1 b 3 c 3 d 2 e 2 FD: A → B , D → E C ( D , A → B ) = { t 1 , t 2 , t 3 , t 4 } 12 / 33
Preliminaries - Data Quality We define data quality as the fraction of tuples that are correct with regard to all the functional dependencies. TID A B C D E t 1 a 1 b 2 c 1 d 1 e 1 t 2 a 1 b 2 c 1 d 1 e 1 t 3 a 1 b 2 c 2 d 1 e 1 t 4 a 1 b 2 c 3 d 1 e 2 t 5 a 1 b 3 c 3 d 2 e 2 FD: A → B , D → E C ( D , A → B ) = { t 1 , t 2 , t 3 , t 4 } C ( D , D → E ) = { t 1 , t 2 , t 3 , t 5 } 13 / 33
Preliminaries - Data Quality We define data quality as the fraction of tuples that are correct with regard to all the functional dependencies. TID A B C D E t 1 a 1 b 2 c 1 d 1 e 1 t 2 a 1 b 2 c 1 d 1 e 1 t 3 a 1 b 2 c 2 d 1 e 1 t 4 a 1 b 2 c 3 d 1 e 2 t 5 a 1 b 3 c 3 d 2 e 2 FD: A → B , D → E C ( D , A → B ) = { t 1 , t 2 , t 3 , t 4 } C ( D , D → E ) = { t 1 , t 2 , t 3 , t 5 } Q ( D ) = 3 5 = 0 . 6 14 / 33
Preliminaries - Join Informativeness Definition (Join Informativeness) Given two instances D and D ′ , let J be their join attribute(s). The join informativeness of D and D ′ is defined as JI ( D , D ′ ) = Entropy ( D . J , D ′ . J ) − I ( D . J , D ′ . J ) , Entropy ( D . J , D ′ . J ) by using the joint distribution of D . J and D ′ . J in the output of the full outer join of D and D ′ , where I calculates the mutual information. • It penalizes those joins with excessive numbers of such unmatched values [YPS09]. • 0 ≤ JI ( D , D ′ ) ≤ 1. • The smaller JI ( D , D ′ ) is, the more important is the join connection between D and D ′ . 15 / 33
Preliminaries - Correlation Measurement Definition (Correlation Measurement) Given a dataset D and two attribute sets X and Y , the correlation of X and Y CORR ( X , Y ) is measured as • CORR ( X , Y ) = Entropy ( X ) − Entropy ( X | Y ) if X is categorical, • CORR ( X , Y ) = h ( X ) − h ( X | Y ) if X is numerical, where h ( X ) is the cumulative entropy of attribute X ∫ h ( X ) = − P ( X ≤ x ) logP ( X ≤ x ) dx , and ∫ h ( X | Y ) = − h ( X | y ) p ( y ) dy . 16 / 33
Problem Statement Input A set of data instances D = { D 1 , . . . , D n } , source attributes A S , and target attributes A T , purchase budget B , join informativeness threshold α , quality threshold β Output A set of data views T ⊆ D s.t. maximize CORR ( A S , A T ) \\ correlation T subject to ∀ T i ∈ T , ∃ D j ∈ D s . t . T i ⊆ D j , ∑ JI ( T i , T i + 1 ) ≤ α, \\ informativeness T i ∈ S ∪ T Q ( T ) ≥ β, \\ quality p ( T ) ≤ B . \\ budget 17 / 33
Framework of DANCE DANCE Request for Samples Offline Construction of Join Graph Phase Samples Data Marketplace Join Graph Source Instances Correlation (A S , A T ) !"#$%&'("# Online Data Acquisition Phase Data Purchase Query Data Shopper Data Purchase Query Purchased Data Offline Phase Construct a two-layer join graph of the datasets on the marketplace. Online Phase Process data acquisition requests. 18 / 33
Dealing with Large-scale Data Correlated Sampling S = { t i ∈ D | h ( t i [ J ]) ≤ p } Estimation from Samples • E ( JI ( S 1 , S 2 )) = JI ( D 1 , D 2 ) • E ( Q ( S 1 ⋊ ⋉ S 2 )) = Q ( D 1 ⋊ ⋉ D 2 ) • E ( CORR S 1 ⋊ ⋉ S 2 ( A S , A T )) = CORR D 1 ⋊ ⋉ D 2 ( A S , A T ) Re-sampling We design a correlated-resampling method to deal with large join result from samples in case of long join paths. 19 / 33
Offline Phase: Construction of Join Graph Construct a two-layer join graph from the data samples. Instance layer Nodes data instances Edges join attribute and the minimum informativeness Attribute set layer Nodes attribute sets Edges join attribute and informativeness 20 / 33
Offline Phase: Construction of Join Graph Construct a two-layer join graph from the data samples. (B, 0.45) D1 D2 Instance level (C, 0.6) (B, 0.45) (C, 0.6) (B, 0.45) (BC, 0.5) BC BD CD BE CE DE BC AB AC D1 D2 (BC, 0.5) BCE BDE BCD CDE (B, 0.45) (BC, 0.5) (C, 0.6) (BC, 0.5) BCDE ABC Attribute set level 21 / 33
Online Phase: Data Acquisition We design a two-step algorithm to search for the data views. Step 1 Find minimal weighted graphs at instance layer. D 6 D 6 J 12 D 1 D 1 D 2 D 2 J 16 J 13 D 3 D 3 J 27 J 34 J 46 D 4 D 4 J 49 J 35 D 9 D 9 J 56 J 59 D 5 D 5 D 7 D 7 J 57 J 89 J 58 D 8 D 8 Source Attribute Set Target Attribute Set • It is equivalent to the Steiner tree problem and is NP-hard [Vaz13]. 22 / 33
Online Phase: Data Acquisition We design a two-step algorithm to search for the data views. Step 1 Find minimal weighted graphs at instance layer. D 6 D 6 s 12 D 1 D 1 D 2 D 2 s 29 s 28 D 3 D 3 s 14 D 4 D 4 s 49 D 9 D 9 s 17 s 48 D 5 D 5 s 79 D 7 D 7 s 78 D 8 D 8 Source Attribute Set Target Attribute Set Landmark • We adapt the approximate shortest path search algorithm [GBSW10] based on landmarks. 23 / 33
Online Phase: Data Acquisition We design a two-step algorithm to search for the data views. Step 1 Find minimal weighted graphs at instance layer. Step 2 Find optimal target graphs at attribute set layer based on Markov chain Monte Carlo (MCMC). 24 / 33
Recommend
More recommend