On Data Dependencies in Dataspaces Shaoxu Song Tsinghua University This is a joint work with Lei Chen (HKUST) and Philip S. Yu (UIC) sxsong@tsinghua.edu.cn 2011
On Data Dependencies in Dataspaces Introduction 1/24 Shaoxu Song sxsong@tsinghua.edu.cn Dataspaces provide a co-existing system of heterogeneous data consider three levels of elements, object : { ( attribute : value ) } Example We consider a dataspace with following objects, t 1 : { ( name : iPod ) , ( color : red ) , ( manu : Apple Inc . ) , ( tel : 567 ) , ( addr : InfiniteLoop , CA ) , ( website : itunes . com ) } ; t 2 : { ( name : iPod ) , ( color : cardinal ) , ( prod : Apple ) , ( tel : 123 ) , ( post : InfiniteLoop , Cupert ) , ( website : apple . com ) } ; t 3 : { ( name : iPad ) , ( color : white ) , ( manu : Apple Inc . ) , ( post : InfiniteLoop ) , ( website : apple . com ) , ( phn : 567 ) } .
On Data Dependencies in Dataspaces Introduction 2/24 Shaoxu Song sxsong@tsinghua.edu.cn Comparable Correspondence Relationship between elements in heterogeneous data metric operator ‘ manu ≈ ≤ 5 prod ’ any two respective values of manu and prod are said comparable, e.g., Apple Inc and Apple , if their edit distance is ≤ 5. matching operator ‘ color ⇋ color ’ e.g., red and cardinal are said matched as comparable color , via users’ feedback often incrementally recognized in a pay-as-you-go style A query of ( manu : Apple ) search value similar to Apple in both manu and prod e.g., ( manu : Apple Inc . ) in t 1 and ( prod : Apple ) in t 2
On Data Dependencies in Dataspaces Introduction 3/24 Shaoxu Song sxsong@tsinghua.edu.cn Data Dependencies For wider applications integrity constraints, schema design optimizing query evaluation, capturing data inconsistency, removing data duplicates Conventional data dependencies not directly applicable to dataspaces often defined on the equality function functional dependencies ( FD s), X → A specify the constraint of equality between the values of two objects on the same attribute e.g., manu → addr cannot address the comparable correspondence, in ( manu , prod ) or ( addr , post )
On Data Dependencies in Dataspaces Introduction 4/24 Shaoxu Song sxsong@tsinghua.edu.cn Comparable Function Specify constraints on comparable attributes θ ( manu , prod ) : [ manu ≈ ≤ 5 manu , manu ≈ ≤ 5 prod , prod ≈ ≤ 5 prod ] Two objects are said comparable on ( manu , prod ) if at least one of these three comparison operators in θ ( manu , prod ) is applicable. t 1 , t 2 are comparable on ( manu , prod ) , since edit distance of ( t 1 [ manu ] , t 2 [ prod ]) is ≤ 5 t 1 , t 3 are also comparable on ( manu , prod ) , where ( t 1 [ manu ] , t 3 [ manu ]) satisfy ‘ manu ≈ ≤ 5 manu ’ t 1 : { ( name : iPod ) , ( color : red ) , ( manu : Apple Inc . ) , ( tel : 567 ) , ( addr : InfiniteLoop , CA ) , ( website : itunes . com ) } ; t 2 : { ( name : iPod ) , ( color : cardinal ) , ( prod : Apple ) , ( tel : 123 ) , ( post : InfiniteLoop , Cupert ) , ( website : apple . com ) } ; t 3 : { ( name : iPad ) , ( color : white ) , ( manu : Apple Inc . ) , ( post : InfiniteLoop ) , ( website : apple . com ) , ( phn : 567 ) } .
On Data Dependencies in Dataspaces Introduction 5/24 Shaoxu Song sxsong@tsinghua.edu.cn Comparable Dependencies (CDs) A general form of dependencies on comparable functions ϕ 1 : θ ( manu , prod ) → θ ( addr , post ) if the manu or prod values of two products are comparable then their corresponding addr or post values should also be comparable where θ ( addr , post ) : [ addr ≈ ≤ 9 addr , addr ≈ ≤ 9 post , post ≈ ≤ 9 post ] is another comparable function
On Data Dependencies in Dataspaces Introduction 6/24 Shaoxu Song sxsong@tsinghua.edu.cn Application Example Query optimization consider an object t 1 as the query to query objects having values similar to ( manu : Apple Inc . ) and ( addr : InfiniteLoop , CA ) of t 1 search in the manu , addr attributes specified in the query, also search in the comparable attributes prod , post according to the comparable functions θ ( manu , prod ) and θ ( addr , post ) according to ϕ 1 , rewrite the query by using ( manu , prod ) only t 1 : { ( name : iPod ) , ( color : red ) , ( manu : Apple Inc . ) , ( tel : 567 ) , ( addr : InfiniteLoop , CA ) , ( website : itunes . com ) } ; t 2 : { ( name : iPod ) , ( color : cardinal ) , ( prod : Apple ) , ( tel : 123 ) , ( post : InfiniteLoop , Cupert ) , ( website : apple . com ) } ; t 3 : { ( name : iPad ) , ( color : white ) , ( manu : Apple Inc . ) , ( post : InfiniteLoop ) , ( website : apple . com ) , ( phn : 567 ) } .
On Data Dependencies in Dataspaces Introduction 7/24 Shaoxu Song sxsong@tsinghua.edu.cn Related Work Metric functional dependencies ( MFD s) X δ − → A equality operator in the left-hand-side similarity operator in the right-hand-side for violation detection e.g., manu 2 − → addr Matching dependencies ( MD s) [ X ≈ X ] → [ A ⇋ A ] similarity operator in the left-hand-side matching operator in the right-hand-side for record matching e.g., [ addr ≈ addr ] → [ tel ⇋ tel ]
Outline Introduction Definition Validation Discovery Experiment Conclusion
On Data Dependencies in Dataspaces Definition 8/24 Shaoxu Song sxsong@tsinghua.edu.cn Comparison Operator We consider a general form of comparison operators, which include the previous operators. Let A i ↔ ij A j denote a comparison operator between two attributes A i , A j in a dataspace S equality operator A i = A j in functional dependencies ( FD s) metric operator A i ≈ λ A j in metric functional dependencies ( MFD s) matching operator A i ⇋ A j in matching dependencies ( MD s) The comparision operator indicates true, if two values satisfy the corresponding constraint.
On Data Dependencies in Dataspaces Definition 9/24 Shaoxu Song sxsong@tsinghua.edu.cn Syntex A general comparable function θ ( A i , A j ) : [ A i ↔ ii A i , A i ↔ ij A j , A j ↔ jj A j ] specifies a comparable constraint of two values from attribute A i or A j , according to their corresponding comparison operators. A comparable dependency ( CD ) ϕ with general comparable functions over a dataspace S is in the form of � ϕ : θ ( A i , A j ) → θ ( B 1 , B 2 ) If two objects have comparable values on A i or A j , then they must have comparable values on B 1 or B 2 .
On Data Dependencies in Dataspaces Definition 10/24 Shaoxu Song sxsong@tsinghua.edu.cn Example Consider ϕ 4 : θ ( manu , prod ) → θ ( tel , phn ) where θ ( tel , phn ) is [ tel = tel , tel = phn , phn = phn ] we have ( t 1 , t 3 ) ≍ LHS ( ϕ 4 ) also agree ( t 1 , t 3 ) ≍ RHS ( ϕ 4 ) denoted by ( t 1 , t 3 ) � ϕ 4 . t 1 : { ( name : iPod ) , ( color : red ) , ( manu : Apple Inc . ) , ( tel : 567 ) , ( addr : InfiniteLoop , CA ) , ( website : itunes . com ) } ; t 2 : { ( name : iPod ) , ( color : cardinal ) , ( prod : Apple ) , ( tel : 123 ) , ( post : InfiniteLoop , Cupert ) , ( website : apple . com ) } ; t 3 : { ( name : iPad ) , ( color : white ) , ( manu : Apple Inc . ) , ( post : InfiniteLoop ) , ( website : apple . com ) , ( phn : 567 ) } .
On Data Dependencies in Dataspaces Definition 11/24 Shaoxu Song sxsong@tsinghua.edu.cn Approximate Dependencies Due to the extremely high heterogeneity, data dependencies might not exactly hold in a given dataspace. ϕ 4 : θ ( manu , prod ) → θ ( tel , phn ) , e.g., ( t 1 , t 2 ) ≍ LHS ( ϕ 4 ) but ( t 1 , t 2 ) �≍ RHS ( ϕ 4 ) i.e., ( t 1 , t 2 ) � � ϕ 4 t 1 : { ( name : iPod ) , ( color : red ) , ( manu : Apple Inc . ) , ( tel : 567 ) , ( addr : InfiniteLoop , CA ) , ( website : itunes . com ) } ; t 2 : { ( name : iPod ) , ( color : cardinal ) , ( prod : Apple ) , ( tel : 123 ) , ( post : InfiniteLoop , Cupert ) , ( website : apple . com ) } ; t 3 : { ( name : iPad ) , ( color : white ) , ( manu : Apple Inc . ) , ( post : InfiniteLoop ) , ( website : apple . com ) , ( phn : 567 ) } .
On Data Dependencies in Dataspaces Definition 12/24 Shaoxu Song sxsong@tsinghua.edu.cn Measure To evaluate how a dependency “almost” holds in a data instance Error measure g 3 ( ϕ, S ) = |S| − max {| T | | T ⊆ S , T � ϕ } , |S| the minimum number of objects that have to be removed from the dataspace S for a dependency ϕ to hold. Confidence measure conf ( ϕ, S ) = max {| T | | T ⊆ S , T � ϕ } . |S| the maximum number of objects reserved after removing minimum objects of violations with respect to ϕ .
On Data Dependencies in Dataspaces Definition 13/24 Shaoxu Song sxsong@tsinghua.edu.cn Example ϕ 4 : θ ( manu , prod ) → θ ( tel , phn ) , Error measure { t 2 } is a minimum violation set w.r.t. ϕ 4 such that all the remaining objects { t 1 , t 3 } satisfy ϕ 4 g 3 = 1 / 3 Confidence measure { t 1 , t 3 } a maximum keeping set w.r.t. ϕ 4 conf = 2 / 3 t 1 : { ( name : iPod ) , ( color : red ) , ( manu : Apple Inc . ) , ( tel : 567 ) , ( addr : InfiniteLoop , CA ) , ( website : itunes . com ) } ; t 2 : { ( name : iPod ) , ( color : cardinal ) , ( prod : Apple ) , ( tel : 123 ) , ( post : InfiniteLoop , Cupert ) , ( website : apple . com ) } ; t 3 : { ( name : iPad ) , ( color : white ) , ( manu : Apple Inc . ) , ( post : InfiniteLoop ) , ( website : apple . com ) , ( phn : 567 ) } .
Outline Introduction Definition Validation Discovery Experiment Conclusion
Recommend
More recommend