discovering conditional functional dependencies
play

Discovering conditional Functional Dependencies We n f e i F a n , - PowerPoint PPT Presentation

Discovering conditional Functional Dependencies We n f e i F a n , F l o r i s G e e r t s , J i a n z h o n g L i , a n d M i n g X i o n g I C D E 2 0 0 9 Amira Ghenai Outline Introduction and Motivation Contributions of the


  1. Discovering conditional Functional Dependencies We n f e i F a n , F l o r i s G e e r t s , J i a n z h o n g L i , a n d M i n g X i o n g I C D E 2 0 0 9 Amira Ghenai

  2. Outline • Introduction and Motivation • Contributions of the paper • Algorithms description CFDMiner CTANE FastCFD • Experimental Evaluation • Summary Discovering Conditional Functional Dependencies 2

  3. Introduction & Motivation • CFD (as previously discussed) are introduced for data cleaning purposes • CFDs are more effective than FDs in detecting and repairing inconsistencies Unrealistic to rely on human experts to design CFDs via experiments Automatically discover CFDs The discovery problem is highly non-trivial Discovering Conditional Functional Dependencies 3

  4. Example cust : (country code (CC), area code (AC), phone number (PN)), name (NM), and address (street (STR), city (CT), zip code (ZIP)). (CC,ZIP,STR) FDs: CFDs: Variable CFDs Constant CFDs Discovering Conditional Functional Dependencies 4

  5. Main Contributions • Three algorithms for CFDs discovery: CFDMiner : for discovering constant CFDs only 1. using depth-first search schema CTANE : extension of TANE (presented last week) 2. that uses levelwise approach to discover FDs FastCFD : depth-first approach to discover general 3. CFDs and it’s an extension to FastFD. • Experimental study on real life datasets Discovering Conditional Functional Dependencies 5

  6. Problem Statement • Minimal CFDs A minimal CFD is a non-trivial one i.e. left-reduced . A CFD ᵩ =(X  A, 𝑢 𝑞 ) is left-reduced if: • None of its LRS attributes can be removed (X) • None of the constants in the LHS can be upgraded to “_” i.e. make 𝑢 𝑞 “most general”. (Applied in variable CFDs only) Discovering Conditional Functional Dependencies 6

  7. Problem Statement • Minimal CFDs Example 𝜒 2 = ( 𝐷𝐷, 𝐵𝐷 → (44,131||𝐹𝐸𝐽)) Constant CFD • True for 𝑢 5 and • 𝑢 6 Can’t remove CC • 𝜒 3 = ( 𝐷𝐷, 𝐵𝐷 → 𝐷𝑈, (01,212|𝑂𝑍𝐷)) or AC from LRS Only true for 𝑢 3 • -> Minimal CFD Even if we remove CC from LRS, still holds • -> Non Non-minimal CFD Discovering Conditional Functional Dependencies 7

  8. Problem Statement • Frequent CFDs Given CFD ᵩ =(X  A, 𝑢 𝑞 ) in r, there exist a support denoted by sup( ᵩ ,r) defined as a set of tuples that 𝑢 [X]≤ 𝑢 𝑞 [X]and t[A] ≤ 𝑢 𝑞 [A]. Example: 𝜒 1 = ( 𝐷𝐷, 𝐵𝐷 → 𝐷𝑈, (01,908||𝑁𝐼))  3-frequent f1 : [CC,AC] → CT  8-frequent Discovering Conditional Functional Dependencies 8

  9. CFDMiner Algorithm • Goal: Given an instance r of R and a support threshold k , the algorithm finds a canonical cover of k -frequent minimal constant CFDs of the form X  A, (𝑢 𝑞 ||𝑏) • The algorithm users the notion of free and closed item sets for a given item set pair (X, 𝑢 𝑞 ): Closed set: can’t be extended without decreasing support Free set: can’t be generalized without increasing support Discovering Conditional Functional Dependencies 9

  10. CFDMiner Algorithm ` Discovering Conditional Functional Dependencies 10

  11. CFDMiner Algorithm • The relation between free/closed item sets and left-reduced constant CFDs is: For an instance r in R , any k-frequent left-reduced constant CFD ᵩ =(X  A, 𝑢 𝑞 ||a) holds iff: Item set (X, 𝑢 𝑞 ) is a free k-frequent set and does 1. not contain (A,a) Item set clo(X, 𝑢 𝑞 ) ⪯ (A,a) (less general) and 2. (X, 𝑢 𝑞 ) does not contain a smaller free set (Y, 𝑡 𝑞 ) 3. such that (X, 𝑢 𝑞 ) ⪯ (Y, 𝑡 𝑞 ) (i.e (Y, 𝑡 𝑞 ) is more general ) and 1. Clo (Y, 𝑡 𝑞 ) ⪯ (A,a) 2. Discovering Conditional Functional Dependencies 11

  12. CFDMiner Algorithm • Example 𝜒 1 = ( 𝐷𝐷, 𝐵𝐷 → 𝐷𝑈, (01,908||𝑁𝐼)) 1. φ 1 is extracted from 3-constant CFD and matches the free item set ([CC,AC],(01,908)) 2. φ 1 contains a free item set ([AC],908) which belongs to a Closed sets and free sets that closed set ([AC,CT],908,MH) contain (CT,MH); which is more general i.e (A,a) = (CT,MH) Clo (Y, 𝒕 𝒒 ) ⪯ (A,a) • -> not left-reduced Discovering Conditional Functional Dependencies 12

  13. CFDMiner Algorithm 1. Get top k-frequent closet item sets (X, 𝑢 𝑞 ) and their corresponding free sets 2. Associate with every free item set (Y, 𝑡 𝑞 ) the RHS (Y, 𝑡 𝑞 ) = (X\Y, 𝑢 𝑞 [X\Y]) 3. An ordered list L will be constructed to keep track of all k-frequent free item sets. 4. For each free item set (Y, 𝑡 𝑞 ) in L: Replace RHS(Y, 𝑡 𝑞 ) with RHS(Y, 𝑡 𝑞 ) ∩ RHS ( Y’, a. 𝑡 𝑞 [𝑍′] ) where Y’ ⊈ Y. After checking all subsets, CFDMiner outputs K- b. frequent CFDs Discovering Conditional Functional Dependencies 13

  14. CTANE Algorithm • Goal: Levelwise algorithm for discovering minimal k-frequent (variable and constant) CFDs. An extension of TANE algorithm. • Briefly, the algorithm works as follows: Compute the RHS for minimal CFDs with their 1. LHS in 𝑀 𝑚 (where 𝑀 𝑚 is the corresponding level in the lattice) For each (X, 𝑢 𝑞 ) ∈ 𝑀 𝑚 , we look for CFDs 2. Prune 𝑀 𝑚 3. Generate next level 𝑀 𝑚+1 4. • The following demonstrative example … Discovering Conditional Functional Dependencies 14

  15. CTANE Algorithm Assume a support threshold k ≥ 3 for attributes [CC,AC,ZIP,STR] Figure showing two levels of the lattice and partial third level showing [CC,AC,ZIP] attributes Discovering Conditional Functional Dependencies 15

  16. FastCFD Algorithm • Goal: Find minimal k -frequent variable and constant CFDs in a depth-first search inspired by FastFD algorithm. • Key idea: Minimal CFDs are minimal covers of difference sets • Difference Sets: D ( 𝑢 1 , 𝑢 2 ; 𝑠 0 )={NM} D ( 𝑢 1 , 𝑢 2 ; r )={B ∈ attr(R)| 𝑢 1 [B] ≠ 𝑢 2 [B] } (the set of attributes which are different in 𝑢 1 and 𝑢 2 ) 𝑠 is set {Y\{A}|Y ∈ 𝐸 𝑠 , 𝐵 ∈ Y} 𝐸 𝐵 16

  17. FastCFD Algorithm 1. FindCover Algorithm: Extract the list of k-frequent free item sets in I . r (A) Free pattern 𝑠 𝐷𝐷=01 AND K ≥ 2 for [CC,AC,PN,CT,ZIP,STR] For each item set, produces the minimal I I . 𝑛 (B) difference sets 𝐸 𝐵 𝑛 Calls FindMin to find the minimal cover of 𝐸 𝐵 I I I . 2. FindMin Algorithm: (down-left of example) Orders attributes (alphabetically in example) I . I I . All subsets of attributes are enumerated in a depth-first , left -to-right fashion. Example: for sets{[PN],[AC,CT]}, we can have the possible subsets [AC,PN],[CT,PN]… By getting possible subsets, the algorithm I I I . verifies if the CFD is minimal. For example: Tree for CC=01 and Y=[AC,PN] and we are looking for STR (input) ᵩ ′ = [CC,AC,PN] ->STR,(01, -,-||-) Discovering Conditional Functional Dependencies 17

  18. FastCFD Algorithm Frk: list of free sets, D: minimal difference set, A: attributes in R Conditions of checking whether a CFD is valid or no or whether it is minimal or not Discovering Conditional Functional Dependencies 18

  19. FastCFD Algorithm • Differences compared to FastFD: More complicated (constants, unnamed variables) K-frequent CFDs instead of 1-frequent FD • Needs efficient way of computing sets NaiveFast Algorithm: Stripped partition-passed, Naïve and fast approach FastCFD Algorithm : • Considering the 2-frequent closed item sets only in r which will be computed by CFDMiner algorithm. • Difference set can be computed more efficiently Reorder attributes such that ones that cover most of the sets are treated first to improve efficiency. Discovering Conditional Functional Dependencies 19

  20. Experimental Evaluation • Two Real-life datasets and a synthetic dataset: • The experiments studied the effect of: The support threshold k The number of tuples DBSIZE The number of columns (Arity) The correlation factor (average range of distinct values in an attribute domain) Discovering Conditional Functional Dependencies 20

  21. Experimental Evaluation • Scalability wrt DBSIZE Discovering Conditional Functional Dependencies 21

  22. Experimental Evaluation • Scalability wrt Arity and k Discovering Conditional Functional Dependencies 22

  23. Experimental Evaluation • Scalability wrt CF • The results were on synthetic dataset • Similar results were achieved on real datasets Discovering Conditional Functional Dependencies 23

  24. Summary • CFDMiner is efficient in discovering constant CFDs. • CTANE works well with databases where arity is small and support threshold is large. • NaiveFast and FastCFD are very efficient when arity of relation is very large. • FastCFD is more efficient than the NaiveFast implementation especially when the arity is large. Discovering Conditional Functional Dependencies 24

Recommend


More recommend