Data Linkage Advanced Topics in Database Management (INFSCI 2711) Some materials are from Similarity Joins in Relational Database Systems, Augsten and Bohlen and Semantic Web for Working Ontologists, Allemang and Hendler Vladimir Zadorozhny, DINS, University of Pittsburgh Problem with matching related records A B Equi-join on name attribute: SELECT * FROM A,B WHERE A.name = B.name Problem: exact match does not work Solution: approximate match based on a “distance” measure between strings: SELECT * FROM A,B WHERE distance(A.name, B.name) <= k ??????? 1
Edit-Based Distance Minimum number of edit operations that transform one string into another. I G Edit operations : (1) insertion of a character, (2) deletion of a character, (3) replacing a character in the string by another character. Instead of counting the edit operations, a cost can be assigned to each I operation. G we introduce the empty character e G cost of transforming character a to character b : c(a,b) G cost for inserting b : c( e ,b) G cost for deleting a : c(a, e ) G cost for replacing a by b : c(a, b) G Unit cost model : all costs c(a,b) , a ¹ b are one . The string edit distance , sed( s1, s2 ), between two strings s1 and s2 is the I minimum cost sequence of edit operations that transforms s1 to s2 . Example: sed( banana , ananas ) = 2 I G remove the first character b from banana è anana G insert a new character s at the end: anana è ananas Similarity Join based on Edit Distance A B Edit distance join on name attribute: SELECT * FROM A,B WHERE sed(A.name, B.name) <= k Result of edit distance join for k = 3 2
Performance In a similarity join all pairs of records must be considered: it is essential that the I query predicate is evaluated efficiently. Often there is a preferable distance measure from the application point of view; this I distance is expensive to compute. Algorithm 1: a nested- loop similarity join which returns all pairs of records (s1,s2) I from the sets (tables) X and Y such that distance( s1,s2 ) £ k. Algorithm 1: nestedLoopNaive( X,Y ) G foreach s1 Î X do G foreach s1 Î Y do G if distance( s1,s2 ) £ k G output (s1,s2) ; G The number of calls to the expensive distance function is |X | ´ |Y| . I In many applications the result size of the similarity join is much smaller than the I cross product. Similarity Join based on Edit Distance A B Number of all record pairs, for which the similarity is computed: |A| ´ |B| = 16 Result size of the similarity join is much smaller than the cross product Result size of edit distance join ( for k = 3) = 4 Many of the record pairs, for which the similarity is computed, are very different from each other. è filters come into play . 3
Filters Filters preprocess the input sets and produce a set of candidate pairs, which is a subset of the cross product, C Í X ´ Y . The distance function is then evaluated only on the candidate pairs. If the filter condition can be evaluated faster than the distance function, the overall join will be faster. X = {1, 6, 11} Y = {2, 5, 20} X ´ Y = { (1,2), (1,5), (1,20), (6,2), (6,5), (6,20), (11,2), (11,5), (11,20) } Check: 2*x + 2*y < 10 (“distance” function) Filters : x > 10 è 2*x + 2*y >10, C1 = X ´ Y - { (11,2), (11,5), (11,20) } y > 10 è 2*x + 2*y >10 C2 = X ´ Y - { (1,20), (6,20), (11,20) } x + y > 10 è 2*x + 2*y >10 C3 = X ´ Y - { (1,20), (6,5), (6,20), (11,2), (11,5), (11,20) } Error Types of Filters Filters place a pair (s1,s2) Î X ´ Y into the candidate set based on a fast guess. There are four possibilities for this guess: Result Pair Candidate Pair • On true positives and true negatives the filter guess is correct. • Candidates that do not qualify for the result set are false positives. Since the actual distance is computed on all pairs in the candidate set, false positives are removed. • False positives increase the runtime, but do not affect the correctness. • A false negative is a pair that should be in the result set, but does not make it into the candidate set and is thus missed. • Ideally, a filter produces no false negatives (to guarantee the correctness of the result) and few false positives (to increase the efficiency) . 4
Example Filters preprocess the input sets and produce a set of candidate pairs, which is a subset of the cross product, C Í X ´ Y . X = {1, 6, 11} Y = {2, 5, 20} X ´ Y = { (1,2), (1,5), (1,20), (6,2), (6,5), (6,20), (11,2), (11,5), (11,20) } Check: 2*x + 2*y < 10 (“distance” function) Filters : x > 10 è 2*x + 2*y >10, C1 = X ´ Y - { (11,2), (11,5), (11,20) } true negative = { (1,2), (1,5), (1,20), (6,2), (6,5), (6,20) } true positive, false positive y > 10 è 2*x + 2*y >10 C2 = X ´ Y - { (1,20), (6,20), (11,20) } = { (1,2), (1,5), (6,2), (6,5), (11,2), (11,5), (11,20) } x + y > 10 è 2*x + 2*y >10 C3 = X ´ Y - { (1,20), (6,5), (6,20), (11,2), (11,5), (11,20) } = { (1,2), (1,5), (6,2) } Lower and upper bounds a lower bound function produces a value that is within the distance value for I any pair of input records. an upper bound function produces a value that is equal or larger than the I distance value. More formally, for any pair (s1,s2) of records lowerBound( s1,s2 ) is a lower I bound and upperBound( s1,s2 ) is an upper bound for the function distance iff G distance( s1,s2 ) ³ lowerBound( s1,s2 ) G distance( s1,s2 ) £ upperBound( s1,s2 ) 5
Example Filters preprocess the input sets and produce a set of candidate pairs, which is a subset of the cross product, C Í X ´ Y . X = {1, 6, 11} Y = {2, 5, 20} X ´ Y = { (1,2), (1,5), (1,20), (6,2), (6,5), (6,20), (11,2), (11,5), (11,20) } Check: 2*x + 2*y < 10 (“distance” function) Filters : Lower Bounds: 2*x + 2*y ≥ x, 2*x + 2*y ≥ y, 2*x + 2*y ≥ x+y x > 10 è 2*x + 2*y >10, C1 = X ´ Y - { (11,2), (11,5), (11,20) } true negative y > 10 è 2*x + 2*y >10 C2 = X ´ Y - { (1,20), (6,20), (11,20) } x + y > 10 è 2*x + 2*y >10 C3 = X ´ Y - { (1,20), (6,5), (6,20), (11,2), (11,5), (11,20) } Upper Bounds: 2*x + 2*y ≤ 4*max(x,y) 4*max(x,y) < 10 è 2*x + 2*y < 10, C1 = X ´ Y - { (1,2) } true positive Similarity Join with bounds Useful properties of lower and upper bounds: I G upperBound(s1,s2) £ k è distance(s1,s2) £ k è (s1,s2) is in result set (true positive) G lowerBound(s1,s2) > k è distance(s1,s2) > k è (s1,s2) is not in result set (true negative) These properties are used to rephrase the nested loop similarity join. I G Algorithm 2: nestedLoopWithBounds( X,Y ) G foreach s1 Î X do • foreach s1 Î Y do Before the distance(s1,s2) is G computed the lower and upper bound if upperBound( s1,s2 ) £ k G are evaluated. If upperBound(s1,s2) £ k è (s1,s2) is • output (s1,s2) ; G in result set (true positive) else if lowerBound( s1,s2 ) > k G • If lowerBound(s1,s2) > k è (s1,s2) is /* nothing to do */ G not in result set (true negative) else if distance( s1,s2 ) £ k • All other pairs are candidates and the G distance function must be called to output (s1,s2) ; G remove false positives. There are no false negatives. 6
Developing bounds for distance functions I Only bounding functions that are evaluated faster than the distance function are useful. I Then the similarity join with upper and lower bounds is typically much faster than the naive nested loop join. The use of lower and upper bounds in the similarity join is very appealing I since the join can be computed faster without sacrificing the correctness of the result. Upper and lower bounds have been developed for the edit distance I between strings Length Filter I The length of two strings s1 and s2 that are at edit distance k = sed( s1,s2 ) cannot differ by more than k, since k insertions or deletions are required to get strings of the same length. G sed( s1,s2 ) ³ abs( |s1| - |s2| ) lowerBound(s1,s2) I The length filter can be evaluated in constant time. Its effectiveness depends on the length distribution of the strings in the dataset. The filter is effective only for strings with different lengths. For datasets, I where many string have similar lengths, many false positives will be produced for the candidate set. 7
Using Length Filter A B 13 18 16 11 10 13 13 13 SELECT * FROM A,B WHERE ABS(LENGTH(A.name)-LENGTH(B.name)) <= k (lowerBound > k è true negative) AND sed(A.name, B.name) <= k Number of all record pairs, for which the similarity is computed ( for k = 3) : 12 < |A| ´ |B| = 16 Te length filter will prune the pairs (J. R. R. Tolkien, C. S. Lewis), (Frodo Baggins, John R. R. Tolkien), (C.S. Lewis, John R. R. Tolkien), (Bilbo Baggins, John R. R. Tolkien). Problems with Edit Distance The edit distance counts the differences between strings, but does not take I into account the string length. This can lead to undesired effects. G sed(International Business Machines Corporation, International Bussiness Machine_Corporation) = 2 G sed(IBM,BMW) = 2 G sed(Int. Business Machines Corp., International Business Machines Corporation) = 17 For above string pairs we cannot find a single distance threshold that I distinguishes matches from non-matches. The value of the edit distance ranges between zero and the length of the I longer string. Overall, the non-normalized edit distance is useful only for very small I thresholds (e.g., when we allow a single typo) or when all involved strings are of similar length. 8
Recommend
More recommend