Sherlock Rules Proof Positive and Negative in Data Cleaning Matteo Interlandi Nan Tang
Outline • Motivation • Sherlock Rules • Fundamental problems • Algorithms 2
Data Mining Machine Learning Rule Discovery Roadblocks to Get Value from Data? 3
Data Mining Machine Learning Rule Discovery Roadblocks to Get Value from Data? 3
Data Mining Machine Learning Rule Discovery Roadblocks to Get Value from Data? High Quality Data 3
D name nation capital Si China Beijing Yan China Shanghai Ian China Tokyo
D data repairing name nation capital Si China Beijing Yan China Shanghai Ian China Tokyo consistent D’ nation -> capital name nation capital Si China Beijing Yan China Beijing Ian China Beijing
D data repairing name nation capital Si China Beijing Yan China Shanghai Ian China Tokyo consistent D’ nation -> capital name nation capital Si China Beijing Yan China Beijing Ian China Beijing
D proof positive data repairing name nation capital and negative Si China Beijing Yan China Shanghai Ian China Tokyo consistent D’ annotated D” nation -> capital name nation capital name nation capital Si China Beijing Si China Beijing Yan China Shanghai Yan China Beijing Ian China Tokyo Ian China Beijing
D proof positive data repairing name nation capital and negative Si China Beijing Yan China Shanghai Ian China Tokyo consistent D’ annotated D” nation -> capital name nation capital name nation capital Si China Beijing Si China Beijing Yan China Shanghai Yan China Beijing Ian China Tokyo Ian China Beijing help
D proof positive data repairing name nation capital and negative Si China Beijing Yan China Shanghai Ian China Tokyo consistent D’ annotated D” nation -> capital name nation capital name nation capital Si China Beijing Si China Beijing Yan China Shanghai Yan China Beijing Ian China Tokyo Ian China Beijing Sherlock Rules help
Outline • Motivation • Sherlock Rules • Fundamental problems • Algorithms 5
Proof Positive and Negative name dep nation capital bornat officePhn t t1 Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai Chengdu 24038698 t2 t3 Ian ALT China Beijing Hangzhou 33668323 name officePhn mobile Si 28098001 66700541 r1 Yan 24038698 66706563 r2 Ian 27364928 33668323 r3 6
Proof Positive and Negative name dep nation capital bornat officePhn t t1 Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai Chengdu 24038698 t2 t3 Ian ALT China Beijing Hangzhou 33668323 name officePhn mobile Si 28098001 66700541 r1 Yan 24038698 66706563 r2 Ian 27364928 33668323 r3 6
Proof Positive and Negative name dep nation capital bornat officePhn t t1 Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai Chengdu 24038698 t2 t3 Ian ALT China Beijing Hangzhou 33668323 name officePhn mobile Si 28098001 66700541 r1 Yan 24038698 66706563 r2 Ian 27364928 33668323 r3 Proof Positive/Negative, Correction t3[Ian] is correct, t3[officePhn] = 27364928 6
Proof Positive and Negative name dep nation capital bornat officePhn t t1 Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai Chengdu 24038698 t2 t3 Ian ALT China Beijing Hangzhou 33668323 name officePhn mobile Si 28098001 66700541 r1 Yan 24038698 66706563 r2 Ian 27364928 33668323 r3 Proof Positive/Negative, Correction t3[Ian] is correct, t3[officePhn] = 27364928 6
Proof Positive and Negative name dep nation capital bornat officePhn t t1 Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai Chengdu 24038698 t2 t3 Ian ALT China Beijing Hangzhou 33668323 name officePhn mobile Si 28098001 66700541 r1 Yan 24038698 66706563 r2 Ian 27364928 33668323 r3 Proof Positive/Negative, Proof Positive/Negative Correction t3[Ian] is correct, t3[Ian] is correct, t3[officePhn] = 27364928 t3[officePhn] is wrong 6
Proof Positive and Negative name dep nation capital bornat officePhn t t1 Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai Chengdu 24038698 t2 t3 Ian ALT China Beijing Hangzhou 33668323 country capital China Beijing s1 Japan Tokyo s2 Chile Santiago s3 Proof Positive/Negative, Proof Positive/Negative Correction t3[Ian] is correct, t3[Ian] is correct, t3[officePhn] = 27364928 t3[officePhn] is wrong 6
Proof Positive and Negative name dep nation capital bornat officePhn t t1 Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai Chengdu 24038698 t2 t3 Ian ALT China Beijing Hangzhou 33668323 country capital China Beijing s1 Japan Tokyo s2 Chile Santiago s3 Proof Positive/Negative, Proof Positive/Negative Proof Positive Correction t3[Ian] is correct, t3[Ian] is correct, t1[nation, capital] is correct t3[officePhn] = 27364928 t3[officePhn] is wrong t3[nation, capital] is correct 6
Sherlock Rules name dep nation capital bornat officePhn t t1 Si DA China Beijing ChenYang 28098001 D Yan DA China Shanghai Chengdu 24038698 t2 t3 Ian ALT China Beijing Hangzhou 33668323 country capital name officePhn mobile China Beijing Si 28098001 66700541 s1 r1 D m Japan Tokyo Yan 24038698 66706563 s2 r2 Chile Santiago Ian 27364928 33668323 s3 r3 positive evidence negative 7
Sherlock Rules name dep nation capital bornat officePhn t t1 Si DA China Beijing ChenYang 28098001 D Yan DA China Shanghai Chengdu 24038698 t2 t3 Ian ALT China Beijing Hangzhou 33668323 country capital name officePhn mobile China Beijing Si 28098001 66700541 s1 r1 D m Japan Tokyo Yan 24038698 66706563 s2 r2 Chile Santiago Ian 27364928 33668323 s3 r3 positive evidence negative 7
Point of Innovation Integrity Constraints There does not exist t1[X1] = t2[X2] but t1[B1] = t2[B2] (China, Shanghai) <> = (China, Beijing) 8
Point of Innovation Integrity Constraints There does not exist t1[X1] = t2[X2] but t1[B1] = t2[B2] (China, Shanghai) <> = (China, Beijing) 8
Point of Innovation Integrity Constraints Sherlock Rules t1[X1] = t2[X2] and There does not exist t1[B] = t2[B - ], then t1[X1] = t2[X2] but t1[B1] = t2[B2] t1[B] := t2[B + ] (China, Shanghai) (China, Shanghai) <> = (China, Beijing) (China, Beijing, Shanghai) 8
Point of Innovation Integrity Constraints Sherlock Rules t1[X1] = t2[X2] and There does not exist t1[B] = t2[B - ], then t1[X1] = t2[X2] but t1[B1] = t2[B2] t1[B] := t2[B + ] (China, Shanghai) (China, Shanghai) <> = (China, Beijing) (China, Beijing, Shanghai) 8
Point of Innovation Integrity Constraints Sherlock Rules t1[X1] = t2[X2] and There does not exist t1[B] = t2[B - ], then t1[X1] = t2[X2] but t1[B1] = t2[B2] t1[B] := t2[B + ] (China, Shanghai) (China, Shanghai) <> = (China, Beijing) (China, Beijing, Shanghai) 8
Applying Multiple Rules Pos(t) + + Neg(t) - Free(t) 9
Sherlock Rules in Action t1 (Si, DA, China, Beijing, ChenYang, 28098001) t1 ( Si + , DA, China, Beijing, ChenYang-, 28098001 + ) t1 ( Si + , DA, China, Beijing, ShenYang + , 28098001 + ) 10
Sherlock Rules in Action t1 (Si, DA, China, Beijing, ChenYang, 28098001) t1 ( Si + , DA, China, Beijing, ChenYang-, 28098001 + ) t1 ( Si + , DA, China, Beijing, ShenYang + , 28098001 + ) Pos(t1) 10
Transformation Rules 11
Outline • Motivation • Sherlock Rules • Fundamental problems • Algorithms 12
Fundamental Problems Termination Consistency ( coNP-complete ) Determinism Implication ( coNP-complete ) 13
Algorithms • Motivation • Sherlock Rules • Fundamental problems • Algorithms 14
Algorithms Naive Repairing chase-based O(|R| x |Sigma| x |M|) 15
Algorithms Fast Repairing Naive Repairing Similarity indices to reduce |M| (BK-tree, FastSS, n-gram) chase-based Inverted index to reduce |Sigma| O(|R| x |Sigma| x |M|) (hash map) O(|R| x |Sigma| x com(S)) 15
Algorithms Fast Repairing Naive Repairing Similarity indices to reduce |M| (BK-tree, FastSS, n-gram) chase-based Inverted index to reduce |Sigma| O(|R| x |Sigma| x |M|) (hash map) O(|R| x |Sigma| x com(S)) Caching similarity index accesses Rule pruning based on dependency 15
Rule Pruning Example R1: R2: R3: t3(Ian, ALT, Chine, Beijing, Hangzhou, 33668323) R1 R2 R3 16
Rule Pruning Example R1: R2: R3: t3(Ian, ALT, Chine, Beijing, Hangzhou, 33668323) iteration 1: {(R1, Yes), (R2, Yes), (R3, No)} R1 R2 R3 16
Rule Pruning Example R1: R2: R3: t3(Ian, ALT, Chine, Beijing, Hangzhou, 33668323) iteration 1: {(R1, Yes), (R2, Yes), (R3, No)} R1 R2 iteration 2: {(R1, Yes), (R2, No), (R3, No)} R3 16
Rule Pruning Example R1: R2: R3: t3(Ian, ALT, Chine, Beijing, Hangzhou, 33668323) iteration 1: {(R1, Yes), (R2, Yes), (R3, No)} R1 R2 iteration 2: {(R1, Yes), (R2, No), (R3, No)} R3 iteration 3: {(R1, Yes), (R2, No), (R3, No)} 16
Conclusion • Sherlock rules for accurately annotating and repairing data • Fundamental problems • Efficient algorithms 17
Recommend
More recommend