data cleaning
play

Data Cleaning Nan Tang, QCRI Big Data Cleaning Nan Tang, QCRI - PowerPoint PPT Presentation

Data Cleaning Nan Tang, QCRI Big Data Cleaning Nan Tang, QCRI Big Data Cleaning Nan Tang, QCRI Data Cleaning? 2 Data is Dirty 2 incomplete inconsistent inaccurate Data is Dirty 2 incomplete 25% companies: flawed data


  1. Error Detection FD: [country] -> [capital] name country capital city salary tax r1 Nan China Beijing Beijing 50000 1000 r2 Yin China Shanghai Hongkong 40000 1200 r3 Si Netherlands Den Hagg Utrecht 60000 1400 r4 Lei Netherlands Amsterdam Amsterdam 35000 800 emp 6

  2. Error Detection FD: [country] -> [capital] name country capital city salary tax r1 Nan China Beijing Beijing 50000 1000 r2 Yin China Shanghai Hongkong 40000 1200 r3 Si Netherlands Den Hagg Utrecht 60000 1400 r4 Lei Netherlands Amsterdam Amsterdam 35000 800 emp 6

  3. Error Detection FD: [country] -> [capital] CFD: [country = China] -> [capital = Beijing] name country capital city salary tax r1 Nan China Beijing Beijing 50000 1000 r2 Yin China Shanghai Hongkong 40000 1200 r3 Si Netherlands Den Hagg Utrecht 60000 1400 r4 Lei Netherlands Amsterdam Amsterdam 35000 800 emp 6

  4. Error Detection FD: [country] -> [capital] CFD: [country = China] -> [capital = Beijing] name country capital city salary tax r1 Nan China Beijing Beijing 50000 1000 r2 Yin China Shanghai Hongkong 40000 1200 r3 Si Netherlands Den Hagg Utrecht 60000 1400 r4 Lei Netherlands Amsterdam Amsterdam 35000 800 emp 6

  5. Error Detection FD: [country] -> [capital] CFD: [country = China] -> [capital = Beijing] name country capital city salary tax r1 Nan China Beijing Beijing 50000 1000 r2 Yin China Shanghai Hongkong 40000 1200 r3 Si Netherlands Den Hagg Utrecht 60000 1400 r4 Lei Netherlands Amsterdam Amsterdam 35000 800 emp DC: ⌉ t1, t2 (t1.salary > t2.salary and t1.tax < t2.tax) 6

  6. Error Detection FD: [country] -> [capital] CFD: [country = China] -> [capital = Beijing] name country capital city salary tax r1 Nan China Beijing Beijing 50000 1000 r2 Yin China Shanghai Hongkong 40000 1200 r3 Si Netherlands Den Hagg Utrecht 60000 1400 r4 Lei Netherlands Amsterdam Amsterdam 35000 800 emp DC: ⌉ t1, t2 (t1.salary > t2.salary and t1.tax < t2.tax) 6

  7. Error Detection FD: [country] -> [capital] CFD: [country = China] -> [capital = Beijing] name country capital city salary tax country capital s1 China Beijing r1 Nan China Beijing Beijing 50000 1000 s2 Canada Ottawa r2 Yin China Shanghai Hongkong 40000 1200 s3 … … r3 Si Netherlands Den Hagg Utrecht 60000 1400 cap r4 Lei Netherlands Amsterdam Amsterdam 35000 800 emp DC: ⌉ t1, t2 (t1.salary > t2.salary and t1.tax < t2.tax) 6

  8. Error Detection FD: [country] -> [capital] CFD: [country = China] -> [capital = Beijing] name country capital city salary tax country capital s1 China Beijing r1 Nan China Beijing Beijing 50000 1000 s2 Canada Ottawa r2 Yin China Shanghai Hongkong 40000 1200 s3 … … r3 Si Netherlands Den Hagg Utrecht 60000 1400 cap r4 Lei Netherlands Amsterdam Amsterdam 35000 800 emp DC: ⌉ t1, t2 (t1.salary > t2.salary and t1.tax < t2.tax) MD: (emp[country] = cap[country]) -> (emp[capital] <=> cap[capital]) 6

  9. Error Detection FD: [country] -> [capital] CFD: [country = China] -> [capital = Beijing] name country capital city salary tax country capital Inclusion dependency Currency constraint Sequential dependency s1 China Beijing r1 Nan China Beijing Beijing 50000 1000 s2 Canada Ottawa r2 Yin China Shanghai Hongkong 40000 1200 Aggregation constraint Accuracy constraint … … … … s3 … … r3 Si Netherlands Den Hagg Utrecht 60000 1400 cap r4 Lei Netherlands Amsterdam Amsterdam 35000 800 emp DC: ⌉ t1, t2 (t1.salary > t2.salary and t1.tax < t2.tax) MD: (emp[country] = cap[country]) -> (emp[capital] <=> cap[capital]) 6

  10. Data Repairing CFD CFDs Functional dependency (ICDE 2012) Rule discovery DCs Denial constraints input (QCRI, VLDB 2014) Currency constraints Error detection (ICDE 2013) Unique columns (QCRI, ICDE 2013, VLDB 2014) Fixing rules (SIGMOD 2014) ... ... input Dashboard Equivalence class Heuristic methods Set cover (VLDB 2013 demo) Generic system Sat solver NADEEF Data repairing Confidence values (SIGMOD 2013) (SIGMOD 2011) ER (SIGMOD 2014 demo) User guided (VLDB 2010 best paper) Open source Commercialize 7

  11. Data Repairing CFD CFDs Functional dependency (ICDE 2012) Rule discovery DCs Denial constraints input (QCRI, VLDB 2014) Currency constraints Error detection (ICDE 2013) Unique columns (QCRI, ICDE 2013, VLDB 2014) Fixing rules (SIGMOD 2014) ... ... input Dashboard Equivalence class Heuristic methods Set cover (VLDB 2013 demo) Generic system Sat solver NADEEF Data repairing Confidence values (SIGMOD 2013) (SIGMOD 2011) ER (SIGMOD 2014 demo) User guided (VLDB 2010 best paper) Open source Commercialize 7

  12. Automated

  13. Computing a Consistent Database D 9

  14. Computing a Consistent Database Dg ? D 9

  15. Computing a Consistent Database Consistent Dg ? D 9

  16. Computing a Consistent Database Consistent Consistent D’ D’’ Dg ? D D … … D’’’ 9

  17. Computing a Consistent Database Consistent Consistent D’ D’’ Dg ? D D … … D’’’ find a D’ such that dist(D,D’) is minimum 9

  18. Computing a Consistent Database name nationality capital areacode bornAt salary tax r1 Nan China Beijing 10 Shenyang 50000 1000 Yan China Shanghai 10 Hangzhou 40000 900 r2 a a a r3 Si China Beijing 10 Changsha 60000 1400 Miura China Tokyo 3 Kyoto 35000 800 r4 a a a 10

  19. Computing a Consistent Database FD1: [nationality] -> [capital] FD2: [areacode] -> [capital] name nationality capital areacode bornAt salary tax r1 Nan China Beijing 10 Shenyang 50000 1000 Yan China Shanghai 10 Hangzhou 40000 900 r2 a a a r3 Si China Beijing 10 Changsha 60000 1400 Miura China Tokyo 3 Kyoto 35000 800 r4 a a a 10

  20. Computing a Consistent Database FD1: [nationality] -> [capital] FD2: [areacode] -> [capital] name nationality capital areacode bornAt salary tax r1 Nan China Beijing 10 Shenyang 50000 1000 Yan China Shanghai 10 Hangzhou 40000 900 r2 a a a r3 Si China Beijing 10 Changsha 60000 1400 Miura China Tokyo 3 Kyoto 35000 800 r4 a a a 10

  21. Computing a Consistent Database FD1: [nationality] -> [capital] FD2: [areacode] -> [capital] name nationality capital areacode bornAt salary tax r1 Nan China Beijing 10 Shenyang 50000 1000 Yan China Shanghai 10 Hangzhou 40000 900 r2 a a a Beijing r3 Si China Beijing 10 Changsha 60000 1400 Miura China Tokyo 3 Kyoto 35000 800 r4 a a a Beijing 10

  22. Computing a Consistent Database FD1: [nationality] -> [capital] FD2: [areacode] -> [capital] name nationality capital areacode bornAt salary tax r1 Nan China Beijing 10 Shenyang 50000 1000 Yan China Shanghai 10 Hangzhou 40000 900 r2 a a a Beijing r3 Si China Beijing 10 Changsha 60000 1400 Miura China Tokyo 3 Kyoto 35000 800 r4 a a a Beijing Equivalence Vertex SAT … … class cover solver 10

  23. Computing a Consistent Database FD1: [nationality] -> [capital] FD2: [areacode] -> [capital] name nationality capital areacode bornAt salary tax r1 Nan China Beijing 10 Shenyang 50000 1000 Yan China Shanghai 10 Hangzhou 40000 900 r2 a a a Beijing r3 Si China Beijing 10 Changsha 60000 1400 Miura China Tokyo 3 Kyoto 35000 800 r4 a a a Beijing Equivalence Vertex SAT … … class cover solver 10

  24. User Guided

  25. Certain Fixes (VLDB 2010 Best Paper) name country capital city conf country capital r1 George China Beijing Beijing SIGMOD s1 China Beijing s2 Canada Ottawa r2 Ian China Shanghai Hongkong ICDE s3 Japan Tokyo r3 Peter China Tokyo Tokyo ICDE r4 Mike Canada Toronto Toronto VLDB 12

  26. Certain Fixes (VLDB 2010 Best Paper) editing rule: ((country, country) -> (capital, capital)) name country capital city conf country capital r1 George China Beijing Beijing SIGMOD s1 China Beijing s2 Canada Ottawa r2 Ian China Shanghai Hongkong ICDE s3 Japan Tokyo r3 Peter China Tokyo Tokyo ICDE r4 Mike Canada Toronto Toronto VLDB 12

  27. Certain Fixes (VLDB 2010 Best Paper) editing rule: ((country, country) -> (capital, capital)) name country capital city conf country capital r1 George China Beijing Beijing SIGMOD s1 China Beijing s2 Canada Ottawa r2 Ian China Shanghai Hongkong ICDE s3 Japan Tokyo r3 Peter China Tokyo Tokyo ICDE r4 Mike Canada Toronto Toronto VLDB 12

  28. Certain Fixes (VLDB 2010 Best Paper) editing rule: ((country, country) -> (capital, capital)) name country capital city conf country capital r1 George China Beijing Beijing SIGMOD s1 China Beijing s2 Canada Ottawa r2 Ian China Shanghai Hongkong ICDE s3 Japan Tokyo r3 Peter China Tokyo Tokyo ICDE Is r2[country] China? r4 Mike Canada Toronto Toronto VLDB YES. 12

  29. Certain Fixes (VLDB 2010 Best Paper) editing rule: ((country, country) -> (capital, capital)) name country capital city conf country capital r1 George China Beijing Beijing SIGMOD s1 China Beijing s2 Canada Ottawa r2 Ian China Shanghai Hongkong ICDE Beijing s3 Japan Tokyo r3 Peter China Tokyo Tokyo ICDE Is r2[country] China? r4 Mike Canada Toronto Toronto VLDB YES. 12

  30. Certain Fixes (VLDB 2010 Best Paper) editing rule: ((country, country) -> (capital, capital)) name country capital city conf country capital r1 George China Beijing Beijing SIGMOD s1 China Beijing s2 Canada Ottawa r2 Ian China Shanghai Hongkong ICDE Beijing s3 Japan Tokyo r3 Peter China Tokyo Tokyo ICDE Is r2[country] China? r4 Mike Canada Toronto Toronto VLDB YES. Is r1[country] China? Is r3[country] China? Is r4[country] Canada? … … … … 12

  31. Certain Fixes (VLDB 2010 Best Paper) editing rule: ((country, country) -> (capital, capital)) name country capital city conf country capital r1 George China Beijing Beijing SIGMOD s1 China Beijing s2 Canada Ottawa r2 Ian China Shanghai Hongkong ICDE Beijing s3 Japan Tokyo r3 Peter China Tokyo Tokyo ICDE Is r2[country] China? r4 Mike Canada Toronto Toronto VLDB YES. Is r1[country] China? Is r3[country] China? Is r4[country] Canada? … … … … check each tuple: not cheap !! 12

  32. precision: + precision: ++ recall: ++ recall: ++ Heuristic Certain (Automated) (User guided) 13

  33. precision: ++ precision: + precision: ++ recall: + recall: ++ recall: ++ Heuristic Certain Fixing Rules � (Automated) (Automated) (User guided) 13

  34. China Shanghai country capital Data patterns 14

  35. evidence China negative Shanghai country capital Data patterns 14

  36. evidence China negative Shanghai country China capital T okyo Data patterns 14

  37. evidence China negative Shanghai ? country (China, Beijing) China capital T okyo (Japan, T okyo) Data patterns 14

  38. evidence China negative Shanghai ? country (China, Beijing) China capital T okyo (Japan, T okyo) name Ian Data patterns work mail ian@gmail.com 14

  39. evidence China negative Shanghai ? country (China, Beijing) China capital T okyo (Japan, T okyo) evidence name Ian Data patterns work mail negative ian@gmail.com 14

  40. evidence China negative Shanghai ? country (China, Beijing) China capital T okyo (Japan, T okyo) evidence name Ian Data patterns work mail negative ian@gmail.com city Beijing area code 110002 14

  41. evidence China negative Shanghai ? country (China, Beijing) China capital T okyo (Japan, T okyo) evidence name Ian Data patterns work mail negative ian@gmail.com evidence city Beijing area code negative 110002 14

  42. Fixing Rules (SIGMOD 2014) • Syntax fR1: (([country], [China]), (capital, {Shanghai, Hongkong})) -> Beijing 15

  43. Fixing Rules (SIGMOD 2014) • Syntax fR1: (([country], [China]), (capital, {Shanghai, Hongkong})) -> Beijing country {capital capital China Shanghai Beijing Hongkong 15

  44. Fixing Rules (SIGMOD 2014) • Syntax fR1: (([country], [China]), (capital, {Shanghai, Hongkong})) -> Beijing evidence negative country {capital capital China Shanghai Beijing Hongkong 15

  45. Fixing Rules (SIGMOD 2014) • Syntax fR1: (([country], [China]), (capital, {Shanghai, Hongkong})) -> Beijing evidence negative fact country {capital capital China Shanghai Beijing Hongkong 15

  46. Fixing Rules (SIGMOD 2014) • Syntax fR1: (([country], [China]), (capital, {Shanghai, Hongkong})) -> Beijing evidence negative fact country {capital capital China Shanghai Beijing Hongkong name nationality capital bornAt r1 Nan China Beijing Shenyang r2 Yan China Shanghai Hangzhou r3 Si China Beijing Changsha r4 Miura China Tokyo Kyoto 15

  47. Fixing Rules (SIGMOD 2014) • Syntax fR1: (([country], [China]), (capital, {Shanghai, Hongkong})) -> Beijing evidence negative fact country {capital capital China Shanghai Beijing Hongkong name nationality capital bornAt r1 Nan China Beijing Shenyang r2 Yan China Shanghai Hangzhou r3 Si China Beijing Changsha r4 Miura China Tokyo Kyoto 15

  48. Fixing Rules (SIGMOD 2014) • Syntax fR1: (([country], [China]), (capital, {Shanghai, Hongkong})) -> Beijing evidence negative fact country {capital capital China Shanghai Beijing Hongkong name nationality capital bornAt r1 Nan China Beijing Shenyang r2 Yan China Shanghai Hangzhou Beijing r3 Si China Beijing Changsha r4 Miura China Tokyo Kyoto 15

  49. Fixing Rules (SIGMOD 2014) • Syntax fR1: (([country], [China]), (capital, {Shanghai, Hongkong})) -> Beijing evidence negative fact country {capital capital China Shanghai Beijing Hongkong name nationality capital bornAt r1 Nan China Beijing Shenyang r2 Yan China Shanghai Hangzhou Beijing r3 Si China Beijing Changsha r4 Miura China Tokyo Kyoto 15

  50. Fixing Rules (SIGMOD 2014) • Syntax fR1: (([country], [China]), (capital, {Shanghai, Hongkong})) -> Beijing evidence negative fact country {capital capital China Shanghai Beijing Hongkong name nationality capital bornAt r1 Nan China Beijing Shenyang r2 Yan China Shanghai Hangzhou Beijing r3 Si China Beijing Changsha r4 Miura China Tokyo Kyoto 15

  51. Confidence values � Interaction � … …

  52. Matching and Repairing (SIGMOD 2011) FD: [nationality] -> [capital] MD: ((nationality, country) -> (capital, capital)) name nationality capital bornAt country capital Nan China Beijing Shenyang China Beijing r1 s1 (0.9) (1.0) (1.0) (0.9) (1.0) (1.0) s2 Canada Ottawa Yan China Beijing Hangzhou r2 (1.0) (1.0) (0.8) (1.0) (0.5) (0.9) Si Canada Toronto Changsha Japan Tokyo r3 s3 (0.9) (1.0) (0.5) (0.8) (1.0) (1.0) Miura Canada Vancuver Kyoto r4 (0.9) (0.9) (0.5) (1.0) 17

  53. Matching and Repairing (SIGMOD 2011) FD: [nationality] -> [capital] MD: ((nationality, country) -> (capital, capital)) name nationality capital bornAt country capital Nan China Beijing Shenyang China Beijing r1 s1 (0.9) (1.0) (1.0) (0.9) (1.0) (1.0) s2 Canada Ottawa Yan China Beijing Hangzhou r2 (1.0) (1.0) (0.8) (1.0) (0.5) (0.9) Si Canada Toronto Changsha Japan Tokyo r3 s3 (0.9) (1.0) (0.5) (0.8) (1.0) (1.0) Miura Canada Vancuver Kyoto r4 (0.9) (0.9) (0.5) (1.0) 17

  54. Matching and Repairing (SIGMOD 2011) FD: [nationality] -> [capital] MD: ((nationality, country) -> (capital, capital)) name nationality capital bornAt country capital Nan China Beijing Shenyang China Beijing r1 s1 (0.9) (1.0) (1.0) (0.9) (1.0) (1.0) s2 Canada Ottawa Yan China Beijing Hangzhou r2 (1.0) (1.0) (0.8) (1.0) (0.5) (0.9) Si Canada Toronto Changsha Japan Tokyo r3 s3 (0.9) (1.0) (0.5) (0.8) (1.0) (1.0) Miura Canada Vancuver Kyoto r4 (0.9) (0.9) (0.5) (1.0) 17

  55. Matching and Repairing (SIGMOD 2011) FD: [nationality] -> [capital] MD: ((nationality, country) -> (capital, capital)) name nationality capital bornAt country capital Nan China Beijing Shenyang China Beijing r1 s1 (0.9) (1.0) (1.0) (0.9) (1.0) (1.0) s2 Canada Ottawa Yan China Beijing Hangzhou r2 (1.0) (1.0) (0.8) (1.0) (0.5) (0.9) Si Canada Toronto Changsha Japan Tokyo Ottawa � r3 s3 (0.9) (1.0) (0.5) (0.8) (1.0) (1.0) (1.0) Miura Canada Vancuver Kyoto r4 (0.9) (0.9) (0.5) (1.0) 17

  56. Matching and Repairing (SIGMOD 2011) FD: [nationality] -> [capital] MD: ((nationality, country) -> (capital, capital)) name nationality capital bornAt country capital Nan China Beijing Shenyang China Beijing r1 s1 (0.9) (1.0) (1.0) (0.9) (1.0) (1.0) s2 Canada Ottawa Yan China Beijing Hangzhou r2 (1.0) (1.0) (0.8) (1.0) (0.5) (0.9) Si Canada Toronto Changsha Japan Tokyo Ottawa � r3 s3 (0.9) (1.0) (0.5) (0.8) (1.0) (1.0) (1.0) Miura Canada Vancuver Kyoto r4 (0.9) (0.9) (0.5) (1.0) 17

  57. Summary of Data Repairing Consistent database (heuristic) Equivalence class Vertex cover Sat solver 18

  58. Summary of Data Repairing Users Consistent database improve accuracy Reference data (heuristic) Equivalence class Confidence value Vertex cover Sat solver 18

  59. Summary of Data Repairing Users Consistent database improve accuracy Reference data (heuristic) Equivalence class Confidence value Vertex cover Sat solver Machine learning Scared GDR 18

  60. Summary of Data Repairing Users Consistent database improve accuracy Reference data (heuristic) Equivalence class Confidence value Vertex cover Sat solver Machine learning Automated and Dependable Scared Fixing rules GDR 18

  61. Generic Data Cleaning System CFD CFDs Functional dependency (ICDE 2012) Rule discovery DCs Denial constraints input (QCRI, VLDB 2014) Currency constraints Error detection (ICDE 2013) Unique columns (QCRI, ICDE 2013, VLDB 2014) Fixing rules (SIGMOD 2014) ... ... input Dashboard Equivalence class Heuristic methods Set cover (VLDB 2013 demo) Generic system Sat solver NADEEF Data repairing Confidence values (SIGMOD 2013) (SIGMOD 2011) ER (SIGMOD 2014 demo) User guided (VLDB 2010 best paper) Open source Commercialize 19

  62. Generic Data Cleaning System CFD CFDs Functional dependency (ICDE 2012) Rule discovery DCs Denial constraints input (QCRI, VLDB 2014) Currency constraints Error detection (ICDE 2013) Unique columns (QCRI, ICDE 2013, VLDB 2014) Fixing rules (SIGMOD 2014) ... ... input Dashboard Equivalence class Heuristic methods Set cover (VLDB 2013 demo) Generic system Sat solver NADEEF Data repairing Confidence values (SIGMOD 2013) (SIGMOD 2011) ER (SIGMOD 2014 demo) User guided (VLDB 2010 best paper) Open source Commercialize 19

  63. NADEEF (SIGMOD 2013) Data Loader Data Rule Collector ETLs, CFDs, MDs, Business rules 20

  64. NADEEF (SIGMOD 2013) NADEEF Data Loader Metadata Management Data Quality Dashboard Data Auditing and Lineage Data owners Experts Indices Probabilistic models Rule Collector Metadata ETLs, CFDs, MDs, Detection and Cleaning Core Business rules Rule Compiler Violation Detection Rules Data Repairing 20

  65. NADEEF (SIGMOD 2013) metadata management and data custodians NADEEF Data Loader Metadata Management Data Quality Dashboard Data Auditing and Lineage Data owners Experts Indices Probabilistic models Rule Collector extensibility Metadata ETLs, CFDs, MDs, Detection and Cleaning Core Business rules Rule Compiler Violation Detection Rules Data Repairing interdependency heterogeneity 20

  66. NADEEF (SIGMOD 2013) 21

  67. NADEEF Online 22

  68. NADEEF for Big Data NADEEF 23

Recommend


More recommend