Error Detection FD: [country] -> [capital] name country capital city salary tax r1 Nan China Beijing Beijing 50000 1000 r2 Yin China Shanghai Hongkong 40000 1200 r3 Si Netherlands Den Hagg Utrecht 60000 1400 r4 Lei Netherlands Amsterdam Amsterdam 35000 800 emp 6
Error Detection FD: [country] -> [capital] name country capital city salary tax r1 Nan China Beijing Beijing 50000 1000 r2 Yin China Shanghai Hongkong 40000 1200 r3 Si Netherlands Den Hagg Utrecht 60000 1400 r4 Lei Netherlands Amsterdam Amsterdam 35000 800 emp 6
Error Detection FD: [country] -> [capital] CFD: [country = China] -> [capital = Beijing] name country capital city salary tax r1 Nan China Beijing Beijing 50000 1000 r2 Yin China Shanghai Hongkong 40000 1200 r3 Si Netherlands Den Hagg Utrecht 60000 1400 r4 Lei Netherlands Amsterdam Amsterdam 35000 800 emp 6
Error Detection FD: [country] -> [capital] CFD: [country = China] -> [capital = Beijing] name country capital city salary tax r1 Nan China Beijing Beijing 50000 1000 r2 Yin China Shanghai Hongkong 40000 1200 r3 Si Netherlands Den Hagg Utrecht 60000 1400 r4 Lei Netherlands Amsterdam Amsterdam 35000 800 emp 6
Error Detection FD: [country] -> [capital] CFD: [country = China] -> [capital = Beijing] name country capital city salary tax r1 Nan China Beijing Beijing 50000 1000 r2 Yin China Shanghai Hongkong 40000 1200 r3 Si Netherlands Den Hagg Utrecht 60000 1400 r4 Lei Netherlands Amsterdam Amsterdam 35000 800 emp DC: ⌉ t1, t2 (t1.salary > t2.salary and t1.tax < t2.tax) 6
Error Detection FD: [country] -> [capital] CFD: [country = China] -> [capital = Beijing] name country capital city salary tax r1 Nan China Beijing Beijing 50000 1000 r2 Yin China Shanghai Hongkong 40000 1200 r3 Si Netherlands Den Hagg Utrecht 60000 1400 r4 Lei Netherlands Amsterdam Amsterdam 35000 800 emp DC: ⌉ t1, t2 (t1.salary > t2.salary and t1.tax < t2.tax) 6
Error Detection FD: [country] -> [capital] CFD: [country = China] -> [capital = Beijing] name country capital city salary tax country capital s1 China Beijing r1 Nan China Beijing Beijing 50000 1000 s2 Canada Ottawa r2 Yin China Shanghai Hongkong 40000 1200 s3 … … r3 Si Netherlands Den Hagg Utrecht 60000 1400 cap r4 Lei Netherlands Amsterdam Amsterdam 35000 800 emp DC: ⌉ t1, t2 (t1.salary > t2.salary and t1.tax < t2.tax) 6
Error Detection FD: [country] -> [capital] CFD: [country = China] -> [capital = Beijing] name country capital city salary tax country capital s1 China Beijing r1 Nan China Beijing Beijing 50000 1000 s2 Canada Ottawa r2 Yin China Shanghai Hongkong 40000 1200 s3 … … r3 Si Netherlands Den Hagg Utrecht 60000 1400 cap r4 Lei Netherlands Amsterdam Amsterdam 35000 800 emp DC: ⌉ t1, t2 (t1.salary > t2.salary and t1.tax < t2.tax) MD: (emp[country] = cap[country]) -> (emp[capital] <=> cap[capital]) 6
Error Detection FD: [country] -> [capital] CFD: [country = China] -> [capital = Beijing] name country capital city salary tax country capital Inclusion dependency Currency constraint Sequential dependency s1 China Beijing r1 Nan China Beijing Beijing 50000 1000 s2 Canada Ottawa r2 Yin China Shanghai Hongkong 40000 1200 Aggregation constraint Accuracy constraint … … … … s3 … … r3 Si Netherlands Den Hagg Utrecht 60000 1400 cap r4 Lei Netherlands Amsterdam Amsterdam 35000 800 emp DC: ⌉ t1, t2 (t1.salary > t2.salary and t1.tax < t2.tax) MD: (emp[country] = cap[country]) -> (emp[capital] <=> cap[capital]) 6
Data Repairing CFD CFDs Functional dependency (ICDE 2012) Rule discovery DCs Denial constraints input (QCRI, VLDB 2014) Currency constraints Error detection (ICDE 2013) Unique columns (QCRI, ICDE 2013, VLDB 2014) Fixing rules (SIGMOD 2014) ... ... input Dashboard Equivalence class Heuristic methods Set cover (VLDB 2013 demo) Generic system Sat solver NADEEF Data repairing Confidence values (SIGMOD 2013) (SIGMOD 2011) ER (SIGMOD 2014 demo) User guided (VLDB 2010 best paper) Open source Commercialize 7
Data Repairing CFD CFDs Functional dependency (ICDE 2012) Rule discovery DCs Denial constraints input (QCRI, VLDB 2014) Currency constraints Error detection (ICDE 2013) Unique columns (QCRI, ICDE 2013, VLDB 2014) Fixing rules (SIGMOD 2014) ... ... input Dashboard Equivalence class Heuristic methods Set cover (VLDB 2013 demo) Generic system Sat solver NADEEF Data repairing Confidence values (SIGMOD 2013) (SIGMOD 2011) ER (SIGMOD 2014 demo) User guided (VLDB 2010 best paper) Open source Commercialize 7
Automated
Computing a Consistent Database D 9
Computing a Consistent Database Dg ? D 9
Computing a Consistent Database Consistent Dg ? D 9
Computing a Consistent Database Consistent Consistent D’ D’’ Dg ? D D … … D’’’ 9
Computing a Consistent Database Consistent Consistent D’ D’’ Dg ? D D … … D’’’ find a D’ such that dist(D,D’) is minimum 9
Computing a Consistent Database name nationality capital areacode bornAt salary tax r1 Nan China Beijing 10 Shenyang 50000 1000 Yan China Shanghai 10 Hangzhou 40000 900 r2 a a a r3 Si China Beijing 10 Changsha 60000 1400 Miura China Tokyo 3 Kyoto 35000 800 r4 a a a 10
Computing a Consistent Database FD1: [nationality] -> [capital] FD2: [areacode] -> [capital] name nationality capital areacode bornAt salary tax r1 Nan China Beijing 10 Shenyang 50000 1000 Yan China Shanghai 10 Hangzhou 40000 900 r2 a a a r3 Si China Beijing 10 Changsha 60000 1400 Miura China Tokyo 3 Kyoto 35000 800 r4 a a a 10
Computing a Consistent Database FD1: [nationality] -> [capital] FD2: [areacode] -> [capital] name nationality capital areacode bornAt salary tax r1 Nan China Beijing 10 Shenyang 50000 1000 Yan China Shanghai 10 Hangzhou 40000 900 r2 a a a r3 Si China Beijing 10 Changsha 60000 1400 Miura China Tokyo 3 Kyoto 35000 800 r4 a a a 10
Computing a Consistent Database FD1: [nationality] -> [capital] FD2: [areacode] -> [capital] name nationality capital areacode bornAt salary tax r1 Nan China Beijing 10 Shenyang 50000 1000 Yan China Shanghai 10 Hangzhou 40000 900 r2 a a a Beijing r3 Si China Beijing 10 Changsha 60000 1400 Miura China Tokyo 3 Kyoto 35000 800 r4 a a a Beijing 10
Computing a Consistent Database FD1: [nationality] -> [capital] FD2: [areacode] -> [capital] name nationality capital areacode bornAt salary tax r1 Nan China Beijing 10 Shenyang 50000 1000 Yan China Shanghai 10 Hangzhou 40000 900 r2 a a a Beijing r3 Si China Beijing 10 Changsha 60000 1400 Miura China Tokyo 3 Kyoto 35000 800 r4 a a a Beijing Equivalence Vertex SAT … … class cover solver 10
Computing a Consistent Database FD1: [nationality] -> [capital] FD2: [areacode] -> [capital] name nationality capital areacode bornAt salary tax r1 Nan China Beijing 10 Shenyang 50000 1000 Yan China Shanghai 10 Hangzhou 40000 900 r2 a a a Beijing r3 Si China Beijing 10 Changsha 60000 1400 Miura China Tokyo 3 Kyoto 35000 800 r4 a a a Beijing Equivalence Vertex SAT … … class cover solver 10
User Guided
Certain Fixes (VLDB 2010 Best Paper) name country capital city conf country capital r1 George China Beijing Beijing SIGMOD s1 China Beijing s2 Canada Ottawa r2 Ian China Shanghai Hongkong ICDE s3 Japan Tokyo r3 Peter China Tokyo Tokyo ICDE r4 Mike Canada Toronto Toronto VLDB 12
Certain Fixes (VLDB 2010 Best Paper) editing rule: ((country, country) -> (capital, capital)) name country capital city conf country capital r1 George China Beijing Beijing SIGMOD s1 China Beijing s2 Canada Ottawa r2 Ian China Shanghai Hongkong ICDE s3 Japan Tokyo r3 Peter China Tokyo Tokyo ICDE r4 Mike Canada Toronto Toronto VLDB 12
Certain Fixes (VLDB 2010 Best Paper) editing rule: ((country, country) -> (capital, capital)) name country capital city conf country capital r1 George China Beijing Beijing SIGMOD s1 China Beijing s2 Canada Ottawa r2 Ian China Shanghai Hongkong ICDE s3 Japan Tokyo r3 Peter China Tokyo Tokyo ICDE r4 Mike Canada Toronto Toronto VLDB 12
Certain Fixes (VLDB 2010 Best Paper) editing rule: ((country, country) -> (capital, capital)) name country capital city conf country capital r1 George China Beijing Beijing SIGMOD s1 China Beijing s2 Canada Ottawa r2 Ian China Shanghai Hongkong ICDE s3 Japan Tokyo r3 Peter China Tokyo Tokyo ICDE Is r2[country] China? r4 Mike Canada Toronto Toronto VLDB YES. 12
Certain Fixes (VLDB 2010 Best Paper) editing rule: ((country, country) -> (capital, capital)) name country capital city conf country capital r1 George China Beijing Beijing SIGMOD s1 China Beijing s2 Canada Ottawa r2 Ian China Shanghai Hongkong ICDE Beijing s3 Japan Tokyo r3 Peter China Tokyo Tokyo ICDE Is r2[country] China? r4 Mike Canada Toronto Toronto VLDB YES. 12
Certain Fixes (VLDB 2010 Best Paper) editing rule: ((country, country) -> (capital, capital)) name country capital city conf country capital r1 George China Beijing Beijing SIGMOD s1 China Beijing s2 Canada Ottawa r2 Ian China Shanghai Hongkong ICDE Beijing s3 Japan Tokyo r3 Peter China Tokyo Tokyo ICDE Is r2[country] China? r4 Mike Canada Toronto Toronto VLDB YES. Is r1[country] China? Is r3[country] China? Is r4[country] Canada? … … … … 12
Certain Fixes (VLDB 2010 Best Paper) editing rule: ((country, country) -> (capital, capital)) name country capital city conf country capital r1 George China Beijing Beijing SIGMOD s1 China Beijing s2 Canada Ottawa r2 Ian China Shanghai Hongkong ICDE Beijing s3 Japan Tokyo r3 Peter China Tokyo Tokyo ICDE Is r2[country] China? r4 Mike Canada Toronto Toronto VLDB YES. Is r1[country] China? Is r3[country] China? Is r4[country] Canada? … … … … check each tuple: not cheap !! 12
precision: + precision: ++ recall: ++ recall: ++ Heuristic Certain (Automated) (User guided) 13
precision: ++ precision: + precision: ++ recall: + recall: ++ recall: ++ Heuristic Certain Fixing Rules � (Automated) (Automated) (User guided) 13
China Shanghai country capital Data patterns 14
evidence China negative Shanghai country capital Data patterns 14
evidence China negative Shanghai country China capital T okyo Data patterns 14
evidence China negative Shanghai ? country (China, Beijing) China capital T okyo (Japan, T okyo) Data patterns 14
evidence China negative Shanghai ? country (China, Beijing) China capital T okyo (Japan, T okyo) name Ian Data patterns work mail ian@gmail.com 14
evidence China negative Shanghai ? country (China, Beijing) China capital T okyo (Japan, T okyo) evidence name Ian Data patterns work mail negative ian@gmail.com 14
evidence China negative Shanghai ? country (China, Beijing) China capital T okyo (Japan, T okyo) evidence name Ian Data patterns work mail negative ian@gmail.com city Beijing area code 110002 14
evidence China negative Shanghai ? country (China, Beijing) China capital T okyo (Japan, T okyo) evidence name Ian Data patterns work mail negative ian@gmail.com evidence city Beijing area code negative 110002 14
Fixing Rules (SIGMOD 2014) • Syntax fR1: (([country], [China]), (capital, {Shanghai, Hongkong})) -> Beijing 15
Fixing Rules (SIGMOD 2014) • Syntax fR1: (([country], [China]), (capital, {Shanghai, Hongkong})) -> Beijing country {capital capital China Shanghai Beijing Hongkong 15
Fixing Rules (SIGMOD 2014) • Syntax fR1: (([country], [China]), (capital, {Shanghai, Hongkong})) -> Beijing evidence negative country {capital capital China Shanghai Beijing Hongkong 15
Fixing Rules (SIGMOD 2014) • Syntax fR1: (([country], [China]), (capital, {Shanghai, Hongkong})) -> Beijing evidence negative fact country {capital capital China Shanghai Beijing Hongkong 15
Fixing Rules (SIGMOD 2014) • Syntax fR1: (([country], [China]), (capital, {Shanghai, Hongkong})) -> Beijing evidence negative fact country {capital capital China Shanghai Beijing Hongkong name nationality capital bornAt r1 Nan China Beijing Shenyang r2 Yan China Shanghai Hangzhou r3 Si China Beijing Changsha r4 Miura China Tokyo Kyoto 15
Fixing Rules (SIGMOD 2014) • Syntax fR1: (([country], [China]), (capital, {Shanghai, Hongkong})) -> Beijing evidence negative fact country {capital capital China Shanghai Beijing Hongkong name nationality capital bornAt r1 Nan China Beijing Shenyang r2 Yan China Shanghai Hangzhou r3 Si China Beijing Changsha r4 Miura China Tokyo Kyoto 15
Fixing Rules (SIGMOD 2014) • Syntax fR1: (([country], [China]), (capital, {Shanghai, Hongkong})) -> Beijing evidence negative fact country {capital capital China Shanghai Beijing Hongkong name nationality capital bornAt r1 Nan China Beijing Shenyang r2 Yan China Shanghai Hangzhou Beijing r3 Si China Beijing Changsha r4 Miura China Tokyo Kyoto 15
Fixing Rules (SIGMOD 2014) • Syntax fR1: (([country], [China]), (capital, {Shanghai, Hongkong})) -> Beijing evidence negative fact country {capital capital China Shanghai Beijing Hongkong name nationality capital bornAt r1 Nan China Beijing Shenyang r2 Yan China Shanghai Hangzhou Beijing r3 Si China Beijing Changsha r4 Miura China Tokyo Kyoto 15
Fixing Rules (SIGMOD 2014) • Syntax fR1: (([country], [China]), (capital, {Shanghai, Hongkong})) -> Beijing evidence negative fact country {capital capital China Shanghai Beijing Hongkong name nationality capital bornAt r1 Nan China Beijing Shenyang r2 Yan China Shanghai Hangzhou Beijing r3 Si China Beijing Changsha r4 Miura China Tokyo Kyoto 15
Confidence values � Interaction � … …
Matching and Repairing (SIGMOD 2011) FD: [nationality] -> [capital] MD: ((nationality, country) -> (capital, capital)) name nationality capital bornAt country capital Nan China Beijing Shenyang China Beijing r1 s1 (0.9) (1.0) (1.0) (0.9) (1.0) (1.0) s2 Canada Ottawa Yan China Beijing Hangzhou r2 (1.0) (1.0) (0.8) (1.0) (0.5) (0.9) Si Canada Toronto Changsha Japan Tokyo r3 s3 (0.9) (1.0) (0.5) (0.8) (1.0) (1.0) Miura Canada Vancuver Kyoto r4 (0.9) (0.9) (0.5) (1.0) 17
Matching and Repairing (SIGMOD 2011) FD: [nationality] -> [capital] MD: ((nationality, country) -> (capital, capital)) name nationality capital bornAt country capital Nan China Beijing Shenyang China Beijing r1 s1 (0.9) (1.0) (1.0) (0.9) (1.0) (1.0) s2 Canada Ottawa Yan China Beijing Hangzhou r2 (1.0) (1.0) (0.8) (1.0) (0.5) (0.9) Si Canada Toronto Changsha Japan Tokyo r3 s3 (0.9) (1.0) (0.5) (0.8) (1.0) (1.0) Miura Canada Vancuver Kyoto r4 (0.9) (0.9) (0.5) (1.0) 17
Matching and Repairing (SIGMOD 2011) FD: [nationality] -> [capital] MD: ((nationality, country) -> (capital, capital)) name nationality capital bornAt country capital Nan China Beijing Shenyang China Beijing r1 s1 (0.9) (1.0) (1.0) (0.9) (1.0) (1.0) s2 Canada Ottawa Yan China Beijing Hangzhou r2 (1.0) (1.0) (0.8) (1.0) (0.5) (0.9) Si Canada Toronto Changsha Japan Tokyo r3 s3 (0.9) (1.0) (0.5) (0.8) (1.0) (1.0) Miura Canada Vancuver Kyoto r4 (0.9) (0.9) (0.5) (1.0) 17
Matching and Repairing (SIGMOD 2011) FD: [nationality] -> [capital] MD: ((nationality, country) -> (capital, capital)) name nationality capital bornAt country capital Nan China Beijing Shenyang China Beijing r1 s1 (0.9) (1.0) (1.0) (0.9) (1.0) (1.0) s2 Canada Ottawa Yan China Beijing Hangzhou r2 (1.0) (1.0) (0.8) (1.0) (0.5) (0.9) Si Canada Toronto Changsha Japan Tokyo Ottawa � r3 s3 (0.9) (1.0) (0.5) (0.8) (1.0) (1.0) (1.0) Miura Canada Vancuver Kyoto r4 (0.9) (0.9) (0.5) (1.0) 17
Matching and Repairing (SIGMOD 2011) FD: [nationality] -> [capital] MD: ((nationality, country) -> (capital, capital)) name nationality capital bornAt country capital Nan China Beijing Shenyang China Beijing r1 s1 (0.9) (1.0) (1.0) (0.9) (1.0) (1.0) s2 Canada Ottawa Yan China Beijing Hangzhou r2 (1.0) (1.0) (0.8) (1.0) (0.5) (0.9) Si Canada Toronto Changsha Japan Tokyo Ottawa � r3 s3 (0.9) (1.0) (0.5) (0.8) (1.0) (1.0) (1.0) Miura Canada Vancuver Kyoto r4 (0.9) (0.9) (0.5) (1.0) 17
Summary of Data Repairing Consistent database (heuristic) Equivalence class Vertex cover Sat solver 18
Summary of Data Repairing Users Consistent database improve accuracy Reference data (heuristic) Equivalence class Confidence value Vertex cover Sat solver 18
Summary of Data Repairing Users Consistent database improve accuracy Reference data (heuristic) Equivalence class Confidence value Vertex cover Sat solver Machine learning Scared GDR 18
Summary of Data Repairing Users Consistent database improve accuracy Reference data (heuristic) Equivalence class Confidence value Vertex cover Sat solver Machine learning Automated and Dependable Scared Fixing rules GDR 18
Generic Data Cleaning System CFD CFDs Functional dependency (ICDE 2012) Rule discovery DCs Denial constraints input (QCRI, VLDB 2014) Currency constraints Error detection (ICDE 2013) Unique columns (QCRI, ICDE 2013, VLDB 2014) Fixing rules (SIGMOD 2014) ... ... input Dashboard Equivalence class Heuristic methods Set cover (VLDB 2013 demo) Generic system Sat solver NADEEF Data repairing Confidence values (SIGMOD 2013) (SIGMOD 2011) ER (SIGMOD 2014 demo) User guided (VLDB 2010 best paper) Open source Commercialize 19
Generic Data Cleaning System CFD CFDs Functional dependency (ICDE 2012) Rule discovery DCs Denial constraints input (QCRI, VLDB 2014) Currency constraints Error detection (ICDE 2013) Unique columns (QCRI, ICDE 2013, VLDB 2014) Fixing rules (SIGMOD 2014) ... ... input Dashboard Equivalence class Heuristic methods Set cover (VLDB 2013 demo) Generic system Sat solver NADEEF Data repairing Confidence values (SIGMOD 2013) (SIGMOD 2011) ER (SIGMOD 2014 demo) User guided (VLDB 2010 best paper) Open source Commercialize 19
NADEEF (SIGMOD 2013) Data Loader Data Rule Collector ETLs, CFDs, MDs, Business rules 20
NADEEF (SIGMOD 2013) NADEEF Data Loader Metadata Management Data Quality Dashboard Data Auditing and Lineage Data owners Experts Indices Probabilistic models Rule Collector Metadata ETLs, CFDs, MDs, Detection and Cleaning Core Business rules Rule Compiler Violation Detection Rules Data Repairing 20
NADEEF (SIGMOD 2013) metadata management and data custodians NADEEF Data Loader Metadata Management Data Quality Dashboard Data Auditing and Lineage Data owners Experts Indices Probabilistic models Rule Collector extensibility Metadata ETLs, CFDs, MDs, Detection and Cleaning Core Business rules Rule Compiler Violation Detection Rules Data Repairing interdependency heterogeneity 20
NADEEF (SIGMOD 2013) 21
NADEEF Online 22
NADEEF for Big Data NADEEF 23
Recommend
More recommend