Dependable Data Repairing with Fixing Rules Jiannan Wang Nan Tang 1
Data is Dirty 2
incomplete inconsistent inaccurate … Data is Dirty 2
incomplete 25% companies: flawed data inconsistent 3+ trillion $: US economy 20%: labor productivity inaccurate … … … Data is Dirty 2
incomplete 25% companies: flawed data inconsistent 3+ trillion $: US economy 20%: labor productivity inaccurate … … … Data is Dirty Big (clean) data: new oil 2
State-of-the-art � ) s e l u r L T E ( n o t i a m r o f s n a r t a t a D Entity resolution (deduplication) � � ) s r o r r e c i t c a t n y s ( s o p y T Truth discovery � � L M / l a c i t s i t a t S 3
State-of-the-art � ) s e l u r L T E ( n o t i a m r o f s n a r t a t a D Entity resolution (deduplication) � � ) s r o r r e c i t c a t n y s ( s o p y T Truth discovery � � L M / l a c i t s i t a t S Constraint (dependency) based data cleaning 3
Dependency Theory • Data dependencies ( a.k.a. integrity constraints) 4
Dependency Theory • Data dependencies ( a.k.a. integrity constraints) name country capital city conf r1 George China Beijing Beijing SIGMOD r2 Ian China Shanghai Hongkong ICDE r3 Peter China Tokyo Tokyo ICDE r4 Mike Canada Toronto Toronto VLDB 4
Dependency Theory • Data dependencies ( a.k.a. integrity constraints) FD: [country] -> [capital] name country capital city conf r1 George China Beijing Beijing SIGMOD r2 Ian China Shanghai Hongkong ICDE r3 Peter China Tokyo Tokyo ICDE r4 Mike Canada Toronto Toronto VLDB 4
Dependency Theory • Data dependencies ( a.k.a. integrity constraints) FD: [country] -> [capital] name country capital city conf r1 George China Beijing Beijing SIGMOD r2 Ian China Shanghai Hongkong ICDE r3 Peter China Tokyo Tokyo ICDE r4 Mike Canada Toronto Toronto VLDB 4
Dependency Theory • Data dependencies ( a.k.a. integrity constraints) FD: [country] -> [capital] name country capital city conf r1 George China Beijing Beijing SIGMOD r2 Ian China Shanghai Hongkong ICDE r3 Peter China Tokyo Tokyo ICDE r4 Mike Canada Toronto Toronto VLDB Data dependencies are not sufficient to guide dependable data repairing 4
User Guidance name country capital city conf country capital r1 George China Beijing Beijing SIGMOD s1 China Beijing s2 Canada Ottawa r2 Ian China Shanghai Hongkong ICDE s3 Japan Tokyo r3 Peter China Tokyo Tokyo ICDE r4 Mike Canada Toronto Toronto VLDB 5
User Guidance editing rule: ((country, country) -> (capital, capital)) name country capital city conf country capital r1 George China Beijing Beijing SIGMOD s1 China Beijing s2 Canada Ottawa r2 Ian China Shanghai Hongkong ICDE s3 Japan Tokyo r3 Peter China Tokyo Tokyo ICDE r4 Mike Canada Toronto Toronto VLDB 5
User Guidance editing rule: ((country, country) -> (capital, capital)) name country capital city conf country capital r1 George China Beijing Beijing SIGMOD s1 China Beijing s2 Canada Ottawa r2 Ian China Shanghai Hongkong ICDE s3 Japan Tokyo r3 Peter China Tokyo Tokyo ICDE r4 Mike Canada Toronto Toronto VLDB 5
User Guidance editing rule: ((country, country) -> (capital, capital)) name country capital city conf country capital r1 George China Beijing Beijing SIGMOD s1 China Beijing s2 Canada Ottawa r2 Ian China Shanghai Hongkong ICDE s3 Japan Tokyo r3 Peter China Tokyo Tokyo ICDE Is r2[country] China? r4 Mike Canada Toronto Toronto VLDB YES. 5
User Guidance editing rule: ((country, country) -> (capital, capital)) name country capital city conf country capital r1 George China Beijing Beijing SIGMOD s1 China Beijing s2 Canada Ottawa r2 Ian China Shanghai Hongkong ICDE Beijing s3 Japan Tokyo r3 Peter China Tokyo Tokyo ICDE Is r2[country] China? r4 Mike Canada Toronto Toronto VLDB YES. 5
User Guidance editing rule: ((country, country) -> (capital, capital)) name country capital city conf country capital r1 George China Beijing Beijing SIGMOD s1 China Beijing s2 Canada Ottawa r2 Ian China Shanghai Hongkong ICDE Beijing s3 Japan Tokyo r3 Peter China Tokyo Tokyo ICDE Is r2[country] China? r4 Mike Canada Toronto Toronto VLDB YES. Is r1[country] China? Is r3[country] China? Is r4[country] Canada? … … … … 5
User Guidance editing rule: ((country, country) -> (capital, capital)) name country capital city conf country capital r1 George China Beijing Beijing SIGMOD s1 China Beijing s2 Canada Ottawa r2 Ian China Shanghai Hongkong ICDE Beijing s3 Japan Tokyo r3 Peter China Tokyo Tokyo ICDE Is r2[country] China? r4 Mike Canada Toronto Toronto VLDB YES. Is r1[country] China? Is r3[country] China? Is r4[country] Canada? … … … … check each tuple: not cheap !! 5
precision: + precision: ++ recall: ++ recall: ++ Heuristic Certain (Automated) (User guided) 6
precision: ++ precision: + precision: ++ recall: + recall: ++ recall: ++ Heuristic Certain Fixing Rules � (Automated) (Automated) (User guided) 6
7
7
7
negative evidence 7
negative evidence 7
China Shanghai country capital Data patterns 8
evidence China negative Shanghai country capital Data patterns 8
evidence China negative Shanghai country China capital T okyo Data patterns 8
evidence China negative Shanghai ? country (China, Beijing) China capital T okyo (Japan, T okyo) Data patterns 8
evidence China negative Shanghai ? country (China, Beijing) China capital T okyo (Japan, T okyo) name Ian Data patterns work mail ian@gmail.com 8
evidence China negative Shanghai ? country (China, Beijing) China capital T okyo (Japan, T okyo) evidence name Ian Data patterns work mail negative ian@gmail.com 8
evidence China negative Shanghai ? country (China, Beijing) China capital T okyo (Japan, T okyo) evidence name Ian Data patterns work mail negative ian@gmail.com city Beijing area code 110002 8
evidence China negative Shanghai ? country (China, Beijing) China capital T okyo (Japan, T okyo) evidence name Ian Data patterns work mail negative ian@gmail.com evidence city Beijing area code negative 110002 8
Fixing Rules • Syntax fR1: (([country], [China]), (capital, {Shanghai, Hongkong})) -> Beijing evidence negative fact country {capital capital China Shanghai Beijing Hongkong 9
Fixing Rules • Syntax fR1: (([country], [China]), (capital, {Shanghai, Hongkong})) -> Beijing evidence negative fact country {capital capital China Shanghai Beijing Hongkong name country capital city conf r1 George China Beijing Beijing SIGMOD r2 Ian China Shanghai Hongkong ICDE r3 Peter China Tokyo Tokyo ICDE r4 Mike Canada Toronto Toronto VLDB 9
Fixing Rules • Syntax fR1: (([country], [China]), (capital, {Shanghai, Hongkong})) -> Beijing evidence negative fact country {capital capital China Shanghai Beijing Hongkong name country capital city conf r1 George China Beijing Beijing SIGMOD r2 Ian China Shanghai Hongkong ICDE r3 Peter China Tokyo Tokyo ICDE r4 Mike Canada Toronto Toronto VLDB 9
Fixing Rules • Syntax fR1: (([country], [China]), (capital, {Shanghai, Hongkong})) -> Beijing evidence negative fact country {capital capital China Shanghai Beijing Hongkong name country capital city conf r1 George China Beijing Beijing SIGMOD r2 Ian China Shanghai Beijing Hongkong ICDE r3 Peter China Tokyo Tokyo ICDE r4 Mike Canada Toronto Toronto VLDB 9
Fixing Rules • Syntax fR1: (([country], [China]), (capital, {Shanghai, Hongkong})) -> Beijing Deterministic evidence negative fact country {capital capital Conservative China Shanghai Beijing Hongkong name country capital city conf r1 George China Beijing Beijing SIGMOD r2 Ian China Shanghai Beijing Hongkong ICDE r3 Peter China Tokyo Tokyo ICDE r4 Mike Canada Toronto Toronto VLDB 9
Applying One Fixing Rule country {capital capital China Shanghai Beijing Hongkong r2 Ian China Shanghai Hongkong ICDE 10
Applying One Fixing Rule country {capital capital China Shanghai Beijing Hongkong r2 Ian China Shanghai Hongkong ICDE r2’ Ian China Beijing Hongkong ICDE 10
Applying Multiple Fixing Rules • Fixes fR 1 ’ fR 3 capital city conf {country country country {capital capital Tokyo Tokyo ICDE China Japan China Shanghai Beijing Hongkong Tokyo 11
Applying Multiple Fixing Rules • Fixes fR 1 ’ fR 3 capital city conf {country country country {capital capital Tokyo Tokyo ICDE China Japan China Shanghai Beijing Hongkong Tokyo fR 1 ’ r2 Ian China Shanghai Hongkong ICDE r2’ Ian China Beijing Hongkong ICDE 11
Recommend
More recommend