NADEEF: A Commodity Data Cleaning System Data analytics, QCRI Michele Dallachiesa Amr Ebaid Ahmed Eldawy University of Trento Purdue University University of Minnesota Ahmed Elmagarmid Ihab F. Ilyas Mourad Ouzzani Nan Tang 1
2
Bob should be standardized to Robert 2
Country code determines a country 2
2
A Motivating Scenario name street city CC country phn when where David Holywell Oxford 44 UK 66700543 1pm 6/05/2012 Netherlands tran Paul Ratcliffe Oxford 44 UK 44944631 11am 2/12/2011 Netherlands David Holywell Oxford 44 Netherlands 66700541 6am 6/05/2012 US Paul Market Amsterdam 31 UK 55384922 9am 6/02/2012 Netherlands name street city CC country tel gd bank David Holywell Oxford 44 UK 66700543 M Paul Ratcliffe Oxford 44 UK 44944631 M 3
A Motivating Scenario name street city CC country phn when where David Holywell Oxford 44 UK 66700543 1pm 6/05/2012 Netherlands tran Paul Ratcliffe Oxford 44 UK 44944631 11am 2/12/2011 Netherlands David Holywell Oxford 44 Netherlands 66700541 6am 6/05/2012 US Paul Market Amsterdam 31 UK 55384922 9am 6/02/2012 Netherlands name street city CC country tel gd bank David Holywell Oxford 44 UK 66700543 M Paul Ratcliffe Oxford 44 UK 44944631 M If a customer’s CC is 31, but his/her ETL rules (lookup table) country is neither Netherlands nor Holland, update the country to Extended CFDs Netherlands; 3
A Motivating Scenario name street city CC country phn when where David Holywell Oxford 44 UK 66700543 1pm 6/05/2012 Netherlands tran Paul Ratcliffe Oxford 44 UK 44944631 11am 2/12/2011 Netherlands David Holywell Oxford 44 Netherlands 66700541 6am 6/05/2012 US Paul Market Amsterdam 31 UK 55384922 9am 6/02/2012 Netherlands name street city CC country tel gd bank David Holywell Oxford 44 UK 66700543 M Paul Ratcliffe Oxford 44 UK 44944631 M If a customer’s CC is 31, but his/her If the same person from different country is neither Netherlands nor Editing rules tables has different phones, the phone Holland, update the country to ( w.r.t. master data) number from table bank is more Netherlands; reliable; 3
A Motivating Scenario name street city CC country phn when where David Holywell Oxford 44 UK 66700543 1pm 6/05/2012 Netherlands tran Paul Ratcliffe Oxford 44 UK 44944631 11am 2/12/2011 Netherlands David Holywell Oxford 44 Netherlands 66700541 6am 6/05/2012 US Paul Market Amsterdam 31 UK 55384922 9am 6/02/2012 Netherlands name street city CC country tel gd bank David Holywell Oxford 44 UK 66700543 M Paul Ratcliffe Oxford 44 UK 44944631 M If a customer’s CC is 31, but his/her If the same person from different country is neither Netherlands nor A country code (CC) uniquely tables has different phones, the phone CFDs (FDs) Holland, update the country to determines a country number from table bank is more Netherlands; reliable; 3
A Motivating Scenario name street city CC country phn when where David Holywell Oxford 44 UK 66700543 1pm 6/05/2012 Netherlands tran Paul Ratcliffe Oxford 44 UK 44944631 11am 2/12/2011 Netherlands David Holywell Oxford 44 Netherlands 66700541 6am 6/05/2012 US Paul Market Amsterdam 31 UK 55384922 9am 6/02/2012 Netherlands name street city CC country tel gd bank David Holywell Oxford 44 UK 66700543 M Paul Ratcliffe Oxford 44 UK 44944631 M If a customer’s CC is 31, but his/her If two purchases of the same person If the same person from different country is neither Netherlands nor happened in the Netherlands and the A country code (CC) uniquely Write a special-purpose tables has different phones, the phone Holland, update the country to US (East Coast) within 1 hour, these determines a country application number from table bank is more Netherlands; two purchases might be a fraud. reliable; 3
A Motivating Scenario name street city CC country phn when where David Holywell Oxford 44 UK 66700543 1pm 6/05/2012 Netherlands tran Paul Ratcliffe Oxford 44 UK 44944631 11am 2/12/2011 Netherlands David Holywell Oxford 44 Netherlands 66700541 6am 6/05/2012 US Paul Market Amsterdam 31 UK 55384922 9am 6/02/2012 Netherlands name street city CC country tel gd bank David Holywell Oxford 44 UK 66700543 M Paul Ratcliffe Oxford 44 UK 44944631 M If a customer’s CC is 31, but his/her If two purchases of the same person If the same person from different country is neither Netherlands nor happened in the Netherlands and the A country code (CC) uniquely Write a special-purpose tables has different phones, the phone Holland, update the country to US (East Coast) within 1 hour, these determines a country application number from table bank is more Netherlands; two purchases might be a fraud. reliable; Challenging to capture multiple types of rules 3
The User Perspective These are our data quality rules CFD MD Customized rule 4
The User Perspective These are our data quality rules CFD Data Cleaning MD System Customized rule 4
The User Perspective These are our data quality rules CFD Data Cleaning MD System Customized rule Easy to specify and easy to deploy 4
Challenges • Heterogeneity • Interdependency • Deployment and extensibility • Metadata management and user interaction 5
Challenges • Heterogeneity • Integrity constraints (CFDs, DCs) ETL rules, customized rules • Interdependency • Deployment and extensibility • Metadata management and user interaction 5
Challenges • Heterogeneity • Integrity constraints (CFDs, DCs) ETL rules, customized rules • Interdependency • Interaction of various types of rules • Deployment and extensibility • Metadata management and user interaction 5
Challenges • Heterogeneity • Integrity constraints (CFDs, DCs) ETL rules, customized rules • Interdependency • Interaction of various types of rules • Deployment and • Download, compile and run Extend with new cleaning solutions extensibility • Metadata management and user interaction 5
Challenges • Heterogeneity • Integrity constraints (CFDs, DCs) ETL rules, customized rules • Interdependency • Interaction of various types of rules • Deployment and • Download, compile and run Extend with new cleaning solutions extensibility • Metadata management • Dashboard and metadata profiling and user interaction 5
NADEEF Architecture Data Data Loader Rule Collector 6
NADEEF Architecture Data Data Loader Rule Collector Detection and Cleaning Algorithms Core Rule compiler Rules Violation detection Data repairing 6
NADEEF Architecture NADEEF Metadata Management Data Quality Dashboard Data Data Loader Auditing and lineage Indices Probabilistic models Rule Collector Metadata Detection and Cleaning Algorithms Core Rule compiler Rules Violation detection Data repairing 6
NADEEF Architecture NADEEF Metadata Management Data Quality Dashboard Data Data Loader Auditing and lineage Indices Users Probabilistic models Rule Collector Metadata Detection and Cleaning Algorithms Core Rule compiler Rules Violation detection Data repairing 6
NADEEF Architecture metadata management and data custodians NADEEF Metadata Management Data Quality Dashboard Data Data Loader Auditing and lineage Indices Users Probabilistic models extensibility Rule Collector Metadata Detection and Cleaning Algorithms Core Rule compiler Rules Violation detection Data repairing heterogeneity interdependency 6
NADEEF Architecture metadata management and data custodians NADEEF Demo at Metadata Management Data Quality Dashboard VLDB 2013 Data Data Loader Auditing and lineage Indices Users Probabilistic models extensibility Rule Collector Metadata Detection and Cleaning Algorithms Core Rule compiler Rules Violation detection Data repairing heterogeneity interdependency 6
NADEEF Architecture metadata management and data custodians NADEEF Demo at Metadata Management Data Quality Dashboard VLDB 2013 Data Data Loader Auditing and lineage Indices Users Probabilistic models extensibility Rule Collector Metadata Detection and Cleaning Algorithms Core Rule compiler Rules Violation detection Data repairing heterogeneity interdependency 6
NADEEF Architecture metadata management and data custodians NADEEF Demo at Metadata Management Data Quality Dashboard VLDB 2013 Data Data Loader Auditing and lineage Indices Users Probabilistic models extensibility Rule Collector Metadata Detection and Cleaning Algorithms Core Rule compiler Rules Violation detection Data repairing heterogeneity interdependency A commodity data cleaning system 6
Programming Interface NADEEF Rules 7
Programming Interface NADEEF Rules Rule static semantics dynamic semantics vio(tuple t) fix(violation V) vio(tuple t 1 , tuple t 2 ) 7
Programming Interface NADEEF Rules Rule static semantics dynamic semantics vio(tuple t) fix(violation V) vio(tuple t 1 , tuple t 2 ) A simple and declarative interface 7
Sample Rules ( tran ) If a customer’s CC is 31, but his/her country is neither Netherlands nor Holland, update the country to Netherlands. 8
Recommend
More recommend