Data Quality Assurance 25.06.20 | Software Engineering for Artificial Intelligence | A. Alizadeh, T. Ihlefeld | 1
Importance of Data Quality Source: [1] Swartz (2007) 25.06.20 | Software Engineering for Artificial Intelligence | A. Alizadeh, T. Ihlefeld | 2
What is „dirty“ data? 1. Outliers include data values that deviate from the „We define an error to be a deviation from its ground truth value." distribution of values in a column of a table. 2. Duplicates are distinct records that refer to the Quantitative Outliers same real-world entity. If attribute values do not match, this could signify an error. Data Errors Duplicates 3. Rule violations refer to values that violate any kind of integrity constraints, such as Not Null constraints Qualitative Rule violations and Uniqueness constraints. 4. Pattern violations refer to values that violate Pattern violations syntactic and semantic constraints, such as alignment, formatting, misspelling, and semantic data types. Source: [2] Abedjan et al. (2016) 25.06.20 | Software Engineering for Artificial Intelligence | A. Alizadeh, T. Ihlefeld | 3
Ways to clean data Source: [3] Chu et al. (2016) 25.06.20 | Software Engineering for Artificial Intelligence | A. Alizadeh, T. Ihlefeld | 4
New / Emerging Challenges Scalability User Engagement Semi-structured and unstructured data New Applications for Growing Privacy and Streaming Data Security Concerns Source: [3] Chu et al. (2016) 25.06.20 | Software Engineering for Artificial Intelligence | A. Alizadeh, T. Ihlefeld | 5
Simplified Data Quality Assurance Process Error Detection Data Cleaning Evaluation - - - Data Linter SampleClean, ActiveClean CleanML - - Automating Large-Scale HoloClean Data Quality Verification 25.06.20 | Software Engineering for Artificial Intelligence | A. Alizadeh, T. Ihlefeld | 6
Data Linter 1/3 “[…] cleaning which, even when automated, is a time-consuming and error-prone process of repeated inspection and correction.” Data-linter: “[…] analyzes a user’s training data and suggests ways features can be transformed to improve model quality, for a specific model type.” Source: [4] Hynes et al. (2017) 25.06.20 | Software Engineering for Artificial Intelligence | A. Alizadeh, T. Ihlefeld | 7 Error Detection Data Cleaning Evaluation
Data Linter 2/3 Lint Examples: LintDetectors corresponds to a specific issue to search for, for given model type Enum as real: An enum (a categorical value) is encoded as a real number. Consider converting to an integer and using an embedding or one-hot vector. DataLinter Uncommon sign detector: The data includes some values that have a different sign (+/-) from the rest of the engine that applies the LintDetectors to data (e.g., -9999), which can affect training. If these a data set are special markers in the data, consider replacing them with a more neutral value (e.g., an empty or average value). LintExplorer presents output to user Source: [4] Hynes et al. (2017) 25.06.20 | Software Engineering for Artificial Intelligence | A. Alizadeh, T. Ihlefeld | 8 Error Detection Data Cleaning Evaluation
Data Linter 3/3 End-User Evaluation: Data Set Evaluation: - led to a DNN model’s precision increasing from 0.48 to 0.59 - after an initial model parameter tuning by engineer - user was unaware of the benefits of normalizing inputs to a DNN - so the tool also served as an educational aid Source: [4] Hynes et al. (2017) 25.06.20 | Software Engineering for Artificial Intelligence | A. Alizadeh, T. Ihlefeld | 9 Error Detection Data Cleaning Evaluation
Automatic Data Quality Verification 1/5 Das Bildelement mit der Beziehungs-ID rId2 wurde in der Datei nicht gefunden. Declarative API l User-defined “unit tests” - Combined with custom code - Source: [5] Schelter et al. (2017) 25.06.20 | Software Engineering for Artificial Intelligence | A. Alizadeh, T. Ihlefeld | 10 Error Detection Data Cleaning Evaluation
Automatic Data Quality Verification 2/5 Declarative l Think about how data should look - like Incremental l Support for growing data sets - Only needs new data set + state - Source: [5] Schelter et al. (2017) 25.06.20 | Software Engineering for Artificial Intelligence | A. Alizadeh, T. Ihlefeld | 11 Error Detection Data Cleaning Evaluation
Automatic Data Quality Verification 3/5 Actual data quality verification Das Bildelement mit der Beziehungs-ID rId2 wurde l in der Datei nicht gefunden. Compute required metrics - Metrics provided by the tool: l Completeness - Consistency - Statistics - → used for consistency metrics Source: [5] Schelter et al. (2017) 25.06.20 | Software Engineering for Artificial Intelligence | A. Alizadeh, T. Ihlefeld | 12 Error Detection Data Cleaning Evaluation
Automatic Data Quality Verification 4/5 Output l Fails and successes of constraints - “How much” a constraint failed - Das Bildelement mit der Beziehungs-ID rId2 wurde in der Datei nicht gefunden. Source: [5] Schelter et al. (2017) 25.06.20 | Software Engineering for Artificial Intelligence | A. Alizadeh, T. Ihlefeld | 13 Error Detection Data Cleaning Evaluation
Automatic Data Quality Verification 5/5 Learnings l Advantages of using a shared data quality library - Reuse checks and constraints - Reduced manual work on data - Source: [5] Schelter et al. (2017) 25.06.20 | Software Engineering for Artificial Intelligence | A. Alizadeh, T. Ihlefeld | 14 Error Detection Data Cleaning Evaluation
Sample Clean Dirty Dirty Dirty Dirty Big Data Dirty Dirty Dirty Data Data Data Data Data Data Data Set No No No No No No No No Cleaning Cleaning Cleaning Cleaning Sampling Cleaning Cleaning Cleaning Cleaning Cleaning Cleaning Cleaning Sampling Cleaning Cleaning Cleaning Two error sources: dirty data and too little data l Benefits of clean data outweigh error of from using less data l → Only use a clean sample Source: [6] Krishnan et al. (2015) 25.06.20 | Software Engineering for Artificial Intelligence | A. Alizadeh, T. Ihlefeld | 15 Error Detection Data Cleaning Evaluation
Simpson‘s Paradox Another problem: training on partially cleaned data l Das Bildelement mit der Beziehungs-ID rId2 wurde in der Datei nicht gefunden. Source: [7] Krishnan et al. (2016) 25.06.20 | Software Engineering for Artificial Intelligence | A. Alizadeh, T. Ihlefeld | 16 Error Detection Data Cleaning Evaluation
Active Clean 1/2 Extends Sample Clean l Prevent the effects of partially cleaned data l Use samples of cleaned data and integrate it into training of model l Source: [7] Krishnan et al. (2016) 25.06.20 | Software Engineering for Artificial Intelligence | A. Alizadeh, T. Ihlefeld | 17 Error Detection Data Cleaning Evaluation
Active Clean 2/2 Dirty data 1) Train on dirty data for initial model Initial model Initial model Initial model Initial model 2) Select sample records 3) Clean sample Sampler 4) Update weights of model (using cleaned sample) Cleaner Updater Source: [7] Krishnan et al. (2016) 25.06.20 | Software Engineering for Artificial Intelligence | A. Alizadeh, T. Ihlefeld | 18 Error Detection Data Cleaning Evaluation
Holo Clean 1/4 Two tasks of data cleaning l 1) Error detection → automation works fine - 2) Data cleaning → automation fails - Source: [8] Rekatsinas et al. (2017) 25.06.20 | Software Engineering for Artificial Intelligence | A. Alizadeh, T. Ihlefeld | 19 Error Detection Data Cleaning Evaluation
Holo Clean 2/4 Qualitative data repairing l Integrity constraints - External information - Quantitative Data repairing l Statistical methods - Source: [8] Rekatsinas et al. (2017) 25.06.20 | Software Engineering for Artificial Intelligence | A. Alizadeh, T. Ihlefeld | 20 Error Detection Data Cleaning Evaluation
Holo Clean 3/4 Using them separately yields bad results l Issue addressed by Holo Clean l Bad automation for data repairing - Solution: combine quantitative and qualitative data repairing - Source: [8] Rekatsinas et al. (2017) 25.06.20 | Software Engineering for Artificial Intelligence | A. Alizadeh, T. Ihlefeld | 21 Error Detection Data Cleaning Evaluation
Holo Clean 4/4 Source: [8] Rekatsinas et al. (2017) 25.06.20 | Software Engineering for Artificial Intelligence | A. Alizadeh, T. Ihlefeld | 22 Error Detection Data Cleaning Evaluation
CleanML 1/3 - Most of the real-world applications these „ ML Community has been focusing on problems do not occur on their own understanding the impact - of noises to ML models “ Common practice: Data cleaning followed by ML model training - Need of study the impact of cleaning on ML models „ DB Community has - Construct benchmarks to evaluate the been focssing on impact understanding the fundamental process of data cleaning“ Source: [9] Li et al. (2019) 25.06.20 | Software Engineering for Artificial Intelligence | A. Alizadeh, T. Ihlefeld | 23 Error Detection Data Cleaning Evaluation
Recommend
More recommend