Panorama des méthodes de détection et de traitement des anomalies Laure Berti-Équille IRD www.ird.fr laure.berti@ird.fr AAFD 2012
À la recherche des problèmes… de qualité de données “Dirty Data” : – Données malformatées – Données aberrantes ( outliers ) – Doublons – Données incohérentes – Données obsolètes – Données fausses, incorrectes, erronées – Données incomplètes, tronquées, censurées – Données manquantes 2 AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 2
Outline 1. Motivating Example 2. Generic Guidelines 3. Methods for Anomaly Detection 4. Techniques for Cleaning Dirty Data 5. Summary and Conclusions AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 3
Outline 1. Motivating Example 2. Generic Guidelines 3. Methods for Anomaly Detection 4. Techniques for Cleaning Dirty Data 5. Summary and Conclusions AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 4
IP Data Streams: A Picture • 10 Attributes, every 5 minutes, over four weeks • Axes transformed for plotting *L. Berti-Équille, T. Dasu, D. Srivastava : Discovery of complex glitch patterns : A novel approach to Quantitative Data Cleaning. Proc. of ICDE 2011 , pp. 733-744, Hannover, Germany, 2011. 5
Detection of Patterns of Anomalies Outliers Outliers Interfaces Utilization_Out Utilization_In Bytes_Out Bytes_In Memory CPU Latency Syslog_Events CPU_Poll Duplicate Missing 6
Detection: Main Issues � A large variety of detection methods with conflicting results � No benchmark � DQ problems are not necessarily rare events � DQ problems may be (partially) correlated � Mutual masking-effects impair the detection (e.g., - missing values affects the detection of duplicates - duplicate records affects the detection of outliers - imputation methods may mask the presence of duplicates) � Classical assumptions won’t work (e.g., MCAR/MAR, normality, symmetry, uni-modality) 7 AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 7
Cleaning: What Can Be Done? • Cleaning strategies ( ad hoc ) – Impute missing values � component-wise median? – De-duplicate � retain a random record? – Handle outliers � identify and remove? So many methods but contradicting results? – Drop all records that have any imperfection – Add special categories and analyze singularities in isolation • Almost all existing approaches look at one-shot approaches to univariate glitches. Why? • Cleaning introduces new errors !? 8 AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 8
So Many Choices… Data Deletion Imputation Modeling Missing Values Deletion Fusion Random Duplicates Selection Outliers Deletion Winsorization Trimming 9 9 AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 9
Outline 1. Motivating Example 2. Generic Guidelines 3. Methods for Anomaly Detection 4. Techniques for Cleaning Dirty Data 5. Summary and Conclusions AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 10
Guidelines Step 1 – Explore the data distributions Goal – Detect and count missing, extreme and aberrant data values – Decide not to consider some values or variables – Decide the transformation and corrective actions to apply For continuous variables – Discretization – Test for normality (essential for small datasets) and normalization – Optional test for homoscedasticity (equality of variance-covariance matrices) – Detect non-linearity and non-monotony For discrete variables – Group the variables with small populations – Create new relevant aggregates AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 11
Step 1 - Data Distribution Characteristics ( ) • Dispersion N 1 ∑ 2 σ σ = − x x = CV i – Standard deviation N µ i – Coefficient of Variation (CV): a normalized measure of dispersion of a probability distribution – IQR: Q3-Q1 – Homoscedasticity: equality of variances for a variable on different subsets using Levene, Barlett or Fisher tests (if p<.05 ⇒ heteroscedasticity) • Skewness : measure of the asymmetry of the probability distribution of a real-valued random variable 3 − N x x 1 ∑ • = 0 : when the distribution is symmetrical = S i σ N • >0 : the mass of the distribution is concentrated on the left i x • <0 : the mass of the distribution is concentrated on the right • Kurtosis : measure of the flatness of the distribution • =3 flat like the normal distribution 4 − N x x 1 ∑ • >3 more concentrated = K i σ N • <3 flatter than the Gaussian i x 12
Step 1- Test for Normality • Many DM methods assume multivariate normal distributions • Multivariate normality can be detected by inspecting the indices of multivariate skewness and kurtosis • Lack of univariate normality occurs when the skewness index > 3.0 and kurtosis index > 10 • Non-normal distributions can sometimes be corrected by transforming variables • Tests : – Kolmogorov-Smirnov Test: non-parametric test that quantifies the maximum distance between the empirical distribution function of the variable and the cdf of the normal distribution – Anderson-Darling Test: variant of K-S test weighting the tails of distributions – Lilliefors Test: variant of K-S test for unknown mean and standard deviation – Shapiro-Wilk Test : orders the sample values in ascending order and uses the correlation to detect small departures from normality - not suitable for very large sample sizes (SAS proc UNIVARIATE) 13 AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 13
Guidelines Step 2 – Analyze data relationships Goal – Detect inconsistencies between 2 or more variables – Determine relationships between one target variable and one or more variables contributing to its explanation in order to eliminate no effect variables – Determine relationships between explanation variables in order to avoid multicollinearity that may causes the failure of regression techniques – Quantify the strength of the relationship and sensitivity in presence of outliers – Detect spurious correlations Methods – Bivariate statistics measuring pair-wise correlations – Discover FDs AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 14
Guidelines Step 1&2 - Use the toolbox for detection UV statistics Distributional techniques Skewness, Kurtosis Goodness of fit tests: normality, Chi-square tests, analysis of residulas, Kullback-Lieber divergence Control Charts: X-Bar, CUSUM, R Ultimate Research Goals MV statistics MCD, MVE, Robust estimators Model-based methods � Benchmarking Linear, logistic regression Probabilistic methods � Optimization Clustering � Refinement Distance-based techniques Density-based techniques � Scalability Subspace-based techniques � Classification Tuning Rule-based techniques � Real-time SVM, Neural Networks, Bayesian Networks Information theoretic measures � Interactivity Kernel-based methods Rule & Pattern Discovery Association Rule Discovery FD, AFD, CFD mining Visualization Graphics Q-Q plot Confusion Matrix 15 AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 15
Guidelines Step 3 - Data Preparation: Major Task s • Data cleaning – Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies • Data integration – Integration of multiple databases, data cubes, or files • Data transformation – Normalization and aggregation • Data reduction – Obtains reduced representation in volume but produces the same or similar analytical results • Data discretization – Part of data reduction but with particular importance, especially for numerical data AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 16
Data Preparation: Major Tasks • Data cleaning – Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies • Data integration – Integration of multiple databases, data cubes, or files • Data transformation – Normalization and aggregation • Data reduction – Obtains reduced representation in volume but produces the same or similar analytical results • Data discretization – Part of data reduction but with particular importance, especially for numerical data AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 17
Outline 1. Motivating Example 2. Methods for Anomaly Detection – Non standardized, – Inconsistencies misfielded/formatted – Missing, truncated – Duplicates – Out-of-date – Outliers – Erroneous, contradicting, false AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 18
Outline 1. Motivating Example 2. Methods for Anomaly Detection – Non standardized, – Inconsistencies misfielded/formatted – Missing, truncated – Duplicates – Out-of-date – Outliers – Erroneous, contradicting, false AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 19
Recommend
More recommend