Panorama des mthodes de dtection et de traitement des anomalies - PowerPoint PPT Presentation

Panorama des méthodes de détection et de traitement des anomalies Laure Berti-Équille IRD www.ird.fr laure.berti@ird.fr AAFD 2012

À la recherche des problèmes… de qualité de données “Dirty Data” : – Données malformatées – Données aberrantes ( outliers ) – Doublons – Données incohérentes – Données obsolètes – Données fausses, incorrectes, erronées – Données incomplètes, tronquées, censurées – Données manquantes 2 AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 2

Outline 1. Motivating Example 2. Generic Guidelines 3. Methods for Anomaly Detection 4. Techniques for Cleaning Dirty Data 5. Summary and Conclusions AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 3

IP Data Streams: A Picture • 10 Attributes, every 5 minutes, over four weeks • Axes transformed for plotting *L. Berti-Équille, T. Dasu, D. Srivastava : Discovery of complex glitch patterns : A novel approach to Quantitative Data Cleaning. Proc. of ICDE 2011 , pp. 733-744, Hannover, Germany, 2011. 5

Detection of Patterns of Anomalies Outliers Outliers Interfaces Utilization_Out Utilization_In Bytes_Out Bytes_In Memory CPU Latency Syslog_Events CPU_Poll Duplicate Missing 6

Detection: Main Issues � A large variety of detection methods with conflicting results � No benchmark � DQ problems are not necessarily rare events � DQ problems may be (partially) correlated � Mutual masking-effects impair the detection (e.g., - missing values affects the detection of duplicates - duplicate records affects the detection of outliers - imputation methods may mask the presence of duplicates) � Classical assumptions won’t work (e.g., MCAR/MAR, normality, symmetry, uni-modality) 7 AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 7

Cleaning: What Can Be Done? • Cleaning strategies ( ad hoc ) – Impute missing values � component-wise median? – De-duplicate � retain a random record? – Handle outliers � identify and remove? So many methods but contradicting results? – Drop all records that have any imperfection – Add special categories and analyze singularities in isolation • Almost all existing approaches look at one-shot approaches to univariate glitches. Why? • Cleaning introduces new errors !? 8 AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 8

So Many Choices… Data Deletion Imputation Modeling Missing Values Deletion Fusion Random Duplicates Selection Outliers Deletion Winsorization Trimming 9 9 AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 9

Guidelines Step 1 – Explore the data distributions Goal – Detect and count missing, extreme and aberrant data values – Decide not to consider some values or variables – Decide the transformation and corrective actions to apply For continuous variables – Discretization – Test for normality (essential for small datasets) and normalization – Optional test for homoscedasticity (equality of variance-covariance matrices) – Detect non-linearity and non-monotony For discrete variables – Group the variables with small populations – Create new relevant aggregates AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 11

Step 1 - Data Distribution Characteristics ( ) • Dispersion N 1 ∑ 2 σ σ = − x x = CV i – Standard deviation N µ i – Coefficient of Variation (CV): a normalized measure of dispersion of a probability distribution – IQR: Q3-Q1 – Homoscedasticity: equality of variances for a variable on different subsets using Levene, Barlett or Fisher tests (if p<.05 ⇒ heteroscedasticity) • Skewness : measure of the asymmetry of the probability distribution of a real-valued random variable 3   − N x x 1 ∑ • = 0 : when the distribution is symmetrical   = S i   σ N • >0 : the mass of the distribution is concentrated on the left   i x • <0 : the mass of the distribution is concentrated on the right • Kurtosis : measure of the flatness of the distribution • =3 flat like the normal distribution 4   − N x x 1 ∑ • >3 more concentrated   = K i   σ N • <3 flatter than the Gaussian   i x 12

Step 1- Test for Normality • Many DM methods assume multivariate normal distributions • Multivariate normality can be detected by inspecting the indices of multivariate skewness and kurtosis • Lack of univariate normality occurs when the skewness index > 3.0 and kurtosis index > 10 • Non-normal distributions can sometimes be corrected by transforming variables • Tests : – Kolmogorov-Smirnov Test: non-parametric test that quantifies the maximum distance between the empirical distribution function of the variable and the cdf of the normal distribution – Anderson-Darling Test: variant of K-S test weighting the tails of distributions – Lilliefors Test: variant of K-S test for unknown mean and standard deviation – Shapiro-Wilk Test : orders the sample values in ascending order and uses the correlation to detect small departures from normality - not suitable for very large sample sizes (SAS proc UNIVARIATE) 13 AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 13

Guidelines Step 2 – Analyze data relationships Goal – Detect inconsistencies between 2 or more variables – Determine relationships between one target variable and one or more variables contributing to its explanation in order to eliminate no effect variables – Determine relationships between explanation variables in order to avoid multicollinearity that may causes the failure of regression techniques – Quantify the strength of the relationship and sensitivity in presence of outliers – Detect spurious correlations Methods – Bivariate statistics measuring pair-wise correlations – Discover FDs AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 14

Guidelines Step 1&2 - Use the toolbox for detection UV statistics Distributional techniques Skewness, Kurtosis Goodness of fit tests: normality, Chi-square tests, analysis of residulas, Kullback-Lieber divergence Control Charts: X-Bar, CUSUM, R Ultimate Research Goals MV statistics MCD, MVE, Robust estimators Model-based methods � Benchmarking Linear, logistic regression Probabilistic methods � Optimization Clustering � Refinement Distance-based techniques Density-based techniques � Scalability Subspace-based techniques � Classification Tuning Rule-based techniques � Real-time SVM, Neural Networks, Bayesian Networks Information theoretic measures � Interactivity Kernel-based methods Rule & Pattern Discovery Association Rule Discovery FD, AFD, CFD mining Visualization Graphics Q-Q plot Confusion Matrix 15 AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 15

Guidelines Step 3 - Data Preparation: Major Task s • Data cleaning – Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies • Data integration – Integration of multiple databases, data cubes, or files • Data transformation – Normalization and aggregation • Data reduction – Obtains reduced representation in volume but produces the same or similar analytical results • Data discretization – Part of data reduction but with particular importance, especially for numerical data AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 16

Data Preparation: Major Tasks • Data cleaning – Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies • Data integration – Integration of multiple databases, data cubes, or files • Data transformation – Normalization and aggregation • Data reduction – Obtains reduced representation in volume but produces the same or similar analytical results • Data discretization – Part of data reduction but with particular importance, especially for numerical data AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 17

Outline 1. Motivating Example 2. Methods for Anomaly Detection – Non standardized, – Inconsistencies misfielded/formatted – Missing, truncated – Duplicates – Out-of-date – Outliers – Erroneous, contradicting, false AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 18

Outline 1. Motivating Example 2. Methods for Anomaly Detection – Non standardized, – Inconsistencies misfielded/formatted – Missing, truncated – Duplicates – Out-of-date – Outliers – Erroneous, contradicting, false AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 19

Panorama des mthodes de dtection et de traitement des anomalies - PowerPoint PPT Presentation

Panorama des mthodes de dtection et de traitement des anomalies Laure Berti-quille IRD www.ird.fr laure.berti@ird.fr AAFD 2012 la recherche des problmes de qualit de donnes Dirty Data : Donnes malformates

PANORAMA PANORAMA INTRODUCTION (1/2) WG PANORAMA WG PANORAMA = + Set of tools and open and

12.05.2015 ITP-PANORAMA Welcome to our presentaton at Master your Software with PANORAMA

Using the Panorama Teacher Survey Dr. Hunter Gehlbach Elizabeth Loehr Director of Research

Autodesk Idol MFG Autodesk Panorama 2015 Introducing Dmytro Mukhin Anton Vasiliev Evgeny

Rle des comorbidits psychiatriques dans le traitement et le pronostic des maladies

by 28 Octobre 2016 Gif sur Yvette

Block Ciphers and DES S-DES DES Details DES Design Other Ciphers CSS441: Security and

Data Encryption Standard Simplified-DES Details of DES DES in OpenSSL Cryptography DES in

Data Encryption Standard Simplified-DES Details of DES DES in OpenSSL Cryptography DES in

Algorithmes de traitement dimage pour lestimation des caract eristiques locales de la

Traitement automatique des langues : Fondements et applications Cours 10 : Neural networks (1)

Traitement automatique des langues : Fondements et applications Cours 11 : Neural networks (2)

CAP APACIT CITY Y BUILDING ILDING FO FOR TH THE E PHYSIC YSICAL AL PROTE TECTION CTION

SURGE GE PROTECT TECTION ION FI FIRE ALARM RM 1-800-753-2345 Technical Support:

PL PLAN ANT T PR PROTE TECTION CTION MIN INIS ISTR TRY Y OF OF N NATIO TIONAL L FOO

Public Prot tection and Anti-Social l Behaviour Brie efing S.Bar rstow Service M Manager

Automatic Failure Diagnosis Support in Distributed Large-Scale Software Systems based on Timing

Network Intrusion Detection & Forensics with Bro Matthias Vallentin vallentin@berkeley.edu

What is Anomalies? If Efficient Market Hypothesis holds, all securities should have the same

FerMINI - Fermilab Search for Millicharged Particles & Strongly Interacting Dark Matter Yu-Dai

Spatial and temporal extent of ionospheric anomalies during sudden stratospheric warmings in the

24 February 2014 Counsellors Office of the State Council Beijing, China Please Provide

FH University of Applied Sciences TECHNIKUM WIEN Dominik Widhalm, Karl M. Gschka, Wolfgang

Heavy vy Fla lavour Physics at the LHC Ying Li Yantai University Talk given in The 5th China

Sambuz

Useful Links

Newsletter

Mail Us

Panorama des mthodes de dtection et de traitement des anomalies - PowerPoint PPT Presentation

Panorama des mthodes de dtection et de traitement des anomalies Laure Berti-quille IRD www.ird.fr laure.berti@ird.fr AAFD 2012 la recherche des problmes de qualit de donnes Dirty Data : Donnes malformates

PANORAMA PANORAMA INTRODUCTION (1/2) WG PANORAMA WG PANORAMA = + Set of tools and open and

12.05.2015 ITP-PANORAMA Welcome to our presentaton at Master your Software with PANORAMA

Using the Panorama Teacher Survey Dr. Hunter Gehlbach Elizabeth Loehr Director of Research

Autodesk Idol MFG Autodesk Panorama 2015 Introducing Dmytro Mukhin Anton Vasiliev Evgeny

Rle des comorbidits psychiatriques dans le traitement et le pronostic des maladies

by 28 Octobre 2016 Gif sur Yvette

Block Ciphers and DES S-DES DES Details DES Design Other Ciphers CSS441: Security and

Data Encryption Standard Simplified-DES Details of DES DES in OpenSSL Cryptography DES in

Data Encryption Standard Simplified-DES Details of DES DES in OpenSSL Cryptography DES in

Algorithmes de traitement dimage pour lestimation des caract eristiques locales de la

Traitement automatique des langues : Fondements et applications Cours 10 : Neural networks (1)

Traitement automatique des langues : Fondements et applications Cours 11 : Neural networks (2)

CAP APACIT CITY Y BUILDING ILDING FO FOR TH THE E PHYSIC YSICAL AL PROTE TECTION CTION

SURGE GE PROTECT TECTION ION FI FIRE ALARM RM 1-800-753-2345 Technical Support:

PL PLAN ANT T PR PROTE TECTION CTION MIN INIS ISTR TRY Y OF OF N NATIO TIONAL L FOO

Public Prot tection and Anti-Social l Behaviour Brie efing S.Bar rstow Service M Manager

Automatic Failure Diagnosis Support in Distributed Large-Scale Software Systems based on Timing

Network Intrusion Detection &amp; Forensics with Bro Matthias Vallentin vallentin@berkeley.edu

What is Anomalies? If Efficient Market Hypothesis holds, all securities should have the same

FerMINI - Fermilab Search for Millicharged Particles &amp; Strongly Interacting Dark Matter Yu-Dai

Spatial and temporal extent of ionospheric anomalies during sudden stratospheric warmings in the

24 February 2014 Counsellors Office of the State Council Beijing, China Please Provide

FH University of Applied Sciences TECHNIKUM WIEN Dominik Widhalm, Karl M. Gschka, Wolfgang

Heavy vy Fla lavour Physics at the LHC Ying Li Yantai University Talk given in The 5th China

Sambuz

Useful Links

Newsletter

Mail Us

Network Intrusion Detection & Forensics with Bro Matthias Vallentin vallentin@berkeley.edu

FerMINI - Fermilab Search for Millicharged Particles & Strongly Interacting Dark Matter Yu-Dai