Lecture 7 Course Content Week 12 (May 26) • Introduction to Data Mining 33459-01 Principles of Knowledge Discovery in Data • Association analysis • Sequential Pattern Analysis Outlier Detection • Classification and prediction • Contrast Sets • Data Clustering Lecture by: Dr. Osmar R. Zaïane • Outlier Detection • Web Mining 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 1 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 2 (Dr. O. Zaiane) (Dr. O. Zaiane) What is an Outlier? Many Names for Outlier Detection • An observation (or measurement) that is • Outlier detection unusually different (large or small) relative to the • Outlier analysis other values in a data set. • Anomaly detection • Outliers typically are attributable to one of the • Intrusion detection following causes: • Misuse detection – Error : the measurement or event is observed, • Surprise discovery recorded, or entered into the computer incorrectly. – Contamination : the measurement or event comes • Rarity detection from a different population. • Detection of unusual events – Inherent variability : the measurement or event is correct, but represents a rare event. 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 3 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 4 (Dr. O. Zaiane) (Dr. O. Zaiane)
Lecture Outline Finding Gems Part I: What is Outlier Detection (30 minutes) • Introduction to outlier analysis • If Data Mining is about finding gems in a • Definitions and Relative Notions database, from all the data mining tasks: • Motivating Examples for outlier detection • Taxonomy of Major Outlier Detection Algorithms characterization, classification, clustering, Part II: Statistics Approaches association analysis, contrasting…, outlier • Distribution-Based (Univariate and multivariate) detection is the closest to this metaphor. • Depth-Based • Graphical Aids Part III: Data Mining Approaches • Clustering-Based Data Mining can • Distance-Based discover “gems” in the • Density-Based data • Resolution-Based 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 5 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 6 (Dr. O. Zaiane) (Dr. O. Zaiane) Global versus Local Outliers Different Definitions • Global outliers Vis-à-vis the whole dataset • An observation that deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism. [Hawkins, 1980] • An outlier is an observation (or subset of observations which appear to be inconsistent with the remainder of • Local outliers that dataset [Barnet & Lewis,1994] Vis-à-vis a subset of the data • Is there an anomaly • An outlier is an observation that lies outside the overall more outlier than pattern of a distribution [Moore & McCabe, 1999] other outliers? • Outliers are those data records that do not follow any • Could we rank patter in an application. [Chen, Tan & Fu, 2003] outliers? 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 7 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 8 (Dr. O. Zaiane) (Dr. O. Zaiane)
More Definitions Relativity of an Outlier • An object O in a dataset is a DB( p,D )-outlier if at least a fraction p of the other objects in the dataset lies greater than distance D from O . [Knorr & Ng, 1997] • The notion of outlier is subjective and • An outlier in a set of data is an observation or a point that is highly application-domain-dependant. considerably dissimilar or inconsistent with the remainder of the data [Ramaswany, Rastogi & Shim, 2000] • Given and input data set with N points, parameters n and k, a point p is a D k N outlier if there are no more than n-1 other points p’ such that D k ( d’ )<D k ( p ) where D k ( p ) denotes the distance of point p from its k th nearest neighbor. [Ramaswany, Rastogi & Shim, 2000] • Given a set of observations X, an outlier is an observation that is an element of this set X but which is inconsistent with the majority of the data or inconsistent with a sub-group of X to which the element is meant to be similar. There is an ambiguity in defining an outlier [Fan, Zaïane, Foss & Wu, 2006] 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 9 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 10 (Dr. O. Zaiane) (Dr. O. Zaiane) Topology for Outlier Detection Application of Anomaly Detection • Data Cleaning - Elimination of Noise (abnormal data) Outlier Detection Methods – Noise can significantly affect data modeling (Data Quality) • Network Intrusion (Hackers, DoS, etc.) Statistical Methods Data Mining Methods • Fraud detection (Credit cards, stocks, financial transactions, communications, voting irregularities, etc.) Distribution-Based Visual generic Specific • Surveillance • Performance Analysis (for scouting athletes, etc.) Depth-Based Distance-Based Resolution-Based Spatial • Weather Prediction (Environmental protection, disaster Clustering-Based Density-Based Sequence-Based prevention, etc.) • Real-time anomaly detection in various monitoring Top-n systems, such as structural health, transportation; 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 11 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 12 (Dr. O. Zaiane) (Dr. O. Zaiane)
Lecture Outline Outliers and Statistics Part I: What is Outlier Detection (30 minutes) • Currently, in most applications outlier detection still • Introduction to outlier analysis depends to a large extent on traditional statistical • Definitions and Relative Notions methods. • Motivating Examples for outlier detection • Taxonomy of Major Outlier Detection Algorithms Part II: Statistics Approaches • In Statistics, prior to the building of a multivariate • Distribution-Based (Univariate and multivariate) (or any) statistical representation from the process • Depth-Based data, a pre-screening/pre-treatment procedure is Distribution-Based Visual • Graphical Aids essential to remove noise that can affect models Depth-Based Part III: Data Mining Approaches and seriously bias and influence statistic estimates. • Clustering-Based • Distance-Based • Assume statistical distribution and find records • Density-Based which deviate significantly from the assumed model. • Resolution-Based 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 13 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 14 (Dr. O. Zaiane) (Dr. O. Zaiane) Chebyshev Theorem Distribution-Based Outlier Detection • Univariate • Univariate According to Chebyshev’s theorem almost all The definition is based on a standard the observations in a data set will have a z- probability model (Normal, Poison, Binomial) score less than 3 in absolute value. – i.e. all Assumes or fits a distribution to the data. data fall into interval [ µ -3 σ , µ +3 σ ] µ is the mean and σ is the standard deviation. • The Russian mathematician P. L. Chebyshev (1821- 1894) discovered that the Z-score z=(x- µ )/ σ fraction of observations falling between two distinct values, whose differences from the mean have the same absolute value, is related to the variance of the population. Chebyshev's Theorem gives a conservative estimate to the above The z-score for each data point is computed and the percentage. observations with z-score greater than 3 are declared outliers. Theorem: The fraction of any data set lying within k standard deviations µ and σ are themselves very sensitive to of the mean is at least 1 – 1/k 2 outliers. Extreme values skew the mean. • Any problem • For any population or sample, at least (1 - (1 / k2) of the observations in the data Consider the mean of {1,2,3,4,5} is 3 while with this? set fall within k standard deviations of the mean, where k >= 1. the mean of {1, 2, 3, 4, 1000} is 202. 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 15 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 16 (Dr. O. Zaiane) (Dr. O. Zaiane)
Recommend
More recommend