Chapter 9: Out utlie lier A r Ana naly lysis is Jilles Vreeken IRDM ‘15/16 8 Dec 2015
IRDM Chapter 9, overview Basics & Motivation 1. Extreme Value Analysis 2. Probabilistic Methods 3. Cluster-based Methods 4. Distance-based Methods 5. You’ll find this covered in: Aggarwal, Ch. 8, 9 IX: 2 IRDM ‘15/16
Decem embe ber 14 th th – 18 18 th th Tut utoria rials ls o on n Gra Graph Mining ph Mining Januar uary 4 y 4 th th – 8 th th No T Tut utoria rials IX: 3 IRDM ‘15/16
Th The S e Sec econd Midt Midterm T T es est Dece cemb mber 10 th th 2015 2015 Wh When: from 14:15 to 15:25 Where: Wh Günter-Hotz-Hörsaal (E2 2) Material: Patterns, Clusters, and Classification You are a allo llowed to br brin ing o one (1 (1) ) sheet o of A A4 p pape per wit with handwr writ itten or pr prin inted notes o on bo both s sid ides . . No other material (n l (notes, bo books, c course m materials) ) or devic ices (c (calc lculator, n notebook, c cell ph ll phone, s spo poon, etc tc) a ) allo llowed. Br Brin ing a an ID; D; eit ither y your UdS UdS card, o or pa passport. IX: 4 IRDM ‘15/16
Chapter 9.1: The B he Basi asics s & M Motiva vation Aggarwal Ch. 8.1 IX: 5 IRDM ‘15/16
Outliers An outlie lier is a data point very different from most of the remaining data. the standard definition is by Hawkins “An outlier is an observation which deviates so much from the other observations as to arouse suspicion it was generated by a different mechanism” IX: 6 IRDM ‘15/16
Example Outliers Outlier Outlier, maybe Feature 𝑍 Inlier Feature 𝑌 IX: 7 IRDM ‘15/16
Outliers An outlie lier is a data point very different from most of the remaining data. the standard definition is by Hawkins “An outlier is an observation which deviates so much from the other observations as to arouse suspicion it was generated by a different mechanism” Outliers are also known as anoma nomalies, abnormalities, discordants, deviants IX: 8 IRDM ‘15/16
Why bother? Outlier analysis is a key area of data mining Unlike pattern mining, clustering, and classification, it aims to describe what is not not normal Applications are many data cleaning fraud detection intrusion detection rare disease detection predictive maintenance IX: 9 IRDM ‘15/16
Not noise Outliers are not noise noise is uninteresting, outliers are noise is random, outliers aren’t Outliers are generated by a differ fferen ent p process e.g. Lionel Messi, or credit card fraudsters, or rare disease patients we have too little data to infer that process exactly detected outliers help us to better understand the data IX: 10 IRDM ‘15/16
Outliers everywhere Many many different outlier detection methods exist many different methods needed e.g. continuous vs. discrete data e.g. tables, sequences, graphs The key ey pr probl blem, and why outlier analysis is interesting: beforehand, we do not know what we are looking for what is weird? what is normal? IX: 11 IRDM ‘15/16
Three Types of Outliers global outlier Global outliers object that deviate from the r e res est o of t the e dat data s set et main issue: find a good measure of deviation local outlier Local outliers object that deviates from a sele lected co context e.g. differs strongly from its neighboring objects main issue: how to define the local context? Collective outliers a subset of objects that co colle llect ctively ly deviate collective from the data or context, e.g. intrusion detection outliers main issue: combinatorial number of sets of objects IX: 12 IRDM ‘15/16
Ranking versus Thresholding Most outlier analysis methods give a real-valued score How to decide whether a point is worth looking at? we set a threshold, or look at the top- 𝑙 no best answer, depends on situation How to evaluate? very, very difficult is there a ‘true’ outlier ranking? how bad is it to miss one, or to report two too many? IX: 13 IRDM ‘15/16
Supervised Outlier Detection Given sufficient data, we can construct a classifier and then simply use it to predict how outlying an object is typically does not fly in practice Problem 1: Insufficient training data outliers are rare we can boost (resample) a training set from a small set of known outliers we can train on artificial samples Problem 2: Recall recall is more important than accuracy we want to catch them all IX: 14 IRDM ‘15/16
Chapter 9.2: Extreme V Value A alue Analysi nalysis Aggarwal Ch. 8.2 IX: 15 IRDM ‘15/16
Extreme Values The traditional statistical approach to identifying outliers is extreme v eme value a e analysi sis Those points 𝑦 ∈ 𝑬 that are in the stati tatisti tical al tails ils of the probability distribution 𝑞 of 𝑬 are outliers. only identifies very specific outliers For example, for {1,3,3,3,50,97,97,97,100} extreme values are 1 and 100 , although 50 is the most isolated Tails are naturally defined for univariate distributions defining the multivariate tail area of a distribution is more tricky IX: 16 IRDM ‘15/16
Problems with multivariate tails Extreme value Feature 𝑍 Outlier, but not extreme value Feature 𝑌 IX: 17 IRDM ‘15/16
Univariate Extreme Value Analysis Strong relation to statistical tail confidence tests Assume a distribution, and consider the probability density function 𝑔 𝑌 ( 𝑦 ) for attribute 𝑌 the lo lower t tail il are then those values 𝑦 < 𝑚 for which for all 𝑔 𝑌 𝑦 < 𝜗 the uppe pper t r tail il are then those values 𝑦 > 𝑣 for which for all 𝑔 𝑌 𝑦 < 𝜗 IX: 18 IRDM ‘15/16
Not a density threshold. Not all distributions have two tails exponential distributions, for example IX: 19 IRDM ‘15/16
Univariate For example, for a Gaussian ⋅ 𝑓 − 𝑦−𝜈 2 1 2⋅𝜏 2 𝑔 𝑌 𝑦 = 𝜏 ⋅ 2 ⋅ 𝜌 with sufficient data we can estimate 𝜏 and 𝜈 with high accuracy We can then compute 𝑨 -scores, 𝑨 𝑗 = ( 𝑦 𝑗 − 𝜈 )/ 𝜏 large positive values correspond to upper tail, large negative to lower tail We can write the pdf in terms of 𝑨 -scores as 2 1 ⋅ 𝑓 −𝑦 𝑗 2 𝑔 𝑌 𝑨 𝑗 = 𝜏 ⋅ 2 ⋅ 𝜌 the cumu mulat ative e normal mal distribution on then tells the area of the tail larger than 𝑨 𝑗 as rule of thumb, 𝑨 -scores with absolute values larger than 3 are extreme IX: 20 IRDM ‘15/16
Depth-based methods The main idea is that the conv onvex-hul ull of a set of data points represents the pa pare reto-opt ptimal e extrem emes es of the set find the convex hull, and assign 𝑙 to all 𝑦 ∈ ℎ𝑣𝑚𝑚 𝑬 remove ℎ𝑣𝑚𝑚 𝑬 from 𝑬 , increase 𝑙 and repeat until 𝑬 is empty The depth 𝑙 identifies how extreme a point is IX: 21 IRDM ‘15/16
Example, depth Depth 2 h 2 Depth 3 h 3 Depth 1 h 1 Depth 4 h 4 IX: 22 IRDM ‘15/16
Depth-based methods The main idea is that the conv onvex-hul ull of a set of data points represents the pa pare reto-opt ptimal e extrem emes es of the set find set 𝑇 of corners of convex hull of 𝑬 assign depth 𝑙 to all 𝑦 ∈ 𝑇 , and repeat until 𝑬 is empty The depth of a point identifies how extreme it is Very sensitive to dimensionality recall, how are typically distributed over the hull of a hypersphere computational complexity IX: 23 IRDM ‘15/16
Multivariate Extreme Value Analysis We can also define tails for mult ltiv ivaria iate d dis istrib ibutio ions areas of extreme values with probability density less than some threshold More complicated than univariate and, only works for unim imodal d dis istribu butions w wit ith s sin ingle pe peak IX: 24 IRDM ‘15/16
Multivariate Extreme Value Analysis For a mult ltiv ivaria iate G Gaussia ian, we have its density as 1 ⋅ 𝑓 −1 2⋅ 𝑦−𝜈 Σ −1 x−𝜈 𝑈 𝑔 𝑦 = 𝑒 Σ ⋅ 2 ⋅ 𝜌 2 where Σ is the 𝑒 -by- 𝑒 covariance matrix, and | Σ | is its determinant The exponent resembles Mahala lanobis is di dista stance… IX: 25 IRDM ‘15/16
Mahalonobis distance Mah ahal alan anobis d distan tance is defined as point 𝑏 𝑦 − 𝜈 Σ −1 x − 𝜈 𝑈 𝑁 𝑦 , 𝜈 , Σ = Σ is a 𝑒 -by- 𝑒 covariance Feature 𝑍 matrix, and 𝜈 a mean-vector centre Essentially Euclidean distance, point 𝑐 after applying PCA, and after dividing by standard deviation very useful in practice Feature 𝑌 e.g. for example on the left, 𝑁 𝑐 , 𝜈 , Σ > 𝑁 ( 𝑏 , 𝜈 , Σ ) (Mahalanobis, 1936) IX: 26 IRDM ‘15/16
Multivariate Extreme Value Analysis For a mult ltiv ivaria iate G Gaussia ian, we have its density as 1 ⋅ 𝑓 −1 2⋅ 𝑦−𝜈 Σ −1 x−𝜈 𝑈 𝑔 𝑦 = 𝑒 Σ ⋅ 2 ⋅ 𝜌 2 where Σ is the 𝑒 -by- 𝑒 covariance matrix, and | Σ | is its determinant The exponent is half squared Mahalanobis distance 1 ⋅ 𝑓 −1 2⋅𝑁 𝑦 , 𝜈 , Σ 2 𝑔 𝑦 = 𝑒 Σ ⋅ 2 ⋅ 𝜌 2 for the probability density to fall below a threshold, the Mahalonobis distance needs to be larger than a threshold. IX: 27 IRDM ‘15/16
Recommend
More recommend