chapter 9 out utlie lier a r ana naly lysis is
play

Chapter 9: Out utlie lier A r Ana naly lysis is Jilles - PowerPoint PPT Presentation

Chapter 9: Out utlie lier A r Ana naly lysis is Jilles Vreeken IRDM 15/16 8 Dec 2015 IRDM Chapter 9, overview Basics & Motivation 1. Extreme Value Analysis 2. Probabilistic Methods 3. Cluster-based Methods 4.


  1. Chapter 9: Out utlie lier A r Ana naly lysis is Jilles Vreeken IRDM ‘15/16 8 Dec 2015

  2. IRDM Chapter 9, overview Basics & Motivation 1. Extreme Value Analysis 2. Probabilistic Methods 3. Cluster-based Methods 4. Distance-based Methods 5. You’ll find this covered in: Aggarwal, Ch. 8, 9 IX: 2 IRDM ‘15/16

  3. Decem embe ber 14 th th – 18 18 th th Tut utoria rials ls o on n Gra Graph Mining ph Mining Januar uary 4 y 4 th th – 8 th th No T Tut utoria rials IX: 3 IRDM ‘15/16

  4. Th The S e Sec econd Midt Midterm T T es est Dece cemb mber 10 th th 2015 2015 Wh When: from 14:15 to 15:25 Where: Wh Günter-Hotz-Hörsaal (E2 2) Material: Patterns, Clusters, and Classification You are a allo llowed to br brin ing o one (1 (1) ) sheet o of A A4 p pape per wit with handwr writ itten or pr prin inted notes o on bo both s sid ides . . No other material (n l (notes, bo books, c course m materials) ) or devic ices (c (calc lculator, n notebook, c cell ph ll phone, s spo poon, etc tc) a ) allo llowed. Br Brin ing a an ID; D; eit ither y your UdS UdS card, o or pa passport. IX: 4 IRDM ‘15/16

  5. Chapter 9.1: The B he Basi asics s & M Motiva vation Aggarwal Ch. 8.1 IX: 5 IRDM ‘15/16

  6. Outliers An outlie lier is a data point very different from most of the remaining data.  the standard definition is by Hawkins “An outlier is an observation which deviates so much from the other observations as to arouse suspicion it was generated by a different mechanism” IX: 6 IRDM ‘15/16

  7. Example Outliers Outlier Outlier, maybe Feature 𝑍 Inlier Feature 𝑌 IX: 7 IRDM ‘15/16

  8. Outliers An outlie lier is a data point very different from most of the remaining data.  the standard definition is by Hawkins “An outlier is an observation which deviates so much from the other observations as to arouse suspicion it was generated by a different mechanism” Outliers are also known as  anoma nomalies, abnormalities, discordants, deviants IX: 8 IRDM ‘15/16

  9. Why bother? Outlier analysis is a key area of data mining Unlike pattern mining, clustering, and classification, it aims to describe what is not not normal Applications are many  data cleaning  fraud detection  intrusion detection  rare disease detection  predictive maintenance IX: 9 IRDM ‘15/16

  10. Not noise Outliers are not noise  noise is uninteresting, outliers are  noise is random, outliers aren’t Outliers are generated by a differ fferen ent p process  e.g. Lionel Messi, or credit card fraudsters, or rare disease patients  we have too little data to infer that process exactly  detected outliers help us to better understand the data IX: 10 IRDM ‘15/16

  11. Outliers everywhere Many many different outlier detection methods exist  many different methods needed  e.g. continuous vs. discrete data  e.g. tables, sequences, graphs The key ey pr probl blem, and why outlier analysis is interesting: beforehand, we do not know what we are looking for  what is weird?  what is normal? IX: 11 IRDM ‘15/16

  12. Three Types of Outliers global outlier Global outliers  object that deviate from the r e res est o of t the e dat data s set et  main issue: find a good measure of deviation local outlier Local outliers  object that deviates from a sele lected co context e.g. differs strongly from its neighboring objects  main issue: how to define the local context? Collective outliers  a subset of objects that co colle llect ctively ly deviate collective from the data or context, e.g. intrusion detection outliers  main issue: combinatorial number of sets of objects IX: 12 IRDM ‘15/16

  13. Ranking versus Thresholding Most outlier analysis methods give a real-valued score How to decide whether a point is worth looking at?  we set a threshold, or look at the top- 𝑙  no best answer, depends on situation How to evaluate?  very, very difficult  is there a ‘true’ outlier ranking?  how bad is it to miss one, or to report two too many? IX: 13 IRDM ‘15/16

  14. Supervised Outlier Detection Given sufficient data, we can construct a classifier  and then simply use it to predict how outlying an object is  typically does not fly in practice Problem 1: Insufficient training data  outliers are rare  we can boost (resample) a training set from a small set of known outliers  we can train on artificial samples Problem 2: Recall  recall is more important than accuracy we want to catch them all  IX: 14 IRDM ‘15/16

  15. Chapter 9.2: Extreme V Value A alue Analysi nalysis Aggarwal Ch. 8.2 IX: 15 IRDM ‘15/16

  16. Extreme Values The traditional statistical approach to identifying outliers is extreme v eme value a e analysi sis Those points 𝑦 ∈ 𝑬 that are in the stati tatisti tical al tails ils of the probability distribution 𝑞 of 𝑬 are outliers.  only identifies very specific outliers For example, for {1,3,3,3,50,97,97,97,100}  extreme values are 1 and 100 , although 50 is the most isolated Tails are naturally defined for univariate distributions  defining the multivariate tail area of a distribution is more tricky IX: 16 IRDM ‘15/16

  17. Problems with multivariate tails Extreme value Feature 𝑍 Outlier, but not extreme value Feature 𝑌 IX: 17 IRDM ‘15/16

  18. Univariate Extreme Value Analysis Strong relation to statistical tail confidence tests Assume a distribution, and consider the probability density function 𝑔 𝑌 ( 𝑦 ) for attribute 𝑌  the lo lower t tail il are then those values 𝑦 < 𝑚 for which for all 𝑔 𝑌 𝑦 < 𝜗  the uppe pper t r tail il are then those values 𝑦 > 𝑣 for which for all 𝑔 𝑌 𝑦 < 𝜗 IX: 18 IRDM ‘15/16

  19. Not a density threshold. Not all distributions have two tails  exponential distributions, for example IX: 19 IRDM ‘15/16

  20. Univariate For example, for a Gaussian ⋅ 𝑓 − 𝑦−𝜈 2 1 2⋅𝜏 2 𝑔 𝑌 𝑦 = 𝜏 ⋅ 2 ⋅ 𝜌 with sufficient data we can estimate 𝜏 and 𝜈 with high accuracy  We can then compute 𝑨 -scores, 𝑨 𝑗 = ( 𝑦 𝑗 − 𝜈 )/ 𝜏 large positive values correspond to upper tail, large negative to lower tail  We can write the pdf in terms of 𝑨 -scores as 2 1 ⋅ 𝑓 −𝑦 𝑗 2 𝑔 𝑌 𝑨 𝑗 = 𝜏 ⋅ 2 ⋅ 𝜌 the cumu mulat ative e normal mal distribution on then tells the area of the tail larger than 𝑨 𝑗  as rule of thumb, 𝑨 -scores with absolute values larger than 3 are extreme  IX: 20 IRDM ‘15/16

  21. Depth-based methods The main idea is that the conv onvex-hul ull of a set of data points represents the pa pare reto-opt ptimal e extrem emes es of the set  find the convex hull, and assign 𝑙 to all 𝑦 ∈ ℎ𝑣𝑚𝑚 𝑬  remove ℎ𝑣𝑚𝑚 𝑬 from 𝑬 , increase 𝑙 and repeat until 𝑬 is empty The depth 𝑙 identifies how extreme a point is IX: 21 IRDM ‘15/16

  22. Example, depth Depth 2 h 2 Depth 3 h 3 Depth 1 h 1 Depth 4 h 4 IX: 22 IRDM ‘15/16

  23. Depth-based methods The main idea is that the conv onvex-hul ull of a set of data points represents the pa pare reto-opt ptimal e extrem emes es of the set  find set 𝑇 of corners of convex hull of 𝑬  assign depth 𝑙 to all 𝑦 ∈ 𝑇 , and repeat until 𝑬 is empty The depth of a point identifies how extreme it is Very sensitive to dimensionality  recall, how are typically distributed over the hull of a hypersphere  computational complexity IX: 23 IRDM ‘15/16

  24. Multivariate Extreme Value Analysis We can also define tails for mult ltiv ivaria iate d dis istrib ibutio ions  areas of extreme values with probability density less than some threshold More complicated than univariate  and, only works for unim imodal d dis istribu butions w wit ith s sin ingle pe peak IX: 24 IRDM ‘15/16

  25. Multivariate Extreme Value Analysis For a mult ltiv ivaria iate G Gaussia ian, we have its density as 1 ⋅ 𝑓 −1 2⋅ 𝑦−𝜈 Σ −1 x−𝜈 𝑈 𝑔 𝑦 = 𝑒 Σ ⋅ 2 ⋅ 𝜌 2  where Σ is the 𝑒 -by- 𝑒 covariance matrix, and | Σ | is its determinant The exponent resembles Mahala lanobis is di dista stance… IX: 25 IRDM ‘15/16

  26. Mahalonobis distance Mah ahal alan anobis d distan tance is defined as point 𝑏 𝑦 − 𝜈 Σ −1 x − 𝜈 𝑈 𝑁 𝑦 , 𝜈 , Σ = Σ is a 𝑒 -by- 𝑒 covariance Feature 𝑍  matrix, and 𝜈 a mean-vector centre Essentially Euclidean distance, point 𝑐 after applying PCA, and after dividing by standard deviation very useful in practice Feature 𝑌  e.g. for example on the left,  𝑁 𝑐 , 𝜈 , Σ > 𝑁 ( 𝑏 , 𝜈 , Σ ) (Mahalanobis, 1936) IX: 26 IRDM ‘15/16

  27. Multivariate Extreme Value Analysis For a mult ltiv ivaria iate G Gaussia ian, we have its density as 1 ⋅ 𝑓 −1 2⋅ 𝑦−𝜈 Σ −1 x−𝜈 𝑈 𝑔 𝑦 = 𝑒 Σ ⋅ 2 ⋅ 𝜌 2  where Σ is the 𝑒 -by- 𝑒 covariance matrix, and | Σ | is its determinant The exponent is half squared Mahalanobis distance 1 ⋅ 𝑓 −1 2⋅𝑁 𝑦 , 𝜈 , Σ 2 𝑔 𝑦 = 𝑒 Σ ⋅ 2 ⋅ 𝜌 2  for the probability density to fall below a threshold, the Mahalonobis distance needs to be larger than a threshold. IX: 27 IRDM ‘15/16

Recommend


More recommend