Chapter 9: Out utlie lier A r Ana naly lysis is Jilles - PowerPoint PPT Presentation

Chapter 9: Out utlie lier A r Ana naly lysis is Jilles Vreeken IRDM ‘15/16 8 Dec 2015

IRDM Chapter 9, overview Basics & Motivation 1. Extreme Value Analysis 2. Probabilistic Methods 3. Cluster-based Methods 4. Distance-based Methods 5. You’ll find this covered in: Aggarwal, Ch. 8, 9 IX: 2 IRDM ‘15/16

Decem embe ber 14 th th – 18 18 th th Tut utoria rials ls o on n Gra Graph Mining ph Mining Januar uary 4 y 4 th th – 8 th th No T Tut utoria rials IX: 3 IRDM ‘15/16

Th The S e Sec econd Midt Midterm T T es est Dece cemb mber 10 th th 2015 2015 Wh When: from 14:15 to 15:25 Where: Wh Günter-Hotz-Hörsaal (E2 2) Material: Patterns, Clusters, and Classification You are a allo llowed to br brin ing o one (1 (1) ) sheet o of A A4 p pape per wit with handwr writ itten or pr prin inted notes o on bo both s sid ides . . No other material (n l (notes, bo books, c course m materials) ) or devic ices (c (calc lculator, n notebook, c cell ph ll phone, s spo poon, etc tc) a ) allo llowed. Br Brin ing a an ID; D; eit ither y your UdS UdS card, o or pa passport. IX: 4 IRDM ‘15/16

Chapter 9.1: The B he Basi asics s & M Motiva vation Aggarwal Ch. 8.1 IX: 5 IRDM ‘15/16

Outliers An outlie lier is a data point very different from most of the remaining data.  the standard definition is by Hawkins “An outlier is an observation which deviates so much from the other observations as to arouse suspicion it was generated by a different mechanism” IX: 6 IRDM ‘15/16

Example Outliers Outlier Outlier, maybe Feature 𝑍 Inlier Feature 𝑌 IX: 7 IRDM ‘15/16

Outliers An outlie lier is a data point very different from most of the remaining data.  the standard definition is by Hawkins “An outlier is an observation which deviates so much from the other observations as to arouse suspicion it was generated by a different mechanism” Outliers are also known as  anoma nomalies, abnormalities, discordants, deviants IX: 8 IRDM ‘15/16

Why bother? Outlier analysis is a key area of data mining Unlike pattern mining, clustering, and classification, it aims to describe what is not not normal Applications are many  data cleaning  fraud detection  intrusion detection  rare disease detection  predictive maintenance IX: 9 IRDM ‘15/16

Not noise Outliers are not noise  noise is uninteresting, outliers are  noise is random, outliers aren’t Outliers are generated by a differ fferen ent p process  e.g. Lionel Messi, or credit card fraudsters, or rare disease patients  we have too little data to infer that process exactly  detected outliers help us to better understand the data IX: 10 IRDM ‘15/16

Outliers everywhere Many many different outlier detection methods exist  many different methods needed  e.g. continuous vs. discrete data  e.g. tables, sequences, graphs The key ey pr probl blem, and why outlier analysis is interesting: beforehand, we do not know what we are looking for  what is weird?  what is normal? IX: 11 IRDM ‘15/16

Three Types of Outliers global outlier Global outliers  object that deviate from the r e res est o of t the e dat data s set et  main issue: find a good measure of deviation local outlier Local outliers  object that deviates from a sele lected co context e.g. differs strongly from its neighboring objects  main issue: how to define the local context? Collective outliers  a subset of objects that co colle llect ctively ly deviate collective from the data or context, e.g. intrusion detection outliers  main issue: combinatorial number of sets of objects IX: 12 IRDM ‘15/16

Ranking versus Thresholding Most outlier analysis methods give a real-valued score How to decide whether a point is worth looking at?  we set a threshold, or look at the top- 𝑙  no best answer, depends on situation How to evaluate?  very, very difficult  is there a ‘true’ outlier ranking?  how bad is it to miss one, or to report two too many? IX: 13 IRDM ‘15/16

Supervised Outlier Detection Given sufficient data, we can construct a classifier  and then simply use it to predict how outlying an object is  typically does not fly in practice Problem 1: Insufficient training data  outliers are rare  we can boost (resample) a training set from a small set of known outliers  we can train on artificial samples Problem 2: Recall  recall is more important than accuracy we want to catch them all  IX: 14 IRDM ‘15/16

Chapter 9.2: Extreme V Value A alue Analysi nalysis Aggarwal Ch. 8.2 IX: 15 IRDM ‘15/16

Extreme Values The traditional statistical approach to identifying outliers is extreme v eme value a e analysi sis Those points 𝑦 ∈ 𝑬 that are in the stati tatisti tical al tails ils of the probability distribution 𝑞 of 𝑬 are outliers.  only identifies very specific outliers For example, for {1,3,3,3,50,97,97,97,100}  extreme values are 1 and 100 , although 50 is the most isolated Tails are naturally defined for univariate distributions  defining the multivariate tail area of a distribution is more tricky IX: 16 IRDM ‘15/16

Problems with multivariate tails Extreme value Feature 𝑍 Outlier, but not extreme value Feature 𝑌 IX: 17 IRDM ‘15/16

Univariate Extreme Value Analysis Strong relation to statistical tail confidence tests Assume a distribution, and consider the probability density function 𝑔 𝑌 ( 𝑦 ) for attribute 𝑌  the lo lower t tail il are then those values 𝑦 < 𝑚 for which for all 𝑔 𝑌 𝑦 < 𝜗  the uppe pper t r tail il are then those values 𝑦 > 𝑣 for which for all 𝑔 𝑌 𝑦 < 𝜗 IX: 18 IRDM ‘15/16

Not a density threshold. Not all distributions have two tails  exponential distributions, for example IX: 19 IRDM ‘15/16

Univariate For example, for a Gaussian ⋅ 𝑓 − 𝑦−𝜈 2 1 2⋅𝜏 2 𝑔 𝑌 𝑦 = 𝜏 ⋅ 2 ⋅ 𝜌 with sufficient data we can estimate 𝜏 and 𝜈 with high accuracy  We can then compute 𝑨 -scores, 𝑨 𝑗 = ( 𝑦 𝑗 − 𝜈 )/ 𝜏 large positive values correspond to upper tail, large negative to lower tail  We can write the pdf in terms of 𝑨 -scores as 2 1 ⋅ 𝑓 −𝑦 𝑗 2 𝑔 𝑌 𝑨 𝑗 = 𝜏 ⋅ 2 ⋅ 𝜌 the cumu mulat ative e normal mal distribution on then tells the area of the tail larger than 𝑨 𝑗  as rule of thumb, 𝑨 -scores with absolute values larger than 3 are extreme  IX: 20 IRDM ‘15/16

Depth-based methods The main idea is that the conv onvex-hul ull of a set of data points represents the pa pare reto-opt ptimal e extrem emes es of the set  find the convex hull, and assign 𝑙 to all 𝑦 ∈ ℎ𝑣𝑚𝑚 𝑬  remove ℎ𝑣𝑚𝑚 𝑬 from 𝑬 , increase 𝑙 and repeat until 𝑬 is empty The depth 𝑙 identifies how extreme a point is IX: 21 IRDM ‘15/16

Example, depth Depth 2 h 2 Depth 3 h 3 Depth 1 h 1 Depth 4 h 4 IX: 22 IRDM ‘15/16

Depth-based methods The main idea is that the conv onvex-hul ull of a set of data points represents the pa pare reto-opt ptimal e extrem emes es of the set  find set 𝑇 of corners of convex hull of 𝑬  assign depth 𝑙 to all 𝑦 ∈ 𝑇 , and repeat until 𝑬 is empty The depth of a point identifies how extreme it is Very sensitive to dimensionality  recall, how are typically distributed over the hull of a hypersphere  computational complexity IX: 23 IRDM ‘15/16

Multivariate Extreme Value Analysis We can also define tails for mult ltiv ivaria iate d dis istrib ibutio ions  areas of extreme values with probability density less than some threshold More complicated than univariate  and, only works for unim imodal d dis istribu butions w wit ith s sin ingle pe peak IX: 24 IRDM ‘15/16

Multivariate Extreme Value Analysis For a mult ltiv ivaria iate G Gaussia ian, we have its density as 1 ⋅ 𝑓 −1 2⋅ 𝑦−𝜈 Σ −1 x−𝜈 𝑈 𝑔 𝑦 = 𝑒 Σ ⋅ 2 ⋅ 𝜌 2  where Σ is the 𝑒 -by- 𝑒 covariance matrix, and | Σ | is its determinant The exponent resembles Mahala lanobis is di dista stance… IX: 25 IRDM ‘15/16

Mahalonobis distance Mah ahal alan anobis d distan tance is defined as point 𝑏 𝑦 − 𝜈 Σ −1 x − 𝜈 𝑈 𝑁 𝑦 , 𝜈 , Σ = Σ is a 𝑒 -by- 𝑒 covariance Feature 𝑍  matrix, and 𝜈 a mean-vector centre Essentially Euclidean distance, point 𝑐 after applying PCA, and after dividing by standard deviation very useful in practice Feature 𝑌  e.g. for example on the left,  𝑁 𝑐 , 𝜈 , Σ > 𝑁 ( 𝑏 , 𝜈 , Σ ) (Mahalanobis, 1936) IX: 26 IRDM ‘15/16

Multivariate Extreme Value Analysis For a mult ltiv ivaria iate G Gaussia ian, we have its density as 1 ⋅ 𝑓 −1 2⋅ 𝑦−𝜈 Σ −1 x−𝜈 𝑈 𝑔 𝑦 = 𝑒 Σ ⋅ 2 ⋅ 𝜌 2  where Σ is the 𝑒 -by- 𝑒 covariance matrix, and | Σ | is its determinant The exponent is half squared Mahalanobis distance 1 ⋅ 𝑓 −1 2⋅𝑁 𝑦 , 𝜈 , Σ 2 𝑔 𝑦 = 𝑒 Σ ⋅ 2 ⋅ 𝜌 2  for the probability density to fall below a threshold, the Mahalonobis distance needs to be larger than a threshold. IX: 27 IRDM ‘15/16

Chapter 9: Out utlie lier A r Ana naly lysis is Jilles - PowerPoint PPT Presentation

Chapter 9: Out utlie lier A r Ana naly lysis is Jilles Vreeken IRDM 15/16 8 Dec 2015 IRDM Chapter 9, overview Basics & Motivation 1. Extreme Value Analysis 2. Probabilistic Methods 3. Cluster-based Methods 4.

Monte Ca rlo Ana lysis of Monte Ca rlo Ana lysis of Unc e rta intie s in the Ne the rla nds

An n Emp Empir iric ical Ana naly lysis s of of Anon nonymit ity in n Zcash George

TO O ANA NALY LYSTS STS FINANC NCIAL IAL RESULTS LTS Q1 FY FY 2020 August ust 14 14,

Ha Half lf Yea ear r Ana naly lyst st Mee eetin ting g 20 2019 19 Som omchai hai Le

Pa tro l Da ta Ana lysis Pina l Co unty She riff s Offic e Bo a rd o f Supe rviso rs Pre se

Re tro spe c tive Go ve rna nc e Ana lysis fo r the Na rra g a nse tt Ba y Wa te rshe d a nd

T E RRI T ORI AL COMPUT I NG Hydro lo g y Ana lysis-Va llda ura Bo risla v Sc ha le v

GAME PLAN Cons nsume mer s segme ment ntation a n ana nalys lysis Int Introduce o

Alte rna tive F ue l Corridors Ana lysis a nd Pla nning T ools INT E RMOUNT AIN WE ST E

HWY 42 ALTERNATIVES ANALYSIS AGENDA A Visio n fo r E phra im Alte rna tive s Ana lysis

E xpe rime nts De sig n a nd Ana lysis F o tis E . Pso mo po ulo s CODAT A-RDA Advanc e d

Mode rnizing Minne sota s Grid An E c o no mic Ana lysis o f E ne rg y Sto ra g e Oppo

Annua l Gra nts Ma na g e me nt Surve y Re sults a nd Ana lysis FEBRUARY, 2020 RE I Syste ms,

TUMOR LYSIS SYNDROME ISU Dietetic Intern Mini Case Topic Presentation March 2017 OBJECTIVES:

Coll llaboration through Analy lysis: A jo journey in in dig igital content management

Working with Santa Ana City Government Santa Ana Neighborhood Initiatives Statement of Values

MDS Embedding MDS takes as input a distance matrix D , containing all N N pair of distances

Fast and Scalable Outlier Detection with Approximate Nearest Neighbor Ensembles Erich Schubert,

Correspondence Analysis and Moderate Outliers Anna Langovaya, Sonja Kuhnt TU Dortmund Ferbruar

Regularized Directions of Maximal Outlyingness Michiel Debruyne Dept. of mathematics and computer

APA Commissioning Results Andrzej Szelc & Serhan Tufanli Introduction What we measured

Model-theoretic and algebraic approach in machine learning (data mining, pattern recognition,

Robust model training and generalisation with Studentising flows Simon Alexanderson Gustav Eje

Reactive programming @minebocek mine-cetinkaya-rundel Mine etinkaya-Rundel

Chapter 9: Out utlie lier A r Ana naly lysis is Jilles - PowerPoint PPT Presentation

Chapter 9: Out utlie lier A r Ana naly lysis is Jilles Vreeken IRDM 15/16 8 Dec 2015 IRDM Chapter 9, overview Basics & Motivation 1. Extreme Value Analysis 2. Probabilistic Methods 3. Cluster-based Methods 4.

Monte Ca rlo Ana lysis of Monte Ca rlo Ana lysis of Unc e rta intie s in the Ne the rla nds

An n Emp Empir iric ical Ana naly lysis s of of Anon nonymit ity in n Zcash George

TO O ANA NALY LYSTS STS FINANC NCIAL IAL RESULTS LTS Q1 FY FY 2020 August ust 14 14,

Ha Half lf Yea ear r Ana naly lyst st Mee eetin ting g 20 2019 19 Som omchai hai Le

Pa tro l Da ta Ana lysis Pina l Co unty She riff s Offic e Bo a rd o f Supe rviso rs Pre se

Re tro spe c tive Go ve rna nc e Ana lysis fo r the Na rra g a nse tt Ba y Wa te rshe d a nd

T E RRI T ORI AL COMPUT I NG Hydro lo g y Ana lysis-Va llda ura Bo risla v Sc ha le v

GAME PLAN Cons nsume mer s segme ment ntation a n ana nalys lysis Int Introduce o

Alte rna tive F ue l Corridors Ana lysis a nd Pla nning T ools INT E RMOUNT AIN WE ST E

HWY 42 ALTERNATIVES ANALYSIS AGENDA A Visio n fo r E phra im Alte rna tive s Ana lysis

E xpe rime nts De sig n a nd Ana lysis F o tis E . Pso mo po ulo s CODAT A-RDA Advanc e d

Mode rnizing Minne sota s Grid An E c o no mic Ana lysis o f E ne rg y Sto ra g e Oppo

Annua l Gra nts Ma na g e me nt Surve y Re sults a nd Ana lysis FEBRUARY, 2020 RE I Syste ms,

TUMOR LYSIS SYNDROME ISU Dietetic Intern Mini Case Topic Presentation March 2017 OBJECTIVES:

Coll llaboration through Analy lysis: A jo journey in in dig igital content management

Working with Santa Ana City Government Santa Ana Neighborhood Initiatives Statement of Values

MDS Embedding MDS takes as input a distance matrix D , containing all N N pair of distances

Fast and Scalable Outlier Detection with Approximate Nearest Neighbor Ensembles Erich Schubert,

Correspondence Analysis and Moderate Outliers Anna Langovaya, Sonja Kuhnt TU Dortmund Ferbruar

Regularized Directions of Maximal Outlyingness Michiel Debruyne Dept. of mathematics and computer

APA Commissioning Results Andrzej Szelc &amp; Serhan Tufanli Introduction What we measured

Model-theoretic and algebraic approach in machine learning (data mining, pattern recognition,

Robust model training and generalisation with Studentising flows Simon Alexanderson Gustav Eje

Reactive programming @minebocek mine-cetinkaya-rundel Mine etinkaya-Rundel

APA Commissioning Results Andrzej Szelc & Serhan Tufanli Introduction What we measured