Mining Distance-Based Outliers in Near Linear Time with - PowerPoint PPT Presentation

Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Stephen D. Bay 1 and Mark Schwabacher 2 1 Institute for the Study of Learning and Expertise sbay@apres.stanford.edu 2 NASA Ames Research Center Mark.A.Schwabacher@nasa.gov

Motivation Detecting outliers or anomalies is an important KDD task with many practical applications and fast algorithms are needed for large databases. In this talk, I will – Show that very simple modifications of a basic algorithm lead to extremely good performance – Explain why this approach works well – Discuss limitations of this approach

Distance-Based Outliers • The main idea is to find points in low density regions of the feature space k ≅ P x ( ) NV • V is the total volume within radius d • N is the total number of x d examples • k is the number of examples in sphere Distance measure determines proximity and scaling.

Outlier Definitions • Outliers are the examples for which there are fewer than p other examples within distance d – Knorr & Ng • Outliers are the top n examples whose distance to the kth nearest neighbor is greatest – Ramaswamy, Rastogi, & Shim • Outliers are the top n examples whose average distance to the k nearest neighbors is greatest – Angiulli & Pizzuti, Eskin et al. k ≅ ( ) P x These definitions all relate to NV

Existing Methods • Nested Loops – For each example, find it’s nearest neighbors with a sequential scan • O(N 2 ) • Index Trees – For each example, find it’s nearest neighbors with an index tree • Potentially N log N, in practice can be worse than NL • Partitioning Methods – For each example, find it’s nearest neighbors given that the examples are stored in bins (e.g., cells, clusters) • Cell-based methods potentially N, in practice worse than NL for more than 5 dimensions (Knorr & Ng) • Cluster based methods appear sub-quadratic

Our Algorithm • Based on Nested loops – For each example, find it’s nearest neighbors with a sequential scan • Two modifications – Randomize order of examples • Can be done with a disk-based algorithm in linear time – While performing the sequential scan, • Keep track of closest neighbors found so far • prune examples once the neighbors found so far indicate that the example cannot be a top outlier • Process examples in blocks • Worst case O(N 2 ) distance computations, O(N 2 /B) disk accesses

Pruning • Outliers based on distance to the 3rd nearest neighbor (k=3) d is distance to 3 rd nearest sequential 39 State-gov 77516 Bachelors 13 50 Self-emp-not-inc 83311 Bachelors 13 38 Private 215646 HS-grad 9 scan neighbor for the weakest top 53 Private 234721 11th 7 28 Private 338409 Bachelors 13 outlier 37 Private 284582 Masters 14 49 Private 160187 9th 5 52 Self-emp-not-inc 209642 HS-grad 9 31 Private 45781 Masters 14 42 Private 159449 Bachelors 13 37 Private 280464 Some-college 10 30 State-gov 141297 Bachelors 13 23 Private 122272 Bachelors 13 32 Private 205019 Assoc-acdm 12 40 Private 121772 Assoc-voc 11 34 Private 245487 7th-8th 4 x 25 Self-emp-not-inc 176756 HS-grad 9 d 32 Private 186824 HS-grad 9 38 Private 28887 11th 7 43 Self-emp-not-inc 292175 Masters 14 40 Private 193524 Doctorate 16 54 Private 302146 HS-grad 9 35 Federal-gov 76845 9th 5 43 Private 117037 11th 7 59 Private 109015 HS-grad 9 56 Local-gov 216851 Bachelors 13 19 Private 168294 HS-grad 9 54 ? 180211 Some-college 10 39 Private 367260 HS-grad 9

Experimental Setup • 6 data sets varying from 68K to 5M examples • Mixture of discrete and continuous features (23- 55) • Wall time reported (CPU + IO) – Time does not include randomization • No special caching of records • Pentium 4, 1.5 Ghz, 1GB Ram • Memory footprint ~3MB • Mined top 30 outliers, k=5, block size = 1000, average distance

Scaling with N C o re l H is to g ra m C o re l H is to g ra m K D D C u p 1 9 9 9 K D D C u p 1 9 9 9 4 8 1 0 1 0 3 6 1 0 1 0 2 4 1 0 1 0 Total Time Total Time 1 2 1 0 1 0 0 0 1 0 1 0 -1 -2 1 0 1 0 3 4 5 3 4 5 6 7 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 S iz e S iz e P e rs o n P e rs o n N o rm a l 3 0 D N o rm a l 3 0 D 8 6 1 0 1 0 6 4 1 0 1 0 Total Time Total Time 4 2 1 0 1 0 0 2 1 0 1 0 0 -2 1 0 1 0 3 4 5 6 7 3 4 5 6 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 S iz e S iz e

Scaling Summary Data Set Slope Corel Histogram 1.13 Covertype 1.25 KDDCup 1999 1.13 Household 1990 1.32 Person 1990 1.16 Normal 30D 1.15 Slope of regression fit relating log time to log N = + = b log log log or t a b N t aN

Scaling with k P e rs o n P e rs o n N o rm a l 3 0 D N o rm a l 3 0 D 4 5 0 0 4 5 0 0 4 0 0 0 4 0 0 0 3 5 0 0 3 5 0 0 Total Time Total Time 3 0 0 0 3 0 0 0 2 5 0 0 2 5 0 0 2 0 0 0 2 0 0 0 0 2 0 4 0 6 0 8 0 1 0 0 0 2 0 4 0 6 0 8 0 1 0 0 K K 1 million records used for both Person and Normal 30D

Average Case Analysis Consider operation of the algorithm at moment in time – Outliers defined by distance to kth neighbor – Current cutoff distance is d – Randomization + sequential scan = I.I.D. sampling of pdf Let p(x) = prob. randomly drawn example lies within distance d ( ) ∫ = p x pdf ( x ) dV x d How many examples do we need to look at?

For non-outliers, number of samples follows a negative binomial distribution. Let P(Y=y) be probability of obtaining kth success on step y −   y 1 = =   − − k y k   P ( Y y ) p ( x ) ( 1 p ( x )) −   k 1 Expectation of number of samples with infinite data is ∞ [ ] = ∑ = ⋅ E Y P ( Y y ) y = y k [ ] k = E Y p ( x )

How does the cutoff change during program execution? Person 7 6 5 4 Cutoff 3 2 50K 100K 1 1M 5M 0 0 20 40 60 80 100 Percent of Data Set Processed

Scaling Rate b Versus Cutoff Ratio 1.8 Uniform 3D 1.6 Polynomial scaling b 1.4 Household Covertype Person 1.2 KDDCup Corel Histogram Mixed 3D Normal 30D 1 0.2 0.4 0.6 0.8 1 1.2 Relative change in cutoff (50K/5K) as N increases

Limitations • Failure modes – examples not in random order – examples not independent – no outliers in data

Method fails when there are no outliers Unifo rm 3 D Unifo rm 3 D 6 1 0 b=1.76 4 1 0 P ( x ) Total Time i 2 1 0 0 1 0 -2 1 0 3 4 5 6 1 0 1 0 1 0 1 0 S iz e Examples drawn from a uniform distribution in 3 dimensions

However, the method is efficient if there are at least a few outliers M ixe d 3 D M ixe d 3 D 6 1 0 b=1.11 4 1 0 P ( x ) Total Time i 2 1 0 0 1 0 -2 1 0 3 4 5 6 1 0 1 0 1 0 1 0 S iz e Examples drawn from 99% uniform, 1% Gaussian distribution

Future Work • Pruning eliminates examples when they cannot be a top outlier. Can we prune examples when they are almost certain to be an outlier? • How many examples is enough? Do we need to do the full N 2 comparisons? • How do algorithm settings affect performance and do they interact with data set characteristics? • How do we deal with dependent data points?

Summary & Conclusions • Presented a nested loop approach to finding distance-based outliers • Efficient and allows scaling to larger data sets with millions of examples and many features • Easy to implement and should be the new strawman for research in speeding up distance-based outliers

Resources • Executables available from http://www.isle.org/~sbay • Comparison with GritBot on Census data http://www.isle.org/~sbay/papers/kdd03/ • Datasets are public and are available by request

Scaling Summary N b=1.13 b=1.32 NlogN 100 1.8 4.4 2 1000 2.5 9.1 3 10000 3.3 19.1 4 100000 4.5 39.8 5 1000000 6.0 83.2 6 10000000 8.1 173.8 7

How big a sample do we need? It depends… N o rm a l3 0 d - k1 0 0 a 5 0 0 N o rm a l3 0 d - k1 0 0 a 5 0 0 P e rs o n - k30 a 3 0 P e rs o n - k30 a 3 0 1 1 0.8 0.8 Correspondence Correspondence 0.6 0.6 0.4 0.4 0.2 0.2 0 0 3 4 5 6 3 4 5 6 7 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 S iz e o f re fe re n c e s e t S iz e o f re fe re n c e s e t

Mining Distance-Based Outliers in Near Linear Time with - PowerPoint PPT Presentation

Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Stephen D. Bay 1 and Mark Schwabacher 2 1 Institute for the Study of Learning and Expertise sbay@apres.stanford.edu 2 NASA Ames Research Center

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Distance Education Distance education used to be about the distance. 1700s 1800s 1900s 2000s

Detecting Outliers under Detecting Outliers . . . What We Plan To Do Interval Uncertainty:

Correspondence Analysis and Moderate Outliers Anna Langovaya, Sonja Kuhnt TU Dortmund Ferbruar

The Origin of Near Earth The Origin of Near Earth The Origin of Near Earth The Origin of Near

Mark-recapture distance sampling (MRDS) in Distance 7.1 Setting up Distance for MRDS

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

DATA MINING LECTURE 4 Similarity and Distance Recommender Systems SIMILARITY AND DISTANCE

LECTURE 4 Similarity and Distance Recommender Systems SIMILARITY AND DISTANCE Thanks to: Tan,

Distance in data space Notion of distance (metrics) in data space Who is my closest neighbor?

Detecting Outliers with Ensemble of Profile HMMs Xilin Yu 1 UIUC December 11, 2018 1 under the

Conference on Seasonality, Seasonal Adjustment and their implications for Short-Term Analysis and

Supervised classification and outliers detection in gene expression data Laurent Br eh elin

Visualizing Big Data Outliers through Distributed Aggregation Leland Wilkinson. Proc VAST 2017,

Detecting Outliers in HMM modeling through Relative Entropy with Applications to Change-Point

High-Performance Outlier Detection Algorithm for Finding Blob-Filaments in Plasma Lingfei Wu 1 ,

Course Content Week 12 (May 26) Introduction to Data Mining 33459-01 Principles of Knowledge

Regularized Directions of Maximal Outlyingness Michiel Debruyne Dept. of mathematics and computer

Fast and Scalable Outlier Detection with Approximate Nearest Neighbor Ensembles Erich Schubert,

MDS Embedding MDS takes as input a distance matrix D , containing all N N pair of distances

Chapter 9: Out utlie lier A r Ana naly lysis is Jilles Vreeken IRDM 15/16 8 Dec 2015

APA Commissioning Results Andrzej Szelc & Serhan Tufanli Introduction What we measured

Model-theoretic and algebraic approach in machine learning (data mining, pattern recognition,

Mining Distance-Based Outliers in Near Linear Time with - PowerPoint PPT Presentation

Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Stephen D. Bay 1 and Mark Schwabacher 2 1 Institute for the Study of Learning and Expertise sbay@apres.stanford.edu 2 NASA Ames Research Center

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Distance Education Distance education used to be about the distance. 1700s 1800s 1900s 2000s

Detecting Outliers under Detecting Outliers . . . What We Plan To Do Interval Uncertainty:

Correspondence Analysis and Moderate Outliers Anna Langovaya, Sonja Kuhnt TU Dortmund Ferbruar

The Origin of Near Earth The Origin of Near Earth The Origin of Near Earth The Origin of Near

Mark-recapture distance sampling (MRDS) in Distance 7.1 Setting up Distance for MRDS

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

DATA MINING LECTURE 4 Similarity and Distance Recommender Systems SIMILARITY AND DISTANCE

LECTURE 4 Similarity and Distance Recommender Systems SIMILARITY AND DISTANCE Thanks to: Tan,

Distance in data space Notion of distance (metrics) in data space Who is my closest neighbor?

Detecting Outliers with Ensemble of Profile HMMs Xilin Yu 1 UIUC December 11, 2018 1 under the

Conference on Seasonality, Seasonal Adjustment and their implications for Short-Term Analysis and

Supervised classification and outliers detection in gene expression data Laurent Br eh elin

Visualizing Big Data Outliers through Distributed Aggregation Leland Wilkinson. Proc VAST 2017,

Detecting Outliers in HMM modeling through Relative Entropy with Applications to Change-Point

High-Performance Outlier Detection Algorithm for Finding Blob-Filaments in Plasma Lingfei Wu 1 ,

Course Content Week 12 (May 26) Introduction to Data Mining 33459-01 Principles of Knowledge

Regularized Directions of Maximal Outlyingness Michiel Debruyne Dept. of mathematics and computer

Fast and Scalable Outlier Detection with Approximate Nearest Neighbor Ensembles Erich Schubert,

MDS Embedding MDS takes as input a distance matrix D , containing all N N pair of distances

Chapter 9: Out utlie lier A r Ana naly lysis is Jilles Vreeken IRDM 15/16 8 Dec 2015

APA Commissioning Results Andrzej Szelc &amp; Serhan Tufanli Introduction What we measured

Model-theoretic and algebraic approach in machine learning (data mining, pattern recognition,

APA Commissioning Results Andrzej Szelc & Serhan Tufanli Introduction What we measured