Fast and Scalable Outlier Detection with Approximate Nearest - PowerPoint PPT Presentation

0 / 12 Fast and Scalable Outlier Detection with Approximate Nearest Neighbor Ensembles Erich Schubert, Arthur Zimek, Hans-Peter Kriegel Lehr- und Forschungseinheit Datenbanksysteme Institut für Informatik Ludwig-Maximilians-Universität München Hanoi, 2015-04-22 E. Schubert, A. Zimek, H.-P. Kriegel Outlier Detection with AkNN Ensembles 2015-04-22 0 / 12

Motivation 1 / 12 Outlier Detection – Use Cases Outliers – Car crash hotspots (using KDEOS): [SZK14a] Using Open Data (7 years, 1.2 million accidents) from the UK. E. Schubert, A. Zimek, H.-P. Kriegel Outlier Detection with AkNN Ensembles 2015-04-22 1 / 12

Motivation 2 / 12 Outlier Detection: k NN-Outlier k NN outlier [RRS00]: score ( o ) := k-dist ( o ) (here: k = 3) Many outlier detections based on k NN and LOF [Bre+00]. Examples: [AP02; Jin+06; Kri+09; SZK14b] E. Schubert, A. Zimek, H.-P. Kriegel Outlier Detection with AkNN Ensembles 2015-04-22 2 / 12

Motivation 2 / 12 Outlier Detection: k NN-Outlier k NN outlier [RRS00]: score ( o ) := k-dist ( o ) (here: k = 3) 0 . 54 Many outlier detections based on k NN and LOF [Bre+00]. Examples: [AP02; Jin+06; Kri+09; SZK14b] E. Schubert, A. Zimek, H.-P. Kriegel Outlier Detection with AkNN Ensembles 2015-04-22 2 / 12

Motivation 2 / 12 Outlier Detection: k NN-Outlier k NN outlier [RRS00]: score ( o ) := k-dist ( o ) (here: k = 3) 0 . 54 0 . 65 Many outlier detections based on k NN and LOF [Bre+00]. Examples: [AP02; Jin+06; Kri+09; SZK14b] E. Schubert, A. Zimek, H.-P. Kriegel Outlier Detection with AkNN Ensembles 2015-04-22 2 / 12

Motivation 2 / 12 Outlier Detection: k NN-Outlier k NN outlier [RRS00]: score ( o ) := k-dist ( o ) (here: k = 3) 0 . 54 0 . 81 0 . 65 Strongest outlier Many outlier detections based on k NN and LOF [Bre+00]. Examples: [AP02; Jin+06; Kri+09; SZK14b] E. Schubert, A. Zimek, H.-P. Kriegel Outlier Detection with AkNN Ensembles 2015-04-22 2 / 12

Motivation 3 / 12 Outlier Detection: Local Outlier Factor [Bre+00] 1 lrd ( p ) � LOF ( o ) := | k NN ( o ) | lrd ( o ) p ∈ k NN ( o ) � �� average relative density where lrd ( o ) is the local reachability density: � 1 � lrd ( o ) := 1 reach-dist ( o ← p ) | k NN ( o ) | p ∈ k NN ( o ) �� average inverse reachability distance and the (asymmetric) reachability of o from p is: reach-dist ( o ← p ) := max { dist ( o , p ) , k -dist ( p ) } � �� true distance core size of neighbor E. Schubert, A. Zimek, H.-P. Kriegel Outlier Detection with AkNN Ensembles 2015-04-22 3 / 12

Motivation 4 / 12 Outlier Detection: Local Outlier Factor [Bre+00] k NN has difficulties with different densities 100 k NN k = 5 90 80 70 60 50 No Outlier 40 True Outlier 30 20 10 0 0 10 20 30 40 50 60 70 80 90 100 E. Schubert, A. Zimek, H.-P. Kriegel Outlier Detection with AkNN Ensembles 2015-04-22 4 / 12

Motivation 4 / 12 Outlier Detection: Local Outlier Factor [Bre+00] LOF is designed to cope with different densities 100 LOF k = 5 90 80 70 60 50 No Outlier 40 True Outlier 30 20 10 0 0 10 20 30 40 50 60 70 80 90 100 E. Schubert, A. Zimek, H.-P. Kriegel Outlier Detection with AkNN Ensembles 2015-04-22 4 / 12

Motivation 5 / 12 Outlier Detection Many outlier detection methods are based on the k nearest neighbors. Unfortunately, computing the k NN for large data is expensive: Pairwise distance computation is O ( n 2 ) – prohibitive for big data. E. Schubert, A. Zimek, H.-P. Kriegel Outlier Detection with AkNN Ensembles 2015-04-22 5 / 12

Motivation 5 / 12 Outlier Detection Many outlier detection methods are based on the k nearest neighbors. Unfortunately, computing the k NN for large data is expensive: Pairwise distance computation is O ( n 2 ) – prohibitive for big data. ◮ R*-Tree [Bec+90] good up to ≈ 30 dimensions (best: ≤ 10), but not easy to distribute to a cluster. ◮ PINN [dCH10; dCH12]: random projections + kd-tree. ◮ LSH [IM98] may find less than k neighbors for outliers. E. Schubert, A. Zimek, H.-P. Kriegel Outlier Detection with AkNN Ensembles 2015-04-22 5 / 12

Motivation 5 / 12 Outlier Detection Many outlier detection methods are based on the k nearest neighbors. Unfortunately, computing the k NN for large data is expensive: Pairwise distance computation is O ( n 2 ) – prohibitive for big data. ◮ R*-Tree [Bec+90] good up to ≈ 30 dimensions (best: ≤ 10), but not easy to distribute to a cluster. ◮ PINN [dCH10; dCH12]: random projections + kd-tree. ◮ LSH [IM98] may find less than k neighbors for outliers. Wanted: an approximative approach to find the k nearest neighbors: ◮ High probability of finding the correct neighbors ◮ Errors should not hurt much ◮ Distributable to a cluster ◮ Supports high-dimensional data E. Schubert, A. Zimek, H.-P. Kriegel Outlier Detection with AkNN Ensembles 2015-04-22 5 / 12

Space-Filling Curves 6 / 12 Ingredients: Space-Filling Curves Space-filling curves project multiple dimensions to one. (Hilbert curve [Hil91], Peano curve [Pea90], and Z-curve [Mor66]) Neighbors remain neighbors on the curve with high probability. Each curve has “cuts” where neighborhoods are not well preserved. E. Schubert, A. Zimek, H.-P. Kriegel Outlier Detection with AkNN Ensembles 2015-04-22 6 / 12

Space-Filling Curves 6 / 12 Ingredients: Space-Filling Curves Space-filling curves project multiple dimensions to one. (Hilbert curve [Hil91], Peano curve [Pea90], and Z-curve [Mor66]) Neighbors remain neighbors on the curve with high probability. Distributed sorting large data is well understood. E. Schubert, A. Zimek, H.-P. Kriegel Outlier Detection with AkNN Ensembles 2015-04-22 6 / 12

Space-Filling Curves 6 / 12 Ingredients: Space-Filling Curves Space-filling curves project multiple dimensions to one. (Hilbert curve [Hil91], Peano curve [Pea90], and Z-curve [Mor66]) Neighbors remain neighbors on the curve with high probability. However, they do not work well with too many dimensions either, because they split one dimension at a time. We need more ingredients to improve the results. E. Schubert, A. Zimek, H.-P. Kriegel Outlier Detection with AkNN Ensembles 2015-04-22 6 / 12

Random Projections 7 / 12 Ingredients: Random projections (c.f. [dCH10]) Random projections can reduce the dimensionality, and preserve distances well (e.g. database-friendly [Ach01], p -stable [Dat+04]). In contrast to other dimensionality reduction (PCA, MDS), these project one vector at a time and thus can be distributed easily. E. Schubert, A. Zimek, H.-P. Kriegel Outlier Detection with AkNN Ensembles 2015-04-22 7 / 12

Random Projections 7 / 12 Ingredients: Random projections (c.f. [dCH10]) Random projections can reduce the dimensionality, and preserve distances well (e.g. database-friendly [Ach01], p -stable [Dat+04]). In contrast to other dimensionality reduction (PCA, MDS), these project one vector at a time and thus can be distributed easily. Often, multiple projections are used and combined in an ensemble. Objective: Design an ensemble based on random projections and space-filling curves, to find the k nearest neighbors. ◮ Distributable to a cluster with O ( n ) communication ◮ Different curves and projections avoid correlated errors E. Schubert, A. Zimek, H.-P. Kriegel Outlier Detection with AkNN Ensembles 2015-04-22 7 / 12

k NN SFC Ensemble Method 8 / 12 Ensemble for k -Nearest Neighbors 1. Generate m space-filling curves (with high diversity): ◮ Different curve families (Peano, Hilbert, Z-Curve) ◮ Random projections or random subspaces ◮ Different shift offsets 2. Project the data to each space-filling curve 3. Sort the data for each space-filling curve 4. Use a sliding window of width w × k to generate candidates 5. Merge the neighbor candidates for each point 6. Compute the real distances, and keep the k nearest neighbors 7. If needed, also emit reverse k nearest neighbors All steps can well be implemented on a cluster. Except for sort and sliding window as “map” and “reduce”. E. Schubert, A. Zimek, H.-P. Kriegel Outlier Detection with AkNN Ensembles 2015-04-22 8 / 12

k NN SFC Ensemble Method 8 / 12 Ensemble for k -Nearest Neighbors 2. Project the data to each space-filling curve distributed on every node do // Blockwise I/O for efficiency foreach block do foreach curve do // Map to the SFC project data to curve // ...but delay the shuffle step store projected data locally // Sample data for sorting send sample to coordination node end end endon // Complete sort using sample distribution E. Schubert, A. Zimek, H.-P. Kriegel Outlier Detection with AkNN Ensembles 2015-04-22 8 / 12

k NN SFC Ensemble Method 8 / 12 Ensemble for k -Nearest Neighbors 1. Generate m space-filling curves (with high diversity): ◮ Different curve families (Peano, Hilbert, Z-Curve) ◮ Random projections or random subspaces ◮ Different shift offsets 2. Project the data to each space-filling curve 3. Sort the data for each space-filling curve 4. Use a sliding window of width w × k to generate candidates 5. Merge the neighbor candidates for each point 6. Compute the real distances, and keep the k nearest neighbors 7. If needed, also emit reverse k nearest neighbors All steps can well be implemented on a cluster. Except for sort and sliding window as “map” and “reduce”. E. Schubert, A. Zimek, H.-P. Kriegel Outlier Detection with AkNN Ensembles 2015-04-22 8 / 12

Fast and Scalable Outlier Detection with Approximate Nearest - PowerPoint PPT Presentation

0 / 12 Fast and Scalable Outlier Detection with Approximate Nearest Neighbor Ensembles Erich Schubert, Arthur Zimek, Hans-Peter Kriegel Lehr- und Forschungseinheit Datenbanksysteme Institut fr Informatik Ludwig-Maximilians-Universitt

Outlier Outlier Outlier- Outlier - -robust - robust robust robust identification

Outlier Detection Outlier detection is both easy and difficult. It is easy since there are

Proximity-based Outlier Detection Objects far away from the others are outliers The

Outlier Detection Motivation: Fraud Detection http://i.imgur.com/ckkoAOp.gif Jian Pei: CMPT

Outlier Detection Chapter 12 of Data Mining: Concepts and Techniques JIAWEI HAN, MICHELINE KAMBER,

Shape Outlier Detection Using Pose Preserving Dynamic Shape Models Chan-Su Lee and Ahmed

Good and Bad Neighborhood Approximations for Outlier Detection Ensembles Evelyn Kirner, Erich

Data Mining II Anomaly Detection Heiko Paulheim Anomaly Detection Also known as Outlier

Data Mining II Anomaly Detection Heiko Paulheim Anomaly Detection Also known as Outlier

Background Data Resampling for Outlier-Aware Classification Out-of-distribution Detection

Scalable Anomaly Detection with Spark and SOS Strata NYC September 26, 2019 Hi there, my name

Detection of neutral particles detection of neutrons detection of neutrinons detection of low

TreeShrink: fast and accurate detection of outlier long branches in collections of phylogenetic

Anomaly Based Network Intrusion Detection with Unsupervised Outlier Detection Jiong Zhang and

Anomaly Detection of Trajectories Junier B. Oliva Anomaly Detection An anomaly (or outlier)

Fast and Scalable Relational Division on Fast and Scalable Relational Division on Database

Correspondence Analysis and Moderate Outliers Anna Langovaya, Sonja Kuhnt TU Dortmund Ferbruar

Regularized Directions of Maximal Outlyingness Michiel Debruyne Dept. of mathematics and computer

Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule

Visualizing Big Data Outliers through Distributed Aggregation Leland Wilkinson. Proc VAST 2017,

MDS Embedding MDS takes as input a distance matrix D , containing all N N pair of distances

Chapter 9: Out utlie lier A r Ana naly lysis is Jilles Vreeken IRDM 15/16 8 Dec 2015

APA Commissioning Results Andrzej Szelc & Serhan Tufanli Introduction What we measured

Model-theoretic and algebraic approach in machine learning (data mining, pattern recognition,

Fast and Scalable Outlier Detection with Approximate Nearest - PowerPoint PPT Presentation

0 / 12 Fast and Scalable Outlier Detection with Approximate Nearest Neighbor Ensembles Erich Schubert, Arthur Zimek, Hans-Peter Kriegel Lehr- und Forschungseinheit Datenbanksysteme Institut fr Informatik Ludwig-Maximilians-Universitt

Outlier Outlier Outlier- Outlier - -robust - robust robust robust identification

Outlier Detection Outlier detection is both easy and difficult. It is easy since there are

Proximity-based Outlier Detection Objects far away from the others are outliers The

Outlier Detection Motivation: Fraud Detection http://i.imgur.com/ckkoAOp.gif Jian Pei: CMPT

Outlier Detection Chapter 12 of Data Mining: Concepts and Techniques JIAWEI HAN, MICHELINE KAMBER,

Shape Outlier Detection Using Pose Preserving Dynamic Shape Models Chan-Su Lee and Ahmed

Good and Bad Neighborhood Approximations for Outlier Detection Ensembles Evelyn Kirner, Erich

Data Mining II Anomaly Detection Heiko Paulheim Anomaly Detection Also known as Outlier

Data Mining II Anomaly Detection Heiko Paulheim Anomaly Detection Also known as Outlier

Background Data Resampling for Outlier-Aware Classification Out-of-distribution Detection

Scalable Anomaly Detection with Spark and SOS Strata NYC September 26, 2019 Hi there, my name

Detection of neutral particles detection of neutrons detection of neutrinons detection of low

TreeShrink: fast and accurate detection of outlier long branches in collections of phylogenetic

Anomaly Based Network Intrusion Detection with Unsupervised Outlier Detection Jiong Zhang and

Anomaly Detection of Trajectories Junier B. Oliva Anomaly Detection An anomaly (or outlier)

Fast and Scalable Relational Division on Fast and Scalable Relational Division on Database

Correspondence Analysis and Moderate Outliers Anna Langovaya, Sonja Kuhnt TU Dortmund Ferbruar

Regularized Directions of Maximal Outlyingness Michiel Debruyne Dept. of mathematics and computer

Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule

Visualizing Big Data Outliers through Distributed Aggregation Leland Wilkinson. Proc VAST 2017,

MDS Embedding MDS takes as input a distance matrix D , containing all N N pair of distances

Chapter 9: Out utlie lier A r Ana naly lysis is Jilles Vreeken IRDM 15/16 8 Dec 2015

APA Commissioning Results Andrzej Szelc &amp; Serhan Tufanli Introduction What we measured

Model-theoretic and algebraic approach in machine learning (data mining, pattern recognition,

APA Commissioning Results Andrzej Szelc & Serhan Tufanli Introduction What we measured