RCDL 2014 Methods for Anomaly Detection: a Survey Kalinichenko L.A., Shanin I., Taraban I. IPI RAN
Outline • Introduction • Motivation • Methods taxonomy • Metric Data Oriented Methods • Distance-based Data Methods • Probabilistically Distributed Data Methods • Data with Correlated Features Methods • Evolving Data Oriented Data • Discrete Sequences Data • Time Series • Graph-based Data Oriented Methods • Methods for many small graphs analysis • Methods for large single graph analysis • Text-based Data Oriented Methods • Future work RCDL-2014
Popular anomaly detection applications • Credit Card (and Mobile Phone) Fraud Detection • Suspicious Web Site Detection • Whole-Genome DNA Matching • ECG-Signal Filtering • Suspicious transaction detection • Analysis of Digital Sky Surveys • Social Network Analysis RCDL-2014
Introduction Anomaly Detection • Finding data objects that have suspicious origin. Purposes: 1. Data filtering 2. Finding rare events 3. Finding surprise data patterns RCDL-2014
Motivation Sometimes data has unobvious form RCDL-2014
Motivation • Sometimes anomalies are difficult to interpret RCDL-2014
Outline • Introduction • Motivation • Methods taxonomy • Metric Data Oriented Methods • Distance-based Data Methods • Probabilistically Distributed Data Methods • Data with Correlated Features Methods • Evolving Data Oriented Data • Discrete Sequences Data • Time Series • Graph-based Data Oriented Methods • Methods for many small graphs analysis • Methods for large single graph analysis • Text-based Data Oriented Methods • Future work RCDL-2014
Data forms • Definition of the outlier depends on the specific problem and its data representation. The most popular data forms are: • Metric Data (data as objects in a feature space) • Evolving Data (time series and sequences) • Text-based Data (i.e. poll answers) • Graph-based Data (social networks) RCDL-2014
Outline • Introduction • Motivation • Methods taxonomy • Metric Data Oriented Methods • Distance-based Data Methods • Probabilistically Distributed Data Methods • Data with Correlated Features Methods • Evolving Data Oriented Data • Discrete Sequences Data • Time Series • Graph-based Data Oriented Methods • Methods for many small graphs analysis • Methods for large single graph analysis • Text-based Data Oriented Methods • Future work RCDL-2014
Metric Data Oriented Methods • Data is a set of objects in the “feature” space There are three main groups of methods • Distance-based and density-based methods (with no additional assumptions) • Methods that assume that the data has a certain probabilistic distribution • Methods that assume that the “feature” space has highly correlated dimensions . RCDL-2014
Outline • Introduction • Motivation • Methods taxonomy • Metric Data Oriented Methods • Distance-based Data Methods • Probabilistically Distributed Data Methods • Data with Correlated Features Methods • Evolving Data Oriented Data • Discrete Sequences Data • Time Series • Graph-based Data Oriented Methods • Methods for many small graphs analysis • Methods for large single graph analysis • Text-based Data Oriented Methods • Future work RCDL-2014
Distance-based Data Oriented Methods • Data is represented as a set of objects in a «feature» space with a natural Euclidian metric. • The anomaly can be defined as a data object in this space, that is located far enough from other objects. • If no other assumptions on the data are given, there are two main approaches to detect the anomaly: • k-Nearest-Neighbor based methods • Clustering based methods RCDL-2014
Clustering-based methods • The purpose is to form large clusters of data object that would define «normal» data. • The most popular methods are • k-means • EM-algorithm • Self-organizing map RCDL-2014
K-Nearest Neighbors • These methods use a set of distances from a given object to it’s k closest neighbors. • Assumption: the anomaly object has much larger distances than normal data points. • There are many version of kNN algorithm that consider density of the nighborhood: • kNN with Local Outlier Factor (LOF) • kNN with Local Correlation Integral (LOCI) • kNN-DD (using Kolmogorov-Smirnov test) • etc… RCDL-2014
Outline • Introduction • Motivation • Methods taxonomy • Metric Data Oriented Methods • Distance-based Data Methods • Probabilistically Distributed Data Methods • Data with Correlated Features Methods • Evolving Data Oriented Data • Discrete Sequences Data • Time Series • Graph-based Data Oriented Methods • Methods for many small graphs analysis • Methods for large single graph analysis • Text-based Data Oriented Methods • Future work RCDL-2014
Probabilistically distributed data • The data has a natural probabilistic distribution The main approaches: • Extreme value analysis (Markov, Chebychev, Hoeffding inequalities, t-value test, etc...) • Probabilistic distribution estimation (Bayesian methods) • Probabilistic mixture modeling (EM-algorithm) RCDL-2014
Outline • Introduction • Motivation • Methods taxonomy • Metric Data Oriented Methods • Distance-based Data Methods • Probabilistically Distributed Data Methods • Data with Correlated Features Methods • Evolving Data Oriented Data • Discrete Sequences Data • Time Series • Graph-based Data Oriented Methods • Methods for many small graphs analysis • Methods for large single graph analysis • Text-based Data Oriented Methods • Future work RCDL-2014
Correlated dimensions data methods • Different dimensions are highly correlated with one another, linear data models can be used • Main assumption: the data is embedded in a lower dimensional subspace. Main approach: • Linear regression modeling • Principal component analysis RCDL-2014
Outline • Introduction • Motivation • Methods taxonomy • Metric Data Oriented Methods • Distance-based Data Methods • Probabilistically Distributed Data Methods • Data with Correlated Features Methods • Evolving Data Oriented Data • Discrete Sequences Data • Time Series • Graph-based Data Oriented Methods • Methods for many small graphs analysis • Methods for large single graph analysis • Text-based Data Oriented Methods • Future work RCDL-2014
Evolving Data Oriented Methods • Data is generated by a continuous or discreet temporal process. So there are two main groups of methods: • Discrete sequences oriented methods • Example: Genome sequence analysis • Example: User-action sequence analysis • Time Series oriented methods • Example: Detecting novel events in social media • The data can be presented in a multidimensional form (such as datastreams) RCDL-2014
Outline • Introduction • Motivation • Methods taxonomy • Metric Data Oriented Methods • Distance-based Data Methods • Probabilistically Distributed Data Methods • Data with Correlated Features Methods • Evolving Data Oriented Data • Discrete Sequences Data • Time Series • Graph-based Data Oriented Methods • Methods for many small graphs analysis • Methods for large single graph analysis • Text-based Data Oriented Methods • Future work RCDL-2014
Discrete Sequence Oriented Methods • There are several ways to determine an anomaly in discrete sequence: • as a position anomaly (single value is anomaly) • as a combination anomaly (whole sequence is anomaly) • Main approaches to determine a rarity of a value or a combination: • Distance-based • analysis of the pair-wise distance matrix • Frequency-based • Model-based (Hidden Markov Models) RCDL-2014
Outline • Introduction • Motivation • Methods taxonomy • Metric Data Oriented Methods • Distance-based Data Methods • Probabilistically Distributed Data Methods • Data with Correlated Features Methods • Evolving Data Oriented Data • Discrete Sequences Data • Time Series • Graph-based Data Oriented Methods • Methods for many small graphs analysis • Methods for large single graph analysis • Text-based Data Oriented Methods • Future work RCDL-2014
Time Series Oriented Methods • There are several ways to determine an anomaly in time series: • abrupt change in time series • unexpected trend in time series • time series of unusual shape • Main approaches are: • correlation across time or series • time series representation • HMM (Hidden Markov Model) • ARMA (Autoregression Model) • Wavelets • Spectral RCDL-2014
Outline • Introduction • Motivation • Methods taxonomy • Metric Data Oriented Methods • Distance-based Data Methods • Probabilistically Distributed Data Methods • Data with Correlated Features Methods • Evolving Data Oriented Data • Discrete Sequences Data • Time Series • Graph-based Data Oriented Methods • Methods for many small graphs analysis • Methods for large single graph analysis • Text-based Data Oriented Methods • Future work RCDL-2014
Recommend
More recommend