anomaly detection
play

Anomaly Detection Qi Liu University of Science and Technology of - PowerPoint PPT Presentation

Anomaly Detection Qi Liu University of Science and Technology of China qiliuql@ustc.edu.cn ili l@ t d Data Mining Tasks Data Mining Tasks 2 Data Tid Tid Refund Refund Marital Marital Taxable Taxable Cheat Status Income


  1. Anomaly Detection Qi Liu University of Science and Technology of China qiliuql@ustc.edu.cn ili l@ t d

  2. Data Mining Tasks … Data Mining Tasks … 2 Data Tid Tid Refund Refund Marital Marital Taxable Taxable Cheat Status Income 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 11 No Married 60K No 12 12 Yes Yes Divorced Divorced 220K 220K No No 13 No Single 85K Yes 14 No Married 75K No 15 No Single 90K Yes 10 Milk

  3. Anomaly/Outlier Detection Anomaly/Outlier Detection � What are anomalies/outliers? � The set of data points that are considerably different than the considerably different than the remainder of the data � Natural implication is that anomalies are relatively rare � One in a thousand occurs often if you have lots of data � O i th d ft if h l t f d t � Context is important, e.g., freezing temps in July � Can be important or a nuisance � 10 foot tall 2 year old � Unusually high blood pressure

  4. Importance of Anomaly Detection Importance of Anomaly Detection Ozone Depletion History In 1985 three researchers (Farman, � Gardinar and Shanklin) were puzzled by data gathered by the British Antarctic Survey showing that ozone levels for Antarctica had dropped 10% below normal levels below normal levels Why did the Nimbus 7 satellite, which � had instruments aboard for recording had instruments aboard for recording ozone levels, not record similarly low ozone concentrations? The ozone concentrations recorded by � the satellite were so low they were being treated as outliers by a computer Sources: program and discarded! htt http://exploringdata.cqu.edu.au/ozone.html // l i d t d / ht l http://www.epa.gov/ozone/science/hole/size.html

  5. Causes of Anomalies Causes of Anomalies � Data from different classes � Measuring the weights of oranges, but a few grapefruit are mixed in � Natural variation Natural ariation � Unusually tall people � Data errors � 200 pound 2 year old � 200 pound 2 year old

  6. Distinction Between Noise and Anomalies Anomalies � Noise is erroneous, perhaps random, values or h d l contaminating objects � Weight recorded incorrectly � Grapefruit mixed in with the oranges � Noise doesn’t necessarily produce unusual values or objects � Noise is not interesting � Anomalies may be interesting if they are not a result of noise noise � Noise and anomalies are related but distinct concepts

  7. General Issues: Number of Attributes General Issues: Number of Attributes � Many anomalies are defined in terms of a single attribute � Height � Shape � Color � Can be hard to find an anomaly using all attributes � Noisy or irrelevant attributes � Noisy or irrelevant attributes � Object is only anomalous with respect to some attributes � However, an object may not be anomalous in any one attribute tt ib t

  8. General Issues: Anomaly Scoring General Issues: Anomaly Scoring � Many anomaly detection techniques provide only a binary categorization � An object is an anomaly or it isn’t � This is especially true of classification ‐ based approaches � Other approaches assign a score to all points � This score measures the degree to which an object is an anomaly � This score measures the degree to which an object is an anomaly � This allows objects to be ranked � In the end, you often need a binary decision � Should this credit card transaction be flagged? gg � Still useful to have a score � How many anomalies are there?

  9. Other Issues for Anomaly Detection y � Find all anomalies at once or one at a time � Swamping � Masking � Evaluation E l ti � How do you measure performance? � Supervised vs unsupervised situations � Supervised vs. unsupervised situations � Efficiency � Efficiency � Context � Context � Professional basketball team

  10. Variants of Anomaly Detection Problems Problems � Given a data set D, find all data points x ∈ D with Gi d t t D fi d ll d t i t D ith anomaly scores greater than some threshold t � Given a data set D, find all data points x ∈ D having the top n largest anomaly scores the top ‐ n largest anomaly scores � Given a data set D, containing mostly normal (but d l l b unlabeled) data points, and a test point x , compute the anomaly score of x with respect to D l f ith t t D

  11. Model ‐ Based Anomaly D t Detection ti � Build a model for the data and see Build a model for the data and see � Unsupervised � Anomalies are those points that don’t fit well � Anomalies are those points that don t fit well � Anomalies are those points that distort the model � Examples: � Statistical distribution � Clusters � Regression g � Geometric � Graph � Su e � Supervised i ed � Anomalies are regarded as a rare class � Need to have training data g

  12. Additional Anomaly Detection Te hni ues Techniques � Proximity ‐ based P i it b d � Anomalies are points far away from other points � Can detect this graphically in some cases � Can detect this graphically in some cases � Density ‐ based � Low density points are outliers � Low density points are outliers � Pattern matching � Create profiles or templates of atypical but important events or � Create profiles or templates of atypical but important events or objects � Algorithms to detect these patterns are usually simple and efficient g p y p

  13. Graphical Approaches Graphical Approaches � Boxplots or scatter plots B l l � Limitations � Not automatic N t t ti � Subjective

  14. Convex Hull Method Convex Hull Method � Extreme points are assumed to be outliers � Extreme points are assumed to be outliers � Use convex hull method to detect extreme values � What if the outlier occurs in the middle of the data?

  15. Statistical Approaches Statistical Approaches Probabilistic definition of an outlier: An outlier is an object that Probabilistic definition of an outlier: An outlier is an object that has a low probability with respect to a probability distribution model of the data. � Usually assume a parametric model describing the distribution of the data (e.g., normal distribution) � Apply a statistical test that depends on � Data distribution � Parameters of distribution (e.g., mean, variance) � Number of expected outliers (confidence limit) � Issues I ue � Identifying the distribution of a data set � Heavy tailed distribution � Heavy tailed distribution � Number of attributes � Is the data a mixture of distributions?

  16. Normal Distributions Normal Distributions One-dimensional G Gaussian i 8 7 0.1 6 0.09 5 0.08 4 0.07 Two-dimensional 3 0.06 2 Gaussian Gaussian 0.05 y 1 0.04 0 0.03 -1 0.02 -2 -3 0.01 -4 probability -5 density -4 -3 -2 -1 0 1 2 3 4 5 x

  17. Grubbs’ Test Grubbs Test � Detect outliers in univariate data D li i i i d � Assume data comes from normal distribution � Detects one outlier at a time, remove the outlier, and repeat and repeat � H 0 : There is no outlier in data � H A : There is at least one outlier − max X X = � Grubbs’ test statistic: G s 2 − t ( ( 1 ) ) N > � Reject H 0 if: α α − ( ( / / , 2 2 ) ) N N N N G G − + 2 2 N t N α − ( / N , N 2 )

  18. Statistical ‐ based – Likelihood Approach A h � Assume the data set D contains samples from a mixture of two probability distributions: � M (majority distribution) � A (anomalous distribution) � General Approach: � Initially, assume all the data points belong to M � Let L t (D) be the log likelihood of D at time t L L (D) b h l lik lih d f D i � For each point x t that belongs to M, move it to A � Let L � Let L t+1 (D) be the new log likelihood. 1 (D) be the new log likelihood � Compute the difference, Δ = L t (D) – L t+1 (D) � If Δ > c (some threshold), then x t is declared as an anomaly and moved permanently from M to A tl f M t A

  19. Statistical ‐ based – Likelihood Approach A h � Data distribution, D = (1 – λ ) M + λ A � M is a probability distribution estimated from data � M is a probability distribution estimated from data � Can be based on any modeling method (naïve Bayes, maximum entropy etc) maximum entropy, etc) � A is initially assumed to be uniform distribution � Likelihood at time t: ⎛ ⎛ ⎞ ⎞ ⎛ ⎛ ⎞ ⎞ N N ∏ ∏ ∏ ⎜ ⎟ ⎜ ⎟ = = − λ λ | | | | M A ( ) ( ) ( 1 ) ( ) ( ) L D P x P x P x t t ⎜ ⎟ ⎜ ⎟ t D i M i A i ⎝ t ⎠ ⎝ t ⎠ = ∈ ∈ 1 i x M x A i t i t ∑ ∑ ∑ ∑ = − λ + + λ + ( ) log( 1 ) log ( ) log log ( ) LL D M P x A P x t t M i t A i t t ∈ ∈ x M x A i t i t

  20. Strengths/Weaknesses of Statistical A Approaches h � Firm mathematical foundation � Can be very efficient � Good results if distribution is known G d l f d b k � In many cases, data distribution may not be known I d di ib i b k � For high dimensional data it may be difficult to estimate � For high dimensional data, it may be difficult to estimate the true distribution � Anomalies can distort the parameters of the distribution

Recommend


More recommend