Updates: v Week 15 on 11/30 § Project 2 progress Presentation and Discussions • Each T eam 10 min (including presentation and Q&A) • Focus on ideas and Q&A § Quiz 2 on Recommender systems v Quiz 1 will be graded over the weekend v One review will be graded over the weekend 1
Welcome to DS504/CS586: Big Data Analytics Anomaly/Outlier Detection Prof. Yanhua Li Time: 6:00pm – 8:50pm Thu Location: KH116 Fall 2017
What is an outlier? Outlier is an observation that deviates too much from other observations so that it arouses suspicions that it was generated by a different mechanism. outliers, exceptions, peculiarities, surprise, etc. Outliers
Introduction (not an easy task) w We are drowning in the deluge of data that are being collected world-wide, while starving for knowledge at the same time* w Anomalous events occur relatively infrequently w However, when they do occur, “ Mining needle in a haystack. So much hay and so little time ” their consequences can be quite dramatic and quite often in a negative sense * - J. Naisbitt, Megatrends: Ten New Directions Transforming Our Lives. New York: Warner Books, 1982.
Importance of Anomaly Detection Ozone Depletion History In 1985 three researchers (Farman, v Gardinar and Shanklin) were puzzled by data gathered by the British Antarctic Survey showing that ozone levels for Antarctica had dropped 10% below normal levels Why did the Nimbus 7 satellite, which v had instruments aboard for recording ozone levels, not record similarly low ozone concentrations? The ozone concentrations recorded by v the satellite were so low they were being treated as outliers by a computer program and discarded! Sources: http://exploringdata.cqu.edu.au/ozone.html http://www.epa.gov/ozone/science/hole/size. html
Real World Anomalies v Credit Card Fraud § An abnormally high purchase made on a credit card v Cyber Intrusions § A web server involved in ftp traffic Classical problem : Noise in measurements needs to be removed because of the detrimental effect for statistical inference. For example making clustering more robust. Data mining : Deviating or surprising measurement is interesting . In this case we want to detect outliers.
Anomaly/Outlier Detection v Variants of Anomaly/Outlier Detection Problems § Given a database D, containing mostly normal (but unlabeled) data points, and a test point x , compute the anomaly score of x with respect to D § Given a database D, find all the data points x Î D with anomaly scores greater than some threshold t § Given a database D, find all the data points x Î D having the top-n largest anomaly scores f( x )
Anomaly Detection v Challenges § How many outliers are there in the data? § Method is unsupervised • Validation can be quite challenging (just like for clustering) § Finding needle in a haystack v Working assumption: § There are considerably more “ normal ” observations than “ abnormal ” observations (outliers/anomalies) in the data
Anomaly Detection Schemes v General Steps § Build a profile of the “ normal ” behavior • Profile can be patterns or summary statistics for the overall population § Use the “ normal ” profile to detect anomalies • Anomalies are observations whose characteristics differ significantly from the normal profile v Types of anomaly detection schemes § Graphical & Statistical-based § Distance-based § Model-based (Machine Learning/classification)
Graphical Approaches v Boxplot (1-D), Scatter plot (2-D), Spin plot (3-D) v Limitations § Time consuming § Subjective
Statistical Approaches v Assume a parametric model describing the distribution of the data (e.g., normal distribution) v Apply a statistical test that depends on § Data distribution § Parameter of distribution (e.g., mean, variance) § Number of expected outliers (confidence limit)
Distance-based Approaches v Data is represented as a vector of features v Three major approaches § Nearest-neighbor based § Density based § Clustering based
K-Nearest Neighbour Graph Definition: Every vector in the data set forms one node and every node has pointers to its k nearest neighbours.
KDIST: Density-based outlier detection [S. Ramaswamy et al., SIGMOD , Texas, 2000.] Define k Nearest Neighbour distance ( KDIST ) as the distance to the k th nearest vector. Vectors are sorted by their KDIST distance. The last n vectors in the list are defined as outliers. Intuitive idea is that when KDIST is large, vector is in sparse area and is likely to be an outlier. Problem : user has to know in advance how many outliers he has in the data set.
Density-based: Inverse distance v For each point, ! − 1 P y ∈ N ( x,k ) distance ( x, y ) density ( x, k ) = | N ( x, k ) | In the NN and ID approach, p 1 is considered as an outlier, p 2 is not considered as outlier p 2 ´ p 1 ´
Density-based: LOF v For each point, compute the density of its local neighborhood v Compute local outlier factor (LOF) of a sample p as the average of the ratios of the density of sample p and the density of its nearest neighbors v Outliers are points with largest LOF value density ( x, k ) Reldensity ( x, k ) = P y ∈ N ( x,k ) density ( y, k ) / | N ( x, k ) | In the NN approach, p 2 is not considered as outlier, while LOF approach find both p 1 and p 2 as outliers p 2 ´ p 1 ´
Clustering-Based v Basic idea: § Cluster the data into groups of different density § Choose points in small cluster as candidate outliers § Compute the distance between candidate points and non-candidate clusters. • If candidate points are far from all other non-candidate points, they are outliers § E.g., DBSCAN
Summary v Types of anomaly detection schemes § Graphical & Statistical-based § Distance-based • Nearest-neighbor based • Density based • Clustering based § Model-based (Machine Learning/classification)
Crowd Sensing of Traffic Anomalies based on Human Mobility and Social Media Bei Pan (Penny), University of Southern California Yu Zheng, Microsoft Research David Wilkie, University of North Carolina Cyrus Shahabi, University of Southern California
20 Background v The prevalence of location services • Mobile phones, GPS • Check-in services • “Crowd sensing” city rhythms • Urban planning • Activity understanding • Our interests: • Dynamics of urban traffic • Detect and Analyze traffic anomalies 20 ACM SIGSPTIAL 2013
21 Insights When a traffic anomaly occurs: 1)% of traveling on different routes may change 2)People may discuss the anomaly on social media rt 2 rt 2 rt 1 rt 1 rt 4 rt 3 rt 3 routing behavior in routing behavior normal times during the traffic 21 ACM SIGSPTIAL anomaly 2013
22 Goal - Detection During During regular anomalous times event Increase of routing behavior Anomalous graph Decrease of routing behavior ACM SIGSPTIAL 2013
23 Goal - Analysis v Understand the traffic anomalies § Describe the anomaly using social media § Impact analysis on travel time delay Detected anomalous graph ACM SIGSPTIAL 2013
Applications Individual users Transportation authorities ACM SIGSPTIAL 24 2013
System Overview ACM SIGSPTIAL 25 2013
System Overview ACM SIGSPTIAL 26 2013
Preliminaries v Trajectory ( tr ) § A sequence of GPS points • E.g.,{ <loc 1 , t 1 >, <loc 2 , t 2 >, <loc 3 , t 3 > } § After map-matching & interpolation [1][2] • E.g.,{ <r 1 , t’ 1 >, <r 2 , t’ 2 >, <r 3 , t’ 3 >, <r 4 , t’ 4 > } v Route ( rt ) : a sequence of connected road segments § E.g., < r 1 , r 2 , r 3 , r 4 > ACM SIGSPTIAL 2013 [1] J. Yuan, Y. Zheng, C. Zhang, X. Xie, and G.-Z. Sun. An interactive-voting based map matching algorithm. In MDM ’10. 27 [2] L.-Y. Wei, Y. Zheng, and W.-C. Peng. Constructing popular routes from uncertain trajectories. In KDD ’12
Anomaly Detection in Routing Behavior Analysis v Routing Behavior: § RP OD = < f 1 , p 1 , f 2 , p 2 , ... , f n , p n > § f : traffic flow / p : percentage § e.g., RP OD = <160, 0.8, 20, 0.1, 20, 0.1> v Anomaly Detection Problem Definition: § Given a complete road network, trajectory set in [t 0 , t 1 ], find graphs • For each O , at least one D , that the RP OD at time t 1 is anomalous compared with regular RP OD at time [t 0 , t 1 ): 28
System Overview ACM SIGSPTIAL 29 2013
Anomaly Detection in Term Mining ( T H ) ( T C ) ACM SIGSPTIAL 30 2013
Impact Analysis & Visualization v Impact : Travel Time Delay § Individual travel time calculation: • E.g., travel time at segment a is : 96 sec. § Mean travel time during time interval T : § Delayed travel time for road segment r: v Visaulization: § Green: < 2x regular travel time § Yellow: [2x, 3x] regular travel time § Red: >3x regular travel time ACM SIGSPTIAL 31 2013
Evaluation v Traffic data set: (~ 20% of traffic flow on Beijing road network) v Social Media Data: § Crawled from Chinese micro-blogging services called “Weibo”. v Anomaly detection baseline approach § PCA – proposed in [1]: anomaly detection based on traffic volume ACM SIGSPTIAL 32 2013 [1] S. Chawla, Y. Zheng, and J. Hu. Inferring the root cause in road traffic anomalies. In ICDM ’12 .
Recommend
More recommend