msd kmeans
play

MSD-Kmeans: A Novel Algorithm for Efficient Detection of Global and - PowerPoint PPT Presentation

MSD-Kmeans: A Novel Algorithm for Efficient Detection of Global and Local Outliers Yuanyuan Wei, Julian Jang-Jaccard, Ruili Wang Fariza Sabrina School of Natural & Computational Science School of Engineering and Technology Central


  1. MSD-Kmeans: A Novel Algorithm for Efficient Detection of Global and Local Outliers Yuanyuan Wei, Julian Jang-Jaccard, Ruili Wang Fariza Sabrina School of Natural & Computational Science School of Engineering and Technology Central Queensland University Massey University Australia New Zealand

  2. Outline 1. Introduction – Outlier 2. An Issue with Extreme Value 3. Our Approach – MSD-Kmeans 4. Evaluation and Discussion 5. Summary

  3. Introduction • What is an outlier? (source from [1], [2], [3])

  4. Types of Outlier Global vs Local outliers ([4], [5], [6])

  5. Existing State-of-the-arts Ø Statistical based approaches • MSD (Mean and Standard Deviation) • Z-score • MIQR (Median and Interquartile Range) Ø Machine learning based approaches • LOF (Local Outlier Factor) • K-means

  6. K-means Clustering Ø K-means is one of the most popular clustering algorithms to detect outliers. K- means clustering [7]

  7. Issue with Extreme Values of K-means The algorithm can be misled if there are clusters of highly different Ø size or different density. Ø It is sensitive to noise and outlier data, and the K-means worked best for globular clusters. For example [8]: • Increase intra-cluster distance • Fail to detect local outliers

  8. Our approach: MSD-Kmeans Ø The goal of MSD-kmeans: to eliminate as many global outliers (extreme value) as possible to minimize their interference on efficient clustering by K-means. Ø To make efficient clustering by removing extreme values v 2 phase approaches 1 st phase: remove extreme values so that the data is better • normalized 2 nd phase: do the clustering with normalize data to produce • better clustering with more realistic outliers.

  9. MSD-Kmeans: v Phase 1: MSD algorithm for detecting global outliers MSD (Mean and standard deviation) • The MSD algorithm is calculated by calculating and comparing the mean and standard deviation values of supplied dataset; those deviating from the mean value by more than one standard deviation are considered global outliers [9].

  10. MSD Algorithm for Detecting Global Outliers � Let ! ∈ {! 1 , ! 2 , ! 3 , … , !)} , then � 8 = 7 6 6 ∑ 2:7 ! i , 0 1 2 3 4 5 ∑ -./ , + = 637 Where ; is the number of dataset. 8 is the mean value of the dataset, while + is standard deviation. ! global_outliers = <! upper , ! > 8 + + ! lower , ! < 8 − +

  11. MSD-Kmeans: v Phase 2: K-means algorithm for detecting local outliers Dataset Normal data If intra-cluster distance < threshold Initial K clusters Calculating the centroids If intra-cluster distance > threshold threshold Outliers based on Yes intra-cluster distance Distance between objects and centroid No object no move Converged Grouping based group? on minimum distance

  12. MSD-Kmeans Outliers: Ø This method was implemented in Python using the sklearn.cluster.Kmeans module to group the dataset into K clusters ( K = 2). Ø Calculating intra-distance from centroid to each data points in each cluster by using Euclidean Distance: , (%- − /-) 2 d(p, %) = ∑ )*+ Where p is the centroid data point, while % is a set of data point in its cluster.

  13. MSD-Kmeans Outliers: Ø The threshold of top-n outliers in each cluster Ø The outlier threshold is calculated to be the sum of the mean value and 1.5 times the standard deviation of intra-clustering distance. ! "#$%"_#'(")*+ = ! ) > . )/(+%_0)1(%/$* + 1.5 ∗ 6 )/(+%_0)1(%/$*

  14. MSD-Kmeans pseudocode �

  15. Experiment & Results Ø Utilizing NYC (New York City) Taxi Dataset • A dataset of about 1.71GB data collected from registered taxis in NYC in January 2016.

  16. Ø Experiment Setup • In order to find out greedy dishonest drivers may attempt to charge high fares by detouring. • Parameter: Fare – amount (from Lower Manhattan suburb of SOHO to John F. Kennedy International (JFK) Airport). (a) Lower Manhattan (b) JFK Airport (Source from Google Map)

  17. MSD-Kmeans & K-means Fig. 2 MSD-Kmeans outlier detection without extreme value Fig. 1 K-means outlier detection with extreme value (11.14% outliers in total) (5.11% outlies in total)

  18. Evaluation Approaches Ø Our Approach v We took a number of popular outlier detection algorithms both from statistical and machine learning area, and compared the results. v Comparing existing state-of-the-arts • Statistical method – MSD, Z-score and MIQR • Machine learning – K-means, LOF v Comparing measurement of the following properties: • TPR (True Positive Rate) !"#!% • Accuracy = • FPR (False Positive Rate) !"#!%#$"#$% &!" !" • F-measure = • Precision = &!"#$"#$% !"#$" v Comparing the execution time

  19. Performance Comparison of Outlier Detection Algorithms using NYC Taxi Dataset Outlier TPR FPR Precision Accuracy Recall F-measure Execution Detection (%) (%) (%) (%) (%) (%) Time (MS) Algorithm MSD 99.9 24.2 96.6 96.9 99.9 98.2 21 Z-score 100 48.9 94.3 94.6 100 97.1 157 MIQR* 97.8 12.6 98.1 96.4 97.8 98.0 54 K-means [10] 99.7 55.6 93.5 93.7 99.7 96.9 1,132 LOF* [11] 98.2 79.3 26.1 38.0 98.2 43.1 31,483 MSD-Kmeans 98.5 11.6 98.6 97.4 98.5 98.6 824 * MIQR (Median and Interquartile Range Method) * LOF (Local Outlier Factor)

  20. Comparison Execution Time (MS) Accuracy (%) 35000 120 30000 Execution Time (MS) 31,483 100 25000 97.4 96.9 96.4 Accuracy (%) 80 94.6 93.7 20000 60 15000 40 10000 38 20 5000 21 157 54 1,132 824 0 0 D * s D * s e e s * s * n n S r R S r R n F n F o o M Q a M Q a a O a O e e c e c e I I s L m s L m M M m m - - Z Z K K - - K - K - D D S S M M Different Algorithms Different Algorithms * MIQR (Median and Interquartile Range Method) * LOF (Local Outlier Factor) MSD-Kmeans paper with NYC taxi dataset paper has submitted in ICONIP conference (A Ranking)

  21. Summary Ø We propose a new outlier detection algorithm MSD-Kmeans which can deal with extreme values better. Ø We have evaluated with IoT datasets, for example NYC taxi data. Ø The performance indicators of MSD-Kmeans with those of other outlier detection algorithms, such as MSD, Z-score, MIQR, K-means and LOF, and proved that the proposed MSD-Kmeans algorithm achieved the highest measure of accuracy.

  22. Future Work Ø Currently testing with different dataset • Indoor air quality data that are generate by a real IoT application SKOMOBO (SKOol MOnitoring BOx) • Initial evaluation is done and results being written as a Journal manuscript • Other datasets will also use such as from § UCI Machine Learning Repository Ø Multivariate analysis of variance • Expand to study interdependence between several variables and its impact.

  23. References: [1]. Ved, M. (2018). Outlier Detection and Anomaly Detection with Machine Learning. Retrieved from https://medium.com/@mehulved1503/outlier-detection-and-anomaly-detection-with-machine-learning-caa96b34b7f6 [2]. Floydhub. (2019). Introduction to Anomaly Detection in Python [ online forum comment]. Retrieved from https://blog.floydhub.com/introduction-to-anomaly-detection-in-python/ [3]. Related Guided Lesson. (2019). The Ugly Ducking. Retrieved from https://www.education.com/game/the-ugly-duckling/ [4]. Rushworth, A. (2019). The local outlier factor (LOF). Retrieved from https://campus.datacamp.com/courses/anomaly-detection-in-r/distance-and-density-based-anomaly-detection?ex=9 [5]. Packet. (2017). Machine Learning Review. Retrieved from https://hub.packtpub.com/machine-learning-review/ [6]. Jose, C. (2019). Anomaly Detection Techniques in Python. Retrieved from https://medium.com/learningdatascience/anomaly-detection-techniques-in-python-50f650c75aaf [7]. Learn by Marketing. (2019). K-Means Clustering – What it is and How it works. Retrieved from http://www.learnbymarketing.com/methods/k-means-clustering/ [8]. Secience & Technology review. (2012). Finding and Fixing a Supercomputer’s Faults. Retrieved from https://str.llnl.gov/June12/desupinski.html [9]. Prasad, Y. S., & Krishna, G. R. (2013). Statistical Anomaly Detection Technique for Real Time Datasets. International Journal of Computer Trends and Technology (IJCTT) , 6 (2), 89-94. [10]. Duan, L., Xu, L., Liu, Y., & Lee, J. (2009). Cluster-based outlier detection. Annals of Operations Research , 168 (1), 151-168. [11]. Breunig, M. M., Kriegel, H. P., Ng, R. T., & Sander, J. (2000, May). LOF: identifying density-based local outliers. In ACM sigmod record (Vol. 29, No. 2, pp. 93-104). ACM.

  24. THANK YOU �

More recommend