scalable anomaly detection with spark and sos
play

Scalable Anomaly Detection with Spark and SOS Strata NYC - PowerPoint PPT Presentation

Scalable Anomaly Detection with Spark and SOS Strata NYC September 26, 2019 Hi there, my name is Jeroen Janssens Today SOS, World! Anomalies and outliers Evaluating outlier-selection algorithms Various approaches to outlier


  1. Scalable Anomaly Detection with Spark and SOS Strata NYC September 26, 2019

  2. Hi there, my name is Jeroen Janssens

  3. Today ● SOS, World! ● Anomalies and outliers ● Evaluating outlier-selection algorithms ● Various approaches to outlier selection ● Stochastic Outlier Selection ● Conclusion

  4. SOS, World! 01-sos-world.ipynb

  5. Implementations of SOS ● Python: http://bit.ly/sos-python ● Spark: http://bit.ly/sos-spark ● R: http://bit.ly/sos-r ● Flink: http://bit.ly/sos-flink

  6. Anomalies and outliers

  7. An anomaly is an observation or event that deviates qualitatively from what is considered to be normal, according to a domain expert.

  8. Detecting anomalies is important ● Expensive ● Dangerous Mess up your model ●

  9. Human anomaly detection may suffer from ● Fatigue ● Information overload Emotional bias ●

  10. Feature-vector representation

  11. Dissimilarity-matrix representation

  12. From anomaly to outlier

  13. An outlier is a data point that deviates quantitatively from the majority of the data points, according to an outlier-selection algorithm.

  14. The symbiotic relationship between the domain expert and the algorithm

  15. Data flow diagram Data flow diagram illustrating the relationship between the domain expert (square) and the outlier-selection algorithm (top circle).

  16. Six Euler diagrams (1/2)

  17. Six Euler diagrams (2/2)

  18. Evaluating outlier-selection algorithms

  19. Confusion matrix Computer says no.

  20. Four possible outcomes

  21. Evaluation Illustration of relabelling a multi-class data set into multiple one-class data sets.

  22. Anomalies are rare In order to evaluate the algorithm we simulate anomalies to be rare. Banana for scale.

  23. Outlier scores The dashed line indicates the threshold chosen by the domain expert.

  24. ROC curve An ROC curve plots the false alarm rate against the hit rate for all possible thresholds.

  25. Various approaches to outlier selection

  26. Distribution-based outlier selection

  27. Distance-based outlier selection Size does matter

  28. Density-based outlier-selection

  29. Stochastic Outlier Selection

  30. Stochastic Outlier Selection ● Unsupervised outlier selection algorithm ● Employs concept of affinity (inspired by t-SNE) One parameter: perplexity ● Computes outlier probabilities ●

  31. t-Distributed Neighbor Embedding (t-SNE; Van der Maaten, Hinton) employs affinity to perform dimensionality reduction

  32. A data point is selected as an outlier when all the other data points have insufficient affinity with it.

  33. From input to output

  34. From feature matrix to dissimilarity matrix

  35. From input to output

  36. Smooth neighborhoods

  37. Affinity between data points

  38. From input to output

  39. From affinity to binding probability The binding matrix B is obtained by normalising each row in the affinity matrix A.

  40. Binding probabilities form a graph

  41. Binding probabilities form a graph

  42. Stochastic Neighbor Graph A data point belongs to the outlier class when no it is not selected by any other data points.

  43. Three SNGs The three SNGs Ga, Gb, and Gc are sampled from the discrete probability distribution P(G).

  44. Set of all SNGs

  45. Approximating outlier probabilities by sampling SNGs

  46. Demo: Sampling SNGs in CoffeeScript and D3 http://bit.ly/sos-d3

  47. Computing outlier probabilities through marginalisation

  48. Computing outlier probabilities in closed form

  49. Proof!

  50. Selecting outliers

  51. Adaptive variances via the perplexity parameter

  52. Continuous binary search

  53. Perplexity influences outlier probabilities

  54. Evaluation and comparison

  55. Putlier-score plots

  56. Real-world datasets

  57. Synthetic datasets

  58. Synthetic datasets

  59. Synthetic datasets

  60. Synthetic datasets

  61. SOS performs significantly better

  62. Spark implementation of SOS

  63. Spark implementation of SOS ● Developed by Fokko Driesprong ● Works with DataFrame API ● Available on GitHub ● Plan is to make it part of MLLib

  64. SOS on PySpark 92-pyspark-sos.ipynb

  65. Summary Outlier selection can support the detection of anomalies ● SOS is an intuitive and probabilistic algorithm to select outliers ● SOS has a very good performance ● No free lunch ●

  66. Thank you! Here are some links ● Blog: http://bit.ly/sos-blog ● D3 Demo: http://bit.ly/sos-d3 ● Python implementation: http://bit.ly/sos-python ● Spark implementation: http://bit.ly/sos-spark ● R implementation: http://bit.ly/sos-r ● Flink implementation: http://bit.ly/sos-flink

Recommend


More recommend