understanding software system behavior with ml and time
play

Understanding Software System Behavior With ML and Time Series Data - PowerPoint PPT Presentation

Understanding Software System Behavior With ML and Time Series Data QCon.ai SF April 11, 2018 David Andrzejewski - @davidandrzej Engineering, Sumo Logic Sumo Logic Confidential Intro / context Currently: Sumo Logic since 2011


  1. Understanding Software System Behavior With ML and Time Series Data QCon.ai SF – April 11, 2018 David Andrzejewski - @davidandrzej Engineering, Sumo Logic Sumo Logic Confidential

  2. Intro / context • Currently: – Sumo Logic since 2011 – Co-organizer: SF ML Meetup – @davidandrzej on Twitter • Previously: – Postdoc at LLNL – U Wisconsin • BS Comp E / CS / Math • PhD CS (ML) Sumo Logic Confidential

  3. Continuous intelligence for machine data Sumo Logic Confidential

  4. Overview 1. Mega-trends: “Softwarification” of Everything + ML 2. Machine data: practicalities and basic analytics 3. Machine learning, data mining, and pitfalls Sumo Logic Confidential

  5. Sumo Logic Confidential

  6. Sumo Logic Confidential

  7. Trouble in software paradise! Sumo Logic Confidential Sumo Logic Confidential

  8. Microservices “death star” Sumo Logic Confidential

  9. Sumo Logic Confidential

  10. Big Data to the rescue? DEBUG-level visibility, in production • Logs (TBs / day) • Metrics (M DPs / min) • Source code (GBs) • Traces • Events Sumo Logic Confidential

  11. Not so fast! “Could a Neuroscientist Understand a Microprocessor?” Jonas & Kording (PLoS Comp Bio 2017) • (cool NES plotter art - Michael Fogleman) Sumo Logic Confidential

  12. Using data to understand complex, dynamic, multi-scale systems ”Grand challenge” problem new measurements → new science Data: necessary but not sufficient? • Today’s systems: • Software – – Biological – Social / economic Sumo Logic Confidential

  13. Machine data time series Sumo Logic Confidential

  14. Operational time series telemetry: the basics What: • – “Four Golden Signals” (Google SRE book) Latency, Traffic, Error, Saturation • • (also: USE, RED, …) – Basic resources: CPU, memory, … – More granular timings Event counts, cache miss rates, other internals… – • How: – “push” agents/daemons (eg, StatsD) “pull” metrics endpoints (eg, Prometheus) – Where: • – TSDB (time series database) – OSS / Commercial systems Sumo Logic Confidential

  15. Operational time series telemetry: why Q: WTF is my system actually doing? Monitoring & troubleshooting • data visualization • alerting* • summarize behavior • comparisons Sumo Logic Confidential

  16. Operational time series telemetry: example “Metrics 2.0”–style deployment=production key-value identifier cluster=indexer host=foobuzz-39 Actual data: sequence metric=write_latency of (timestamp, value) units=ms 8:01 8:02 8:03 8:04 8:05 … 64 128 72 144 96 … Sumo Logic Confidential

  17. Quantization: rollup / time-based aggregation Raw event/observation data à coarser, more regular !: # ℝ → ℝ 1-minute aggregations à 1-hour aggregations, etc Aggregation: map from 8: 8:00 00 8: 8:01 01 … 8: 8:58 58 8: 8:59 59 multiset of floats to some 60.1 43.2 33.3 45.1 42.5 single-valued summary Min • Max • Avg • 6: 6:00 00 7: 7:00 00 8: 8:00 00 9:00 9: 00 10:00 10: 00 Sum • … … 33.3 … … Count • Sumo Logic Confidential

  18. Quantization: rollup / time-based aggregation Raw event/observation data à coarser, more regular !: # ℝ → ℝ 1-minute aggregations à 1-hour aggregations, etc Aggregation: map from 8: 8:00 00 8: 8:01 01 … 8: 8:58 58 8: 8:59 59 multiset of floats to some 60.1 43.2 33.3 45.1 42.5 single-valued summary Min • Max • Avg • 6: 6:00 00 7:00 7: 00 8: 8:00 00 9:00 9: 00 10:00 10: 00 Sum • … … 33.3 … … Count • Percentiles? • Sumo Logic Confidential

  19. SRE percentiles Percentile as guarantee p99 < 2000 ms translates into unambiguous language: avg = 1485 ms • “No more than 1% of p95 = 4894 ms • customer requests take longer than 2 seconds to execute” Sumo Logic Confidential

  20. Percentiles via CDF -1 p60 = -1.8 etc... https://en.wikipedia.org/wiki/Normal_distribution Sumo Logic Confidential

  21. Algebraic structure for fun and profit Example: item counts f ( s 1 + s 2 ) = f ( s 1 ) ⊕ f ( s 2 ) data data data Sumo Logic Confidential

  22. Algebraic structure for fun and profit Example: word counts f ( s 1 + s 2 ) = f ( s 1 ) ⊕ f ( s 2 ) Aggregate of combined data Combination of aggregates Monoid data homomorphism! data data Sumo Logic Confidential

  23. Percentile original sin: ! " # + " % ≠ ! " # ⊕ !(" % ) Not a monoid homomorphism In general, cannot combine: • – p95 of dataset X – p95 of dataset Y • ...to say anything meaningful at all about dataset X ∪ Y Impress your SRE/DevOps friends at parties! • Sumo Logic Confidential

  24. Basic aggregation: across series What is max write_latency of entire foobuzz cluster? 8: 8:01 01 8: 8:02 02 8: 8:03 03 8: 8:04 04 8: 8:05 05 … 64 128 72 144 96 … host=foobuzz-1 23 33 49 57 37 … host=foobuzz-2 46 101 78 58 39 … host=foobuzz-3 … … … … … … f = MAX( ) 8: 8:01 01 8: 8:02 02 8: 8:03 03 8: 8:04 04 8: 8:05 05 … 55.3 47.1 76.8 52.3 41.7 Sumo Logic Confidential

  25. Basic aggregation: across time (aka “fold”) What is average queue depth of each foobuzz host over this time period? f = AVG( ) 8: 8:01 01 8: 8:02 02 8:03 8: 03 … 64 128 72 … 103.4 host=foobuzz-1 23 33 49 … 48.6 host=foobuzz-2 46 101 78 … 62.1 host=foobuzz-3 … … … … Sumo Logic Confidential

  26. Comparison Time-shifted comparisons 160 140 How does write_latency for this foobuzz 120 instance compare versus yesterday ? 100 80 60 deployment=production 40 cluster=indexer 20 host=foobuzz-21 0 metric=write_latency 8:01 8:02 8:03 8:04 8:05 units=ms Now Timeshift 8: 8:01 01 8: 8:02 02 8: 8:03 03 8: 8:04 04 8:05 8: 05 … 64 128 72 144 96 … 8: 8:01 01 8: 8:02 02 8:03 8: 03 8:04 8: 04 8:05 8: 05 … (-24h 24h) (-24h 24h) (-24h 24h) (-24h 24h) (-24h 24h) 23 12 18 37 24 … Sumo Logic Confidential

  27. Comparison Time-shifted comparisons 160 140 How does write_latency for this foobuzz 120 instance compare versus yesterday ? 100 80 60 deployment=production 40 cluster=indexer 20 host=foobuzz-21 0 metric=write_latency 8:01 8:02 8:03 8:04 8:05 units=ms Now Timeshift 8: 8:01 01 8: 8:02 02 8: 8:03 03 8: 8:04 04 8:05 8: 05 … 64 128 72 144 96 … 8: 8:01 01 8: 8:02 02 8:03 8: 03 8:04 8: 04 8:05 8: 05 … (-24h 24h) (-24h 24h) (-24h 24h) (-24h 24h) (-24h 24h) 23 12 18 37 24 … Sumo Logic Confidential

  28. Windowing data Aka “grouping over time” Tiled / Fixed • Sliding / Rolling • … See Tyler Akidau (Apache Beam) • – QCon SF 2016 slides – ”Beyond Batch” blog posts Part 1, Part 2 Sumo Logic Confidential

  29. Handling ”missing” data Reality: often messy! pandas fillna() – some very sane basics • Fancier model / ML based approaches • – try to “predict” missing data – “imputation” (statistics / econometrics) inference / sampling (probabilistic models) – Sumo Logic Confidential

  30. Original data Fixed value (mean) (notebook code on Github) Forward fill Back fill Interpolation Sumo Logic Confidential

  31. Fixed-threshold alerting ”Wake somebody up if the site is down” Sumo Logic Confidential

  32. MACHINE SCALE = overwhelming complexity! N ≈ one million series Can’t analyze them all • • Can’t even look at them ! " pairs to compare • • Historical comparisons over different timescales • PROBLEM: how to “scale” expert human time and attention? Sumo Logic Confidential

  33. “Machine learning studies computer algorithms for learning to do stuff.” -Prof. Rob Schapire (COS 511 scribe notes) Sumo Logic Confidential

  34. ML cheat sheet Uh oh NO Is machine learning Do you know what right for you? you’re trying to accomplish? YES Do that YES Can you do it with simple / deterministic analysis? NO Let’s try ML…? Sumo Logic Confidential

  35. Predictive models and outliers Surprise: Your prediction is wrong! Sumo Logic Confidential

  36. Outlier detection via predictive modeling “It ’s tough to make predictions, especially about the future” KEY ASSUMPTIONS 1. In “steady-state”, data exhibit some regularity / predictability 2. Learn a model of this behavior 3. Major deviations from our expectation represent new underlying behavior or totally novel “exogenous shock” 4. These surprises are valuable to discover Sumo Logic Confidential

  37. Outlier detection via predictive modeling “It ’s tough to make predictions, especially about the future” KEY ASSUMPTIONS 1. In “steady -state”, data exhibit some regularity / predictability 2. Learn a model of this behavior 3. Major deviations from our expectation represent new underlying behavior or KEY Qs totally novel “exogenous shock” 1. Is behavior actually regular? 4. These surprises are valuable to discover 2. How to model behavior? 3. How major is “major”? 4. Are surprises actually valuable? Sumo Logic Confidential

Recommend


More recommend