Understanding Software System Behavior With ML and Time Series Data QCon.ai SF – April 11, 2018 David Andrzejewski - @davidandrzej Engineering, Sumo Logic Sumo Logic Confidential
Intro / context • Currently: – Sumo Logic since 2011 – Co-organizer: SF ML Meetup – @davidandrzej on Twitter • Previously: – Postdoc at LLNL – U Wisconsin • BS Comp E / CS / Math • PhD CS (ML) Sumo Logic Confidential
Continuous intelligence for machine data Sumo Logic Confidential
Overview 1. Mega-trends: “Softwarification” of Everything + ML 2. Machine data: practicalities and basic analytics 3. Machine learning, data mining, and pitfalls Sumo Logic Confidential
Sumo Logic Confidential
Sumo Logic Confidential
Trouble in software paradise! Sumo Logic Confidential Sumo Logic Confidential
Microservices “death star” Sumo Logic Confidential
Sumo Logic Confidential
Big Data to the rescue? DEBUG-level visibility, in production • Logs (TBs / day) • Metrics (M DPs / min) • Source code (GBs) • Traces • Events Sumo Logic Confidential
Not so fast! “Could a Neuroscientist Understand a Microprocessor?” Jonas & Kording (PLoS Comp Bio 2017) • (cool NES plotter art - Michael Fogleman) Sumo Logic Confidential
Using data to understand complex, dynamic, multi-scale systems ”Grand challenge” problem new measurements → new science Data: necessary but not sufficient? • Today’s systems: • Software – – Biological – Social / economic Sumo Logic Confidential
Machine data time series Sumo Logic Confidential
Operational time series telemetry: the basics What: • – “Four Golden Signals” (Google SRE book) Latency, Traffic, Error, Saturation • • (also: USE, RED, …) – Basic resources: CPU, memory, … – More granular timings Event counts, cache miss rates, other internals… – • How: – “push” agents/daemons (eg, StatsD) “pull” metrics endpoints (eg, Prometheus) – Where: • – TSDB (time series database) – OSS / Commercial systems Sumo Logic Confidential
Operational time series telemetry: why Q: WTF is my system actually doing? Monitoring & troubleshooting • data visualization • alerting* • summarize behavior • comparisons Sumo Logic Confidential
Operational time series telemetry: example “Metrics 2.0”–style deployment=production key-value identifier cluster=indexer host=foobuzz-39 Actual data: sequence metric=write_latency of (timestamp, value) units=ms 8:01 8:02 8:03 8:04 8:05 … 64 128 72 144 96 … Sumo Logic Confidential
Quantization: rollup / time-based aggregation Raw event/observation data à coarser, more regular !: # ℝ → ℝ 1-minute aggregations à 1-hour aggregations, etc Aggregation: map from 8: 8:00 00 8: 8:01 01 … 8: 8:58 58 8: 8:59 59 multiset of floats to some 60.1 43.2 33.3 45.1 42.5 single-valued summary Min • Max • Avg • 6: 6:00 00 7: 7:00 00 8: 8:00 00 9:00 9: 00 10:00 10: 00 Sum • … … 33.3 … … Count • Sumo Logic Confidential
Quantization: rollup / time-based aggregation Raw event/observation data à coarser, more regular !: # ℝ → ℝ 1-minute aggregations à 1-hour aggregations, etc Aggregation: map from 8: 8:00 00 8: 8:01 01 … 8: 8:58 58 8: 8:59 59 multiset of floats to some 60.1 43.2 33.3 45.1 42.5 single-valued summary Min • Max • Avg • 6: 6:00 00 7:00 7: 00 8: 8:00 00 9:00 9: 00 10:00 10: 00 Sum • … … 33.3 … … Count • Percentiles? • Sumo Logic Confidential
SRE percentiles Percentile as guarantee p99 < 2000 ms translates into unambiguous language: avg = 1485 ms • “No more than 1% of p95 = 4894 ms • customer requests take longer than 2 seconds to execute” Sumo Logic Confidential
Percentiles via CDF -1 p60 = -1.8 etc... https://en.wikipedia.org/wiki/Normal_distribution Sumo Logic Confidential
Algebraic structure for fun and profit Example: item counts f ( s 1 + s 2 ) = f ( s 1 ) ⊕ f ( s 2 ) data data data Sumo Logic Confidential
Algebraic structure for fun and profit Example: word counts f ( s 1 + s 2 ) = f ( s 1 ) ⊕ f ( s 2 ) Aggregate of combined data Combination of aggregates Monoid data homomorphism! data data Sumo Logic Confidential
Percentile original sin: ! " # + " % ≠ ! " # ⊕ !(" % ) Not a monoid homomorphism In general, cannot combine: • – p95 of dataset X – p95 of dataset Y • ...to say anything meaningful at all about dataset X ∪ Y Impress your SRE/DevOps friends at parties! • Sumo Logic Confidential
Basic aggregation: across series What is max write_latency of entire foobuzz cluster? 8: 8:01 01 8: 8:02 02 8: 8:03 03 8: 8:04 04 8: 8:05 05 … 64 128 72 144 96 … host=foobuzz-1 23 33 49 57 37 … host=foobuzz-2 46 101 78 58 39 … host=foobuzz-3 … … … … … … f = MAX( ) 8: 8:01 01 8: 8:02 02 8: 8:03 03 8: 8:04 04 8: 8:05 05 … 55.3 47.1 76.8 52.3 41.7 Sumo Logic Confidential
Basic aggregation: across time (aka “fold”) What is average queue depth of each foobuzz host over this time period? f = AVG( ) 8: 8:01 01 8: 8:02 02 8:03 8: 03 … 64 128 72 … 103.4 host=foobuzz-1 23 33 49 … 48.6 host=foobuzz-2 46 101 78 … 62.1 host=foobuzz-3 … … … … Sumo Logic Confidential
Comparison Time-shifted comparisons 160 140 How does write_latency for this foobuzz 120 instance compare versus yesterday ? 100 80 60 deployment=production 40 cluster=indexer 20 host=foobuzz-21 0 metric=write_latency 8:01 8:02 8:03 8:04 8:05 units=ms Now Timeshift 8: 8:01 01 8: 8:02 02 8: 8:03 03 8: 8:04 04 8:05 8: 05 … 64 128 72 144 96 … 8: 8:01 01 8: 8:02 02 8:03 8: 03 8:04 8: 04 8:05 8: 05 … (-24h 24h) (-24h 24h) (-24h 24h) (-24h 24h) (-24h 24h) 23 12 18 37 24 … Sumo Logic Confidential
Comparison Time-shifted comparisons 160 140 How does write_latency for this foobuzz 120 instance compare versus yesterday ? 100 80 60 deployment=production 40 cluster=indexer 20 host=foobuzz-21 0 metric=write_latency 8:01 8:02 8:03 8:04 8:05 units=ms Now Timeshift 8: 8:01 01 8: 8:02 02 8: 8:03 03 8: 8:04 04 8:05 8: 05 … 64 128 72 144 96 … 8: 8:01 01 8: 8:02 02 8:03 8: 03 8:04 8: 04 8:05 8: 05 … (-24h 24h) (-24h 24h) (-24h 24h) (-24h 24h) (-24h 24h) 23 12 18 37 24 … Sumo Logic Confidential
Windowing data Aka “grouping over time” Tiled / Fixed • Sliding / Rolling • … See Tyler Akidau (Apache Beam) • – QCon SF 2016 slides – ”Beyond Batch” blog posts Part 1, Part 2 Sumo Logic Confidential
Handling ”missing” data Reality: often messy! pandas fillna() – some very sane basics • Fancier model / ML based approaches • – try to “predict” missing data – “imputation” (statistics / econometrics) inference / sampling (probabilistic models) – Sumo Logic Confidential
Original data Fixed value (mean) (notebook code on Github) Forward fill Back fill Interpolation Sumo Logic Confidential
Fixed-threshold alerting ”Wake somebody up if the site is down” Sumo Logic Confidential
MACHINE SCALE = overwhelming complexity! N ≈ one million series Can’t analyze them all • • Can’t even look at them ! " pairs to compare • • Historical comparisons over different timescales • PROBLEM: how to “scale” expert human time and attention? Sumo Logic Confidential
“Machine learning studies computer algorithms for learning to do stuff.” -Prof. Rob Schapire (COS 511 scribe notes) Sumo Logic Confidential
ML cheat sheet Uh oh NO Is machine learning Do you know what right for you? you’re trying to accomplish? YES Do that YES Can you do it with simple / deterministic analysis? NO Let’s try ML…? Sumo Logic Confidential
Predictive models and outliers Surprise: Your prediction is wrong! Sumo Logic Confidential
Outlier detection via predictive modeling “It ’s tough to make predictions, especially about the future” KEY ASSUMPTIONS 1. In “steady-state”, data exhibit some regularity / predictability 2. Learn a model of this behavior 3. Major deviations from our expectation represent new underlying behavior or totally novel “exogenous shock” 4. These surprises are valuable to discover Sumo Logic Confidential
Outlier detection via predictive modeling “It ’s tough to make predictions, especially about the future” KEY ASSUMPTIONS 1. In “steady -state”, data exhibit some regularity / predictability 2. Learn a model of this behavior 3. Major deviations from our expectation represent new underlying behavior or KEY Qs totally novel “exogenous shock” 1. Is behavior actually regular? 4. These surprises are valuable to discover 2. How to model behavior? 3. How major is “major”? 4. Are surprises actually valuable? Sumo Logic Confidential
Recommend
More recommend