5 Years of Metrics & Monitoring Lindsay Holmwood @auxesis
Cultural & Technical
• Key retrospective questions • What did we do well? • What did we learn? • What should we do differently next time? • What still puzzles us?
What got us here won’t get us there
What did we do well? (that if we don’t talk about, we might forget)
The Pipeline
aggregation collection storage checking alerting graphing
aggregation collection storage checking alerting graphing collectd & statsd
aggregation collection storage checking alerting graphing Graphite & OpenTSDB & InfluxDB
aggregation collection storage checking alerting graphing Riemann
Alert fatigue has become a recognised problem
Cottage industry
PagerDuty & VictorOps & OpsGenie
• Librato • Big Panda • Datadog • AppDynamics • Metafor • Stackdriver • New Relic • Pagerduty • Pingdom • VictorOps • Dataloop.io • OpsGenie
#monitoringsucks
https://github.com/ /monitoringsucks/tool-repos
https://github.com/ /monitoringsucks/metrics-catalog
If your business had to choose one metric to alert off, what would it be?
#monitoringlove
What would we do differently next time?
Graphs & Dashboards
Apparently the hardest problem in monitoring is graphing and dashboarding.
What we’re doing wrong
Strip charts
We have a problem
Strip charts: the PHP hammer of graphing
What can the data tell us?
What is the distribution?
It’s not a problem with the tools
Our approach is tainted
graphing problems serviced by strip charts graphing problems we have
Basic graph layout
Black on white
1 2 3 4 5 5 5 bounding box with 3 3 x + y axes labels 1 1 1 2 3 4 5
Colour
Differential colour engine
Maximum of 15 colours on-screen
8%
Adjust saturation, not hue
This is hue
This is saturation
Use minimal hue to call out data
Fucking Pie Charts
Experiment: Compare segment sizes
This allows us to see very clearly that the pie chart judgements are less accurate than the bar chart judgements. – William S. Cleveland, p.86 Principles of Graphing Data
Pie chart comparisons are more error prone
The only time you should use a pie chart Pie eaten Pie not eaten
Or maybe this
What did we learn?
Democratisation of graphing tool development
Scratch our itches
Same poor UX, better paint job
We get the graphing tools we deserve
Nagios is here to stay (at least for ops)
kartar.net/2014/11/monitoring-survey---tools
283 193 86 77 68 47 38 17 9 8 7 6 6 s r a u x c l s n n l S a o i e n o g s i i s n e i M b l o l h e o g i n a k n g b t t N R n e n n O i m a n c a e S i n a N w h I Z w e Z e G S i e o R p N r O g e m o H
Inertia
No strong, compelling alternative
Sensu
When I hear people say “I'm not using Sensu because it's too complex” I think “and Nagios isn't hiding the same complexity from you?”
This is a problem
Using Nagios? Look at Icinga & Naemon
We don’t know stats
aggregation collection storage checking alerting graphing checks
Poor statistical literacy has implications for graphs & checks
Graphs
We need many partially overlapping and always somehow contradictory descriptive layers to approximate a rendition of reality – Niels Bohr
D3 & NVD3
Checks
Numbers & Strings & Behaviour
Numbers
Fault detection (thresholding)
Anomaly detection (trend analysis)
Recommend
More recommend