5 years of metrics monitoring
play

5 Years of Metrics & Monitoring Lindsay Holmwood @auxesis - PowerPoint PPT Presentation

5 Years of Metrics & Monitoring Lindsay Holmwood @auxesis Cultural & Technical Key retrospective questions What did we do well? What did we learn? What should we do differently next time? What still puzzles us?


  1. 5 Years of Metrics & Monitoring Lindsay Holmwood @auxesis

  2. Cultural & Technical

  3. • Key retrospective questions • What did we do well? • What did we learn? • What should we do differently next time? • What still puzzles us?

  4. What got us here won’t get us there

  5. What did we do well? (that if we don’t talk about, we might forget)

  6. The Pipeline

  7. aggregation collection storage checking alerting graphing

  8. aggregation collection storage checking alerting graphing collectd & statsd

  9. aggregation collection storage checking alerting graphing Graphite & OpenTSDB & InfluxDB

  10. aggregation collection storage checking alerting graphing Riemann

  11. Alert fatigue has become a recognised problem

  12. Cottage industry

  13. PagerDuty & VictorOps & OpsGenie

  14. • Librato • Big Panda • Datadog • AppDynamics • Metafor • Stackdriver • New Relic • Pagerduty • Pingdom • VictorOps • Dataloop.io • OpsGenie

  15. #monitoringsucks

  16. https://github.com/ /monitoringsucks/tool-repos

  17. https://github.com/ /monitoringsucks/metrics-catalog

  18. If your business had to choose one metric to alert off, what would it be?

  19. #monitoringlove

  20. What would we do differently next time?

  21. Graphs & Dashboards

  22. Apparently the hardest problem in monitoring is graphing and dashboarding.

  23. What we’re doing wrong

  24. Strip charts

  25. We have a problem

  26. Strip charts: the PHP hammer of graphing

  27. What can the data tell us?

  28. What is the distribution?

  29. It’s not a problem with the tools

  30. Our approach is tainted

  31. graphing problems serviced by strip charts graphing problems we have

  32. Basic graph layout

  33. Black on white

  34. 1 2 3 4 5 5 5 bounding box with 3 3 x + y axes labels 1 1 1 2 3 4 5

  35. Colour

  36. Differential colour engine

  37. Maximum of 15 colours on-screen

  38. 8%

  39. Adjust saturation, not hue

  40. This is hue

  41. This is saturation

  42. Use minimal hue to call out data

  43. Fucking Pie Charts

  44. Experiment: Compare segment sizes

  45. This allows us to see very clearly that the pie chart judgements are less accurate than the bar chart judgements. – William S. Cleveland, p.86 Principles of Graphing Data

  46. Pie chart comparisons are more error prone

  47. The only time you should use a pie chart Pie eaten Pie not eaten

  48. Or maybe this

  49. What did we learn?

  50. Democratisation of graphing tool development

  51. Scratch our itches

  52. Same poor UX, better paint job

  53. We get the graphing tools we deserve

  54. Nagios is here to stay (at least for ops)

  55. kartar.net/2014/11/monitoring-survey---tools

  56. 283 193 86 77 68 47 38 17 9 8 7 6 6 s r a u x c l s n n l S a o i e n o g s i i s n e i M b l o l h e o g i n a k n g b t t N R n e n n O i m a n c a e S i n a N w h I Z w e Z e G S i e o R p N r O g e m o H

  57. Inertia

  58. No strong, compelling alternative

  59. Sensu

  60. When I hear people say “I'm not using Sensu because it's too complex” I think “and Nagios isn't hiding the same complexity from you?”

  61. This is a problem

  62. Using Nagios? Look at Icinga & Naemon

  63. We don’t know stats

  64. aggregation collection storage checking alerting graphing checks

  65. Poor statistical literacy has implications for graphs & checks

  66. Graphs

  67. We need many partially overlapping and always somehow contradictory descriptive layers to approximate a rendition of reality – Niels Bohr

  68. D3 & NVD3

  69. Checks

  70. Numbers & Strings & Behaviour

  71. Numbers

  72. Fault detection (thresholding)

  73. Anomaly detection (trend analysis)

Recommend


More recommend