avoiding alerts overload from microservices
play

Avoiding alerts overload from microservices Sarah Wells Principal - PowerPoint PPT Presentation

Avoiding alerts overload from microservices Sarah Wells Principal Engineer, Financial Times @sarahjwells Knowing when theres a problem isnt enough @sarahjwells You only want an alert when you need to take action Hello @sarahjwells 1


  1. Avoiding alerts overload from microservices Sarah Wells Principal Engineer, Financial Times @sarahjwells

  2. Knowing when there’s a problem isn’t enough @sarahjwells

  3. You only want an alert when you need to take action

  4. Hello @sarahjwells

  5. 1

  6. 2 1

  7. 2 1 3

  8. 4 2 1 3

  9. Monitoring this system… @sarahjwells

  10. Microservices make it worse @sarahjwells

  11. “microservices (n,pl): an efficient device for transforming business problems into distributed transaction problems” @drsnooks

  12. The services *themselves* are simple… @sarahjwells

  13. There’s a lot of complexity around them @sarahjwells

  14. Why do they make monitoring harder? @sarahjwells

  15. You have a lot more services @sarahjwells

  16. 99 functional microservices 350 running instances @sarahjwells

  17. 52 non functional services 218 running instances @sarahjwells

  18. That’s 568 separate services @sarahjwells

  19. If we checked each service every minute… @sarahjwells

  20. 817,920 checks per day @sarahjwells

  21. What about system checks? @sarahjwells

  22. 16,358,400 checks per day @sarahjwells

  23. “One-in-a-million” issues would hit us 16 times every day @sarahjwells

  24. Running containers on shared VMs reduces this to 92,160 system checks per day @sarahjwells

  25. For a total of 910,080 checks per day @sarahjwells

  26. It’s a distributed system @sarahjwells

  27. Services are not independent @sarahjwells

  28. http://devopsreactions.tumblr.com/post/122408751191/alerts-when- an-outage-starts

  29. You have to change how you think about monitoring @sarahjwells

  30. How can you make it better?

  31. 1. Build a system you can support @sarahjwells

  32. The basic tools you need @sarahjwells

  33. Log aggregation @sarahjwells

  34. Logs go missing or get delayed more now @sarahjwells

  35. Which means log based alerts may miss stuff @sarahjwells

  36. Monitoring @sarahjwells

  37. Limitations of our nagios integration… @sarahjwells

  38. No ‘service-level’ view @sarahjwells

  39. Default checks included things we couldn’t fix @sarahjwells

  40. A new approach for our container stack @sarahjwells

  41. We care about each service @sarahjwells

  42. We care about each VM @sarahjwells

  43. We care about unhealthy instances @sarahjwells

  44. Monitoring needs aggregating somehow @sarahjwells

  45. SAWS @sarahjwells

  46. Built by Silvano Dossan See our Engine room blog: http://bit.ly/1GATHLy

  47. "I imagine most people do exactly what I do - create a google filter to send all Nagios emails straight to the bin" @sarahjwells

  48. "Our screens have a viewing angle of about 10 degrees" @sarahjwells

  49. "It never seems to show the page I want" @sarahjwells

  50. Code at: https://github.com/muce/SAWS @sarahjwells

  51. Dashing @sarahjwells

  52. Graphing of metrics @sarahjwells

  53. https://www.flickr.com/photos/davidmasters/2564786205/

  54. The things that make those tools WORK @sarahjwells

  55. Effective log aggregation needs a way to find all related logs @sarahjwells

  56. Transaction ids tie all microservices together

  57. Make it easy for any language you use @sarahjwells

  58. @sarahjwells

  59. Services need to report on their own health @sarahjwells

  60. The FT healthcheck standard GET http://{service}/__health

  61. The FT healthcheck standard GET http://{service}/__health returns 200 if the service can run the healthcheck

  62. The FT healthcheck standard GET http://{service}/__health returns 200 if the service can run the healthcheck each check will return "ok": true or "ok": false

  63. Knowing about problems before your clients do @sarahjwells

  64. Synthetic requests tell you about problems early https://www.flickr.com/photos/jted/ 5448635109

  65. 2. Concentrate on the stuff that matters @sarahjwells

  66. It’s the business functionality you should care about @sarahjwells

  67. We care about whether content got published successfully

  68. When people call our APIs, we care about speed

  69. … we also care about errors

  70. But it's the end-to-end that matters https://www.flickr.com/photos/robef/16537786315/

  71. If you just want information, create a dashboard or report

  72. Checking the services involved in a business flow @sarahjwells

  73. /__health?categories=lists-publish

  74. 3. Cultivate your alerts @sarahjwells

Recommend


More recommend