how to build observable distributed systems
play

How to build observable distributed systems @PierreVincent - PowerPoint PPT Presentation

March 8th, 2018 QCon London How to build observable distributed systems @PierreVincent pvincent.io @PierreVincent @PierreVincent Pierre Vincent SRE Manager at Poppulo @PierreVincent pvincent.io @PierreVincent Reaching production is


  1. March 8th, 2018 – QCon London How to build observable distributed systems @PierreVincent pvincent.io @PierreVincent

  2. @PierreVincent

  3. Pierre Vincent SRE Manager at Poppulo @PierreVincent pvincent.io @PierreVincent

  4. Reaching production is only the beginning @PierreVincent

  5. No system is immune to failure Be ready to recover @PierreVincent

  6. When distributing a system, we’re also distributing the places where things might go wrong @PierreVincent

  7. Monitoring only applies to known failure modes. What about everything else? @PierreVincent

  8. “ Monitoring tells you whether the system works. Observability lets ” you ask why it's not working. – Baron Schwarz @PierreVincent

  9. Metrics Healthchecks Tracing Logging @PierreVincent

  10. Healthchecks @PierreVincent

  11. Healthchecks Can it perform Can it accept Is it running? its task? more work? @PierreVincent

  12. Healthchecks Broadcast Register Expose @PierreVincent

  13. Healthchecks GET http://1.2.3.4:8080/health 200 OK { "service": "registration-service", "healthy": true , "workload": { "healthy": true }, "dependencies": [ /health { "name": "cassandra", "healthy": true }, { "name": "billing-svc", "healthy": true }, ] } @PierreVincent

  14. Overzealous Healthchecks can be counter-productive Source: HTTP Healthchecks for a Resilient Platform - Chris O’Dell skeltonthatcher.com/blog/http-healthchecks-for-a-resilient-platform @PierreVincent

  15. Metrics @PierreVincent

  16. Metrics System Application Business metrics metrics metrics CPU usage Error rates Customer conversions @PierreVincent

  17. Metrics Servers / VMs Dashboards Metrics Metrics Appliances/Infra query collector engine Alerts Services @PierreVincent

  18. Metrics Servers / VMs /metrics Appliances/Infra /metrics Prometheus Services /metrics @PierreVincent

  19. Watch out for over-reliance on metrics Not every metric Limit alerting to deserves an alert user-impacting symptoms Poor fine-grained debugging Limitations at high-cardinality e.g. CustomerId Real-time querying means Not suitable for long-term some trade-offs on retention trend analysis @PierreVincent

  20. Logging @PierreVincent

  21. Logging Making sense of (a lot of) logs Centralised Searchable Correlated @PierreVincent

  22. Log Correlation C B E A D a1b2c3 ERROR [svc= A ][trace= a1b2c3 ] Failed to process order F Cause: Order process manager responded with 500 G a1b2c3 ERROR [svc= F ][trace= a1b2c3 ] Failed to complete order a1b2c3 Cause: Shipping service responded with 500 J INFO [svc= G ][trace= a1b2c3 ] Items verified in stock H ERROR [svc= H ][trace= a1b2c3 ] Failed to save order Cause: Cassandra timeout exception a1b2c3 @PierreVincent

  23. 2018-02-20T16:38:23+00:00 ERROR Read timed out Hmmm thanks… ? When did it happen? timestamp 2018-02-20T16:38:23+00:00 What is the message? level ERROR log Read timed out What is it? service registration-service team events What is it running? commit 542a8b8e build 542a8b8e.7 runtime java-1.8.0_161 JSON Where is it? region europe-west2 node node_e79f3e52 Who caused it? customerID 55123 userID 458 Can I trace it? requestID ec667cb45 Any other info? ... ... @PierreVincent

  24. Structured logs unleash high-cardinality exploration Error rate spike isolated by build version Activity spike isolated for single customer @PierreVincent

  25. Tracing @PierreVincent

  26. Tracing Trace 50ms 0ms 100ms 150ms 172ms event-mgt-api get_confirmed_attendees 172ms attendees-service get_attendees 73ms Spans cassandra/select 54ms registration-service get_confirm_status 78ms mysql/select 41ms @PierreVincent

  27. @PierreVincent

  28. Known Unknown Unknowns Unknowns Logs Metrics Healthchecks (Queries) Metrics Events (Alerting) Tracing Monitoring & Resiliency Debugging & Exploration @PierreVincent

  29. Usability of tooling is key to adoption Cheap to Easy to Reliable & instrument explore Trustworthy @PierreVincent

  30. Visibility builds trust but requires safety @PierreVincent

  31. Visibility helps justify decisions @PierreVincent

  32. Visibility enables operability @PierreVincent

  33. Test (a little bit*) less Test (a little bit*) less Don’t spend all your time testing ...keep some for instrumenting @PierreVincent Thank you! pvincent.io @PierreVincent

  34. Thank you! Pierre Vincent SRE Manager at Poppulo @PierreVincent pvincent.io @PierreVincent

More recommend