ghc18 why monitoring
play

#GHC18 WHY MONITORING #GHC18 Run reliable services at scale - PowerPoint PPT Presentation

M O N I T O R I N G C O M P L E X S Y S T E M S : L E S S O N S F R O M M O N I T O R I N G 1 0 K B A N K I N T E G R A T I O N S Joy Zheng #GHC18 WHY MONITORING #GHC18 Run reliable services at scale Understand service


  1. M O N I T O R I N G C O M P L E X S Y S T E M S : L E S S O N S F R O M M O N I T O R I N G 1 0 K B A N K I N T E G R A T I O N S Joy Zheng #GHC18

  2. WHY MONITORING #GHC18 • Run reliable services at scale • Understand service performance over time • Quickly detect and react to problems PAGE 2 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

  3. GOALS #GHC18 • Customize monitoring to your system or company • Avoid incurring high customization costs PAGE 3 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

  4. WHO IS PLAID? P E R S O N A L L E N D I N G B A N K I N G A N D C O N S U M E R B U S I N E S S I N T E G R A T I O N #GHC18 F I N A N C E S B R O K E R A G E P A Y M E N T S F I N A N C E S P A R T N E R S PAGE 4 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

  5. WHO IS PLAID? #GHC18 PAGE 5 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

  6. MONITORING AT PLAID #GHC18 10,000 fjnancial institutions PAGE 6 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

  7. MONITORING AT PLAID #GHC18 10,000 fjnancial institutions Heterogeneous traffjc patterns PAGE 7 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

  8. TWO SAMPLE INSTITUTIONS #GHC18 T attersall Federal Credit First Platypus Bank Union 10 billion tries / 5 tries / Hour 1 100 million failures 1 failures 10 billion tries / 5 tries / Hour 2 200 million failures 5 failures PAGE 8 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

  9. MONITORING AT PLAID #GHC18 10,000 fjnancial institutions Heterogeneous traffjc patterns Metrics beyond success/failure PAGE 9 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

  10. Determining Requirements #GHC18

  11. OUR STEPS 1. Convince ourselves we needed a new system #GHC18 2. Identify a full list of metrics to monitor 3. Prioritize, prioritize, prioritize 4. Determine technical system requirements 5. Research technologies PAGE 11 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

  12. THE OLD SYSTEM #GHC18 PAGE 12 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

  13. METRICS WISHLIST #GHC18 PAGE 13 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

  14. PRIORITIZATION: #SHIPTHEMVP #GHC18 • Based on customer impact and instrumentation cost Narrowed to 1/3 of original wishlist • Still failed to narrow the list of metrics enough Result: time spent writing complex (unused) database logic PAGE 14 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

  15. TECHNICAL REQUIREMENTS #GHC18 Scalability: Latency: Usability: Engineers can 10k banks * 30s from event create metrics and # metrics each to metric alerts with minimal 1s to query monitoring metrics implementation knowledge PAGE 15 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

  16. Building a Monitoring Pipeline #GHC18

  17. TWO USE CASES #GHC18 1 2 Standard Pipeline Custom Pipeline Monitoring, alerting, Metrics generation and dashboards at involving custom scale for easy-to- aggregation generate metrics 3 4 PAGE 17 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

  18. STANDARD PIPELINE #GHC18 PAGE 18 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

  19. PROMETHEUS #GHC18 PAGE 19 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

  20. PROMETHEUS #GHC18 ​ server: tasks_processed_count{ server="i-abcdef123", status="success", institution="FirstPlatypusBank" } ​query: sum(rate(tasks_processed_count[5m])) ​alert: sum(rate(tasks_processed_count{status!="success}[5m])) ​ ​ ​ ​ ​ ​by​(institution) ​ ​ ​ ​ >​100 ​ PAGE 20 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

  21. PROMETHEUS #GHC18 ​ ​server:​tasks_processed_count{ ​ ​ ​ ​ ​ ​ server="i-abcdef123", ​ ​ ​ ​ ​ ​ status="success", ​ ​ ​ ​ ​ ​ institution="FirstPlatypusBank" ​ ​ ​ ​ ​} query: sum(rate(tasks_processed_count[5m])) ​alert: sum(rate(tasks_processed_count{status!="success}[5m])) ​ ​ ​ ​ ​ ​by​(institution) ​ ​ ​ ​ >​100 ​ PAGE 21 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

  22. PROMETHEUS #GHC18 ​ ​server:​tasks_processed_count{ ​ ​ ​ ​ ​ ​ server="i-abcdef123", ​ ​ ​ ​ ​ ​ status="success", ​ ​ ​ ​ ​ ​ institution="FirstPlatypusBank" ​ ​ ​ ​ ​} ​query: sum(rate(tasks_processed_count[5m])) alert: sum(rate(tasks_processed_count{status!="success}[5m])) by (institution) > 100 ​ PAGE 22 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

  23. ALERTMANAGER ​ ​route:​ #GHC18 ​ receiver:​team-monitoring-email​ group_by: ​ ​ - alertname​ ​ -​environment​ ​ routes:​ ​ - match_re:​ ​ ​ alert_type:​^(?:monitoring_uptime|prometheus)$ ​ ​ ​ routes:​ ​ ​ -​receiver:​team-monitoring-email​ ​ ​ match_re:​ ​ ​ ​ ​ environment:​^(?:testing|preprod)$​ ​ ​ -​receiver:​team-monitoring-email​ ​ ​ match_re:​ ​ ​ ​ plaid_env:​^(?:testing|preprod)$​ ​ ​ -​receiver:​team-monitoring-pager​ ​ PAGE 23 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

  24. ALERTMANAGER ​ ​route:​ #GHC18 ​ receiver:​team-monitoring-email​ group_by: ​ ​ - alertname​ ​ -​environment​ ​ routes:​ ​ - match_re:​ alert_type: ^(?:monitoring_uptime|prometheus)$ ​ ​ routes:​ ​ ​ -​receiver:​team-monitoring-email​ ​ ​ match_re:​ ​ ​ ​ ​ environment:​^(?:testing|preprod)$​ ​ ​ -​receiver:​team-monitoring-email​ ​ ​ match_re:​ ​ ​ ​ plaid_env:​^(?:testing|preprod)$​ ​ ​ -​receiver:​team-monitoring-pager​ ​ PAGE 24 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

  25. ALERTMANAGER ​ ​route:​ #GHC18 ​ receiver:​team-monitoring-email​ group_by: ​ ​ - alertname​ ​ -​environment​ ​ routes:​ ​ - match_re:​ ​ ​ alert_type:​^(?:monitoring_uptime|prometheus)$ ​ ​ ​ routes:​ - receiver: team-monitoring-email match_re: environment: ^(?:testing|preprod)$ - receiver: team-monitoring-email match_re: plaid_env: ^(?:testing|preprod)$ ​ ​ -​receiver:​team-monitoring-pager​ ​ PAGE 25 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

  26. ALERTMANAGER ​ ​route:​ #GHC18 ​ receiver:​team-monitoring-email​ group_by: ​ ​ - alertname​ ​ -​environment​ ​ routes:​ ​ - match_re:​ ​ ​ alert_type:​^(?:monitoring_uptime|prometheus)$ ​ ​ ​ routes:​ ​ ​ -​receiver:​team-monitoring-email​ ​ ​ match_re:​ ​ ​ ​ ​ environment:​^(?:testing|preprod)$​ ​ ​ -​receiver:​team-monitoring-email​ ​ ​ match_re:​ ​ ​ ​ plaid_env:​^(?:testing|preprod)$​ ​ ​ - receiver: team-monitoring-pager ​ PAGE 26 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

  27. GRAFANA #GHC18 PAGE 27 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

  28. STANDARD PIPELINE #GHC18 PAGE 28 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

  29. CUSTOM PIPELINE #GHC18 PAGE 29 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

  30. RESULTS #GHC18 > 700 190k+ 17 Events per second processed Metrics exported Services monitored 31 <5s Engineers who have Average delay from contributed monitoring event to metrics changes (on a team of 45) generation PAGE 30 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

  31. Takeaways #GHC18

Recommend


More recommend