M O N I T O R I N G C O M P L E X S Y S T E M S : L E S S O N S F R O M M O N I T O R I N G 1 0 K B A N K I N T E G R A T I O N S Joy Zheng #GHC18
WHY MONITORING #GHC18 • Run reliable services at scale • Understand service performance over time • Quickly detect and react to problems PAGE 2 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY
GOALS #GHC18 • Customize monitoring to your system or company • Avoid incurring high customization costs PAGE 3 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY
WHO IS PLAID? P E R S O N A L L E N D I N G B A N K I N G A N D C O N S U M E R B U S I N E S S I N T E G R A T I O N #GHC18 F I N A N C E S B R O K E R A G E P A Y M E N T S F I N A N C E S P A R T N E R S PAGE 4 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY
WHO IS PLAID? #GHC18 PAGE 5 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY
MONITORING AT PLAID #GHC18 10,000 fjnancial institutions PAGE 6 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY
MONITORING AT PLAID #GHC18 10,000 fjnancial institutions Heterogeneous traffjc patterns PAGE 7 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY
TWO SAMPLE INSTITUTIONS #GHC18 T attersall Federal Credit First Platypus Bank Union 10 billion tries / 5 tries / Hour 1 100 million failures 1 failures 10 billion tries / 5 tries / Hour 2 200 million failures 5 failures PAGE 8 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY
MONITORING AT PLAID #GHC18 10,000 fjnancial institutions Heterogeneous traffjc patterns Metrics beyond success/failure PAGE 9 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY
Determining Requirements #GHC18
OUR STEPS 1. Convince ourselves we needed a new system #GHC18 2. Identify a full list of metrics to monitor 3. Prioritize, prioritize, prioritize 4. Determine technical system requirements 5. Research technologies PAGE 11 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY
THE OLD SYSTEM #GHC18 PAGE 12 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY
METRICS WISHLIST #GHC18 PAGE 13 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY
PRIORITIZATION: #SHIPTHEMVP #GHC18 • Based on customer impact and instrumentation cost Narrowed to 1/3 of original wishlist • Still failed to narrow the list of metrics enough Result: time spent writing complex (unused) database logic PAGE 14 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY
TECHNICAL REQUIREMENTS #GHC18 Scalability: Latency: Usability: Engineers can 10k banks * 30s from event create metrics and # metrics each to metric alerts with minimal 1s to query monitoring metrics implementation knowledge PAGE 15 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY
Building a Monitoring Pipeline #GHC18
TWO USE CASES #GHC18 1 2 Standard Pipeline Custom Pipeline Monitoring, alerting, Metrics generation and dashboards at involving custom scale for easy-to- aggregation generate metrics 3 4 PAGE 17 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY
STANDARD PIPELINE #GHC18 PAGE 18 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY
PROMETHEUS #GHC18 PAGE 19 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY
PROMETHEUS #GHC18 server: tasks_processed_count{ server="i-abcdef123", status="success", institution="FirstPlatypusBank" } query: sum(rate(tasks_processed_count[5m])) alert: sum(rate(tasks_processed_count{status!="success}[5m])) by(institution) >100 PAGE 20 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY
PROMETHEUS #GHC18 server:tasks_processed_count{ server="i-abcdef123", status="success", institution="FirstPlatypusBank" } query: sum(rate(tasks_processed_count[5m])) alert: sum(rate(tasks_processed_count{status!="success}[5m])) by(institution) >100 PAGE 21 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY
PROMETHEUS #GHC18 server:tasks_processed_count{ server="i-abcdef123", status="success", institution="FirstPlatypusBank" } query: sum(rate(tasks_processed_count[5m])) alert: sum(rate(tasks_processed_count{status!="success}[5m])) by (institution) > 100 PAGE 22 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY
ALERTMANAGER route: #GHC18 receiver:team-monitoring-email group_by: - alertname -environment routes: - match_re: alert_type:^(?:monitoring_uptime|prometheus)$ routes: -receiver:team-monitoring-email match_re: environment:^(?:testing|preprod)$ -receiver:team-monitoring-email match_re: plaid_env:^(?:testing|preprod)$ -receiver:team-monitoring-pager PAGE 23 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY
ALERTMANAGER route: #GHC18 receiver:team-monitoring-email group_by: - alertname -environment routes: - match_re: alert_type: ^(?:monitoring_uptime|prometheus)$ routes: -receiver:team-monitoring-email match_re: environment:^(?:testing|preprod)$ -receiver:team-monitoring-email match_re: plaid_env:^(?:testing|preprod)$ -receiver:team-monitoring-pager PAGE 24 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY
ALERTMANAGER route: #GHC18 receiver:team-monitoring-email group_by: - alertname -environment routes: - match_re: alert_type:^(?:monitoring_uptime|prometheus)$ routes: - receiver: team-monitoring-email match_re: environment: ^(?:testing|preprod)$ - receiver: team-monitoring-email match_re: plaid_env: ^(?:testing|preprod)$ -receiver:team-monitoring-pager PAGE 25 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY
ALERTMANAGER route: #GHC18 receiver:team-monitoring-email group_by: - alertname -environment routes: - match_re: alert_type:^(?:monitoring_uptime|prometheus)$ routes: -receiver:team-monitoring-email match_re: environment:^(?:testing|preprod)$ -receiver:team-monitoring-email match_re: plaid_env:^(?:testing|preprod)$ - receiver: team-monitoring-pager PAGE 26 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY
GRAFANA #GHC18 PAGE 27 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY
STANDARD PIPELINE #GHC18 PAGE 28 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY
CUSTOM PIPELINE #GHC18 PAGE 29 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY
RESULTS #GHC18 > 700 190k+ 17 Events per second processed Metrics exported Services monitored 31 <5s Engineers who have Average delay from contributed monitoring event to metrics changes (on a team of 45) generation PAGE 30 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY
Takeaways #GHC18
Recommend
More recommend