monitoring at scale
play

Monitoring at Scale Migrating to Prometheus at Fastly PROMCON 2018 - PowerPoint PPT Presentation

Monitoring at Scale Migrating to Prometheus at Fastly PROMCON 2018 | Marcus Barczak @ickymettle How were we monitoring Fastly? + Growing pains with Ganglia Operational overhead. Limited graphing functions. No alerting support,


  1. Monitoring at Scale Migrating to Prometheus at Fastly PROMCON 2018 | Marcus Barczak @ickymettle

  2. How were we monitoring Fastly?

  3. +

  4. Growing pains with Ganglia ๏ Operational overhead. ๏ Limited graphing functions. ๏ No alerting support, ๏ No real API for consuming metric data.

  5. + + aaS

  6. Growing pains doubled ๏ Now supporting two systems. ๏ Where do I put my metrics? ๏ Still writing external plugins and agents. ๏ Monitoring treated as a "post-release" phase.

  7. Scaling our infrastructure horizontally Required scaling our monitoring vertically

  8. Third time lucky

  9. Project goals ๏ Scale with our infrastructure growth, ๏ Be easy to deploy and operate. ๏ Engineer friendly instrumentation libraries. ๏ First class API support for data access. ๏ To reinvigorate our monitoring culture. 
 See: https:/ /peter.bourgon.org/observability-the-hard-parts/

  10. ?

  11. Getting started ๏ Build a proof of concept. ๏ Pair with pilot team to instrument their services. ๏ Iterate through the rest. ๏ Run both systems in parallel. ๏ Decommission SaaS system and Ganglia.

  12. Infrastructure build

  13. prometheus A prometheus B scrapes scrapes targets targets SJC

  14. prometheus A prometheus B prometheus A prometheus B scrapes scrapes scrapes scrapes targets targets targets targets prometheus A prometheus B SJC JFK scrapes scrapes targets targets ATL

  15. GCP frontend stack federator A federator B prometheus A prometheus B prometheus A prometheus B scrapes scrapes scrapes scrapes targets targets targets targets prometheus A prometheus B SJC JFK scrapes scrapes targets targets ATL

  16. GCP frontend stack federator A federator B Query Tra ffi c (TLS) prometheus A prometheus B prometheus A prometheus B scrapes scrapes scrapes scrapes targets targets targets targets prometheus A prometheus B SJC JFK scrapes scrapes targets targets ATL

  17. Prometheus Server Software Stack Ghost Tunnel TLS termination and auth. Service Discovery Sidecar Target con fi guration Rules Loader Recording and Alert rules Prometheus

  18. Prometheus Server Typical Server Software Stack Software Stack Exporters Ghost Tunnel Built into services or sidecar TLS termination and auth. Service Discovery Sidecar Service Discovery Proxy Target con fi guration Service discovery and TLS exporter proxy Rules Loader Recording and Alert rules Prometheus

  19. Build your own service discovery?

  20. Fastly's infrastructure is bare metal hardware no cloud conveniences

  21. Service discovery requirements ๏ Automatic discovery of targets. ๏ Self-service registration of exporter endpoints, ๏ TLS encryption for all exporter tra ffi c. ๏ Minimal exposure of exporter TCP ports.

  22. Prometheus Server Typical Server Software Stack Software Stack Exporters Ghost Tunnel Built into services or sidecar TLS termination and auth. queries for available targets PromSD Sidecar PromSD Proxy Target con fi guration Service discovery and TLS exporter proxy generates con fi g for prometheus Prometheus scrapes proxied targets over TLS

  23. PromSD sidecar 4 Prometheus reads " exporter_hosts ": [ 1 the fi le and scrapes "10.0.0.1", "10.0.0.2", the con fi gured fetch list of hosts "10.0.0.3", targets. "10.0.0.4" in a datacenter ] con fi gly { " targets ": [ “10.0.0.1:9702”, “10.0.0.2:9702” promsd sidecar ], " labels ": { " __metrics_path__ ": “/node_exporter_9100/metrics", " job ": “node_exporter” } 3 }, promsd proxy output all targets as a { 3 2 " targets ": [ fi le service discovery “10.0.0.1:9702”, request /targets endpoint JSON fi le “10.0.0.2:9702” for each host to get list ], " labels ": { of available scrape targets " __metrics_path__ ": "/varnishstat_exporter_19102/metrics", " job ": "varnishstat_exporter" } }

  24. PromSD proxy 1 /node_exporter_9100/metrics node_exporter /varnish_exporter_19102/metrics process_exporter fetch list of installed systemd services varnishstat_exporter systemd 3 exposes an API used by prometheus promsd proxy and promsd sidecar " node_exporter ": { " prometheus_properties ": { " target ": "127.0.0.1:9100" 3 2 } }, for each corresponding … systemd service fetch the " varnishstat_exporter ": { local exporter target address " prometheus_properties ": { sidecar /targets " target ": "127.0.0.1:19102" } } con fi gly

  25. It worked! ๏ Really easy to leverage the fi le SD mechanism. ๏ New targets can be added with one line of con fi g. ๏ TLS and authentication everywhere. ๏ Single exporter port open per host.

  26. Prometheus Adoption

  27. Prometheus at Scale at Fastly 114 Prometheus servers globally 28.4 M time series 2.2 M million samples/second

  28. ... a few hours later

  29. Prometheus wins ๏ Engineers love it. ๏ Dashboard and alert quality have increased. ๏ PromQL enables some deep insights. ๏ Scaling linearly with our infrastructure growth.

  30. Still some rough edges. ๏ Metrics exploration without prior knowledge. ๏ Alertmanager's fl exibility. ๏ Federation and global views. ๏ Long term storage still an open question.

  31. 😎

  32. Thanks! @ickymettle fastly.com

Recommend


More recommend