monitoring swift
play

Monitoring Swift OpenStack Summit, Austin 2016 Adam Takvam, Sr. - PowerPoint PPT Presentation

Monitoring Swift OpenStack Summit, Austin 2016 Adam Takvam, Sr. Systems Engineer Martin Lanner, Engagement Manager April 28, 2016 2 | SwiftStack Confidential Overview Problems Swift key monitoring concepts - Usage intelligence - What to


  1. Monitoring Swift OpenStack Summit, Austin 2016 Adam Takvam, Sr. Systems Engineer Martin Lanner, Engagement Manager April 28, 2016

  2. 2 | SwiftStack Confidential

  3. Overview • Problems • Swift key monitoring concepts - Usage intelligence - What to monitor? - Capacity planning - How to monitor - Operational health - Audit trails • Monitoring methods - demos - Logging: ELK - Trending/Forecasting: • Background Prometheus + Grafana - Methods: logs + system metrics - System monitoring: Zabbix - Interpretation of metrics - Actions: thresholds + alerting | SwiftStack Confidential 3

  4. It’s Linux ! | SwiftStack Confidential 4

  5. Properties of Swift • Distributed system • Extremely durable through replication or Erasure Coding • No single point of failure • Even distribution of data • Resilient • Self-healing capabilities • Can take a lot of abuse and negligence 5

  6. Anatomy of a Monitoring Solution Agent: Gathers metrics on a host and either Visualizer: Renders graphs in a human-friendly • • pushed or advertises them format for easy comprehension of system state - Logstash - Kibana - Prometheus Node Exporter - Grafana - Zabbix Agent - Nagios NRPE Alerting: Uses metric thresholds to trigger • alerts when metrics fall out of an acceptable Aggregation Engines: Collects metrics from • range agents and provides an API with access to - AlertManager aggregated metric values - PagerDuty - Nagios - Zabbix - Elasticsearch - Prometheus | SwiftStack Confidential 6

  7. Developing a Monitoring Strategy Forms of Monitoring Monitoring Lifecycle System utilization: CPU, memory, disk Measurement • • I/O, network, auditing cycles, replicator Reporting • timing Characterization • Performance: Transaction latency • Thresholds • Errors: Invalid requests or states • Alerting • Outages: Service failures • Root cause analysis • Feature usage: Understand CRUD • Remediation • operations and traffic patterns - Manual Audit trail: Who did what when? • - Automated | SwiftStack Confidential 7

  8. Examples of monitoring methods • ELK: Usage intelligence • Prometheus: Capacity planning - Who? - Data growth - Agents - Trending analytics - HTTP response codes - Errors • Zabbix: Operational health - Audit trails - Network - CPU - RAM 8

  9. Key concepts for monitoring Swift • Cluster full • Auditing cycles - df • Replication cycle timing - Data growth - Capacity planning • Networking - Availability - Saturation • Proxy state - CPU - /healthcheck 9

  10. Load balancer health checks against Swift proxy servers • Most load balancers run ICMP checks against all IPs in its pool by default • Also, consider configuring the load balancer to run TCP checks against Swift’s /healthcheck endpoint Example: demo@demo:~$ curl http://swift.swiftstack.oss/healthcheck OK | SwiftStack Confidential 10

  11. Audit trails with ELK | SwiftStack Confidential 11

  12. Object size distribution | SwiftStack Confidential 12

  13. Distribution of CRUD operations over time | SwiftStack Confidential 13

  14. Zabbix triggers for Swift | SwiftStack Confidential 14

  15. Zabbix node memory usage | SwiftStack Confidential 15

  16. Zabbix drive utilization events | SwiftStack Confidential 16

  17. Disk I/O | SwiftStack Confidential 17

  18. Object Replicator Operations | SwiftStack Confidential 18

  19. Prometheus + Grafana trending and forecasting | SwiftStack Confidential 19

  20. Alerting ALERT StorageCritical24Hours IF sum(predict_linear(node_filesystem_free{ job='swiftstack',mountpoint=~"/srv/node/.*” }[1d]), 24*3600) < sum(node_filesystem_size{ job="swiftstack",mountpoint=~"/srv/node/.*” }) * 0.2 FOR 1h LABELS { group="storage_admin“ Example: severity="critical“ } Translation: Send a critical alert to all members of the storage_admin group if the total available storage capacity is projected to be less than 20% of the total storage capacity within the next 24 hours and that forecast has held true for at least 1 hour, recalculating every 5 minutes (per server config / not shown). | SwiftStack Confidential 20

  21. Q&A / Demo | SwiftStack Confidential 21

  22. Thank you! | SwiftStack Confidential 22

Recommend


More recommend