Monitoring Swift OpenStack Summit, Austin 2016 Adam Takvam, Sr. Systems Engineer Martin Lanner, Engagement Manager April 28, 2016
2 | SwiftStack Confidential
Overview • Problems • Swift key monitoring concepts - Usage intelligence - What to monitor? - Capacity planning - How to monitor - Operational health - Audit trails • Monitoring methods - demos - Logging: ELK - Trending/Forecasting: • Background Prometheus + Grafana - Methods: logs + system metrics - System monitoring: Zabbix - Interpretation of metrics - Actions: thresholds + alerting | SwiftStack Confidential 3
It’s Linux ! | SwiftStack Confidential 4
Properties of Swift • Distributed system • Extremely durable through replication or Erasure Coding • No single point of failure • Even distribution of data • Resilient • Self-healing capabilities • Can take a lot of abuse and negligence 5
Anatomy of a Monitoring Solution Agent: Gathers metrics on a host and either Visualizer: Renders graphs in a human-friendly • • pushed or advertises them format for easy comprehension of system state - Logstash - Kibana - Prometheus Node Exporter - Grafana - Zabbix Agent - Nagios NRPE Alerting: Uses metric thresholds to trigger • alerts when metrics fall out of an acceptable Aggregation Engines: Collects metrics from • range agents and provides an API with access to - AlertManager aggregated metric values - PagerDuty - Nagios - Zabbix - Elasticsearch - Prometheus | SwiftStack Confidential 6
Developing a Monitoring Strategy Forms of Monitoring Monitoring Lifecycle System utilization: CPU, memory, disk Measurement • • I/O, network, auditing cycles, replicator Reporting • timing Characterization • Performance: Transaction latency • Thresholds • Errors: Invalid requests or states • Alerting • Outages: Service failures • Root cause analysis • Feature usage: Understand CRUD • Remediation • operations and traffic patterns - Manual Audit trail: Who did what when? • - Automated | SwiftStack Confidential 7
Examples of monitoring methods • ELK: Usage intelligence • Prometheus: Capacity planning - Who? - Data growth - Agents - Trending analytics - HTTP response codes - Errors • Zabbix: Operational health - Audit trails - Network - CPU - RAM 8
Key concepts for monitoring Swift • Cluster full • Auditing cycles - df • Replication cycle timing - Data growth - Capacity planning • Networking - Availability - Saturation • Proxy state - CPU - /healthcheck 9
Load balancer health checks against Swift proxy servers • Most load balancers run ICMP checks against all IPs in its pool by default • Also, consider configuring the load balancer to run TCP checks against Swift’s /healthcheck endpoint Example: demo@demo:~$ curl http://swift.swiftstack.oss/healthcheck OK | SwiftStack Confidential 10
Audit trails with ELK | SwiftStack Confidential 11
Object size distribution | SwiftStack Confidential 12
Distribution of CRUD operations over time | SwiftStack Confidential 13
Zabbix triggers for Swift | SwiftStack Confidential 14
Zabbix node memory usage | SwiftStack Confidential 15
Zabbix drive utilization events | SwiftStack Confidential 16
Disk I/O | SwiftStack Confidential 17
Object Replicator Operations | SwiftStack Confidential 18
Prometheus + Grafana trending and forecasting | SwiftStack Confidential 19
Alerting ALERT StorageCritical24Hours IF sum(predict_linear(node_filesystem_free{ job='swiftstack',mountpoint=~"/srv/node/.*” }[1d]), 24*3600) < sum(node_filesystem_size{ job="swiftstack",mountpoint=~"/srv/node/.*” }) * 0.2 FOR 1h LABELS { group="storage_admin“ Example: severity="critical“ } Translation: Send a critical alert to all members of the storage_admin group if the total available storage capacity is projected to be less than 20% of the total storage capacity within the next 24 hours and that forecast has held true for at least 1 hour, recalculating every 5 minutes (per server config / not shown). | SwiftStack Confidential 20
Q&A / Demo | SwiftStack Confidential 21
Thank you! | SwiftStack Confidential 22
Recommend
More recommend