Open Infrastructure Summit 2019 Demonstrating At Scale Monitoring Of OpenStack Cloud Using Prometheus Anandeep Pannu Pradeep Kilambi apannu@redhat.com prad@redhat.com 1
● ● ● ● ● ●
Definitions 3
4
○ ○ ○ ○ ○
Implications for Open Infrastructure 7
● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
Critical Monitoring Features
● Portability across different footprints ● HA, scaling, persistence available for free ● Re-use platform capabilities - eg. Prometheus ● Users integrate for capabilities they want ● Stringent SLAs can be met ● Plug-in different OSS components with the same API ● For each API, SLAs achieved can be optimized ○ E.g Fault management uses message bus directly ● Metrics meta-data and declarative metrics for every component, so metrics can be incorporated automatically ● Data sensing, collection and processing ○ Either, some or all processed at the Edge ● Centralized access to reports, alerts ● Integration with Analytics
Service Assurance Framework Architecture
Architecture Overview On-site infrastructure platform
○ ■ ○ ■ ■ ○ ■
3rd Party Prometheus Operator Integrations MGMT Cluster Metrics APIs Events Dispatch Routing Message Distribution Bus (AMQP 1.0) V V V M M M syslog /proc pid kernel Application Components cpu mem net Prometheus-based K8S (VM, Container); Monitoring hardware Controller, Compute, Ceph, RHEV, OpenShift Nodes (All Infrastructure Nodes)
● Collectd container -- Host / VM metrics collection framework ○ Collectd 5.8 with additional OPNFV Barometer specific plugins not yet in collectd project ● Intel RDT, Intel PMU, IPMI ● AMQP1.0 client plugin ● Procevent -- Process state changes ● Sysevent -- Match syslog for critical errors ● Connectivity -- Fast detection of interface link status changes ○ Integrated as part of TripleO (OSP Director)
write_syslog write_kafka write_prometheus amqp_09 amqp1
AMQ 7 Interconnect - Native AMQP 1.0 Message Router ● Large Scale Message Networks ○ Offers shortest path (least cost) message routing Client Server Client B ○ Used without broker ○ High Availability through redundant path topology and re-route (not clustering) Server C ○ Automatic recovery from network partitioning failures ○ Reliable delivery without requiring storage ● QDR Router Functionality ○ Apache Qpid Dispatch Router QDR ○ Dynamically learn addresses of messaging endpoints Client ○ Stateless - no message queuing, end-to-end transfer Server A High Throughput, Low Latency Low Operational Costs
Prometheus Operator ● ● ● ● ●
● ○ ○ ○ ● ○ ○ ● ○
Evolution
Site 10 Site 2 Site 1 compute compute compute compute compute compute compute compute compute compute compute compute ceph ceph ceph ceph ceph ceph ceph ceph ceph ceph ceph ceph cntrl 1 cntrl 1 cntrl 1 cntrl 2 cntrl 2 cntrl 2 cntrl 3 cntrl 3 cntrl 3 OS Networks AMQP OS Networks AMQP AMQP OS Networks Remote Site(s) Layer 3 Network to Remote Sites Central Site Grafana Prometheus Operator++ Cluster compute compute QDR QDR QDR compute compute ceph S S S ceph G G G ceph Prometheus ceph cntrl 1 cntrl 2 cntrl 3 AMQP OS Networks
DCN Use Case Deployment Stack OPTIONAL OPTIONAL Undercloud AZ0 AZ0 +Container Registry Compute Nodes Ceph Cluster 0 (Local Ephemeral) Controller Nodes Primary Site L3 Routed AZ1 AZ2 AZ3 AZ4 AZn Compute Nodes Compute Nodes Compute Nodes Compute Nodes Compute Nodes (Local Ephemeral) (Local Ephemeral) (Local Ephemeral) (Local Ephemeral) (Local Ephemeral) DCN Site 1 DCN Site 2 DCN Site 3 DCN Site 4 DCN Site n
Configuration & Deployment
TripleO Integration Of client side components Collectd and QDR profiles are integrated as part of the TripleO ● Collectd and QDRs run as containers on the openstack nodes ● Configured via heat environment file ● Each node will have a qpid dispatch router running with collectd ● agent Collectd is configured to talk to qpid dispatch router and send ● metrics and events Relevant collectd plugins can be configured via the heat template file ●
TripleO Client side Configuration environments/metrics-collectd-qdr.yaml ## This environment template to enable Service Assurance Client side bits resource_registry: OS::TripleO::Services::MetricsQdr: ../docker/services/metrics/qdr.yaml OS::TripleO::Services::Collectd: ../docker/services/metrics/collectd.yaml parameter_defaults: CollectdConnectionType: amqp1 CollectdAmqpInstances: notify: notify: true format: JSON presettle: true telemetry: format: JSON presettle: false
TripleO Client side Configuration params.yaml cat > params.yaml <<EOF --- parameter_defaults: CollectdConnectionType: amqp1 CollectdAmqpInstances: telemetry: format: JSON presettle: true MetricsQdrConnectors: - host: qdr-white-normal-sa-telemetry.apps.dev7.nfvpe.site port: 443 role: edge sslProfile: tlsProfile verifyHostname: false EOF
Client side Deployment Using overcloud deploy with collectd & qdr configuration and environment templates cd ~/tripleo-heat-templates git checkout master cd ~ cp overcloud-deploy.sh overcloud-deploy-overcloud.sh sed -i 's/usr\/share\/openstack-/home\/stack\//g' overcloud-deploy-overcloud.sh ./overcloud-deploy-overcloud.sh -e /usr/share/openstack-tripleo-heat-templates/environments/metrics-collectd-qdr.yaml -e /home/stack/params.yaml
● ● ● ●
There are 3 core components to the telemetry framework: ● Prometheus (and the AlertManager) ● Smart Gateway ● QPID Dispatch Router Each of these components has a corresponding Operator that we'll use to spin up the various application components and objects.
To deploy telemetry framework from the script, simply run the following command after cloning the telemetry-framework repo[1] into the following directory. cd ~/src/github.com/redhat-service-assurance/telemetry-framework/deploy/ ./deploy.sh CREATE [1] https://github.com/redhat-service-assurance/telemetry-framework
Deploying Service Assurance Framework From Operator to Application Operators Custom Service Resources Assurance Framework
Demo
avg_over_time(sa_collectd_cpu_percent{type=~”system|user”}[1m] ) > 75 and avg_over_time(sa_collectd_cpu_percent{type=~”system|user”}[1m] ) < 90 Critical CPU Usage Alert: avg_over_time(sa_collectd_cpu_percent{type=~”system|user”}[1m] ) > 90
Architecture Demo Service Assurance framework
https://telemetry-framework.readthedocs.io/en/master/ ● ● https://quay.io/repository/redhat-service-assurance/smart-gateway-operator?tab=info https://github.com/redhat-service-assurance ● ●
○ ○
○ ○
○ ○ ○ ○ ○ ○ ○
○ ○ ○
○ ○ ○
Target /Metrics HTTP PromQL Visualize Target HTTP /Metrics Prometheus Server ● ● ● ● ● ● ●
● ● ● ● ●
Recommend
More recommend