demonstrating at scale monitoring of openstack cloud
play

Demonstrating At Scale Monitoring Of OpenStack Cloud Using - PowerPoint PPT Presentation

Open Infrastructure Summit 2019 Demonstrating At Scale Monitoring Of OpenStack Cloud Using Prometheus Anandeep Pannu Pradeep Kilambi apannu@redhat.com prad@redhat.com 1 Definitions 3 4


  1. Open Infrastructure Summit 2019 Demonstrating At Scale Monitoring Of OpenStack Cloud Using Prometheus Anandeep Pannu Pradeep Kilambi apannu@redhat.com prad@redhat.com 1

  2. ● ● ● ● ● ●

  3. Definitions 3

  4. 4

  5. ○ ○ ○ ○ ○

  6. Implications for Open Infrastructure 7

  7. ● ● ● ● ● ● ● ● ●

  8. ● ● ● ● ● ●

  9. Critical Monitoring Features

  10. ● Portability across different footprints ● HA, scaling, persistence available for free ● Re-use platform capabilities - eg. Prometheus ● Users integrate for capabilities they want ● Stringent SLAs can be met ● Plug-in different OSS components with the same API ● For each API, SLAs achieved can be optimized ○ E.g Fault management uses message bus directly ● Metrics meta-data and declarative metrics for every component, so metrics can be incorporated automatically ● Data sensing, collection and processing ○ Either, some or all processed at the Edge ● Centralized access to reports, alerts ● Integration with Analytics

  11. Service Assurance Framework Architecture

  12. Architecture Overview On-site infrastructure platform

  13. ○ ■ ○ ■ ■ ○ ■

  14. 3rd Party Prometheus Operator Integrations MGMT Cluster Metrics APIs Events Dispatch Routing Message Distribution Bus (AMQP 1.0) V V V M M M syslog /proc pid kernel Application Components cpu mem net Prometheus-based K8S (VM, Container); Monitoring hardware Controller, Compute, Ceph, RHEV, OpenShift Nodes (All Infrastructure Nodes)

  15. ● Collectd container -- Host / VM metrics collection framework ○ Collectd 5.8 with additional OPNFV Barometer specific plugins not yet in collectd project ● Intel RDT, Intel PMU, IPMI ● AMQP1.0 client plugin ● Procevent -- Process state changes ● Sysevent -- Match syslog for critical errors ● Connectivity -- Fast detection of interface link status changes ○ Integrated as part of TripleO (OSP Director)

  16. write_syslog write_kafka write_prometheus amqp_09 amqp1

  17. AMQ 7 Interconnect - Native AMQP 1.0 Message Router ● Large Scale Message Networks ○ Offers shortest path (least cost) message routing Client Server Client B ○ Used without broker ○ High Availability through redundant path topology and re-route (not clustering) Server C ○ Automatic recovery from network partitioning failures ○ Reliable delivery without requiring storage ● QDR Router Functionality ○ Apache Qpid Dispatch Router QDR ○ Dynamically learn addresses of messaging endpoints Client ○ Stateless - no message queuing, end-to-end transfer Server A High Throughput, Low Latency Low Operational Costs

  18. Prometheus Operator ● ● ● ● ●

  19. ● ○ ○ ○ ● ○ ○ ● ○

  20. Evolution

  21. Site 10 Site 2 Site 1 compute compute compute compute compute compute compute compute compute compute compute compute ceph ceph ceph ceph ceph ceph ceph ceph ceph ceph ceph ceph cntrl 1 cntrl 1 cntrl 1 cntrl 2 cntrl 2 cntrl 2 cntrl 3 cntrl 3 cntrl 3 OS Networks AMQP OS Networks AMQP AMQP OS Networks Remote Site(s) Layer 3 Network to Remote Sites Central Site Grafana Prometheus Operator++ Cluster compute compute QDR QDR QDR compute compute ceph S S S ceph G G G ceph Prometheus ceph cntrl 1 cntrl 2 cntrl 3 AMQP OS Networks

  22. DCN Use Case Deployment Stack OPTIONAL OPTIONAL Undercloud AZ0 AZ0 +Container Registry Compute Nodes Ceph Cluster 0 (Local Ephemeral) Controller Nodes Primary Site L3 Routed AZ1 AZ2 AZ3 AZ4 AZn Compute Nodes Compute Nodes Compute Nodes Compute Nodes Compute Nodes (Local Ephemeral) (Local Ephemeral) (Local Ephemeral) (Local Ephemeral) (Local Ephemeral) DCN Site 1 DCN Site 2 DCN Site 3 DCN Site 4 DCN Site n

  23. Configuration & Deployment

  24. TripleO Integration Of client side components Collectd and QDR profiles are integrated as part of the TripleO ● Collectd and QDRs run as containers on the openstack nodes ● Configured via heat environment file ● Each node will have a qpid dispatch router running with collectd ● agent Collectd is configured to talk to qpid dispatch router and send ● metrics and events Relevant collectd plugins can be configured via the heat template file ●

  25. TripleO Client side Configuration environments/metrics-collectd-qdr.yaml ## This environment template to enable Service Assurance Client side bits resource_registry: OS::TripleO::Services::MetricsQdr: ../docker/services/metrics/qdr.yaml OS::TripleO::Services::Collectd: ../docker/services/metrics/collectd.yaml parameter_defaults: CollectdConnectionType: amqp1 CollectdAmqpInstances: notify: notify: true format: JSON presettle: true telemetry: format: JSON presettle: false

  26. TripleO Client side Configuration params.yaml cat > params.yaml <<EOF --- parameter_defaults: CollectdConnectionType: amqp1 CollectdAmqpInstances: telemetry: format: JSON presettle: true MetricsQdrConnectors: - host: qdr-white-normal-sa-telemetry.apps.dev7.nfvpe.site port: 443 role: edge sslProfile: tlsProfile verifyHostname: false EOF

  27. Client side Deployment Using overcloud deploy with collectd & qdr configuration and environment templates cd ~/tripleo-heat-templates git checkout master cd ~ cp overcloud-deploy.sh overcloud-deploy-overcloud.sh sed -i 's/usr\/share\/openstack-/home\/stack\//g' overcloud-deploy-overcloud.sh ./overcloud-deploy-overcloud.sh -e /usr/share/openstack-tripleo-heat-templates/environments/metrics-collectd-qdr.yaml -e /home/stack/params.yaml

  28. ● ● ● ●

  29. There are 3 core components to the telemetry framework: ● Prometheus (and the AlertManager) ● Smart Gateway ● QPID Dispatch Router Each of these components has a corresponding Operator that we'll use to spin up the various application components and objects.

  30. To deploy telemetry framework from the script, simply run the following command after cloning the telemetry-framework repo[1] into the following directory. cd ~/src/github.com/redhat-service-assurance/telemetry-framework/deploy/ ./deploy.sh CREATE [1] https://github.com/redhat-service-assurance/telemetry-framework

  31. Deploying Service Assurance Framework From Operator to Application Operators Custom Service Resources Assurance Framework

  32. Demo

  33. avg_over_time(sa_collectd_cpu_percent{type=~”system|user”}[1m] ) > 75 and avg_over_time(sa_collectd_cpu_percent{type=~”system|user”}[1m] ) < 90 Critical CPU Usage Alert: avg_over_time(sa_collectd_cpu_percent{type=~”system|user”}[1m] ) > 90

  34. Architecture Demo Service Assurance framework

  35. https://telemetry-framework.readthedocs.io/en/master/ ● ● https://quay.io/repository/redhat-service-assurance/smart-gateway-operator?tab=info https://github.com/redhat-service-assurance ● ●

  36. ○ ○

  37. ○ ○

  38. ○ ○ ○ ○ ○ ○ ○

  39. ○ ○ ○

  40. ○ ○ ○

  41. Target /Metrics HTTP PromQL Visualize Target HTTP /Metrics Prometheus Server ● ● ● ● ● ● ●

  42. ● ● ● ● ●

Recommend


More recommend