the return of openstack telemetry and the 10 000 instances
play

The return of OpenStack Telemetry and the 10,000 Instances - PowerPoint PPT Presentation

The return of OpenStack Telemetry and the 10,000 Instances Telemetry Project Update Alex Krzos Julien Danjou 8 November 2017 The return of OpenStack Telemetry and the 10,000 Instances 20,000 Alex Krzos Julien Danjou 8 November 2017


  1. The return of OpenStack Telemetry and the 10,000 Instances Telemetry Project Update Alex Krzos Julien Danjou 8 November 2017

  2. The return of OpenStack Telemetry and the 10,000 Instances 20,000 Alex Krzos Julien Danjou 8 November 2017

  3. Introductions Alex Krzos Senior Performance Engineer @ Red Hat akrzos@redhat.com IRC: akrzos Julien Danjou Principal Software Engineer @ Red Hat jdanjou@redhat.com IRC: jd_ Red Hat

  4. Lets talk about Telemetry and Scaling... ● Why scale test? Telemetry Architecture ● ● Gnocchi Architecture ● The Road to 10,000 Instances ● Scale and Performance Test Results Conclusion ● Red Hat

  5. Why Scale Test? Determine capacity and limits Develop good defaults and recommendations Characterize resource utilization Telemetry must scale as number of metrics collected will only increase. Red Hat

  6. Telemetry Architecture Red Hat

  7. Gnocchi Architecture Red Hat

  8. The Road to 10,000 Instances Ocata struggled to get 5,000 instances even with lots of tuned parameters and reducing workload. Goal: Achieve 10,000 instances with less tuning than Ocata and a more difficult workload. Extra Credit: Go beyond 10,000 with same hardware. Red Hat

  9. Workloads Boot Persisting Instances with Network 500 at a time, then quiesce ● Boot Persisting Instances ● 1000 at a time, then quiesce Measure Gnocchi API Responsiveness ● Metric Create/Delete ● Resource Create/Delete Get Measures ● Red Hat

  10. Hardware 3 Controllers 2 x E5-2683 v3 - 28 Cores / 56 Threads ● ● 128GiB Memory ● 2 x 1TB 7.2K SATA in Raid 1 12 Ceph Storage Nodes 2 x E5-2650 v3 - 20 Cores / 40 Threads ● ● 128GiB Memory ● 18 x 500GB 7.2K SAS ( 2 - Raid 1 - OS, 16 OSDs), 1 NVMe Journal 59 Compute Nodes 2 x E5-2620 v2 - 12 Cores / 24 Threads ● ● 128GiB / 64GiB Memory ● 2 x 1TB 7.2K SATA in Raid 1 Red Hat

  11. Network Topology Red Hat

  12. 10,000 Instances with NICs Test Workload (20 iterations) 500 instances with attached network booted every 30 minutes ● Gnocchi Settings ● metricd workers per Controller = 18 ● api workers per Controller = 24 Ceilometer Settings ● notification_workers = 3 ● rabbit_qos_prefetch_count = 128 ● 300s polling interval Red Hat

  13. Pike Results - 10k Test Gnocchi Backlog Red Hat

  14. Pike Results - 10k Test CPU on Controllers Red Hat

  15. Pike Results - 10k Test Memory on All Hosts Red Hat

  16. Pike Results - 10k Test Disks on Controllers Red Hat

  17. Pike Results - 10k Test Disks on CephStorage Red Hat

  18. 20,000 Instances Test Workload (20 iterations) 1000 instances booted ● ● 5000 get measures ● 1000 metric and resource creates/deletes Gnocchi metricd workers per Controller = 36 ● ● api processes per Controller = 24 Ceilometer ● notification_workers = 5 rabbit_qos_prefetch_count = 128 ● ● 300s polling interval Red Hat

  19. Ocata Results Red Hat

  20. Ocata Results Not in Pike Red Hat

  21. Pike Results - 20k Test Gnocchi Backlog Red Hat

  22. Pike Results - 20k Test CPU on Controllers Red Hat

  23. Pike Results - 20k Test Memory on All Hosts Red Hat

  24. Pike Results - 20k Test Disks on Controllers Red Hat

  25. Pike Results - 20k Test Disks on CephStorage Red Hat

  26. Pike Results - 20k Test Network Controllers Em1 Red Hat

  27. Pike Results - 20k Test Network Controllers Em2 Red Hat

  28. API Get Measures - 20k Test Red Hat

  29. API Create/Delete Metrics - 20k Test Red Hat

  30. API Create/Delete Resources - 20k Test Red Hat

  31. Tuning - Gnocchi Some Differences between versions (Newton, Ocata, Pike) Pike (Gnocchi v4) metricd/api workers ● Incoming storage driver (Redis is currently prefered) ● Ocata / Newton (Gnocchi 3.1 / 3.0) metricd/api workers ● tasks_per_worker / metric_processing_delay ● Check scheduler (Use latest version of Gnocchi) ● Red Hat

  32. Tuning - Ceilometer Always avoid overwhelming Gnocchi backlog (collect what you need/use) ● check rabbit_qos_prefetch_count - Monitor Rabbitmq too ● Pike agent-notification workers ● Ocata publish directly to Gnocchi (disable collector) ● Newton collector workers ● Red Hat

  33. Conclusion OpenStack Telemetry is now proven to the 10,000 instance mark and more in Pike Minimal degradation in response timing of API as more and more metrics are collected Of course there is still room for improvements: Reduce the load on the archival storage ● Spikes in API timings (Frontend API vs Backend API) ● Performance testing with other storage drivers (Swift, File) ● Red Hat

  34. THANK YOU plus.google.com/+RedHat facebook.com/redhatinc linkedin.com/company/red-hat twitter.com/RedHatNews youtube.com/user/RedHatVideos

Recommend


More recommend