The return of OpenStack Telemetry and the 10,000 Instances Telemetry Project Update Alex Krzos Julien Danjou 8 November 2017
The return of OpenStack Telemetry and the 10,000 Instances 20,000 Alex Krzos Julien Danjou 8 November 2017
Introductions Alex Krzos Senior Performance Engineer @ Red Hat akrzos@redhat.com IRC: akrzos Julien Danjou Principal Software Engineer @ Red Hat jdanjou@redhat.com IRC: jd_ Red Hat
Lets talk about Telemetry and Scaling... ● Why scale test? Telemetry Architecture ● ● Gnocchi Architecture ● The Road to 10,000 Instances ● Scale and Performance Test Results Conclusion ● Red Hat
Why Scale Test? Determine capacity and limits Develop good defaults and recommendations Characterize resource utilization Telemetry must scale as number of metrics collected will only increase. Red Hat
Telemetry Architecture Red Hat
Gnocchi Architecture Red Hat
The Road to 10,000 Instances Ocata struggled to get 5,000 instances even with lots of tuned parameters and reducing workload. Goal: Achieve 10,000 instances with less tuning than Ocata and a more difficult workload. Extra Credit: Go beyond 10,000 with same hardware. Red Hat
Workloads Boot Persisting Instances with Network 500 at a time, then quiesce ● Boot Persisting Instances ● 1000 at a time, then quiesce Measure Gnocchi API Responsiveness ● Metric Create/Delete ● Resource Create/Delete Get Measures ● Red Hat
Hardware 3 Controllers 2 x E5-2683 v3 - 28 Cores / 56 Threads ● ● 128GiB Memory ● 2 x 1TB 7.2K SATA in Raid 1 12 Ceph Storage Nodes 2 x E5-2650 v3 - 20 Cores / 40 Threads ● ● 128GiB Memory ● 18 x 500GB 7.2K SAS ( 2 - Raid 1 - OS, 16 OSDs), 1 NVMe Journal 59 Compute Nodes 2 x E5-2620 v2 - 12 Cores / 24 Threads ● ● 128GiB / 64GiB Memory ● 2 x 1TB 7.2K SATA in Raid 1 Red Hat
Network Topology Red Hat
10,000 Instances with NICs Test Workload (20 iterations) 500 instances with attached network booted every 30 minutes ● Gnocchi Settings ● metricd workers per Controller = 18 ● api workers per Controller = 24 Ceilometer Settings ● notification_workers = 3 ● rabbit_qos_prefetch_count = 128 ● 300s polling interval Red Hat
Pike Results - 10k Test Gnocchi Backlog Red Hat
Pike Results - 10k Test CPU on Controllers Red Hat
Pike Results - 10k Test Memory on All Hosts Red Hat
Pike Results - 10k Test Disks on Controllers Red Hat
Pike Results - 10k Test Disks on CephStorage Red Hat
20,000 Instances Test Workload (20 iterations) 1000 instances booted ● ● 5000 get measures ● 1000 metric and resource creates/deletes Gnocchi metricd workers per Controller = 36 ● ● api processes per Controller = 24 Ceilometer ● notification_workers = 5 rabbit_qos_prefetch_count = 128 ● ● 300s polling interval Red Hat
Ocata Results Red Hat
Ocata Results Not in Pike Red Hat
Pike Results - 20k Test Gnocchi Backlog Red Hat
Pike Results - 20k Test CPU on Controllers Red Hat
Pike Results - 20k Test Memory on All Hosts Red Hat
Pike Results - 20k Test Disks on Controllers Red Hat
Pike Results - 20k Test Disks on CephStorage Red Hat
Pike Results - 20k Test Network Controllers Em1 Red Hat
Pike Results - 20k Test Network Controllers Em2 Red Hat
API Get Measures - 20k Test Red Hat
API Create/Delete Metrics - 20k Test Red Hat
API Create/Delete Resources - 20k Test Red Hat
Tuning - Gnocchi Some Differences between versions (Newton, Ocata, Pike) Pike (Gnocchi v4) metricd/api workers ● Incoming storage driver (Redis is currently prefered) ● Ocata / Newton (Gnocchi 3.1 / 3.0) metricd/api workers ● tasks_per_worker / metric_processing_delay ● Check scheduler (Use latest version of Gnocchi) ● Red Hat
Tuning - Ceilometer Always avoid overwhelming Gnocchi backlog (collect what you need/use) ● check rabbit_qos_prefetch_count - Monitor Rabbitmq too ● Pike agent-notification workers ● Ocata publish directly to Gnocchi (disable collector) ● Newton collector workers ● Red Hat
Conclusion OpenStack Telemetry is now proven to the 10,000 instance mark and more in Pike Minimal degradation in response timing of API as more and more metrics are collected Of course there is still room for improvements: Reduce the load on the archival storage ● Spikes in API timings (Frontend API vs Backend API) ● Performance testing with other storage drivers (Swift, File) ● Red Hat
THANK YOU plus.google.com/+RedHat facebook.com/redhatinc linkedin.com/company/red-hat twitter.com/RedHatNews youtube.com/user/RedHatVideos
Recommend
More recommend