operational and scaling wins at workday
play

Operational and Scaling Wins at Workday From 50K to 300K Cores - PowerPoint PPT Presentation

Operational and Scaling Wins at Workday From 50K to 300K Cores OpenStack Summit Berlin 2018 Edgar Magana Imtiaz Chowdhury Howard Abrams Kyle Jorgensen Sergio de Carvalho Moderator Instrumentation Image Challenges API Challenges


  1. Operational and Scaling Wins at Workday From 50K to 300K Cores OpenStack Summit Berlin 2018

  2. Edgar Magana Imtiaz Chowdhury Howard Abrams Kyle Jorgensen Sergio de Carvalho Moderator Instrumentation Image Challenges API Challenges Architecture Overview and Use Cases Monitoring, Logging Clearing the Image Identifying and Fighting and Metrics Distribution Bottleneck Scaling Issues

  3. Workday provides enterprise cloud applications for financial management, human capital management (HCM), payroll, student systems, and analytics.

  4. Our Story OpenStack @ Workday

  5. Our Journey So Far 2013 2014 2015 2016 2017 2018 2019 Cloud Deployment OpenStack 50% of Engineering automation Mitaka production Team formed tools ready. Development workloads - 2 Workday - 14 services on OpenStack services in QA OpenStack First production - Production Icehouse workload workload on in Development Mitaka - Internal - 39 services workload

  6. Workday Private Cloud Growth Revenue US $273M

  7. Our Private Cloud 4.6k 45 5 Compute Hosts Clusters Data Centers 300k 22k 4k Cores Running VMs Active VM Images

  8. How Workday Uses the Private Cloud Immutable Images Weekly update Narrow Update Window https://www.blockchainsemantics.com/blog/immutable-blockchain/

  9. Architecture Architecture Evolution Evolution

  10. Initial Control Plane Architecture MySQL Cassandra rabbitmq rabbitmq keystone zookeeper glance Contrail API nova OpenStack SDN Controller Controller

  11. Key drivers for architectural evolution 400% Scalability Scale API services horizontally Make critical services highly High availability 99% available Downtime Provide upgrade path without 0 upgrade affecting the workload

  12. Control Plane Clients HAProxy 1 HAProxy 2 Controllers rs SDN OpenStack Controllers Controllers Stateless API services Stateful services

  13. Instrumentation Logging and Monitoring and Metrics, Oh My!

  14. Instrumentation Challenges ● No access to production systems: full automation ● Dispersed logs among multiple systems ● Sporadic issues with services: “What do you mean RabbitMQ stopped!?” ● Vague or subjective concerns: “Why is the system slow !?”

  15. Instrumentation Architecture Wavefront Big Panda HA ELK OpenStack Node Metrics Alerts Checks Logs Log Messages Sensu Log Client Collector Uchiwa

  16. Monitoring For each issue, we: ● Fixed the issue/bug ● Wrote tests to address the issue/bug ● Wrote a check to alert if it happened again

  17. Example: Our Health Check Our customers use our project (OpenStack), a particular way… For each node in each cluster , test by: ● Start a VM with a particular image ● Check DNS resolves host name ● Verify SSH service ● Validate LDAP access ● Stop the VM Rinse and Repeat

  18. Troubleshooting Issues Internal Wiki Support Documents Check Failure Details Internal Logging CRITICAL: Health validation suite had failures. Connection Error - While attempting to get VM details. Collection See logging system with r#3FBM for details. System

  19. Troubleshooting with Logs

  20. Troubleshooting with Logs

  21. Troubleshooting with Logs

  22. Troubleshooting with Logs

  23. Metrics There’s death, and then there’s illness … What is this guy doing up here? If all the compute node load levels are down here…

  24. Dashboards to Track Changes nbproc=1 nbproc=1 nbproc=1 nbproc=2 nbproc=2 nbproc=2 –mc –set2 +mc –set2 +mc +set2 –mc –set2 +mc -set2 +mc +set2

  25. Transient Dashboards What’s up with MySQL?

  26. Instrumentation Takeaways ● Can’t scale if you can’t tweak. Can’t tweak if you can’t monitor. ● Collect and filter all the logs ● Create checks for everything...especially running services ● Invest in a good metric visualization tool: ○ Create focused graphs ○ Dashboards start with key metrics (correlated to your service level agreements) ○ Be able to create one-shots and special-cases ○ Learn how to accurately monitor all the OpenStack services ○ Overview/Summary ○ MySQL ○ Networking Services ○ Cassandra ○ Network Traffic ○ Zookeeper ○ HAProxy ○ Hardware (CPU Load / Disk) ○ RabbitMQ

  27. Image Distribution Clearing the Image Distribution Bottleneck

  28. Challenge: Control Plane Usage Example - Nova Scheduler response time

  29. Challenge: Control Plane Usage Example - Nova Scheduler response time

  30. Challenge: Control Plane Usage Example - Count of deployed VMs

  31. Large images: worst offender ~6GB ~1700 Image size Instance count across DC’s

  32. Problem Many VM boots in short period of time + large images = bottleneck Glance Compute Compute Compute Compute

  33. Problem Many VM boots in short period of time + large images = bottleneck Glance Cache Cache Cache Cache

  34. Problem Many VM boots in short period of time + large images = bottleneck SLOW... Glance Cache Cache Cache Cache

  35. Solution: Extend Nova API curl https://<host>:8774/v2.1/image_prefetch -X POST \ ... -H "X-Auth-Token: MIIOvwYJKoZIQcCoIIOsDCCDasdkoas=" \ -H "Content-Type: application/json" \ -d '{ "image_id": "d5ac4b1a-9abe-4f88-8f5f-7896ece564b9" }' Operator

  36. Solution: Extend Nova API curl https://<host>:8774/v2.1/image_prefetch -X POST \ ... -H "X-Auth-Token: MIIOvwYJKoZIQcCoIIOsDCCDasdkoas=" \ -H "Content-Type: application/json" \ -d '{ "image_id": "d5ac4b1a-9abe-4f88-8f5f-7896ece564b9" }' Nova Nova DB Conductor API Nova Nova API Operator Compute libvirtd driver

  37. Solution: Extend Nova API HTTP/1.1 202 Accepted Content-Type: application/json Content-Length: 50 X-Compute-Request-Id: req-f7a3bd10-ab76-427f-b6ee-79b92fc2a978 Date: Mon, 02 Jul 2018 20:52:37 GMT {"job_id": "f7a3bd10-ab76-427f-b6ee-79b92fc2a978"} Operator (Async job) Nova API

  38. Solution: Extend Nova API curl https://<host>:8774/v2.1/image_prefetch/image/<image_ID> ... OR curl https://<host>:8774/v2.1/image_prefetch/job/<job_ID> ... Nova DB Nova API Operator API

  39. Solution: Extend Nova API HTTP/1.1 200 OK ... { "overall_status": "5 of 10 hosts done. 0 errors.", "image_id": "d5ac4b1a-9abe-4f88-8f5f-7896ece564b9", "job_id": "f7a3bd10-ab76-427f-b6ee-79b92fc2a978", "total_errors": 0, "num_hosts_done": 5, "start_time": "2018-07-02T20:52:37.000000", "num_hosts_downloading": 2, Operator "error_hosts": 0, "num_hosts": 10 Nova API }

  40. Image Prefetch API Result Before After Cache hit • Avg 300 sec of VM boot time reduced • VM creation failure rate decreased by 20 %

  41. HAProxy Bottleneck GET image Load balancer Nova Compute 307 redirect Download HTTPD HTTPD HTTPD Glance Glance Glance API API API

  42. HAProxy Bottleneck

  43. Image Distribution: Key Takeaways • Under heavy load, downloading images can be a bottleneck ‒ Contribute image prefetch back to community • HA Tradeoffs • API Specific monitoring allows for unique insights

  44. API Challenges Identifying and Fighting Fire Scaling Issues

  45. Nova Metadata API 14 seconds! Average response time (sec) Each VM makes > 20 API requests

  46. Nova Metadata API & Database Transfer Rate 1 GB/sec 14 seconds! Bytes sent (MB/sec) Average response time (sec) Each VM makes > 20 API requests

  47. Top Query by “Rows Sent” SELECT ... FROM (SELECT ... FROM instances WHERE instances.deleted = 0 AND instances.uuid = ? LIMIT 1) AS instances LEFT OUTER JOIN instance_system_metadata ON instances.uuid = instance_system_metadata.instance_uuid LEFT OUTER JOIN instance_extra ON instance_extra.instance_uuid = instances.uuid LEFT OUTER JOIN instance_metadata ON instance_metadata.instance_uuid = instances.uuid AND instance_metadata.deleted = 0 ...

  48. Instance Object-Relational Mapping instances SELECT ... FROM (SELECT ... FROM instances WHERE instances.deleted = 0 AND instances.uuid = ? LIMIT 1) AS instances 1 N N LEFT OUTER JOIN instance_system_metadata ON instances.uuid = instance_system_metadata.instance_uuid LEFT OUTER JOIN instance_extra ON instance_extra.instance_uuid = instances.uuid LEFT OUTER JOIN instance_metadata ON instance_metadata.instance_uuid = instances.uuid instance instance AND instance_metadata.deleted = 0 metadata system ... metadata

  49. Instance Object-Relational Mapping instances SELECT ... FROM (SELECT ... Expected result set (metadata union): FROM instances 50 + 50 = 100 rows WHERE instances.deleted = 0 AND instances.uuid = ? Actual result set (metadata product): LIMIT 1) AS instances 1 N N LEFT OUTER JOIN instance_system_metadata 50 x 50 = 2,500 rows! ON instances.uuid = instance_system_metadata.instance_uuid LEFT OUTER JOIN instance_extra ON instance_extra.instance_uuid = instances.uuid LEFT OUTER JOIN instance_metadata ON instance_metadata.instance_uuid = instances.uuid instance instance AND instance_metadata.deleted = 0 metadata system ... metadata

Recommend


More recommend