autopilot workload autoscaling at google
play

Autopilot: workload autoscaling at Google Krzysztof Rzadca (Google - PowerPoint PPT Presentation

Autopilot: workload autoscaling at Google Krzysztof Rzadca (Google & University of Warsaw, Poland), Pawe Findeisen, Jacek widerski, Przemysaw Zych, Przemek Broniek, Jarek Kumierek, Pawe Nowak, Beata Strack, Piotr Witusowski, Steven


  1. Autopilot: workload autoscaling at Google Krzysztof Rzadca (Google & University of Warsaw, Poland), Paweł Findeisen, Jacek Świderski, Przemysław Zych, Przemek Broniek, Jarek Kuśmierek, Paweł Nowak, Beata Strack, Piotr Witusowski, Steven Hand, John Wilkes (Google) EuroSys 2020 April 2020

  2. Proprietary + Confjdential Google runs in containers In any given week, we launch over two billion containers across Google.

  3. Proprietary + Confjdential Resource limits are crucial to isolate workloads container limit : max amount of CPU/mem container slack: a container can use CPU/mem wasted container usage : CPU/mem used

  4. Proprietary + Confjdential Borg, our scheduler, packs containers to machines by resource limits. machines image source: http://dx.doi.org/10.1145/2741948.2741964 [Verma et al., EuroSys’15]

  5. Proprietary + Confjdential Limits are fjne-grained: CPU in milli-cores memory in bytes Source: http://dx.doi.org/10.1145/2741948.2741964 [Verma et al., EuroSys’15]

  6. We pack containers to machines by limits. Proprietary + Confjdential So, precise limits are crucial for effjciency and reliability. machine capacity limit container A container B container C 6

  7. We pack containers to machines by limits. Proprietary + Confjdential So, precise limits are crucial for effjciency and reliability. machine capacity limit container A usage limit container B container C precise 7 limits good!

  8. We pack containers to machines by limits. Proprietary + Confjdential So, precise limits are crucial for effjciency and reliability. machine requested capacity usage limit container A usage limit container B container limit container C resource are out-of-resources precise wasted 8 crash limits (underallocated machine) bad! bad! good!

  9. Proprietary + Confjdential Autopilot acts as a controller for Borg limits. container limits container container counts limits staru/stop containers Autopilot continuously adjusts resource limits: CPU/Mem limits for containers (veruical scaling), number of replicas (horizontal scaling).

  10. Proprietary + Confjdential Autopilot Recommenders

  11. Proprietary + Confjdential Moving window recommenders resources Exponentially-decaying samples ● (half-life of 48 hours) 98%ile Compute statistics over the ● samples, e.g. 95%ile add a safety margin ● usage time

  12. Proprietary + Confjdential Moving window recommenders resources Exponentially-decaying samples ● (half-life of 48 hours) 98%ile Compute statistics over the ● samples, e.g. 95%ile exponential decay add a safety margin ● usage time

  13. Proprietary + Confjdential Moving window recommenders resources limit Exponentially-decaying samples ● (half-life of 48 hours) exponential decay Compute statistics over the ● samples, e.g. 95%ile safety margin add a safety margin ● usage time

  14. Proprietary + Confjdential Machine learning recommenders Each model is an arg-max ● model 1 model 2 ……. model n limit algorithm picking a limit value Each model is parametrized by ● the decay rate and the safety margin. The recommender picks the ● model pergorming the best over a longer time period. decay rate

  15. Proprietary + Confjdential Evaluation: Observational study of production jobs Focus on memory

  16. Proprietary + Confjdential Autopilot effjciency - reduction of slack absolute slack: relative slack: ∫ slack(t) dt = ∫ limit(t) dt - ∫ usage(t) dt (av_limit - 95%ile usage) / (av_limit) unit: capacity of a single (largish) machine limit(t) slack(t) (av_limit - 95%ile usage) av_limit 95%ile average usage(t) usage limit during during the day the day 16

  17. Proprietary + Confjdential worse betuer Cumulative distribution function Autopiloted jobs have signifjcantly smaller relative slack. A random sample of 5000 jobs in each category. Relative slack: (av_limit - 95%ile usage) / av_limit (av_limit - 95%ile usage) av_limit

  18. Cumulative distribution function Proprietary + Confjdential betuer worse Autopiloted jobs save signifjcant capacity. A random sample of 5000 jobs in each category. Absolute slack [machines] Absolute slack

  19. Proprietary + Confjdential Cumulative distribution function betuer worse reduction of relative slack When jobs migrate to Autopilot, their slack is signifjcantly reduced. A random sample of 500 jobs that migrated to autopilot in a ceruain month, m0. CDFs for slack for 2 months before and afuer migration

  20. Proprietary + Confjdential requested machine usage capacity Autopilot Reliability: how frequent are container out-of-memory errors. limit We count terminations of containers. out-of-resources crash We weight the number of terminations by the average number of containers of a job.

  21. Proprietary + Confjdential Cumulative distribution function betuer Autopilot reduces the frequency of worse out-of-memory events. OOMs are rare: 99.5% of autopiloted jobs have no OOMs.

  22. Proprietary + Confjdential DevOps: Autopiloted jobs account for over 48% of Google’s fmeet-wide resource usage.

  23. Autopilot’s dynamic limits could help to Proprietary + Confjdential keep the job running despite bugs.

  24. Proprietary + Confjdential Autopilot: workload autoscaling at Google 1. Effjcient scheduling requires fjne-grained control of jobs’ limits 2. Humans are bad at setuing the limits precisely. 3. Autopilot uses past usage to drive future limits 4. Autopilot reduces relative slack by 2x ...and it reduces the number of jobs severely impacted by OOMs 10x

Recommend


More recommend