Autopilot: workload autoscaling at Google Krzysztof Rzadca (Google & University of Warsaw, Poland), Paweł Findeisen, Jacek Świderski, Przemysław Zych, Przemek Broniek, Jarek Kuśmierek, Paweł Nowak, Beata Strack, Piotr Witusowski, Steven Hand, John Wilkes (Google) EuroSys 2020 April 2020
Proprietary + Confjdential Google runs in containers In any given week, we launch over two billion containers across Google.
Proprietary + Confjdential Resource limits are crucial to isolate workloads container limit : max amount of CPU/mem container slack: a container can use CPU/mem wasted container usage : CPU/mem used
Proprietary + Confjdential Borg, our scheduler, packs containers to machines by resource limits. machines image source: http://dx.doi.org/10.1145/2741948.2741964 [Verma et al., EuroSys’15]
Proprietary + Confjdential Limits are fjne-grained: CPU in milli-cores memory in bytes Source: http://dx.doi.org/10.1145/2741948.2741964 [Verma et al., EuroSys’15]
We pack containers to machines by limits. Proprietary + Confjdential So, precise limits are crucial for effjciency and reliability. machine capacity limit container A container B container C 6
We pack containers to machines by limits. Proprietary + Confjdential So, precise limits are crucial for effjciency and reliability. machine capacity limit container A usage limit container B container C precise 7 limits good!
We pack containers to machines by limits. Proprietary + Confjdential So, precise limits are crucial for effjciency and reliability. machine requested capacity usage limit container A usage limit container B container limit container C resource are out-of-resources precise wasted 8 crash limits (underallocated machine) bad! bad! good!
Proprietary + Confjdential Autopilot acts as a controller for Borg limits. container limits container container counts limits staru/stop containers Autopilot continuously adjusts resource limits: CPU/Mem limits for containers (veruical scaling), number of replicas (horizontal scaling).
Proprietary + Confjdential Autopilot Recommenders
Proprietary + Confjdential Moving window recommenders resources Exponentially-decaying samples ● (half-life of 48 hours) 98%ile Compute statistics over the ● samples, e.g. 95%ile add a safety margin ● usage time
Proprietary + Confjdential Moving window recommenders resources Exponentially-decaying samples ● (half-life of 48 hours) 98%ile Compute statistics over the ● samples, e.g. 95%ile exponential decay add a safety margin ● usage time
Proprietary + Confjdential Moving window recommenders resources limit Exponentially-decaying samples ● (half-life of 48 hours) exponential decay Compute statistics over the ● samples, e.g. 95%ile safety margin add a safety margin ● usage time
Proprietary + Confjdential Machine learning recommenders Each model is an arg-max ● model 1 model 2 ……. model n limit algorithm picking a limit value Each model is parametrized by ● the decay rate and the safety margin. The recommender picks the ● model pergorming the best over a longer time period. decay rate
Proprietary + Confjdential Evaluation: Observational study of production jobs Focus on memory
Proprietary + Confjdential Autopilot effjciency - reduction of slack absolute slack: relative slack: ∫ slack(t) dt = ∫ limit(t) dt - ∫ usage(t) dt (av_limit - 95%ile usage) / (av_limit) unit: capacity of a single (largish) machine limit(t) slack(t) (av_limit - 95%ile usage) av_limit 95%ile average usage(t) usage limit during during the day the day 16
Proprietary + Confjdential worse betuer Cumulative distribution function Autopiloted jobs have signifjcantly smaller relative slack. A random sample of 5000 jobs in each category. Relative slack: (av_limit - 95%ile usage) / av_limit (av_limit - 95%ile usage) av_limit
Cumulative distribution function Proprietary + Confjdential betuer worse Autopiloted jobs save signifjcant capacity. A random sample of 5000 jobs in each category. Absolute slack [machines] Absolute slack
Proprietary + Confjdential Cumulative distribution function betuer worse reduction of relative slack When jobs migrate to Autopilot, their slack is signifjcantly reduced. A random sample of 500 jobs that migrated to autopilot in a ceruain month, m0. CDFs for slack for 2 months before and afuer migration
Proprietary + Confjdential requested machine usage capacity Autopilot Reliability: how frequent are container out-of-memory errors. limit We count terminations of containers. out-of-resources crash We weight the number of terminations by the average number of containers of a job.
Proprietary + Confjdential Cumulative distribution function betuer Autopilot reduces the frequency of worse out-of-memory events. OOMs are rare: 99.5% of autopiloted jobs have no OOMs.
Proprietary + Confjdential DevOps: Autopiloted jobs account for over 48% of Google’s fmeet-wide resource usage.
Autopilot’s dynamic limits could help to Proprietary + Confjdential keep the job running despite bugs.
Proprietary + Confjdential Autopilot: workload autoscaling at Google 1. Effjcient scheduling requires fjne-grained control of jobs’ limits 2. Humans are bad at setuing the limits precisely. 3. Autopilot uses past usage to drive future limits 4. Autopilot reduces relative slack by 2x ...and it reduces the number of jobs severely impacted by OOMs 10x
Recommend
More recommend