Autopilot: workload autoscaling at Google Krzysztof Rzadca (Google - PowerPoint PPT Presentation

Autopilot: workload autoscaling at Google Krzysztof Rzadca (Google & University of Warsaw, Poland), Paweł Findeisen, Jacek Świderski, Przemysław Zych, Przemek Broniek, Jarek Kuśmierek, Paweł Nowak, Beata Strack, Piotr Witusowski, Steven Hand, John Wilkes (Google) EuroSys 2020 April 2020

Proprietary + Confjdential Google runs in containers In any given week, we launch over two billion containers across Google.

Proprietary + Confjdential Resource limits are crucial to isolate workloads container limit : max amount of CPU/mem container slack: a container can use CPU/mem wasted container usage : CPU/mem used

Proprietary + Confjdential Borg, our scheduler, packs containers to machines by resource limits. machines image source: http://dx.doi.org/10.1145/2741948.2741964 [Verma et al., EuroSys’15]

Proprietary + Confjdential Limits are fjne-grained: CPU in milli-cores memory in bytes Source: http://dx.doi.org/10.1145/2741948.2741964 [Verma et al., EuroSys’15]

We pack containers to machines by limits. Proprietary + Confjdential So, precise limits are crucial for effjciency and reliability. machine capacity limit container A container B container C 6

We pack containers to machines by limits. Proprietary + Confjdential So, precise limits are crucial for effjciency and reliability. machine capacity limit container A usage limit container B container C precise 7 limits good!

We pack containers to machines by limits. Proprietary + Confjdential So, precise limits are crucial for effjciency and reliability. machine requested capacity usage limit container A usage limit container B container limit container C resource are out-of-resources precise wasted 8 crash limits (underallocated machine) bad! bad! good!

Proprietary + Confjdential Autopilot acts as a controller for Borg limits. container limits container container counts limits staru/stop containers Autopilot continuously adjusts resource limits: CPU/Mem limits for containers (veruical scaling), number of replicas (horizontal scaling).

Proprietary + Confjdential Autopilot Recommenders

Proprietary + Confjdential Moving window recommenders resources Exponentially-decaying samples ● (half-life of 48 hours) 98%ile Compute statistics over the ● samples, e.g. 95%ile add a safety margin ● usage time

Proprietary + Confjdential Moving window recommenders resources Exponentially-decaying samples ● (half-life of 48 hours) 98%ile Compute statistics over the ● samples, e.g. 95%ile exponential decay add a safety margin ● usage time

Proprietary + Confjdential Moving window recommenders resources limit Exponentially-decaying samples ● (half-life of 48 hours) exponential decay Compute statistics over the ● samples, e.g. 95%ile safety margin add a safety margin ● usage time

Proprietary + Confjdential Machine learning recommenders Each model is an arg-max ● model 1 model 2 ……. model n limit algorithm picking a limit value Each model is parametrized by ● the decay rate and the safety margin. The recommender picks the ● model pergorming the best over a longer time period. decay rate

Proprietary + Confjdential Evaluation: Observational study of production jobs Focus on memory

Proprietary + Confjdential Autopilot effjciency - reduction of slack absolute slack: relative slack: ∫ slack(t) dt = ∫ limit(t) dt - ∫ usage(t) dt (av_limit - 95%ile usage) / (av_limit) unit: capacity of a single (largish) machine limit(t) slack(t) (av_limit - 95%ile usage) av_limit 95%ile average usage(t) usage limit during during the day the day 16

Proprietary + Confjdential worse betuer Cumulative distribution function Autopiloted jobs have signifjcantly smaller relative slack. A random sample of 5000 jobs in each category. Relative slack: (av_limit - 95%ile usage) / av_limit (av_limit - 95%ile usage) av_limit

Cumulative distribution function Proprietary + Confjdential betuer worse Autopiloted jobs save signifjcant capacity. A random sample of 5000 jobs in each category. Absolute slack [machines] Absolute slack

Proprietary + Confjdential Cumulative distribution function betuer worse reduction of relative slack When jobs migrate to Autopilot, their slack is signifjcantly reduced. A random sample of 500 jobs that migrated to autopilot in a ceruain month, m0. CDFs for slack for 2 months before and afuer migration

Proprietary + Confjdential requested machine usage capacity Autopilot Reliability: how frequent are container out-of-memory errors. limit We count terminations of containers. out-of-resources crash We weight the number of terminations by the average number of containers of a job.

Proprietary + Confjdential Cumulative distribution function betuer Autopilot reduces the frequency of worse out-of-memory events. OOMs are rare: 99.5% of autopiloted jobs have no OOMs.

Proprietary + Confjdential DevOps: Autopiloted jobs account for over 48% of Google’s fmeet-wide resource usage.

Autopilot’s dynamic limits could help to Proprietary + Confjdential keep the job running despite bugs.

Proprietary + Confjdential Autopilot: workload autoscaling at Google 1. Effjcient scheduling requires fjne-grained control of jobs’ limits 2. Humans are bad at setuing the limits precisely. 3. Autopilot uses past usage to drive future limits 4. Autopilot reduces relative slack by 2x ...and it reduces the number of jobs severely impacted by OOMs 10x

Autopilot: workload autoscaling at Google Krzysztof Rzadca (Google - PowerPoint PPT Presentation

Autopilot: workload autoscaling at Google Krzysztof Rzadca (Google & University of Warsaw, Poland), Pawe Findeisen, Jacek widerski, Przemysaw Zych, Przemek Broniek, Jarek Kumierek, Pawe Nowak, Beata Strack, Piotr Witusowski, Steven

Workload, Fatigue, and Sleep Disruption 1 Workload 1.What is workload? 2.What is the

Autoscaling All Things Kubernetes with Prometheus Michael Hausenblas & Frederic Branczyk,

BurScale: Using Burstable Instances for Cost-Effective Autoscaling in the Public Cloud Ata Fatahi

Downscaling: The Achilles heel of Autoscaling Apache Spark Clusters Prakhar Jain Sourabh Goyal

WORKLOAD WORKLOAD WORKLOAD During exercise, nasal breathing causes a reduction in FEO 2

ASHA Workload Calculator What is Direct and Other indirect workload? activities Services

Flying with Linux Porting an autopilot to Linux Part 2 Andrew Tridgell LCA last year

Cracking the Habit Code 21 days to keeping your resolutions 1 Day 2: Your Brain on Autopilot 2

DAY 2 Agenda for Today Introduce the workload characterization problem. Discuss a

Local 006 Workload Appeal COLLECTIVE AGREEMENT 2014:LETTER OF INTENT #2 Why a Workload Appeal?

No Shard Left Behind Straggler-free data processing in Cloud Dataflow Eugene Kirpichov Senior

arXiv:1706.03762v5 [cs.CL] 6 Dec 2017 Llion Jones Aidan N. Gomez ukasz Kaiser

Work Physiology & Workload Assessment Agenda Work Physiology Workload Assessment

Day 3 Agenda for Today Formulate simple problem statement Revisit the workload

DW optimization Performance optimization in DWs is mainly achieved by carrying out view

Autoscaling Effects in Speed Scaling Systems Carey Williamson Department of Computer Science

Optimizing UDP for content delivery: GSO, pacing and zerocopy Willem de Bruijn

Optimizing UDP for content delivery: GSO, pacing and zerocopy Willem de Bruijn

Hacking a Commercial Drone to run an Open Source Autopilot - APM on Parrot Bebop Julien BERAUD

Google AdWords & Google Analytics Jenn Davidson What are they? Several different Google

PanDA PanDA-based based GRID Workload Management GRID Workload Management Maxim Potekhin

Workload Formulas Judicial Branch Workload Formulas and On-Bench Time Reporting | September 23,

Co-pilot or Autopilot: Who is Navigating your Financial Plan? September 30, 2020 Agenda

AUTOPILOT presentation Franois FISCHER, ERTICO Project Coordinator This project has received

Autopilot: workload autoscaling at Google Krzysztof Rzadca (Google - PowerPoint PPT Presentation

Autopilot: workload autoscaling at Google Krzysztof Rzadca (Google & University of Warsaw, Poland), Pawe Findeisen, Jacek widerski, Przemysaw Zych, Przemek Broniek, Jarek Kumierek, Pawe Nowak, Beata Strack, Piotr Witusowski, Steven

Workload, Fatigue, and Sleep Disruption 1 Workload 1.What is workload? 2.What is the

Autoscaling All Things Kubernetes with Prometheus Michael Hausenblas &amp; Frederic Branczyk,

BurScale: Using Burstable Instances for Cost-Effective Autoscaling in the Public Cloud Ata Fatahi

Downscaling: The Achilles heel of Autoscaling Apache Spark Clusters Prakhar Jain Sourabh Goyal

WORKLOAD WORKLOAD WORKLOAD During exercise, nasal breathing causes a reduction in FEO 2

ASHA Workload Calculator What is Direct and Other indirect workload? activities Services

Flying with Linux Porting an autopilot to Linux Part 2 Andrew Tridgell LCA last year

Cracking the Habit Code 21 days to keeping your resolutions 1 Day 2: Your Brain on Autopilot 2

DAY 2 Agenda for Today Introduce the workload characterization problem. Discuss a

Local 006 Workload Appeal COLLECTIVE AGREEMENT 2014:LETTER OF INTENT #2 Why a Workload Appeal?

No Shard Left Behind Straggler-free data processing in Cloud Dataflow Eugene Kirpichov Senior

arXiv:1706.03762v5 [cs.CL] 6 Dec 2017 Llion Jones Aidan N. Gomez ukasz Kaiser

Work Physiology &amp; Workload Assessment Agenda Work Physiology Workload Assessment

Day 3 Agenda for Today Formulate simple problem statement Revisit the workload

DW optimization Performance optimization in DWs is mainly achieved by carrying out view

Autoscaling Effects in Speed Scaling Systems Carey Williamson Department of Computer Science

Optimizing UDP for content delivery: GSO, pacing and zerocopy Willem de Bruijn

Optimizing UDP for content delivery: GSO, pacing and zerocopy Willem de Bruijn

Hacking a Commercial Drone to run an Open Source Autopilot - APM on Parrot Bebop Julien BERAUD

Google AdWords &amp; Google Analytics Jenn Davidson What are they? Several different Google

PanDA PanDA-based based GRID Workload Management GRID Workload Management Maxim Potekhin

Workload Formulas Judicial Branch Workload Formulas and On-Bench Time Reporting | September 23,

Co-pilot or Autopilot: Who is Navigating your Financial Plan? September 30, 2020 Agenda

AUTOPILOT presentation Franois FISCHER, ERTICO Project Coordinator This project has received

Autoscaling All Things Kubernetes with Prometheus Michael Hausenblas & Frederic Branczyk,

Work Physiology & Workload Assessment Agenda Work Physiology Workload Assessment

Google AdWords & Google Analytics Jenn Davidson What are they? Several different Google