Cluster management at Google 2015-02 john wilkes / - PowerPoint PPT Presentation

Cluster management at Google 2015-02 john wilkes / johnwilkes@google.com Principal Software Engineer

For the past 15 years , Google has been building out the world’s fastest, most powerful, highest quality cloud infrastructure on the planet. Images by Connie Zhou

Hello World job hello_world = { runtime = { cell = 'ic' } // What cluster should we run in? binary = '.../hello_world_webserver' // What program are we to run? args = { port = '%port%' } // Command line parameters requirements = { // Resource requirements ram = 100M disk = 100M cpu = 0.1 } 10000 replicas = 5 // Number of tasks }

Hello World > borgcfg .../hello_world_webserver.borg up ... About to affect 10000 tasks and 1 packages on cell IC. Do you wish to continue (yes/no) [no]? yes ==== Staging package hello_world_webserver.63ce1b965155c75e/johnwilkes on ic... SUCCESS ==== Making package hello_world_webserver.63ce1b965155c75e/johnwilkes on ic... SUCCESS ==== Starting job hello_world on ic... SUCCESS

Hello World

Binary Hello World Config file web browsers borgcfg web browsers What just happened? Cell BorgMaster BorgMaster UI shard BorgMaster UI shard BorgMaster UI shard read/UI BorgMaster UI shard shard persistent store Scheduler scheduler (Paxos) link shard link shard link shard link shard link shard Borglet Borglet Borglet Borglet

Hello World Images by Connie Zhou

Hello World

Failures task-eviction rates and causes 9

A 2000-machine service will DRAM errors (1% AFR) Disk failures (2-10% AFR) have >10 machine crashes per Machine crashes (~2/year) day OS upgrades (2-6/year) Images by Connie Zhou

A 2000-machine service will DRAM errors (1% AFR) Disk failures (2-10% AFR) have >10 machine crashes per Machine crashes (~2/year) day OS upgrades (2-6/year) This is normal; not a problem Images by Connie Zhou

Efficiency Advanced bin- packing algorithms Experimental placement of production VM workload, July 2014

Efficiency Advanced bin- packing algorithms nice round numbers There are no obvious bucket sizes (cf. cloud VMs) gaming the system 13

Efficiency Batch jobs CDF Advanced bin- Service jobs packing algorithms Heterogeneous workloads, May 2011 Omega paper, EuroSys 2013 Job runtime [log]

Efficiency Utilization : sharing clusters between prod/batch helps 15

Efficiency Utilization : sharing clusters between prod/batch helps 16

Efficiency Advanced bin- packing algorithms Data from a cluster with 12k machines, May 2011 Trace is publicly available Heterogeneity and dynamicity of clouds at scale: Google trace analysis . SoCC’12

Efficiency Resource reclamation could be more aggressive Nov/Dec 2013 18

Efficiency Multiple tasks /machine applications per machine CPI^2 paper, EuroSys 2013 threads /machine

Efficiency Multiple applications ← μ per machine CPI^2 paper, EuroSys 2013 ← μ + σ 1. Gather CPI for all the ← μ + 2σ tasks in a job ← μ + 3σ 2. Find outliers 3. Take action outliers => victims task CPI

Achieving desired behavior Exposing mechanisms is fragile Better: declarative intents

Achieving desired behavior an SLO Service level objective (SLO) Examples: • availability • obtainability • reliability • velocity • freshness? • accuracy? • security?

A few other moving parts Config file web browsers borgcfg web browsers Cell UI BorgMaster UI BorgMaster UI BorgMaster UI shard BorgMaster read/UI shard BorgMaster shard shard shard persistent Scheduler scheduler store (Paxos) link shard link shard link shard link shard link shard Borglet Borglet Borglet Borglet

A few other moving parts master job config agent app

A few other moving parts storage master job config agent app

A few other moving parts system config storage master job config agent app

A few other moving parts system config storage master job config agent app monitoring

A few other moving parts system config storage master job config agent app monitoring binaries + data distribution

A few other moving parts system config security storage master job config agent app monitoring binaries + data distribution

A few other moving parts system config security accounting/planning storage master job config agent app monitoring binaries + data distribution Diagram from an original by Cody Smith.

A few other moving parts system config security accounting/billing storage master job config agent app monitoring binaries + data distribution Diagram from an original by Cody Smith.

Containers Everything at Google runs in a container -- including our VMs Containers give us: • resource isolation • execution isolation • CPU QoS We start over 2 billion containers per week. Image: "Container" glynlowe CC-BY-2.0 https://www.flickr.com/photos/glynlowe/10921733615

Kubernetes Machine Machine Machine κυβερνήτης : Machine Greek for “pilot” or “helmsman of a ship” The open source cluster manager from Google.

Kubernetes Web server Log roller Container Container Container Container Container Container Container Agent Agent Agent Agent Agent Agent Agent Machine Machine Machine Machine Machine Machine Machine Host Host Host Host Host Host Host

Pods Web server Kubernetes master/scheduler Log roller Container Container Container Container Container Container Container Agent Agent Agent Agent Agent Agent Agent Machine Machine Machine Machine Machine Machine Machine Host Host Host Host Host Host Host

Labels BE BE BE BE FE BE FE FE FE BE BE BE BE FE Kubernetes master/scheduler Container Container Container Container Container Container Container Agent Agent Agent Agent Agent Agent Agent Machine Machine Machine Machine Machine Machine Machine Host Host Host Host Host Host Host

Label selectors labels: role: frontend BE BE BE BE FE BE FE FE FE BE BE BE BE FE Kubernetes master/scheduler Container Container Container Container Container Container Container Agent Agent Agent Agent Agent Agent Agent Machine Machine Machine Machine Machine Machine Machine Host Host Host Host Host Host Host

Label selectors labels: role: frontend stage: production BE BE BE BE FE BE FE FE FE BE BE BE BE FE Kubernetes master/scheduler Container Container Container Container Container Container Container Agent Agent Agent Agent Agent Agent Agent Machine Machine Machine Machine Machine Machine Machine Host Host Host Host Host Host Host

Replica controller replicas: 3 template: ... labels: FE FE FE role: frontend Kubernetes - Master/Scheduler Container Container Container Container Container Container Container Agent Agent Agent Agent Agent Agent Agent Machine Machine Machine Machine Machine Machine Machine Host Host Host Host Host Host Host

Replica controller replicas: 4 template: ... labels: FE FE FE FE role: frontend Kubernetes - Master/Scheduler Container Container Container Container Container Container Container Agent Agent Agent Agent Agent Agent Agent Machine Machine Machine Machine Machine Machine Machine Host Host Host Host Host Host Host

Service id: frontend-service frontend - service port: 9000 labels: role: frontend FE FE FE FE Kubernetes - Master/Scheduler Container Container Container Container Container Container Container Agent Agent Agent Agent Agent Agent Agent Machine Machine Machine Machine Machine Machine Machine Host Host Host Host Host Host Host

Kubernetes The open source cluster manager from Google. ● Pods: groups of containers ● Labels ● Replica controller ● Services http://kubernetes.io

Pulling it all together Do it yourself? Sure. resources offered load

Pulling it all together We choose to go to the roof not because it is glamorous, but because it is right there! ... the bulk of our success is the result of the methodical, relentless, persistent pursuit of 1.3- 2x opportunities -- what I have come to call " roofshots ". -- Luiz Barroso

Pulling it all together Data: Volkswagen, 2014-07-31 Image: john wilkes Porsche doesn't make cars: it designs and assembles them 1H2014: ○ 1.7% (89k) of VW group's vehicles ○ 23% (€1.4b) of its profits

Pulling it all together Cloud system providers are getting better at everything ... • capacity management • monitoring • storage + networking • reliability • software development tooling • ... Wouldn't you like to stand on others' shoulders?

Three rules of thumb: 1. Resiliency is more important than performance. 2. Relax. Let go. Build on what others have done. 3. Do more monitoring . johnwilkes@google.com http://kubernetes.io Images by Connie Zhou

Cluster management at Google 2015-02 john wilkes / - PowerPoint PPT Presentation

Cluster management at Google 2015-02 john wilkes / johnwilkes@google.com Principal Software Engineer For the past 15 years , Google has been building out the worlds fastest, most powerful, highest quality cloud infrastructure on the

Cluster management at Google with Borg - coping with scale 2016-11 john wilkes /

Cluster management at Google with Borg - coping with scale 2015-11 john wilkes /

Ganeti The Cluster Virtualization Management Software Helga Velroyen (helgav@google.com) Klaus

Ganeti A cluster virtualization manager. Guido Trotter <ultrotter@google.com> Google,

Cluster-Level Storage @ Google How we use Colossus to improve storage efficiency Denis Serenyi

Ganeti The Cluster-based Virtualization Mangement Software Helga Velroyen (helgav@google.com)

Alpha Presentation Kubernetes Cluster Inspection Tool The Capstone Experience Team Google David

Project Plan Kubernetes Cluster Inspection Tool The Capstone Experience Team Google Dave Ackley

April 2020 1 Harvard University and intern at Google; 2 University of St Andrews and visiting

arXiv:1706.03762v5 [cs.CL] 6 Dec 2017 Llion Jones Aidan N. Gomez ukasz Kaiser

Bright Cluster Manager Advanced HPC cluster management made easy Martijn de Vries CTO Bright

Need representative, end-to-end applications 3. Cluster management 3. Cluster management built

RECURRENCE WHAT CAUSES CLUSTER HEADACHES? Occasionally referred to as alarm headaches

Cluster Presentation Cluster Presentation EU-EECA ICT Cluster is the joint effort of three

Getting started on the cluster Learning Objectives Describe the structure of a compute cluster

Outline cluster management & infrastructure management: installation and configuration

Google ProjectARA Power Management Challenges Patrick Titiano, About the Power Management of a

Secrets at Planet-Scale: Engineering the Internal Google Key Management System (KMS) Anvita

history and drivers The Aerospace Cluster The Cluster-Association The Aerospace Cluster The

Xen Summit 8 September, 2006 Xen Management API and Control Stack Ewan Mellor

Google AdWords & Google Analytics Jenn Davidson What are they? Several different Google

Somalia Nutrition Cluster Information Management Presentation 20 th November, 2019 SKA,

E Unum Pluribus Google Network Filtering Management (with apologies to the latin nerds about the

Google Analytics Overview Whats Google Analytics? The Google Analytics

Cluster management at Google 2015-02 john wilkes / - PowerPoint PPT Presentation

Cluster management at Google 2015-02 john wilkes / johnwilkes@google.com Principal Software Engineer For the past 15 years , Google has been building out the worlds fastest, most powerful, highest quality cloud infrastructure on the

Cluster management at Google with Borg - coping with scale 2016-11 john wilkes /

Cluster management at Google with Borg - coping with scale 2015-11 john wilkes /

Ganeti The Cluster Virtualization Management Software Helga Velroyen (helgav@google.com) Klaus

Ganeti A cluster virtualization manager. Guido Trotter &lt;ultrotter@google.com&gt; Google,

Cluster-Level Storage @ Google How we use Colossus to improve storage efficiency Denis Serenyi

Ganeti The Cluster-based Virtualization Mangement Software Helga Velroyen (helgav@google.com)

Alpha Presentation Kubernetes Cluster Inspection Tool The Capstone Experience Team Google David

Project Plan Kubernetes Cluster Inspection Tool The Capstone Experience Team Google Dave Ackley

April 2020 1 Harvard University and intern at Google; 2 University of St Andrews and visiting

arXiv:1706.03762v5 [cs.CL] 6 Dec 2017 Llion Jones Aidan N. Gomez ukasz Kaiser

Bright Cluster Manager Advanced HPC cluster management made easy Martijn de Vries CTO Bright

Need representative, end-to-end applications 3. Cluster management 3. Cluster management built

RECURRENCE WHAT CAUSES CLUSTER HEADACHES? Occasionally referred to as alarm headaches

Cluster Presentation Cluster Presentation EU-EECA ICT Cluster is the joint effort of three

Getting started on the cluster Learning Objectives Describe the structure of a compute cluster

Outline cluster management &amp; infrastructure management: installation and configuration

Google ProjectARA Power Management Challenges Patrick Titiano, About the Power Management of a

Secrets at Planet-Scale: Engineering the Internal Google Key Management System (KMS) Anvita

history and drivers The Aerospace Cluster The Cluster-Association The Aerospace Cluster The

Xen Summit 8 September, 2006 Xen Management API and Control Stack Ewan Mellor

Google AdWords &amp; Google Analytics Jenn Davidson What are they? Several different Google

Somalia Nutrition Cluster Information Management Presentation 20 th November, 2019 SKA,

E Unum Pluribus Google Network Filtering Management (with apologies to the latin nerds about the

Google Analytics Overview Whats Google Analytics? The Google Analytics

Ganeti A cluster virtualization manager. Guido Trotter <ultrotter@google.com> Google,

Outline cluster management & infrastructure management: installation and configuration

Google AdWords & Google Analytics Jenn Davidson What are they? Several different Google