armada
play

Armada Kubernetes multi-cluster batch scheduler 1 / 12 - PowerPoint PPT Presentation

Armada Kubernetes multi-cluster batch scheduler 1 / 12 Introduction G-Research - Compute Platform Engineering Team We build tools around HTCondor clusters and Kubernetes clusters 2 / 12 Agenda 1. Project requirements 2. Current options 3.


  1. Armada Kubernetes multi-cluster batch scheduler 1 / 12

  2. Introduction G-Research - Compute Platform Engineering Team We build tools around HTCondor clusters and Kubernetes clusters 2 / 12

  3. Agenda 1. Project requirements 2. Current options 3. What is Armada 4. Armada architecture 5. Armada from user perspective 6. Demo 7. Questions 3 / 12

  4. Project requirements Experimental project to explore how to use Kubernetes for HPC Jobs. Easily handle large queues of jobs (million +) Enforce fair share over time Failure of few clusters should not cause an outage Reasonable latency between job submission and job start (in order of 10s) Maximize utilization of the cluster Smart queue instead of scheduler - implement only minimum logic needed on global level, let cluster scheduler do its own work. All components should be highly available 4 / 12

  5. Current options Existing schedulers: default scheduler, Kube-batch / Volcano, Poseidon (Firmanent) All focus on scheduling within one cluster O�cial limit for one cluster is 5000 nodes, but in practice this scale is di�cult to achieve 5 / 12

  6. What is Armada Armada is job queuing system for multiple Kubernetes clusters. It maintains fair share over time similarly to HTCondor. It can handle large amount of queued jobs. Allow adding and removing clusters from the system. 6 / 12

  7. Armada architecture Armada Server Report Usage Kubernetes Accounting Cluster Watch nodes Submit Armada & pods Kubernetes Executor Queue 1 Api Server Get workloads API Create or to schedule Queue 2 delete pods Queue 3 Report events Events Recording Events Jobs Database database Pull based model Clusters can be removed or added without disruptions 7 / 12

  8. Armada from user perspective Queue: Represent user or project, used to mantain fair share over time, has priority factor Job: Unit of work to be run (describe as Kubernetes PodSpec) Job Set: Group of related jobs, api allows observing progress of job set together jobs: - queue: test priority: 1 jobSetId: my-project podSpec: ... kubernetes pod spec ... $ armadactl submit jobs.yaml $ armadactl watch my-project 8 / 12

  9. Demo ! Running 5 clusters 99 worker nodes each (3 control plane nodes). Live: Queues Worker Clusters Snapshot: Queues Worker Clusters 9 / 12

  10. Roadmap Finish core functionality (cancellation, job priority change) Improve scheduling (better fairness, handling heterogenous hardware & jobs, preemption) Better authentication / authorization (JWT support) Improve user tooling (armadactl, potentially UI) Integration with other tooling (Argo, ...) How to handle k8s namespaces & secrets GPUs 10 / 12

  11. Questions ? 11 / 12

  12. Thank you. Compute Platform Engineering at G-Research. Jan Kaspar, James Murkin, Jamie Poole. https://www.gresearch.co.uk 12 / 12

Recommend


More recommend