Kube-Knots: Resource Harvesting through Dynamic Container - PowerPoint PPT Presentation

Kube-Knots: Resource Harvesting through Dynamic Container Orchestration in GPU-based Datacenters Prashanth Thinakaran , Jashwant Raj Gunasekaran, Bikash Sharma, Chita Das, Mahmut Kandemir September 25th, IEEE CLUSTER’19

Motivation Sub-PF GPU Pre GPU Training Algorithmic Parallelism & TPUs 1 https://openai.com/blog/ai-and-compute/ 2 Schwartz, Roy, et al. "Green AI." arXiv preprint arXiv:1907.10597 (2019) 2

Motivation Sub-PF GPU Pre GPU Training Algorithmic Parallelism & TPUs Most of the contribution was on improving accuracy but not Increasing compute demands for DNN training resource efficiency!!! • Modern GPGPUs bridge the compute gap ~10 TFlops • GPU Utilization efficiency is 33% • Kube-Knots focus on Green AI (Efficiency) instead of Red AI (Accuracy) 1 https://openai.com/blog/ai-and-compute/ 2 Schwartz, Roy, et al. "Green AI." arXiv preprint arXiv:1907.10597 (2019) 3

Outline • Need for GPU resource harvesting • Cluster workload setup • Kube-Knots architecture • Correlation Based Provisioning and Peak Prediction • Results - Real system & Scalability study • Conclusion 4

Energy Proportionality 5

Need for GPU bin-packing • CPUs operate at peak efficiency for average load cases • GPUs have linear performance per watt scaling • Crucial to pack and use GPUs at 100% Utilization • A real data-center scenario! 6

Alibaba: Study of Over-commitment • Average CPU Utilization ~ 47% • Average Mem Utilization ~ 76% • Half of the scheduled containers consume < 45% of memory • Containers are provisioned for peak utilization in datacenters • Under-utilization epidemic! 7

Harvesting spare compute and memory Under-utilization calls for resource harvesting at the cluster scheduler level 8

CPUs vs GPUs • CPUs have mature docker / hypervisor layers for efficient resource management. • Enforcing bin-packing is the known solution • GPUs have limited support for virtualization. • Context switches overheads (VIPT Vs VIVT) • Agnostic scheduling leads to QoS violations • Energy proportional scheduling calls for a novel approach 9

Workload heterogeneity • Two different types of workload in GPU-based datacenters • Batch workloads: HPC, DL Training, etc., • Long running: typically hours and days • Latency-sensitive workloads: DL Inference, etc., • Short-lived: in milli-seconds to few seconds 10

How to Harvest Spare Cycles Can provision for only average case utilization conservatively ~80% of the asking! But in case of peaks how to resize them back? Are there any early markers to harvest spare cycles? 11

Correlation of resource metrics: Alibaba Tightly No solid correlated leads metrics Predictable load over time Latency-sensitive Batch/Long-running Workload Workload 12

Opportunities for harvesting in batch • Phase changes are predictable • I/O peaks are succeeded by memory peaks • Average consumption is low when compared to peaks • Provisioning for peak leads to over-commitment 13

TensorFlow Inference on GPUs 100 TF % GPU Memory Used 80 face imc key 60 ner pos chk 40 20 0 1 2 4 8 16 32 64 128 Inference Batch Sizes 14

TensorFlow Inference on GPUs 100 • Inference Queries are latency- TF % GPU Memory Used 80 face sensitive ~ 200ms. imc key 60 ner pos chk 40 • Consumes < 10% of GPU. 20 0 • With batching can be pushed up to 1 2 4 8 16 32 64 128 Inference Batch Sizes 30%. • Usually when run inside TF, the GPU memory cannot be harvested. 15

Outline • Need for GPU resource harvesting • Cluster Workload setup • Kube-Knots architecture • Correlation based Provisioning and Peak Prediction • Results - Real system & Scalability study • Conclusion 16

Cluster-level workload setup App-Mix-1 • Eight Rodinia (HPC) GPU applications App-Mix-2 • Batch and long running tasks • Djinn and Tonic suite’s DNN inference Queries • Face recognition, key points detection, speech recognition App-Mix-3 • We characterize and associate them in three different bins • Plot the COV of GPU Utilization • COV <= 1 Static load and not much variation • COV > 1 Heavy tailed and highly varying load 17

Baseline GPU Agnostic Scheduler App-Mix-1 App-Mix-2 App-Mix-3 • Ideal scheduler would strive to improve the GPU utilization in all percentiles. • In case of high COV, the cluster utilization is not stable. • Applications have varying resource needs throughout. • Keeping a GPU cluster busy throughout depends on COV mixes. • GPU Agnostic scheduler leads to QoS violations due to load imbalance. 18

Kube-Knots Design 20

Correlation Based Provisioning • Correlation between utilization metrics is considered for application placement. • Two positively correlating pods for memory is not colocated together on the same GPU • Pods are always resized for average utilization and not peak utilization. • GPUs are still underutilized due to static provisioning. • QoS violations due to pending pods as most of them contend for same resource (+ve Correlation) 22

Peak Prediction Scheduler • PP allows two +vely correlating pods to be on same GPU. • PP is built on first principle that, resource peaks do not happen at the same time for all co-located apps. • PP uses ARIMA to predict peak utilization to resize the pods. • Autocorrelation function predicts the subsequent resource demand trends. • Where n is the total number of events, ȳ is the moving average • When the r value is > 0, we use ARIMA to forecast the resource utilization. 23

Outline • Need for GPU resource harvesting • Cluster workload setup • Kube-Knots Architecture • Correlation Based Provisioning and Peak Prediction • Results - Real System & Scalability Study • Conclusion 24

CBP+PP Utilization Improvements App-Mix-1 App-Mix-2 App-Mix-3 • CBP+PP does an effective load consolidation in case of high & medium loads when compared to GPU-Agnostic scheduler • 62% improvement in average utilization. • 80% improvement for median and 99%ile • In case of low and sporadic load scenario, CBP+PP effectively consolidated loads to active GPUs. • GPU nodes 1, 4, 8, 10 are minimally used due to power efficiency. 25

GPU Utilization Breakdown App-Mix-1 • CBP+PP consistently improved utilization in all cases. App-Mix-2 • By up to 80% for median and tail • In case of low load scenarios, the scope for improvements is low. App-Mix-3 • Still CBP+PP improved in average case. 26

Power & QoS Improvements • Res-Ag consumes least power on an average of 33% • Violates QoS for 53% of requests • PP consumes 10% more than Res-Ag • Ensures QoS for almost 100% of requests • CBP+PP can ensure QoS by predicting the GPU resource peaks • Further power savings is due to consolidation on active GPUs 27

Scalability of CBP+PP in case of DL • Deep Learning Training and Inference workload mixes. • 60% faster median JCT compared to DL-aware schedulers. • 30% better than Gandiva. • 11% better than Tiresias. • QoS guarantees of DLI in presence of DLT • Reduced QoS violations due to GPU- utilization aware placement. 28

Conclusion • Need for resource harvesting in GPU-datacenters. • Exposing GPU real-time utilization to Kubernetes through Knots. • CBP+PP Scheduler improved GPU Utilization by up to 80% for both average and tail-case utilization. • QoS aware workload consolidation lead to 33% energy savings. • Trace-driven scalability experiments show that Kube-Knots performs 36% better in term of JCT compared to DLT schedulers. • Kube-Knots also reduced the overall QoS violations by up to 53%. 29

prashanth@psu.edu http://www.cse.psu.edu/hpcl/index.html “Workload Setup Docker TensorFlow / HPC experiments used in evaluation of kube-knots,” https://hub.docker.com/r/prashanth5192/gpu September 25th, IEEE CLUSTER’19

Backup-1 Cluster Status COV • COV of loads across different GPUs • 0 to 0.2 range, effectively reduced form 0.1 to 0.7. • PP performs load balancing even in case of high-load scenarios. • PP also harvests and consolidates in low-loads by keeping idle GPUs in p_state 12 31

Difference Table Uniform Kubernetes default Scheduler GPUs cannot be shared Low PPW and No QoS guarantees Resource Agnostic Sharing First Fit Decreasing bin-packing High PPW Poor QoS and high queueing delays Utilization metrics based bin-packing Correlation Based Provisioning High PPW Assured QoS but high queueing delays due to affinity constraints Peak Prediction Predicts the resource peaks of co-scheduled apps by Auto Correlation Factor High PPW and Assured QoS guarantees

Kube-Knots: Resource Harvesting through Dynamic Container - PowerPoint PPT Presentation

Kube-Knots: Resource Harvesting through Dynamic Container Orchestration in GPU-based Datacenters Prashanth Thinakaran , Jashwant Raj Gunasekaran, Bikash Sharma, Chita Das, Mahmut Kandemir September 25th, IEEE CLUSTER19 Motivation Sub-PF GPU

Mosaic Knots Mosaic Knots Quantum Knots & Quantum Knot Systems Quantum Knots &

Knots and Permutations Chaim Even-Zohar , Joel Hass, Nati Linial, Tahl Nowik Knots Unknot Trefoil

DISASTER RELIEF CENTER 2x Accommodation Container 2x Sanitary Container 1x

Snow Leopard Permutations, Even Knots, Odd Knots, Janus Knots, and Restricted Catalan Paths Ben

Introduction to Outcome Harvesting Open Contracting Programme Agenda Definition of Outcome

Rain/Snow Harvesting FAQ What is rain/snow harvesting? Rain/snow harvesting is simply to

The distortion of knots John Pardon Princeton University May 9, 2017 Knots Two views of the figure

Review of Synchrotron Radiation based Diagnostics for Transverse Profile Measurements Gero Kube

Container Library and FUSE Container File System Softwarepraktikum f ur Fortgeschrittene

Postcapitalism Jamie Dobson, GOTO Berlin, 2016 www.container-solutions.com |

Kubernetes Crossing the Chasm 05.03.2018 Ian Crosby @IanDCrosby info@container-solutions.com

Mini-Bulk/IBC Pesticide Container Collection Program EPA Sponsored California San Joaquin Valley

Container Live Migration Adrian Reber FOSDEM 2020, February 01 Red Hat Blog: Container

Knots on the Brain: Finding Knots in Proteins Elizabeth Whalen Advisor: Dr. Eric Rawdon

Exploring the garden of Petaluma knots Addie McCurdy University of St. Thomas - St. Paul, MN

Do Super Cats Make Odd Knots? Sean Clark MPIM Oberseminar November 5, 2015 Sean Clark Do Super

Towards a Predictable Execution Model for Heterogeneous Systems on a Chip ANDREA

Data Parallel Programming in Futhark Troels Henriksen (athas@sigkill.dk) DIKU University of

Purely Functional GPU Programming with Futhark Troels Henriksen (athas@sigkill.dk) Computer

Safe Limits on Voltage Reduction Efficiency in GPUs: a Direct Measurement Approach Jingwen Leng,

able to: Understand what the components of a muscle fibres Describe the sliding

Raluca Mateescu, University of Florida 6/2/17 What do consumers want?

Efficient Nonmyopic Active Search Jiang, Malkomes, Converse, Shofner, Moseley and Garnett STA

1. Anesthetic neurotoxicity Growing concern about the effects of Pediatric Ophthalmology

Kube-Knots: Resource Harvesting through Dynamic Container - PowerPoint PPT Presentation

Kube-Knots: Resource Harvesting through Dynamic Container Orchestration in GPU-based Datacenters Prashanth Thinakaran , Jashwant Raj Gunasekaran, Bikash Sharma, Chita Das, Mahmut Kandemir September 25th, IEEE CLUSTER19 Motivation Sub-PF GPU

Mosaic Knots Mosaic Knots Quantum Knots &amp; Quantum Knot Systems Quantum Knots &amp;

Knots and Permutations Chaim Even-Zohar , Joel Hass, Nati Linial, Tahl Nowik Knots Unknot Trefoil

DISASTER RELIEF CENTER 2x Accommodation Container 2x Sanitary Container 1x

Snow Leopard Permutations, Even Knots, Odd Knots, Janus Knots, and Restricted Catalan Paths Ben

Introduction to Outcome Harvesting Open Contracting Programme Agenda Definition of Outcome

Rain/Snow Harvesting FAQ What is rain/snow harvesting? Rain/snow harvesting is simply to

The distortion of knots John Pardon Princeton University May 9, 2017 Knots Two views of the figure

Review of Synchrotron Radiation based Diagnostics for Transverse Profile Measurements Gero Kube

Container Library and FUSE Container File System Softwarepraktikum f ur Fortgeschrittene

Postcapitalism Jamie Dobson, GOTO Berlin, 2016 www.container-solutions.com |

Kubernetes Crossing the Chasm 05.03.2018 Ian Crosby @IanDCrosby info@container-solutions.com

Mini-Bulk/IBC Pesticide Container Collection Program EPA Sponsored California San Joaquin Valley

Container Live Migration Adrian Reber FOSDEM 2020, February 01 Red Hat Blog: Container

Knots on the Brain: Finding Knots in Proteins Elizabeth Whalen Advisor: Dr. Eric Rawdon

Exploring the garden of Petaluma knots Addie McCurdy University of St. Thomas - St. Paul, MN

Do Super Cats Make Odd Knots? Sean Clark MPIM Oberseminar November 5, 2015 Sean Clark Do Super

Towards a Predictable Execution Model for Heterogeneous Systems on a Chip ANDREA

Data Parallel Programming in Futhark Troels Henriksen (athas@sigkill.dk) DIKU University of

Purely Functional GPU Programming with Futhark Troels Henriksen (athas@sigkill.dk) Computer

Safe Limits on Voltage Reduction Efficiency in GPUs: a Direct Measurement Approach Jingwen Leng,

able to: Understand what the components of a muscle fibres Describe the sliding

Raluca Mateescu, University of Florida 6/2/17 What do consumers want?

Efficient Nonmyopic Active Search Jiang, Malkomes, Converse, Shofner, Moseley and Garnett STA

1. Anesthetic neurotoxicity Growing concern about the effects of Pediatric Ophthalmology

Mosaic Knots Mosaic Knots Quantum Knots & Quantum Knot Systems Quantum Knots &