Kube-Knots: Resource Harvesting through Dynamic Container Orchestration in GPU-based Datacenters Prashanth Thinakaran , Jashwant Raj Gunasekaran, Bikash Sharma, Chita Das, Mahmut Kandemir September 25th, IEEE CLUSTER’19
Motivation Sub-PF GPU Pre GPU Training Algorithmic Parallelism & TPUs 1 https://openai.com/blog/ai-and-compute/ 2 Schwartz, Roy, et al. "Green AI." arXiv preprint arXiv:1907.10597 (2019) 2
Motivation Sub-PF GPU Pre GPU Training Algorithmic Parallelism & TPUs Most of the contribution was on improving accuracy but not Increasing compute demands for DNN training resource efficiency!!! • Modern GPGPUs bridge the compute gap ~10 TFlops • GPU Utilization efficiency is 33% • Kube-Knots focus on Green AI (Efficiency) instead of Red AI (Accuracy) 1 https://openai.com/blog/ai-and-compute/ 2 Schwartz, Roy, et al. "Green AI." arXiv preprint arXiv:1907.10597 (2019) 3
Outline • Need for GPU resource harvesting • Cluster workload setup • Kube-Knots architecture • Correlation Based Provisioning and Peak Prediction • Results - Real system & Scalability study • Conclusion 4
Energy Proportionality 5
Need for GPU bin-packing • CPUs operate at peak efficiency for average load cases • GPUs have linear performance per watt scaling • Crucial to pack and use GPUs at 100% Utilization • A real data-center scenario! 6
Alibaba: Study of Over-commitment • Average CPU Utilization ~ 47% • Average Mem Utilization ~ 76% • Half of the scheduled containers consume < 45% of memory • Containers are provisioned for peak utilization in datacenters • Under-utilization epidemic! 7
Harvesting spare compute and memory Under-utilization calls for resource harvesting at the cluster scheduler level 8
CPUs vs GPUs • CPUs have mature docker / hypervisor layers for efficient resource management. • Enforcing bin-packing is the known solution • GPUs have limited support for virtualization. • Context switches overheads (VIPT Vs VIVT) • Agnostic scheduling leads to QoS violations • Energy proportional scheduling calls for a novel approach 9
Workload heterogeneity • Two different types of workload in GPU-based datacenters • Batch workloads: HPC, DL Training, etc., • Long running: typically hours and days • Latency-sensitive workloads: DL Inference, etc., • Short-lived: in milli-seconds to few seconds 10
How to Harvest Spare Cycles Can provision for only average case utilization conservatively ~80% of the asking! But in case of peaks how to resize them back? Are there any early markers to harvest spare cycles? 11
Correlation of resource metrics: Alibaba Tightly No solid correlated leads metrics Predictable load over time Latency-sensitive Batch/Long-running Workload Workload 12
Opportunities for harvesting in batch • Phase changes are predictable • I/O peaks are succeeded by memory peaks • Average consumption is low when compared to peaks • Provisioning for peak leads to over-commitment 13
TensorFlow Inference on GPUs 100 TF % GPU Memory Used 80 face imc key 60 ner pos chk 40 20 0 1 2 4 8 16 32 64 128 Inference Batch Sizes 14
TensorFlow Inference on GPUs 100 • Inference Queries are latency- TF % GPU Memory Used 80 face sensitive ~ 200ms. imc key 60 ner pos chk 40 • Consumes < 10% of GPU. 20 0 • With batching can be pushed up to 1 2 4 8 16 32 64 128 Inference Batch Sizes 30%. • Usually when run inside TF, the GPU memory cannot be harvested. 15
Outline • Need for GPU resource harvesting • Cluster Workload setup • Kube-Knots architecture • Correlation based Provisioning and Peak Prediction • Results - Real system & Scalability study • Conclusion 16
Cluster-level workload setup App-Mix-1 • Eight Rodinia (HPC) GPU applications App-Mix-2 • Batch and long running tasks • Djinn and Tonic suite’s DNN inference Queries • Face recognition, key points detection, speech recognition App-Mix-3 • We characterize and associate them in three different bins • Plot the COV of GPU Utilization • COV <= 1 Static load and not much variation • COV > 1 Heavy tailed and highly varying load 17
Baseline GPU Agnostic Scheduler App-Mix-1 App-Mix-2 App-Mix-3 • Ideal scheduler would strive to improve the GPU utilization in all percentiles. • In case of high COV, the cluster utilization is not stable. • Applications have varying resource needs throughout. • Keeping a GPU cluster busy throughout depends on COV mixes. • GPU Agnostic scheduler leads to QoS violations due to load imbalance. 18
Outline • Need for GPU resource harvesting • Cluster workload setup • Kube-Knots architecture • Correlation Based Provisioning and Peak Prediction • Results - Real system & Scalability study • Conclusion 19
Kube-Knots Design 20
Outline • Need for GPU resource harvesting • Cluster workload setup • Kube-Knots architecture • Correlation Based Provisioning and Peak Prediction • Results - Real system & Scalability study • Conclusion 21
Correlation Based Provisioning • Correlation between utilization metrics is considered for application placement. • Two positively correlating pods for memory is not colocated together on the same GPU • Pods are always resized for average utilization and not peak utilization. • GPUs are still underutilized due to static provisioning. • QoS violations due to pending pods as most of them contend for same resource (+ve Correlation) 22
Peak Prediction Scheduler • PP allows two +vely correlating pods to be on same GPU. • PP is built on first principle that, resource peaks do not happen at the same time for all co-located apps. • PP uses ARIMA to predict peak utilization to resize the pods. • Autocorrelation function predicts the subsequent resource demand trends. • Where n is the total number of events, ȳ is the moving average • When the r value is > 0, we use ARIMA to forecast the resource utilization. 23
Outline • Need for GPU resource harvesting • Cluster workload setup • Kube-Knots Architecture • Correlation Based Provisioning and Peak Prediction • Results - Real System & Scalability Study • Conclusion 24
CBP+PP Utilization Improvements App-Mix-1 App-Mix-2 App-Mix-3 • CBP+PP does an effective load consolidation in case of high & medium loads when compared to GPU-Agnostic scheduler • 62% improvement in average utilization. • 80% improvement for median and 99%ile • In case of low and sporadic load scenario, CBP+PP effectively consolidated loads to active GPUs. • GPU nodes 1, 4, 8, 10 are minimally used due to power efficiency. 25
GPU Utilization Breakdown App-Mix-1 • CBP+PP consistently improved utilization in all cases. App-Mix-2 • By up to 80% for median and tail • In case of low load scenarios, the scope for improvements is low. App-Mix-3 • Still CBP+PP improved in average case. 26
Power & QoS Improvements • Res-Ag consumes least power on an average of 33% • Violates QoS for 53% of requests • PP consumes 10% more than Res-Ag • Ensures QoS for almost 100% of requests • CBP+PP can ensure QoS by predicting the GPU resource peaks • Further power savings is due to consolidation on active GPUs 27
Scalability of CBP+PP in case of DL • Deep Learning Training and Inference workload mixes. • 60% faster median JCT compared to DL-aware schedulers. • 30% better than Gandiva. • 11% better than Tiresias. • QoS guarantees of DLI in presence of DLT • Reduced QoS violations due to GPU- utilization aware placement. 28
Conclusion • Need for resource harvesting in GPU-datacenters. • Exposing GPU real-time utilization to Kubernetes through Knots. • CBP+PP Scheduler improved GPU Utilization by up to 80% for both average and tail-case utilization. • QoS aware workload consolidation lead to 33% energy savings. • Trace-driven scalability experiments show that Kube-Knots performs 36% better in term of JCT compared to DLT schedulers. • Kube-Knots also reduced the overall QoS violations by up to 53%. 29
prashanth@psu.edu http://www.cse.psu.edu/hpcl/index.html “Workload Setup Docker TensorFlow / HPC experiments used in evaluation of kube-knots,” https://hub.docker.com/r/prashanth5192/gpu September 25th, IEEE CLUSTER’19
Backup-1 Cluster Status COV • COV of loads across different GPUs • 0 to 0.2 range, effectively reduced form 0.1 to 0.7. • PP performs load balancing even in case of high-load scenarios. • PP also harvests and consolidates in low-loads by keeping idle GPUs in p_state 12 31
Difference Table Uniform Kubernetes default Scheduler GPUs cannot be shared Low PPW and No QoS guarantees Resource Agnostic Sharing First Fit Decreasing bin-packing High PPW Poor QoS and high queueing delays Utilization metrics based bin-packing Correlation Based Provisioning High PPW Assured QoS but high queueing delays due to affinity constraints Peak Prediction Predicts the resource peaks of co-scheduled apps by Auto Correlation Factor High PPW and Assured QoS guarantees
Recommend
More recommend