Resource Management Paige Calisi, Meghana Yadavalli, B Chase Babrich
Why is Resource Management Important? ● Companies pay for time and resources ● Important to understand workloads ● Traditional big-data analytics workloads vary from DL jobs ● GPUs have become the trend for high performance computing ● Thousands of parallel floating-point units can be packed into a single chip ● Makes parallelizing the same task very easy and optimizable
Key Challenges ● Many data analytics frameworks ● No one-size-fits-all solution ● Fairness ● Load balancing ● Fault tolerance ● Scalability
Existing Resource Schedulers ● YARN ○ Introduced to relieve Hadoop of resource management and job scheduling ○ Takes job and distributes it among slave nodes ● Mesos ○ Resource offer - low demands pick first ○ Delegates scheduling to framework - not centralized ● Tetris ○ Packs tasks onto clusters based on requirements ○ Favors small resource jobs
Multi-tenant GPU Clusters for Deep Learning Workloads: Analysis and Implications
Problem Statement ● GPU utilization for DL is different from traditional big-data analytics workloads ○ Hours - weeks vs milliseconds to hours ● Identify the constraints 1) GPUs are a monolithic resource that cannot be shared in a fine-grained manner across all users 2) Multi-tenant clusters 3) With respect to workload, DL frameworks utilize gang-scheduling which decreases the flexibility of scheduling 4) Synchronization of parameters -> locality ● Identify implications for future schedulers
Project Philly Study 3 Things: 1. Queueing delays: 1. Delay incurred by users waiting for their fair share of resources 2. Waiting for locality constraints to be met 2. How GPU utilization is affected by placement decisions for distributed training jobs 1. Distribution of individual jobs across servers, ignoring locality constraints, increasing synchronization overheads 2. Colocation, or packing of different jobs on the same server leads to contention of shared resources 3. Jobs might fail to complete successfully 1. Programming errors early in the training process 2. Failures due to cluster components happen later in training
System Overview ● Agnostic to ML framework, all supervised learning tasks ● Distributed training across GPUs, aggregated subset training results, perform synchronized updates ● Multiple GPUs on a server (PCIe), multiple servers on a rack (RDMA), multiple racks (ethernet) ● Fair Scheduling ● Collect logs over 3 main sources ○ YARN scheduler logs ○ stdout and stderr ○ Ganglia monitoring system
Analysis of Queueing Delays ● 2 types of queuing delays: 1) Fair-share delay is when a VC uses up its GPUs, so jobs are waiting for GPUs to become available 2) Fragmentation delay, which happens when large jobs are spread across many racks (low locality) ● Jobs with more GPUs means higher probability of longer queuing delays ● Conclusion: need for gang-scheduling and locality introduces fragmentation delay, so sometimes locality constraints need to be relaxed to mitigate delays
Analysis of GPU Utilization ● GPU utilization is low across all jobs ● Efficiency of allocated GPUs varies according to locality and colocation scenarios that could occur in the cluster ● Observe if a particular job requires disproportionate amount of host memory and isolate memory used by jobs colocated on the same server
Training Progress and Completion ● Terminated jobs constituted 55% of GPU utilization ● Large fraction of jobs spent time training for longer than necessary ● User error is a big reason for job failure ● Semantic errors increase when with a higher number of GPUs because they need to communicate and synchronize model parameters
Lessons Learned 1) Schedulers should trade queuing delay for adhering to locality constraints a) Retry jobs without relaxing locality constraints 2) Aim to isolate the jobs on dedicated servers and implement migration for defragmentation to support locality constraints 3) Early failures should be caught on a smaller pool of GPUs before being scheduled on larger clusters a) Lots of user errors can be caught without deploying on large clusters b) Classify errors and don’t retry errors that won’t pass (syntax errors)
Pros and Cons Pros: - Explained different scheduling concerns and gave us a very broad understanding of how scheduling jobs affects runtime - Failure analysis section gives good insight on very easy ways to stop wasting GPU cycles - Highlights the importance of dynamically checking for loss convergence Cons: - Didn’t explain much about the role preemption plays in job completion - Flexible scheduling can lead to more time being spent saving model checkpoints - Didn’t address scalability as an issue
Themis: Fair and Efficient GPU Clustering Scheduling Themis image taken from https://en.wikipedia.org/wiki/Themis#/media/File:00 29MAN-Themis.jpg
Motivation ● Two major problems with other scheduling algorithms: ○ Do not account for the long-running length of ML tasks ○ No attention is paid to the placement of the ML tasks ○ Example: DRF ● Alright for big data scheduling, but not for ML ○ Violates Pareto Efficiency and envy-freedom ○ “Even with existing fair sharing schemes, we do find users frustrated with the inability to get their work done in https://www.economicshelp.org/blog/glossary/pareto-efficiency/ a timely way…” ● We would like to maximize sharing incentive (SI)
Formalization of Time ● ML App ○ One or more training jobs ■ Each job has several tasks that process a minibatch of data ● GPU Time ■ 10 task GPU mins ■ 10*2=20 job GPU mins ■ 10*2*2=40 app GPU mins ● Heterogeneity across apps ○ Analyzation of workload traces from a large internet company ● Can be mitigated with LAS ○ Least Attained Service
Attempts To Pay Attention To Time - Tiresias ● Uses job completion time and GPU usage as measures of services ● Implements a Least Attained Service (LAS) policy ○ Addresses starvation of jobs and therefore fairness ● Does not encode GPU placement preferences of jobs ○ Treats all GPU configurations as absolute Image taken from http://01greekmythology.blogspot.com/2014/06/teiresias.html
The Importance of Space ● The placement of an app can heavily affect its performance ○ Again we see heterogeneity ● LAS and DRF will not achieve efficiency due to these issues ○ Instance 1 violates SI ○ Instance 2 violates PE and EF
Attempts To Pay Attention Space - Gandiva ● Squeezes as much power out of GPUs as possible by exploiting the cyclic nature of SGD ○ Uses a greedy scheduling policy that continuously optimizes for cluster efficiency ● Master scheduler assigns Docker containers as they become available ○ Scheduling policy is built around early feedback and optimizing for efficiency ● Sets the theoretical groundwork for Themis ○ “The primary design goal of the Gandiva scheduler is to provide early feedback to jobs” ○ “Cluster level fairness is not a design goal in Gandiva”
The Themis Solution ● Presented in two parts ○ (1) An auction mechanism that allows apps to bid for resources ■ “Partial Allocation auction” incentivizes truth telling ○ (2) Two-level scheduling architecture ■ Allows for hyper-parameter optimization (1) (2)
Key Ideas for Partial Allocation Auction ● Finish-Time fairness ○ SI Achieved if ρ ≤ 1 ● Requires the app to be able to express a preference for each allocation ■ Wider interface between app and allocation engine ● Hidden Payment incentivizes truth-telling
Computation of ρ ● Recall that ρ is calculated for every permutation of the available GPUs ● This process is complicated by the presence of hyper-parameter optimization or early stopping ○ In this case, T Sh is calculated differently Slowdown to account for system overhead R c = # of GPUs left in cluster
Multi-Round Auctions ● Single-round does not guarantee SI (why?) ● Auctions are triggered by leases ending At each round, 1- ⨏ of the apps with the ● greatest ρ value are filtered ○ Why do we do this? What happens as we vary ⨏ ? ○ ■ Fairness vs Efficiency ● Random allocation of resources leftover from hidden payments
Themis Scheduling Architecture ● Current architectures cannot support multi-round auctioning ○ E.g. Mesos, Omega ○ Entirely pessimistic or entirely optimistic ● Themis has “semi-optimistic” concurrency control ○ Top level offers optimistic, bottom offers pessimistic
Widening the API Between the Apps and Scheduler ● A crucial aspect of Themis’ architecture design is that it requires the app to be able to see all other resources but only able to use is own resources ○ Accomplished with this app/agent idea ○ Agents are able to see all resources and apps can only use their own resources ● Allows the Agent the ability to interact with existing hyper-parameter optimizers ○ Introduces an overhead into the app writer’s process ■ Negligible?
Recommend
More recommend