Resource Management Paige Calisi, Meghana Yadavalli, B Chase Babrich

Why is Resource Management Important? ● Companies pay for time and resources ● Important to understand workloads ● Traditional big-data analytics workloads vary from DL jobs ● GPUs have become the trend for high performance computing ● Thousands of parallel floating-point units can be packed into a single chip ● Makes parallelizing the same task very easy and optimizable

Key Challenges ● Many data analytics frameworks ● No one-size-fits-all solution ● Fairness ● Load balancing ● Fault tolerance ● Scalability

Existing Resource Schedulers ● YARN ○ Introduced to relieve Hadoop of resource management and job scheduling ○ Takes job and distributes it among slave nodes ● Mesos ○ Resource offer - low demands pick first ○ Delegates scheduling to framework - not centralized ● Tetris ○ Packs tasks onto clusters based on requirements ○ Favors small resource jobs

Multi-tenant GPU Clusters for Deep Learning Workloads: Analysis and Implications

Problem Statement ● GPU utilization for DL is different from traditional big-data analytics workloads ○ Hours - weeks vs milliseconds to hours ● Identify the constraints 1) GPUs are a monolithic resource that cannot be shared in a fine-grained manner across all users 2) Multi-tenant clusters 3) With respect to workload, DL frameworks utilize gang-scheduling which decreases the flexibility of scheduling 4) Synchronization of parameters -> locality ● Identify implications for future schedulers

Project Philly Study 3 Things: 1. Queueing delays: 1. Delay incurred by users waiting for their fair share of resources 2. Waiting for locality constraints to be met 2. How GPU utilization is affected by placement decisions for distributed training jobs 1. Distribution of individual jobs across servers, ignoring locality constraints, increasing synchronization overheads 2. Colocation, or packing of different jobs on the same server leads to contention of shared resources 3. Jobs might fail to complete successfully 1. Programming errors early in the training process 2. Failures due to cluster components happen later in training

System Overview ● Agnostic to ML framework, all supervised learning tasks ● Distributed training across GPUs, aggregated subset training results, perform synchronized updates ● Multiple GPUs on a server (PCIe), multiple servers on a rack (RDMA), multiple racks (ethernet) ● Fair Scheduling ● Collect logs over 3 main sources ○ YARN scheduler logs ○ stdout and stderr ○ Ganglia monitoring system

Analysis of Queueing Delays ● 2 types of queuing delays: 1) Fair-share delay is when a VC uses up its GPUs, so jobs are waiting for GPUs to become available 2) Fragmentation delay, which happens when large jobs are spread across many racks (low locality) ● Jobs with more GPUs means higher probability of longer queuing delays ● Conclusion: need for gang-scheduling and locality introduces fragmentation delay, so sometimes locality constraints need to be relaxed to mitigate delays

Analysis of GPU Utilization ● GPU utilization is low across all jobs ● Efficiency of allocated GPUs varies according to locality and colocation scenarios that could occur in the cluster ● Observe if a particular job requires disproportionate amount of host memory and isolate memory used by jobs colocated on the same server

Training Progress and Completion ● Terminated jobs constituted 55% of GPU utilization ● Large fraction of jobs spent time training for longer than necessary ● User error is a big reason for job failure ● Semantic errors increase when with a higher number of GPUs because they need to communicate and synchronize model parameters

Lessons Learned 1) Schedulers should trade queuing delay for adhering to locality constraints a) Retry jobs without relaxing locality constraints 2) Aim to isolate the jobs on dedicated servers and implement migration for defragmentation to support locality constraints 3) Early failures should be caught on a smaller pool of GPUs before being scheduled on larger clusters a) Lots of user errors can be caught without deploying on large clusters b) Classify errors and don’t retry errors that won’t pass (syntax errors)

Pros and Cons Pros: - Explained different scheduling concerns and gave us a very broad understanding of how scheduling jobs affects runtime - Failure analysis section gives good insight on very easy ways to stop wasting GPU cycles - Highlights the importance of dynamically checking for loss convergence Cons: - Didn’t explain much about the role preemption plays in job completion - Flexible scheduling can lead to more time being spent saving model checkpoints - Didn’t address scalability as an issue

Themis: Fair and Efficient GPU Clustering Scheduling Themis image taken from https://en.wikipedia.org/wiki/Themis#/media/File:00 29MAN-Themis.jpg

Motivation ● Two major problems with other scheduling algorithms: ○ Do not account for the long-running length of ML tasks ○ No attention is paid to the placement of the ML tasks ○ Example: DRF ● Alright for big data scheduling, but not for ML ○ Violates Pareto Efficiency and envy-freedom ○ “Even with existing fair sharing schemes, we do find users frustrated with the inability to get their work done in https://www.economicshelp.org/blog/glossary/pareto-efficiency/ a timely way…” ● We would like to maximize sharing incentive (SI)

Formalization of Time ● ML App ○ One or more training jobs ■ Each job has several tasks that process a minibatch of data ● GPU Time ■ 10 task GPU mins ■ 10*2=20 job GPU mins ■ 10*2*2=40 app GPU mins ● Heterogeneity across apps ○ Analyzation of workload traces from a large internet company ● Can be mitigated with LAS ○ Least Attained Service

Attempts To Pay Attention To Time - Tiresias ● Uses job completion time and GPU usage as measures of services ● Implements a Least Attained Service (LAS) policy ○ Addresses starvation of jobs and therefore fairness ● Does not encode GPU placement preferences of jobs ○ Treats all GPU configurations as absolute Image taken from http://01greekmythology.blogspot.com/2014/06/teiresias.html

The Importance of Space ● The placement of an app can heavily affect its performance ○ Again we see heterogeneity ● LAS and DRF will not achieve efficiency due to these issues ○ Instance 1 violates SI ○ Instance 2 violates PE and EF

Attempts To Pay Attention Space - Gandiva ● Squeezes as much power out of GPUs as possible by exploiting the cyclic nature of SGD ○ Uses a greedy scheduling policy that continuously optimizes for cluster efficiency ● Master scheduler assigns Docker containers as they become available ○ Scheduling policy is built around early feedback and optimizing for efficiency ● Sets the theoretical groundwork for Themis ○ “The primary design goal of the Gandiva scheduler is to provide early feedback to jobs” ○ “Cluster level fairness is not a design goal in Gandiva”

The Themis Solution ● Presented in two parts ○ (1) An auction mechanism that allows apps to bid for resources ■ “Partial Allocation auction” incentivizes truth telling ○ (2) Two-level scheduling architecture ■ Allows for hyper-parameter optimization (1) (2)

Key Ideas for Partial Allocation Auction ● Finish-Time fairness ○ SI Achieved if ρ ≤ 1 ● Requires the app to be able to express a preference for each allocation ■ Wider interface between app and allocation engine ● Hidden Payment incentivizes truth-telling

Computation of ρ ● Recall that ρ is calculated for every permutation of the available GPUs ● This process is complicated by the presence of hyper-parameter optimization or early stopping ○ In this case, T Sh is calculated differently Slowdown to account for system overhead R c = # of GPUs left in cluster

Multi-Round Auctions ● Single-round does not guarantee SI (why?) ● Auctions are triggered by leases ending At each round, 1- ⨏ of the apps with the ● greatest ρ value are filtered ○ Why do we do this? What happens as we vary ⨏ ? ○ ■ Fairness vs Efficiency ● Random allocation of resources leftover from hidden payments

Themis Scheduling Architecture ● Current architectures cannot support multi-round auctioning ○ E.g. Mesos, Omega ○ Entirely pessimistic or entirely optimistic ● Themis has “semi-optimistic” concurrency control ○ Top level offers optimistic, bottom offers pessimistic

Widening the API Between the Apps and Scheduler ● A crucial aspect of Themis’ architecture design is that it requires the app to be able to see all other resources but only able to use is own resources ○ Accomplished with this app/agent idea ○ Agents are able to see all resources and apps can only use their own resources ● Allows the Agent the ability to interact with existing hyper-parameter optimizers ○ Introduces an overhead into the app writer’s process ■ Negligible?

Resource Management Paige Calisi, Meghana Yadavalli, B Chase Babrich - PowerPoint PPT Presentation

Resource Management Paige Calisi, Meghana Yadavalli, B Chase Babrich Why is Resource Management Important? Companies pay for time and resources Important to understand workloads Traditional big-data analytics workloads vary from

Resource Resource Management Management RESOURCE MANAGEMENT RESOURCE MANAGEMENT We have a

SDR CLOUDS SDR CLOUDS RESOURCE MANAGEMENT RESOURCE MANAGEMENT IMPLICATIONS IMPLICATIONS INDEX

New Resource Implementation Shawna Warneke, Resource Management Specialist Christina Weiler,

Chapter 6 Cloud Resource Management and Scheduling Contents Resource management and

HUMAN RESOURCE MANAGEMENT Topic: Strategic Human Resource Management Company: Shan Foods (Pvt)

Resource Management with systemd LinuxCon North America 2013 Lennart Poettering September 2013

Deadlock Example Process 1 Process 2 Resource 1 Resource 2 Example Process 1 Process 2

Placement resource view visualization $ openstack resource provider tree balazs.gibizer@est.tech

and Scheduling Techniques Agenda for Today Resource management encompasses all the

Water Resource Management The Oakdale Irrigation Districts strategic approach to resource

Fisheries Relevant Resources Resource Resource Development Research Habitat External

Hillsdale Historic Resource Survey Historic Maps: 1851 Hillsdale Historic Resource Survey

Resource efficiency targets and indicators Dr. Martin Hirschnitz-Garbers Coordinator Resource

City of Watsonville, Water Resource Center Entrance City of Watsonville, Water Resource Center

The Solar Resource The Solar Resource Overview Overview of the solar resource in the U.S.

Integrated Resource Plan Integrated Resource Plan Rick Haener September 4th, 2015 Integrated

McGautha v. California, 402 U.S. 183 (1971) Is it possible to guide jury discretion in capital

Automated Web Patrol with Strider HoneyMonkeys: Finding Web Sites That Exploit Browser

A Field Study of Run-Time Location Access Disclosures on Android Smartphones Huiqing Fu Yulong

0 A.D: Graphics Graphics problems and opportunities of open-source game Vladislav Belov 0 A.D:

Scheduling Main Points Scheduling policy: what to do next,

Gang of Four Object Oriented Design Patterns Motivation Object Oriented Analysis (OOA)

IoT Security Function Distribution via DLT Le Su, Dinil Mon Divakaran, Sze Ling Yeo, Jiqiang Lu,

Build your own Gateway with RAK831 and RESIN.IO Workshops start at: 10:45 13:30 15:30