DCUDA: Dynamic GPU Scheduling with Live Migration Support Fan Guo 1 - PowerPoint PPT Presentation

DCUDA: Dynamic GPU Scheduling with Live Migration Support Fan Guo 1 , Yongkun Li 1 , John C.S. Lui 2 , Yinlong Xu 1 1 University of Science and Technology of China 2 The Chinese University of Hong Kong

Outline 1 Background & Problems 2 DCUDA Design 3 Evaluation 4 Conclusion 2

GPU Sharing and Scheduling n GPUs are underloaded without sharing ü A server may contain multiple GPUs ü Each GPU contains thousands of cores n GPU sharing allows multiple apps to run concurrently on one GPU App. App. GPU scheduling API Frontend Frontend is necessary API Backend Load balance Scheduler GPU utilization API … GPU 1 GPU 2 GPU N GPU N-1 3

Current Scheduling Schemes n Current schemes are “static” ü Round-robin, prediction-based, least-loaded ü They only make the assignment of applications before running them n State-of-the-art: Least-loaded scheduling ü Assign new app to the GPU with the least load New App. Scheduler API … GPU 1 GPU 2 GPU N GPU N-1 4

Limitations of Static Scheduling n Load imbalance (least-loaded scheduling) The fraction of time in which at least one GPU is overloaded and some other GPU is underloaded accounts for up to 41.7% (overloaded: demand > GPU cores) 5

Limitations of Static Scheduling n Why does static scheduling result in load imbalance? n Assign before running New App. ü Hard to get exact Scheduler resource demand API ü The assignment is not … GPU 1 GPU 2 GPU N GPU N-1 optimal n No migration support ü No way to adjust online 6

Limitations of Static Scheduling n Fairness issue caused by contention ü Applications with low resource demand may be blocked by those with high resource demand ü May also exists even with load-balancing schemes n Energy inefficiency 4000 Energy consumption (J) 3500 3000 Compacting multiple 2500 2000 small jobs on one 1500 1000 GPU saves energy 500 0 Triad Kmeans Mnist_mlp BFS Autoencoder Sort Reduction cifar10 single execution concurrent execution(2 app.) 7

Our Goal n Our goal is to design a scheduling scheme so as to achieve better ü Load balance, energy efficiency, fairness n Key idea: DCUDA n Dynamic scheduling Online migration (Schedule after running, (running applications, fairness and energy not executing kernels) awareness) 8

Overall Design n DCUDA is implemented based on the API forwarding framework n Key three modules at the backend ü Monitor GPU utilization l App’s resource demand l ü Scheduler Load balance l Energy efficiency l Fairness l ü Migrator Migration of running app l 10

The Monitor n Resource demand of each application ü GPU cores and GPU memory ü Key challenge: lightweight requirement n Demand on GPU cores ü Existing tool (nvprof): large overhead (replay API calls) Optimization Timer function ü Estimate only at the first time (Track info. only when the kernel func is called from parameters ü Use the recorded info. next time of intercepted API: ü Rationale: GPU applications are #blk, #threads) iteration-based 11

The Monitor n Demand on GPU memory ü Easy to know allocated mem, but not all mem. are used n How to detect actual usage? ü Pointer check with cuPointerGetAttribute() + sampling ü False negative: miss identification of used mem On-demand paging (with unified mem support) l n Estimation of GPU utilization ü Periodically scan the resource demand of applications ü Aggregate them together 12

The Scheduler n A multi-stage and multi-object scheduling policy First priority: Load balance Case 1: (Slightly) overloaded GPU Must avoid low-demand tasks being blocked Case 2: Underloaded GPUs: Waste energy 13

The Scheduler n Load balance ü Which GPUs: check each GPU pair Feasible candidates: An overloaded + an underloaded l ü Which applications to migrate Minimize migration frequency + avoid ping-pong effect l Greedy: Migrate the most heavyweight and feasible applications l n Energy awareness ü Compact lightweight apps to fewer GPUs to save energy n Fairness awareness: Grouping + time slicing Tradeoff Utilization Fairness Utilization vs fairness Mixed packing Priority-based scheme 14

The Migrator n Clone runtime ü Largest overhead: initializing libraries (>80%) ü Handle pooling: maintain a pool of libraries’ handles for each GPU n Migrate memory data ü Leverage unified memory: Able to immediately run task without migrating data ü Transparently support Intercept API and replace l ü Pipeline Prefetch & on-demand paging l 15

The Migrator n Resume computing tasks ü Two states of tasks: running and waiting Only migrate waiting tasks l ü Sync to wait for the completion of all running tasks ü Redirect waiting tasks to target GPUs Order preserving l FIFO queue l 16

Experiment Setting n Testbed ü Prototype implemented based on CUDA toolkit 8.0 ü Four NVIDIA 1080Ti GPUs, each has 3584 cores and 12GB memory n Workload ü 20 benchmark programs which represent a majority of GPU applications (HPC, DM, ML, Graph Alg, DL) ü Focus on randomly selected 50 sequences, each combines the 20 programs with a fixed interval n Baseline algorithm ü Least-loaded: most efficient static scheduling scheme 18

Load Balance n Load states of GPU ü 0%-50% utilization, 50%-100% utilization, and overloaded (demand > GPU cores) n Overloaded time of each GPU ü Least-loaded: 14.3% - 51.4% ü DCUDA: within 6% 19

GPU Utilization n Improves average GPU utilization by 14.6% n Reduce the overloaded time by 78.3% on average (over the 50 sequences/workloads) 20

Application Execution Time n Normalize the time to single execution n DCUDA reduces the average execution time by up to 42.1% 21

Impact of Different Loads n Largest performance improvement in medium load case Average Execution Time n Largest energy saving in light load case Energy Consumption 22

Conclusion & Future Work n Static GPU scheduling algorithm in assigning applications leads to load imbalance ü Low GPU utilization & high energy consumption n We develop DCUDA, a dynamic scheduling alg ü Monitors resource demand and util. w/ low overhead ü Supports migration of running applications ü Transparently supports all CUDA applications n Limitation: DCUDA only considers scheduling within a server and the resource of GPU cores 24

Thanks! Q&A Yongkun Li ykli@ustc.edu.cn http://staff.ustc.edu.cn/~ykli 25

DCUDA: Dynamic GPU Scheduling with Live Migration Support Fan Guo 1 - PowerPoint PPT Presentation

DCUDA: Dynamic GPU Scheduling with Live Migration Support Fan Guo 1 , Yongkun Li 1 , John C.S. Lui 2 , Yinlong Xu 1 1 University of Science and Technology of China 2 The Chinese University of Hong Kong Outline 1 Background & Problems 2

Improving access to migration data Improving access to migration data Local area migration

dCUDA: Hardware Supported Overlap of Computation and Communication Tobias Gysi, Jeremia Br, and

Aperiodic Task Scheduling Radek Pel anek Preemptive Scheduling Non-preemptive Scheduling

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Virtual Machine Migration Pierre Riteau University of Rennes 1, IRISA Inria Rennes - Bretagne

Exploiting Live Virtual Machine Migration Jon Oberheide University of Michigan February 21,

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

Module 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Chapter 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Uniprocessor Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms 2

Module 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Module 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Uniprocessor Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Three

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

IntelliSAR October 18, 2019 Department of Electrical and Computer Engineering Department of

iVisDesigner : Expressive Interactive Design of Information Visualizations Donghao Ren 1, 2 ,

Practical Fully Relocating Garbage Collection in LLVM Philip Reames, Sanjoy Das Azul Systems

Bounding Pause Times in a Regional Garbage Collector Felix S Klock II Thesis Advisor: Will

Op Open Of n Offic ice H Hour urs questions typed in the chat box, and offline afterwards,

Office of eLearning May 19, 2020 OPEN - GB - INFO 1 1 eLearning Projections Net new student

Software Maintenance Overview Legacy code Reverse-engineering Re-engineering

Computer Science Research CS 197 | Stanford University | Michael Bernstein cs197.stanford.edu

DCUDA: Dynamic GPU Scheduling with Live Migration Support Fan Guo 1 - PowerPoint PPT Presentation

DCUDA: Dynamic GPU Scheduling with Live Migration Support Fan Guo 1 , Yongkun Li 1 , John C.S. Lui 2 , Yinlong Xu 1 1 University of Science and Technology of China 2 The Chinese University of Hong Kong Outline 1 Background & Problems 2

Improving access to migration data Improving access to migration data Local area migration

dCUDA: Hardware Supported Overlap of Computation and Communication Tobias Gysi, Jeremia Br, and

Aperiodic Task Scheduling Radek Pel anek Preemptive Scheduling Non-preemptive Scheduling

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Virtual Machine Migration Pierre Riteau University of Rennes 1, IRISA Inria Rennes - Bretagne

Exploiting Live Virtual Machine Migration Jon Oberheide University of Michigan February 21,

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

Module 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Chapter 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Uniprocessor Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms 2

Module 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Module 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Uniprocessor Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Three

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

IntelliSAR October 18, 2019 Department of Electrical and Computer Engineering Department of

iVisDesigner : Expressive Interactive Design of Information Visualizations Donghao Ren 1, 2 ,

Practical Fully Relocating Garbage Collection in LLVM Philip Reames, Sanjoy Das Azul Systems

Bounding Pause Times in a Regional Garbage Collector Felix S Klock II Thesis Advisor: Will

Op Open Of n Offic ice H Hour urs questions typed in the chat box, and offline afterwards,

Office of eLearning May 19, 2020 OPEN - GB - INFO 1 1 eLearning Projections Net new student

Software Maintenance Overview Legacy code Reverse-engineering Re-engineering

Computer Science Research CS 197 | Stanford University | Michael Bernstein cs197.stanford.edu

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team