Performance Analysis, Modeling and Optimization of CUDA codes with the RambutanAcc DAG-based Framework Tan Nguyen, John Bachan, Samuel Williams, David Donofrio, John Shalf, Cy Chan Lawrence Berkeley National Laboratory GPU Technology Conference - May 8, 2017 1
Session Info • S7335 - PERFORMANCE ANALYSIS, MODELING, AND OPTIMIZATION OF CUDA CODES WITH A DAG-BASED FRAMEWORK • We'll discuss how programmers can scale CUDA codes to SMs of a GPUs as well as to many GPUs of a cluster. By representing the application as a DAG (directed acyclic graph) introduced by our RambutanAcc framework, the programmer can improve the application performance and scalability with a fine-grained task scheduler and lightweight communication handler implemented in a runtime system. The programmer can also gain insight into the application behavior by using analysis and modeling tools embedded in the framework. • 25-minute Talk • Cy Chan, Computer Research Scientist, Lawrence Berkeley National Laboratory, Computational Research Division 2
Outline • GPU Programming Challenges • RambutanAcc Programming and Execution Model • Experimental Results • Related Project Modeling Areas • Conclusion 3
Outline • GPU Programming Challenges • RambutanAcc Programming and Execution Model • Experimental Results • Related Project Modeling Areas • Conclusion 4
Programming Challenges • Scaling to all the cores of a GPU is a non-trivial task – Tesla C1060 (240 cores), Fermi C2050 (448 cores) – Kepler K20 (2496 cores), K40 (2880 cores), K80 (2x 2496 cores) – Pascal P100 (3584 cores) 5
Programming Challenges (cont’d) • Placing computation on both CPU and GPU CPU CPU DRAM – Maximizing performance at low programmer effort • Data management – GPU provides higher performance PCIe but less memory (on-device) – Host CPU has lower performance but more memory GPU – Manage data across multiple nodes DRAM – Need runtime support for data management 6
Programming Challenges (cont’d) DMA CPU • Explicit DRAM communication GPU DRAM GPU DRAM GPU DRAM GPU DRAM complicates the application code • Optimizing RDMA communication code is challenging Interconnect – Asynchronous communication – Direct communication among GPUs GPU DRAM GPU DRAM GPU DRAM GPU DRAM CPU DRAM Direct Memory Access (DMA) 7
Limitations of Existing Runtimes • Lack of fine-grain scheduling on GPU – Many runtime systems do not provide an effective mechanism to co-schedule workloads on a GPU – Instead, the programmer must launch kernels on the GPU and hope Hyper-Q will schedule them in a smart way (very challenging in many cases) • Lack of other GPU-aware optimizations – Not so many runtimes support direct communication among GPUs – Likewise for load balancing on a GPU and among GPUs • Lack of performance analysis and modeling capabilities – Most runtimes are developed primarily for production runs – Performance analysis and modeling are important in hardware & software design exploration 8
Outline • GPU Programming Challenges • RambutanAcc Programming and Execution Model • Experimental Results • Related Project Modeling Areas • Conclusion 9
The RambutanAcc Project • Objectives – Analyze and model performance behavior of applications under various execution models and communication policies – Scale our applications to effectively utilize multiple GPUs – Optimize the performance at low programming cost • Methodology – Extend Rambutan , an asynchronous programming model • Represent application code with a task graph (Directed Acyclic Graph) – Support GPU execution within the runtime system – Analyze the performance behavior 10
RambutanAcc Task Graphs (DAGs) F(0) S(3,0) S(1,0) U(1,0,1) S(2,0) U(3,0,1) U(2,0,1) U(3,0,2) U(3,0,3) F(1) U(2,0,2) S(3,1) S(2,1) U(3,1,3) U(3,1,2) U(2,1,2) S(3,2) F(2) U(3,2,3) F(3) Cholesky Factorization DAG 3D Stencil DAG Accumulate C Shift B Shift A CNS/SMC DAG 2.5D Cannon Matrix Multiply DAG 11
Task Spaces and Data Spaces • Task Space – A task space encapsulates the behavior of a class of tasks – Tasks are dynamically created at runtime – E.g. Task <0, 1> (iter 1) will not be created until task <0, 0> (iter 0) completes • Data Space – A data space encapsulates access and management of a class of data – Data parcels are the granularity of data handled by the runtime 3D Stencil DAG – A task may require data inputs, each a partition of the data space called a parcel – Tasks may produce output parcels on execution • Mapping of Task Spaces and Data Spaces Data – A parcel is associated with locale, indicating where data resides (CPU or GPU DRAM) Space – The runtime system is responsible for migrating parcels 12
Defining Tasks • Three Task Types Type 1: Tasks running on host Type 2: Tasks running on host and offload compute intensive kernels to GPUs Type 3: Tasks running on GPUs • Specifying inputs/outputs • Specifying task computation • Specifying post-completion action (e.g. create a new task) 13
Defining Tasks • Three Task Types Type 1: Tasks running on host Type 2: Tasks running on host and offload compute intensive kernels to GPUs – Port legacy CUDA codes quickly Type 3: Tasks running on GPUs void launch(cudaStream_t stream){ kernel1< …, stream, 0> (arguments) } void finish(){ //post-completion action Type 2 tasks employ the create new task which launches kernel2 traditional CUDA non-blocking kernel launch model } 14
Defining Tasks • Three Task Types F(0) S(3,0) Type 1: Tasks running on host S(1,0) Type 2: Tasks running on host and U(1,0,1) S(2,0) U(3,0,1) U(2,0,1) offload compute intensive kernels to U(3,0,2) U(3,0,3) GPUs F(1) Grid U(2,0,2) S(3,1) Type 3: Tasks running on GPUs U(2,0,2) S(2,1) – Low launching overhead U(3,1,3) U(3,1,2) U(2,1,2) – Run tasks on individual SMs S(3,2) U(3,2,3) F(3) F(2) __device__ void myKernel(void* sArgs, void* appArgs){ Grid U(3,2,3) //sArgs contains information of threadIdx, blockIdx, dimBlock and dimGrid //compute } void finish(){ //post-completion action create a new task } 15
Implementation • Task management system Information� of� new� task – Tasks are created at runtime Create� Update� worker� status Fetching� queue New� Task and� commit� Task – Existing tasks issue requests Communication� to create new tasks handler tasks • Task scheduler Host� worker Host� ready� queue task� buffer Create� task Host� worker – Depending on type, tasks are scheduler Acc worker Acc ready� queue scheduled on host or GPU task� buffer Acc worker – Tasks are buffered to reduce Accelerator scheduling overhead RambutanAcc runtime system • Communication handler – Handle all types of communication (host-host, host-GPU, GPU-GPU) – Asynchronous fashion 16
Type-3 Tasks: Persistent Kernel • Initially, we launch a persistent CUDA kernel and TB TB TB TB keep only a few thread blocks per SM TB TB TB TB • Task scheduler running on TB TB TB TB host sends tasks to task buffer on GPU using TB TB TB TB cudaMemcpyAsync CUDA� Thread� Grid • After servicing a task, this kernel notifies task scheduler Worker� 0 Worker� 1 Worker� 2 Worker� 3 on host using UVM SM� 0-1 SM� 2-3 SM� 6-7 SM� 4-5 Persistent kernel servicing tasks on GPU 17
Communication Handler cudaStreamQuery (3) (1)� (1)� Remote� procedural� Requesting� remote� call� to� request� a� parcel procedural� call gasnet_put_nb (4) host host host host gasnet_put_nb (2) cudaMemcpyAsync cudaMemcpyAsync (send� data� location) (5) (4)� Notify� owner (6)� Responding� remote� (7) procedural� call (2) accelerator accelerator accelerator accelerator DMA(3) • We use GASNet to handle communication among GPUs • Data can be routed through hosts or transferred directly among GPUs depending on hardware support 18
Outline • GPU Programming Challenges • RambutanAcc Programming and Execution Model • Experimental Results • Related Project Modeling Areas • Conclusion 19
Studying Task Scheduling & Data Comm. Optimizations Applications Characteristics Sparse Cholesky Irregular algorithm, requiring tasks to be small to Matrix Factorization balance tasks on processors/GPUs. However, it is challenging to scale fine-grain tasks on high-end GPUs 3D Jacobi Iterative Tasks have the same size due to structured grids. Task Solver performance is bounded by memory bandwidth. It is also challenging to scale this application to many GPUs due to high communication costs Dense Matrix Tasks are compute intensive. However, GPUs can Multiply process these tasks quickly. Thus, the overall performance is also sensitive to communication costs. 20
Sparse Cholesky Matrix Factorization • Each matrix is represented as a 8 Type 2: CUDA Launch sparse list of small tiles 7 Type 3: Persistent Kernel • The smaller the tile size, the harder 6 to scale computations to all the milliseconds 5 cores of a GPU 4 • With the persistent kernel (type 3 tasks), we can schedule each 3 computation task on a subset of 2 available SMs 1 • Co-schedule tasks on the same 0 GPU improves the performance 128 256 512 Matrix size substantially 21
Recommend
More recommend