Framework Tan Nguyen, John Bachan, Samuel Williams, David Donofrio, - PowerPoint PPT Presentation

Performance Analysis, Modeling and Optimization of CUDA codes with the RambutanAcc DAG-based Framework Tan Nguyen, John Bachan, Samuel Williams, David Donofrio, John Shalf, Cy Chan Lawrence Berkeley National Laboratory GPU Technology Conference - May 8, 2017 1

Session Info • S7335 - PERFORMANCE ANALYSIS, MODELING, AND OPTIMIZATION OF CUDA CODES WITH A DAG-BASED FRAMEWORK • We'll discuss how programmers can scale CUDA codes to SMs of a GPUs as well as to many GPUs of a cluster. By representing the application as a DAG (directed acyclic graph) introduced by our RambutanAcc framework, the programmer can improve the application performance and scalability with a fine-grained task scheduler and lightweight communication handler implemented in a runtime system. The programmer can also gain insight into the application behavior by using analysis and modeling tools embedded in the framework. • 25-minute Talk • Cy Chan, Computer Research Scientist, Lawrence Berkeley National Laboratory, Computational Research Division 2

Outline • GPU Programming Challenges • RambutanAcc Programming and Execution Model • Experimental Results • Related Project Modeling Areas • Conclusion 3

Programming Challenges • Scaling to all the cores of a GPU is a non-trivial task – Tesla C1060 (240 cores), Fermi C2050 (448 cores) – Kepler K20 (2496 cores), K40 (2880 cores), K80 (2x 2496 cores) – Pascal P100 (3584 cores) 5

Programming Challenges (cont’d) • Placing computation on both CPU and GPU CPU CPU DRAM – Maximizing performance at low programmer effort • Data management – GPU provides higher performance PCIe but less memory (on-device) – Host CPU has lower performance but more memory GPU – Manage data across multiple nodes DRAM – Need runtime support for data management 6

Programming Challenges (cont’d) DMA CPU • Explicit DRAM communication GPU DRAM GPU DRAM GPU DRAM GPU DRAM complicates the application code • Optimizing RDMA communication code is challenging Interconnect – Asynchronous communication – Direct communication among GPUs GPU DRAM GPU DRAM GPU DRAM GPU DRAM CPU DRAM Direct Memory Access (DMA) 7

Limitations of Existing Runtimes • Lack of fine-grain scheduling on GPU – Many runtime systems do not provide an effective mechanism to co-schedule workloads on a GPU – Instead, the programmer must launch kernels on the GPU and hope Hyper-Q will schedule them in a smart way (very challenging in many cases) • Lack of other GPU-aware optimizations – Not so many runtimes support direct communication among GPUs – Likewise for load balancing on a GPU and among GPUs • Lack of performance analysis and modeling capabilities – Most runtimes are developed primarily for production runs – Performance analysis and modeling are important in hardware & software design exploration 8

The RambutanAcc Project • Objectives – Analyze and model performance behavior of applications under various execution models and communication policies – Scale our applications to effectively utilize multiple GPUs – Optimize the performance at low programming cost • Methodology – Extend Rambutan , an asynchronous programming model • Represent application code with a task graph (Directed Acyclic Graph) – Support GPU execution within the runtime system – Analyze the performance behavior 10

RambutanAcc Task Graphs (DAGs) F(0) S(3,0) S(1,0) U(1,0,1) S(2,0) U(3,0,1) U(2,0,1) U(3,0,2) U(3,0,3) F(1) U(2,0,2) S(3,1) S(2,1) U(3,1,3) U(3,1,2) U(2,1,2) S(3,2) F(2) U(3,2,3) F(3) Cholesky Factorization DAG 3D Stencil DAG Accumulate C Shift B Shift A CNS/SMC DAG 2.5D Cannon Matrix Multiply DAG 11

Task Spaces and Data Spaces • Task Space – A task space encapsulates the behavior of a class of tasks – Tasks are dynamically created at runtime – E.g. Task <0, 1> (iter 1) will not be created until task <0, 0> (iter 0) completes • Data Space – A data space encapsulates access and management of a class of data – Data parcels are the granularity of data handled by the runtime 3D Stencil DAG – A task may require data inputs, each a partition of the data space called a parcel – Tasks may produce output parcels on execution • Mapping of Task Spaces and Data Spaces Data – A parcel is associated with locale, indicating where data resides (CPU or GPU DRAM) Space – The runtime system is responsible for migrating parcels 12

Defining Tasks • Three Task Types Type 1: Tasks running on host Type 2: Tasks running on host and offload compute intensive kernels to GPUs Type 3: Tasks running on GPUs • Specifying inputs/outputs • Specifying task computation • Specifying post-completion action (e.g. create a new task) 13

Defining Tasks • Three Task Types Type 1: Tasks running on host Type 2: Tasks running on host and offload compute intensive kernels to GPUs – Port legacy CUDA codes quickly Type 3: Tasks running on GPUs void launch(cudaStream_t stream){ kernel1< …, stream, 0> (arguments) } void finish(){ //post-completion action Type 2 tasks employ the create new task which launches kernel2 traditional CUDA non-blocking kernel launch model } 14

Defining Tasks • Three Task Types F(0) S(3,0) Type 1: Tasks running on host S(1,0) Type 2: Tasks running on host and U(1,0,1) S(2,0) U(3,0,1) U(2,0,1) offload compute intensive kernels to U(3,0,2) U(3,0,3) GPUs F(1) Grid U(2,0,2) S(3,1) Type 3: Tasks running on GPUs U(2,0,2) S(2,1) – Low launching overhead U(3,1,3) U(3,1,2) U(2,1,2) – Run tasks on individual SMs S(3,2) U(3,2,3) F(3) F(2) __device__ void myKernel(void* sArgs, void* appArgs){ Grid U(3,2,3) //sArgs contains information of threadIdx, blockIdx, dimBlock and dimGrid //compute } void finish(){ //post-completion action create a new task } 15

Implementation • Task management system Information� of� new� task – Tasks are created at runtime Create� Update� worker� status Fetching� queue New� Task and� commit� Task – Existing tasks issue requests Communication� to create new tasks handler tasks • Task scheduler Host� worker Host� ready� queue task� buffer Create� task Host� worker – Depending on type, tasks are scheduler Acc worker Acc ready� queue scheduled on host or GPU task� buffer Acc worker – Tasks are buffered to reduce Accelerator scheduling overhead RambutanAcc runtime system • Communication handler – Handle all types of communication (host-host, host-GPU, GPU-GPU) – Asynchronous fashion 16

Type-3 Tasks: Persistent Kernel • Initially, we launch a persistent CUDA kernel and TB TB TB TB keep only a few thread blocks per SM TB TB TB TB • Task scheduler running on TB TB TB TB host sends tasks to task buffer on GPU using TB TB TB TB cudaMemcpyAsync CUDA� Thread� Grid • After servicing a task, this kernel notifies task scheduler Worker� 0 Worker� 1 Worker� 2 Worker� 3 on host using UVM SM� 0-1 SM� 2-3 SM� 6-7 SM� 4-5 Persistent kernel servicing tasks on GPU 17

Communication Handler cudaStreamQuery (3) (1)� (1)� Remote� procedural� Requesting� remote� call� to� request� a� parcel procedural� call gasnet_put_nb (4) host host host host gasnet_put_nb (2) cudaMemcpyAsync cudaMemcpyAsync (send� data� location) (5) (4)� Notify� owner (6)� Responding� remote� (7) procedural� call (2) accelerator accelerator accelerator accelerator DMA(3) • We use GASNet to handle communication among GPUs • Data can be routed through hosts or transferred directly among GPUs depending on hardware support 18

Studying Task Scheduling & Data Comm. Optimizations Applications Characteristics Sparse Cholesky Irregular algorithm, requiring tasks to be small to Matrix Factorization balance tasks on processors/GPUs. However, it is challenging to scale fine-grain tasks on high-end GPUs 3D Jacobi Iterative Tasks have the same size due to structured grids. Task Solver performance is bounded by memory bandwidth. It is also challenging to scale this application to many GPUs due to high communication costs Dense Matrix Tasks are compute intensive. However, GPUs can Multiply process these tasks quickly. Thus, the overall performance is also sensitive to communication costs. 20

Sparse Cholesky Matrix Factorization • Each matrix is represented as a 8 Type 2: CUDA Launch sparse list of small tiles 7 Type 3: Persistent Kernel • The smaller the tile size, the harder 6 to scale computations to all the milliseconds 5 cores of a GPU 4 • With the persistent kernel (type 3 tasks), we can schedule each 3 computation task on a subset of 2 available SMs 1 • Co-schedule tasks on the same 0 GPU improves the performance 128 256 512 Matrix size substantially 21

Framework Tan Nguyen, John Bachan, Samuel Williams, David Donofrio, - PowerPoint PPT Presentation

Performance Analysis, Modeling and Optimization of CUDA codes with the RambutanAcc DAG-based Framework Tan Nguyen, John Bachan, Samuel Williams, David Donofrio, John Shalf, Cy Chan Lawrence Berkeley National Laboratory GPU Technology Conference

Hardware Observability Framework Hardware Observability Framework Hardware Observability

Play Framework One Web Framework to rule them all Felix Mller Agenda Yet another web

A Spatial Cloaking Framework A Spatial Cloaking Framework A Spatial Cloaking Framework A Spatial

A framework for linking land use and A framework for linking land use and A framework for linking

LEGAL FRAMEWORK FOR STRENGTHENING LEGAL FRAMEWORK FOR STRENGTHENING LEGAL FRAMEWORK FOR

Technical Framework Technical Framework Technical Framework Supporting eBusiness eBusiness

National Improvement Framework for Scottish Education The National Improvement Framework Aim

Implementation of global framework agreements in transnational Implementation of global framework

Northeast Conservation h Framework Framework What is it and why do we need it? National LCC

Spring 2013 1 A Framework for Junior Cycle The Framework Published by Minister Quinn on 4

ACADEMIC PERFORMANCE FRAMEWORK 1 AGENDA Why an Academic Performance Framework? What is

Framework & update Dr. Clara Bocchino SADC TFCA M&E Framework & Piloting Consultant

April 2009 Wes J. Lloyd 1 Framework Invasiveness Coupling between application code and

Unit T esting Framework for T cl Unit T esting Framework for T cl What is Unit T

A Linear Logical A Linear Logical A Linear Logical Framework Framework Framework Iliano

A Recareering Framework @fuzzing_panda A Recareering Framework 1. Framework a. Recon b.

Washington Elementary School Caon City Schools Design Advisory Group Meeting 1 August 29, 2018

MADISON HIGH SCHOOL MASTER PLAN Portland Public Schools | Opsis Architecture + Dao MHS MASTER

Assessment of Community Engagement in Design Advisory Processes at Faubion

SK Telecom 1 U U U U U U U- U - - communication - - - - - communication

DAG-Scheduled Linear Algebra Using Template-Based Building Blocks Jonathan Hogg STFC Rutherford

On Querying OBO Ontologies using a DAG Pattern Query Language Amarnath Gupta Simone Santini

Language Generation via DAG Transduction Yajie Ye, Weiwei Sun and Xiaojun Wan

Causality in a wide sense Lecture I Peter B uhlmann Seminar for Statistics ETH Z

Sambuz

Useful Links

Newsletter

Mail Us

Framework Tan Nguyen, John Bachan, Samuel Williams, David Donofrio, - PowerPoint PPT Presentation

Performance Analysis, Modeling and Optimization of CUDA codes with the RambutanAcc DAG-based Framework Tan Nguyen, John Bachan, Samuel Williams, David Donofrio, John Shalf, Cy Chan Lawrence Berkeley National Laboratory GPU Technology Conference

Hardware Observability Framework Hardware Observability Framework Hardware Observability

Play Framework One Web Framework to rule them all Felix Mller Agenda Yet another web

A Spatial Cloaking Framework A Spatial Cloaking Framework A Spatial Cloaking Framework A Spatial

A framework for linking land use and A framework for linking land use and A framework for linking

LEGAL FRAMEWORK FOR STRENGTHENING LEGAL FRAMEWORK FOR STRENGTHENING LEGAL FRAMEWORK FOR

Technical Framework Technical Framework Technical Framework Supporting eBusiness eBusiness

National Improvement Framework for Scottish Education The National Improvement Framework Aim

Implementation of global framework agreements in transnational Implementation of global framework

Northeast Conservation h Framework Framework What is it and why do we need it? National LCC

Spring 2013 1 A Framework for Junior Cycle The Framework Published by Minister Quinn on 4

ACADEMIC PERFORMANCE FRAMEWORK 1 AGENDA Why an Academic Performance Framework? What is

Framework &amp; update Dr. Clara Bocchino SADC TFCA M&amp;E Framework &amp; Piloting Consultant

April 2009 Wes J. Lloyd 1 Framework Invasiveness Coupling between application code and

Unit T esting Framework for T cl Unit T esting Framework for T cl What is Unit T

A Linear Logical A Linear Logical A Linear Logical Framework Framework Framework Iliano

A Recareering Framework @fuzzing_panda A Recareering Framework 1. Framework a. Recon b.

Washington Elementary School Caon City Schools Design Advisory Group Meeting 1 August 29, 2018

MADISON HIGH SCHOOL MASTER PLAN Portland Public Schools | Opsis Architecture + Dao MHS MASTER

Assessment of Community Engagement in Design Advisory Processes at Faubion

SK Telecom 1 U U U U U U U- U - - communication - - - - - communication

DAG-Scheduled Linear Algebra Using Template-Based Building Blocks Jonathan Hogg STFC Rutherford

On Querying OBO Ontologies using a DAG Pattern Query Language Amarnath Gupta Simone Santini

Language Generation via DAG Transduction Yajie Ye, Weiwei Sun and Xiaojun Wan

Causality in a wide sense Lecture I Peter B uhlmann Seminar for Statistics ETH Z

Sambuz

Useful Links

Newsletter

Mail Us

Framework & update Dr. Clara Bocchino SADC TFCA M&E Framework & Piloting Consultant