WIREFRAME: Supporting Data-dependent Parallelism through - PowerPoint PPT Presentation

WIREFRAME:   Supporting Data-dependent Parallelism through Dependency Graph Execution in GPUs AmirAli Abdolrashidi † , Devashree Tripathy † , Mehmet E. Belviranli ‡ , Laxmi N. Bhuyan † , Daniel Wong † † University of California Riverside ‡ Oak Ridge National Laboratory �1 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs

Introduction �2 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs

Motivation • Despite the support for parallelism, GPUs lack support for data-dependent parallelism. �3 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs

Example: Wavefront Pattern Barrier 1 2 1 Thread block 0 1 3 1 0 1 1 1 3 2 1 �4 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs

Example: Wavefront Pattern 1 2 1 0 1 3 1 0 1 1 1 3 2 1 �4 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs

Example: Wavefront Pattern 1 2 1 0 1 3 1 0 1 1 1 …until the 3 2 1 application ends �4 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs

Example Global Barriers (Original) for i = 1 to nWave: -Kernel Launch -Synchronize �5 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs

Example Global Barriers (Original) for i = 1 to nWave: -Kernel Launch -Synchronize Enormous host-side kernel launch overhead! �5 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs

Example Global Barriers (Original) for i = 1 to nWave: -Kernel Launch -Synchronize Enormous host-side kernel launch overhead! Waiting on non-parent thread blocks �5 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs

Example CDP (Nested) RUN : -Parent Kernel Launch -Synchronize Parent Kernel: for i = 1 to nWaves: -Child Kernel Launch -Synchronize �6 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs

Example CDP (Nested) RUN : -Parent Kernel Launch -Synchronize Parent Kernel: for i = 1 to nWaves: -Child Kernel Launch -Synchronize Kernel Execution Pattern �6 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs

Example CDP (Nested) RUN : -Parent Kernel Launch -Synchronize Parent Kernel: for i = 1 to nWaves: -Child Kernel Launch … -Synchronize Kernel Execution Pattern �6 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs

Example • No more host-side kernel launch CDP (Nested) RUN : • Device-side kernel launch still has -Parent Kernel Launch significant overhead -Synchronize Parent Kernel: • NO multi-parent dependency support for i = 1 to nWaves: -Child Kernel Launch -Synchronize • Still NO general dependency support! �7 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs

Motivation • There is a need for a generalized support for finer-grain inter-block data dependency for more performance and efficiency. Intra-Block Global Inter-Block Thread Thread c Block Barrier �8 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs

Motivation • Current limitations • High device-side kernel launch overhead • No general inter-block data dependency support �9 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs

Wireframe Overview Host (CPU) Device (GPU) �10 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs

Wireframe Overview Programming Model #define parent1 dim3 (blockIdx.x-1, blockIdx.y, blockIdx.z); #define parent2 dim3 (blockIdx.x, blockIdx.y- 1, blockIdx.z); void* DepLink() { if (blockIdx.x > 0) WF::AddDependency(parent1); if (blockIdx.y > 0) WF::AddDependency(parent2); } int main() { kernel<<<GridSize, BlockSize, DepLink>>>(0, args); Host } __WF__ void kernel(args) { (CPU) processWave(); } Device (GPU) �10 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs

Wireframe Overview Programming Model Dependency Graph #define parent1 dim3 (blockIdx.x-1, blockIdx.y, blockIdx.z); #define parent2 dim3 (blockIdx.x, blockIdx.y- 1, blockIdx.z); void* DepLink() { if (blockIdx.x > 0) WF::AddDependency(parent1); if (blockIdx.y > 0) WF::AddDependency(parent2); } int main() { kernel<<<GridSize, BlockSize, DepLink>>>(0, args); Host } __WF__ void kernel(args) { (CPU) processWave(); } Device (GPU) �10 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs

Wireframe Overview Programming Model Dependency Convert to CSR Graph #define parent1 dim3 (blockIdx.x-1, blockIdx.y, blockIdx.z); #define parent2 dim3 (blockIdx.x, blockIdx.y- 1, blockIdx.z); Node Array void* DepLink() { if (blockIdx.x > 0) WF::AddDependency(parent1); Edge Array if (blockIdx.y > 0) WF::AddDependency(parent2); } int main() { kernel<<<GridSize, BlockSize, DepLink>>>(0, args); Host } __WF__ void kernel(args) { (CPU) processWave(); } Device (GPU) �10 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs

Wireframe Overview Programming Model Dependency Convert to CSR Graph #define parent1 dim3 (blockIdx.x-1, blockIdx.y, blockIdx.z); #define parent2 dim3 (blockIdx.x, blockIdx.y- 1, blockIdx.z); Node Array void* DepLink() { if (blockIdx.x > 0) WF::AddDependency(parent1); Edge Array if (blockIdx.y > 0) WF::AddDependency(parent2); } int main() { kernel<<<GridSize, BlockSize, DepLink>>>(0, args); Host } __WF__ void kernel(args) { (CPU) processWave(); } Device (GPU) Global Memory Global Node Array Global Edge Array �10 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs

Wireframe Overview Programming Model Dependency Convert to CSR Graph #define parent1 dim3 (blockIdx.x-1, blockIdx.y, blockIdx.z); #define parent2 dim3 (blockIdx.x, blockIdx.y- 1, blockIdx.z); Node Array void* DepLink() { if (blockIdx.x > 0) WF::AddDependency(parent1); Edge Array if (blockIdx.y > 0) WF::AddDependency(parent2); } int main() { kernel<<<GridSize, BlockSize, DepLink>>>(0, args); Host } __WF__ void kernel(args) { (CPU) processWave(); } DATS Hardware Device (Dependency Graph Buffer) (GPU) Global Memory Local Node Array Local Edge Array Global Node Array Pending Update Buffer Global Edge Array Node Insertion Buffer �10 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs

WIREFRAME: Supporting Data-dependent Parallelism through - PowerPoint PPT Presentation

WIREFRAME: Supporting Data-dependent Parallelism through Dependency Graph Execution in GPUs AmirAli Abdolrashidi , Devashree Tripathy , Mehmet E. Belviranli , Laxmi N. Bhuyan , Daniel Wong University of California

Hardware Parallelism vs. Software Parallelism USENIX Workshop on Hot Topics in Parallelism March

Chapter 17: Parallel Databases Introduction I/O Parallelism Interquery Parallelism

Why Dependent Origination? So what is dependent origination? Dependent on ignorance, there

Data-Level Parallelism Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures

RISK ASSESSEMENT supporting TEST supporting supporting supporting supporting REAGENTS RISK

Time- -dependent Similarity Measure dependent Similarity Measure Time Time-dependent Similarity

Pervasive Parallelism Laboratory Stanford University ppl.stanford.edu Make parallelism

Advanced OpenMP Lecture 6: Nested parallelism Nested parallelism Nested parallelism is

CSCI341 Lecture 37, Introduction to Parallelism PIPELINING Exploits potential parallelism

SIMD Single Instruction Multiple Data Parallelism through simultaneous operations on different

Dependent Eligibility Audit Dependent Eligibility Audit Purpose: The dependent eligibility audit

Parallel Models Different ways to exploit parallelism Outline Shared-Variables Parallelism

Parallelism ! Multiple processes concurrently Parallelism CPU1 CPU1 CPU1 Pseudo- Process 1

CO444H parallelism Ben Livshits 1 Why Parallelism? One way to speed up a computation is to

Multi-core Programming: Implicit Parallelism Tuukka Haapasalo April 16, 2009 Tuukka Haapasalo

Plan Parallelism Complexity Measures 1 Multithreaded Parallelism and Performance Measures cilk

Case 2:08-cr-01324-LRR Document 942-7 Filed 08/05/10 Page 1 of 11 Case 2:08-cr-01324-LRR

Pathway Parallel Program Engineering Michael Firbach firbach@in.tum.de Michael Firbach

Native Seed Development at the USDA NRCS East Texas Plant Materials Center Alan Shadow ETPMC

Corporate Presentation September - 2015 www.supreme.co.in 1 Flow of the Presentation The

ENERGY PROGRAMS UPDATE COMMISSION PRESENTATION MARCH t0, 2020 '.--F AGENDA Conservation H isto

9M 2015 Results Frankfurt November 9, 2015 Ticker: CON ADR-Ticker: CTTAY Wolfgang Schaefer

Post Auth riz ti n Authorization Changes F b February 2, 2011 2 2011 Common Decision

already there ... Luc de Witte CATCH, centre for assistive technology and connected