WIREFRAME: Supporting Data-dependent Parallelism through Dependency Graph Execution in GPUs AmirAli Abdolrashidi † , Devashree Tripathy † , Mehmet E. Belviranli ‡ , Laxmi N. Bhuyan † , Daniel Wong † † University of California Riverside ‡ Oak Ridge National Laboratory �1 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs
Introduction �2 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs
Introduction �2 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs
Motivation • Despite the support for parallelism, GPUs lack support for data-dependent parallelism. �3 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs
Example: Wavefront Pattern Barrier 1 2 1 Thread block 0 1 3 1 0 1 1 1 3 2 1 �4 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs
Example: Wavefront Pattern 1 2 1 0 1 3 1 0 1 1 1 3 2 1 �4 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs
Example: Wavefront Pattern 1 2 1 0 1 3 1 0 1 1 1 …until the 3 2 1 application ends �4 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs
Example Global Barriers (Original) for i = 1 to nWave: -Kernel Launch -Synchronize �5 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs
Example Global Barriers (Original) for i = 1 to nWave: -Kernel Launch -Synchronize Enormous host-side kernel launch overhead! �5 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs
Example Global Barriers (Original) for i = 1 to nWave: -Kernel Launch -Synchronize Enormous host-side kernel launch overhead! Waiting on non-parent thread blocks �5 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs
Example Global Barriers (Original) for i = 1 to nWave: -Kernel Launch -Synchronize Enormous host-side kernel launch overhead! Waiting on non-parent thread blocks �5 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs
Example Global Barriers (Original) for i = 1 to nWave: -Kernel Launch -Synchronize Enormous host-side kernel launch overhead! Waiting on non-parent thread blocks �5 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs
Example Global Barriers (Original) for i = 1 to nWave: -Kernel Launch -Synchronize Enormous host-side kernel launch overhead! Waiting on non-parent thread blocks �5 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs
Example CDP (Nested) RUN : -Parent Kernel Launch -Synchronize Parent Kernel: for i = 1 to nWaves: -Child Kernel Launch -Synchronize �6 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs
Example CDP (Nested) RUN : -Parent Kernel Launch -Synchronize Parent Kernel: for i = 1 to nWaves: -Child Kernel Launch -Synchronize Kernel Execution Pattern �6 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs
Example CDP (Nested) RUN : -Parent Kernel Launch -Synchronize Parent Kernel: for i = 1 to nWaves: -Child Kernel Launch -Synchronize Kernel Execution Pattern �6 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs
Example CDP (Nested) RUN : -Parent Kernel Launch -Synchronize Parent Kernel: for i = 1 to nWaves: -Child Kernel Launch -Synchronize Kernel Execution Pattern �6 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs
Example CDP (Nested) RUN : -Parent Kernel Launch -Synchronize Parent Kernel: for i = 1 to nWaves: -Child Kernel Launch -Synchronize Kernel Execution Pattern �6 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs
Example CDP (Nested) RUN : -Parent Kernel Launch -Synchronize Parent Kernel: for i = 1 to nWaves: -Child Kernel Launch -Synchronize Kernel Execution Pattern �6 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs
Example CDP (Nested) RUN : -Parent Kernel Launch -Synchronize Parent Kernel: for i = 1 to nWaves: -Child Kernel Launch … -Synchronize Kernel Execution Pattern �6 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs
Example CDP (Nested) RUN : -Parent Kernel Launch -Synchronize Parent Kernel: for i = 1 to nWaves: -Child Kernel Launch … -Synchronize Kernel Execution Pattern �6 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs
Example CDP (Nested) RUN : -Parent Kernel Launch -Synchronize Parent Kernel: for i = 1 to nWaves: -Child Kernel Launch … -Synchronize Kernel Execution Pattern �6 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs
Example • No more host-side kernel launch CDP (Nested) RUN : • Device-side kernel launch still has -Parent Kernel Launch significant overhead -Synchronize Parent Kernel: • NO multi-parent dependency support for i = 1 to nWaves: -Child Kernel Launch -Synchronize • Still NO general dependency support! �7 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs
Motivation • There is a need for a generalized support for finer-grain inter-block data dependency for more performance and efficiency. Intra-Block Global Inter-Block Thread Thread c Block Barrier �8 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs
Motivation • Current limitations • High device-side kernel launch overhead • No general inter-block data dependency support �9 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs
Wireframe Overview Host (CPU) Device (GPU) �10 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs
Wireframe Overview Programming Model #define parent1 dim3 (blockIdx.x-1, blockIdx.y, blockIdx.z); #define parent2 dim3 (blockIdx.x, blockIdx.y- 1, blockIdx.z); void* DepLink() { if (blockIdx.x > 0) WF::AddDependency(parent1); if (blockIdx.y > 0) WF::AddDependency(parent2); } int main() { kernel<<<GridSize, BlockSize, DepLink>>>(0, args); Host } __WF__ void kernel(args) { (CPU) processWave(); } Device (GPU) �10 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs
Wireframe Overview Programming Model Dependency Graph #define parent1 dim3 (blockIdx.x-1, blockIdx.y, blockIdx.z); #define parent2 dim3 (blockIdx.x, blockIdx.y- 1, blockIdx.z); void* DepLink() { if (blockIdx.x > 0) WF::AddDependency(parent1); if (blockIdx.y > 0) WF::AddDependency(parent2); } int main() { kernel<<<GridSize, BlockSize, DepLink>>>(0, args); Host } __WF__ void kernel(args) { (CPU) processWave(); } Device (GPU) �10 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs
Wireframe Overview Programming Model Dependency Convert to CSR Graph #define parent1 dim3 (blockIdx.x-1, blockIdx.y, blockIdx.z); #define parent2 dim3 (blockIdx.x, blockIdx.y- 1, blockIdx.z); Node Array void* DepLink() { if (blockIdx.x > 0) WF::AddDependency(parent1); Edge Array if (blockIdx.y > 0) WF::AddDependency(parent2); } int main() { kernel<<<GridSize, BlockSize, DepLink>>>(0, args); Host } __WF__ void kernel(args) { (CPU) processWave(); } Device (GPU) �10 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs
Wireframe Overview Programming Model Dependency Convert to CSR Graph #define parent1 dim3 (blockIdx.x-1, blockIdx.y, blockIdx.z); #define parent2 dim3 (blockIdx.x, blockIdx.y- 1, blockIdx.z); Node Array void* DepLink() { if (blockIdx.x > 0) WF::AddDependency(parent1); Edge Array if (blockIdx.y > 0) WF::AddDependency(parent2); } int main() { kernel<<<GridSize, BlockSize, DepLink>>>(0, args); Host } __WF__ void kernel(args) { (CPU) processWave(); } Device (GPU) Global Memory Global Node Array Global Edge Array �10 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs
Wireframe Overview Programming Model Dependency Convert to CSR Graph #define parent1 dim3 (blockIdx.x-1, blockIdx.y, blockIdx.z); #define parent2 dim3 (blockIdx.x, blockIdx.y- 1, blockIdx.z); Node Array void* DepLink() { if (blockIdx.x > 0) WF::AddDependency(parent1); Edge Array if (blockIdx.y > 0) WF::AddDependency(parent2); } int main() { kernel<<<GridSize, BlockSize, DepLink>>>(0, args); Host } __WF__ void kernel(args) { (CPU) processWave(); } DATS Hardware Device (Dependency Graph Buffer) (GPU) Global Memory Local Node Array Local Edge Array Global Node Array Pending Update Buffer Global Edge Array Node Insertion Buffer �10 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs
Recommend
More recommend