wireframe supporting data dependent parallelism through
play

WIREFRAME: Supporting Data-dependent Parallelism through - PowerPoint PPT Presentation

WIREFRAME: Supporting Data-dependent Parallelism through Dependency Graph Execution in GPUs AmirAli Abdolrashidi , Devashree Tripathy , Mehmet E. Belviranli , Laxmi N. Bhuyan , Daniel Wong University of California


  1. WIREFRAME: 
 Supporting Data-dependent Parallelism through Dependency Graph Execution in GPUs AmirAli Abdolrashidi † , Devashree Tripathy † , Mehmet E. Belviranli ‡ , Laxmi N. Bhuyan † , Daniel Wong † † University of California Riverside ‡ Oak Ridge National Laboratory �1 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs

  2. Introduction �2 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs

  3. Introduction �2 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs

  4. Motivation • Despite the support for parallelism, GPUs lack support for data-dependent parallelism. �3 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs

  5. Example: Wavefront Pattern Barrier 1 2 1 Thread block 0 1 3 1 0 1 1 1 3 2 1 �4 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs

  6. Example: Wavefront Pattern 1 2 1 0 1 3 1 0 1 1 1 3 2 1 �4 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs

  7. Example: Wavefront Pattern 1 2 1 0 1 3 1 0 1 1 1 …until the 3 2 1 application ends �4 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs

  8. Example Global Barriers (Original) for i = 1 to nWave: -Kernel Launch -Synchronize �5 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs

  9. Example Global Barriers (Original) for i = 1 to nWave: -Kernel Launch -Synchronize Enormous host-side kernel launch overhead! �5 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs

  10. Example Global Barriers (Original) for i = 1 to nWave: -Kernel Launch -Synchronize Enormous host-side kernel launch overhead! Waiting on non-parent thread blocks �5 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs

  11. Example Global Barriers (Original) for i = 1 to nWave: -Kernel Launch -Synchronize Enormous host-side kernel launch overhead! Waiting on non-parent thread blocks �5 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs

  12. Example Global Barriers (Original) for i = 1 to nWave: -Kernel Launch -Synchronize Enormous host-side kernel launch overhead! Waiting on non-parent thread blocks �5 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs

  13. Example Global Barriers (Original) for i = 1 to nWave: -Kernel Launch -Synchronize Enormous host-side kernel launch overhead! Waiting on non-parent thread blocks �5 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs

  14. Example CDP (Nested) RUN : -Parent Kernel Launch -Synchronize Parent Kernel: for i = 1 to nWaves: -Child Kernel Launch -Synchronize �6 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs

  15. Example CDP (Nested) RUN : -Parent Kernel Launch -Synchronize Parent Kernel: for i = 1 to nWaves: -Child Kernel Launch -Synchronize Kernel Execution Pattern �6 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs

  16. Example CDP (Nested) RUN : -Parent Kernel Launch -Synchronize Parent Kernel: for i = 1 to nWaves: -Child Kernel Launch -Synchronize Kernel Execution Pattern �6 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs

  17. Example CDP (Nested) RUN : -Parent Kernel Launch -Synchronize Parent Kernel: for i = 1 to nWaves: -Child Kernel Launch -Synchronize Kernel Execution Pattern �6 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs

  18. Example CDP (Nested) RUN : -Parent Kernel Launch -Synchronize Parent Kernel: for i = 1 to nWaves: -Child Kernel Launch -Synchronize Kernel Execution Pattern �6 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs

  19. Example CDP (Nested) RUN : -Parent Kernel Launch -Synchronize Parent Kernel: for i = 1 to nWaves: -Child Kernel Launch -Synchronize Kernel Execution Pattern �6 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs

  20. Example CDP (Nested) RUN : -Parent Kernel Launch -Synchronize Parent Kernel: for i = 1 to nWaves: -Child Kernel Launch … -Synchronize Kernel Execution Pattern �6 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs

  21. Example CDP (Nested) RUN : -Parent Kernel Launch -Synchronize Parent Kernel: for i = 1 to nWaves: -Child Kernel Launch … -Synchronize Kernel Execution Pattern �6 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs

  22. Example CDP (Nested) RUN : -Parent Kernel Launch -Synchronize Parent Kernel: for i = 1 to nWaves: -Child Kernel Launch … -Synchronize Kernel Execution Pattern �6 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs

  23. Example • No more host-side kernel launch CDP (Nested) RUN : • Device-side kernel launch still has -Parent Kernel Launch significant overhead -Synchronize Parent Kernel: • NO multi-parent dependency support for i = 1 to nWaves: -Child Kernel Launch -Synchronize • Still NO general dependency support! �7 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs

  24. Motivation • There is a need for a generalized support for finer-grain inter-block data dependency for more performance and efficiency. Intra-Block Global Inter-Block Thread Thread c Block Barrier �8 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs

  25. Motivation • Current limitations • High device-side kernel launch overhead • No general inter-block data dependency support �9 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs

  26. Wireframe Overview Host (CPU) Device (GPU) �10 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs

  27. Wireframe Overview Programming Model #define parent1 dim3 (blockIdx.x-1, blockIdx.y, blockIdx.z); #define parent2 dim3 (blockIdx.x, blockIdx.y- 1, blockIdx.z); void* DepLink() { if (blockIdx.x > 0) WF::AddDependency(parent1); if (blockIdx.y > 0) WF::AddDependency(parent2); } int main() { kernel<<<GridSize, BlockSize, DepLink>>>(0, args); Host } __WF__ void kernel(args) { (CPU) processWave(); } Device (GPU) �10 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs

  28. Wireframe Overview Programming Model Dependency Graph #define parent1 dim3 (blockIdx.x-1, blockIdx.y, blockIdx.z); #define parent2 dim3 (blockIdx.x, blockIdx.y- 1, blockIdx.z); void* DepLink() { if (blockIdx.x > 0) WF::AddDependency(parent1); if (blockIdx.y > 0) WF::AddDependency(parent2); } int main() { kernel<<<GridSize, BlockSize, DepLink>>>(0, args); Host } __WF__ void kernel(args) { (CPU) processWave(); } Device (GPU) �10 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs

  29. Wireframe Overview Programming Model Dependency Convert to CSR Graph #define parent1 dim3 (blockIdx.x-1, blockIdx.y, blockIdx.z); #define parent2 dim3 (blockIdx.x, blockIdx.y- 1, blockIdx.z); Node Array void* DepLink() { if (blockIdx.x > 0) WF::AddDependency(parent1); Edge Array if (blockIdx.y > 0) WF::AddDependency(parent2); } int main() { kernel<<<GridSize, BlockSize, DepLink>>>(0, args); Host } __WF__ void kernel(args) { (CPU) processWave(); } Device (GPU) �10 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs

  30. Wireframe Overview Programming Model Dependency Convert to CSR Graph #define parent1 dim3 (blockIdx.x-1, blockIdx.y, blockIdx.z); #define parent2 dim3 (blockIdx.x, blockIdx.y- 1, blockIdx.z); Node Array void* DepLink() { if (blockIdx.x > 0) WF::AddDependency(parent1); Edge Array if (blockIdx.y > 0) WF::AddDependency(parent2); } int main() { kernel<<<GridSize, BlockSize, DepLink>>>(0, args); Host } __WF__ void kernel(args) { (CPU) processWave(); } Device (GPU) Global Memory Global Node Array Global Edge Array �10 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs

  31. Wireframe Overview Programming Model Dependency Convert to CSR Graph #define parent1 dim3 (blockIdx.x-1, blockIdx.y, blockIdx.z); #define parent2 dim3 (blockIdx.x, blockIdx.y- 1, blockIdx.z); Node Array void* DepLink() { if (blockIdx.x > 0) WF::AddDependency(parent1); Edge Array if (blockIdx.y > 0) WF::AddDependency(parent2); } int main() { kernel<<<GridSize, BlockSize, DepLink>>>(0, args); Host } __WF__ void kernel(args) { (CPU) processWave(); } DATS Hardware Device (Dependency Graph Buffer) (GPU) Global Memory Local Node Array Local Edge Array Global Node Array Pending Update Buffer Global Edge Array Node Insertion Buffer �10 MICRO 50 WIREFRAME: Supporting Data-dependent Parallelism in GPUs

Recommend


More recommend