A GPU Run-Time for Event-Driven Task Parallelism Reservoir Labs, Inc. R-Stream Team : Athanasios Konstantinidis Benoit Meister Muthu Baskaran Tom Henretty Benoit Pradelle Tahina Ramananandro Sanket Tavargeri Ann Johnson Richard Lethin Reservoir Labs 1 2.3.15
GPU Programming with CUDA • Massive data parallelism is required • Hides global memory access latency • What if our program is not data-parallel ? Dependence graph (DAG) of an SPMD computation Reservoir Labs 2 2.3.15
GPU Programming with CUDA • Massive data parallelism is required • Hides global memory access latency • What if our program is not data-parallel ? • We find synchronous chunks of data-parallel computations i.e., wavefronts Dependence graph (DAG) of an SPMD computation Global synchronization overhead from repeated kernel invocations Reservoir Labs 3 2.3.15
A GPU Run-Time for Task Parallelism • Implements an Event-Driven Tasks execution model • A single persistent GPU kernel executes the entire DAG (manages thread-block-level parallelism) • On-the-fly dependence resolution • Light-weight synchronization based on atomics • Work-stealing for load-balancing Dependence graph (DAG) of Task an SPMD computation Light-weight atomic synchronization (Event) Reservoir Labs 4 2.3.15
Dependence Resolution – Event-Driven Tasks (EDTs) • Dependence counters • Each task has a dependence counter ( dcount ) • After task completion decrement successors’ dcount • Task becomes active if dcount becomes zero Task (inactive) dcount(1) Task (active) Events dcount(0) Task (inactive) dcount(2) Reservoir Labs 5 2.3.15
Run-Time Architecture Task meta-data work work work queue queue queue work stealing thread thread thread block block block codelets codelets codelets Reservoir Labs 6 2.3.15
Run-Time Architecture Task meta-data work work work queue queue queue work stealing thread thread thread • Defines persistent GPU kernel block block block codelets codelets codelets Reservoir Labs 7 2.3.15
Run-Time Architecture • Task parameters Task meta-data • dependence counters • Codelet type work work work queue queue queue • Integer vectors work stealing thread thread thread • Defines persistent GPU kernel block block block codelets codelets codelets Reservoir Labs 8 2.3.15
Run-Time Architecture • Task parameters Task meta-data • dependence counters • Codelet type work work work queue queue queue • Integer vectors work stealing Codelet Prologue thread thread thread block block block Computation codelets codelets codelets Epilogue Reservoir Labs 9 2.3.15
Run-Time Architecture • Task parameters Task meta-data • dependence counters • Codelet type work work work queue queue queue • Integer vectors work stealing Codelet Unpacks parameters thread thread thread block block block Computation codelets codelets codelets Dependence resolution Reservoir Labs 10 2.3.15
Run-Time Architecture Task meta-data • Global memory work work work queue queue queue Work Queue work stealing Put thread thread thread block block block Get codelets codelets codelets Reservoir Labs 11 2.3.15
Run-Time Architecture Task meta-data work work work queue queue queue work stealing • Workers • Unrestricted amount thread thread thread • Max stealing rounds block block block • Intra-Thread-block configuration agnostic codelets codelets codelets Reservoir Labs 12 2.3.15
Experimental Evaluation • Simple stencil programs from the PolyBench suite • Jacobi-2D 5pt, FDTD-2D, ADI • Compared against best known wavefront implementations • Konstantinidis et al. LCPC 2013 • Rectangular parametric tiling is applied • For run-time tile-size exploration Rectangular Tile Thread-block Task parallelism Reservoir Labs 13 2.3.15
Experimental Evaluation • NVIDIA GTX 670 • Compute Capability: 3.0 • Driver/Runtime Version: 6.5 • Global Memory: 2GB • Multiprocessors: 7 • ECC: OFF Reservoir Labs 14 2.3.15
Experimental Evaluation Reservoir Labs 15 2.3.15
Experimental Evaluation • Jacobi 2D 5pt – Execution Timelines Worker 23 workers Worker 16 workers Time Reservoir Labs 16 2.3.15
Experimental Evaluation • Jacobi 2D 5pt – Execution Timelines 10 workers 16 workers 23 workers Reservoir Labs 17 2.3.15
Experimental Evaluation • Jacobi 2D 5pt – Execution Timelines Worker 30 workers 23 workers Worker Time Reservoir Labs 18 2.3.15
Experimental Evaluation • Jacobi 2D 5pt – Execution Timelines Worker 30 workers Active workers Worker (23) Redundant workers (7) Time Reservoir Labs 19 2.3.15
Experimental Evaluation • FDTD 2D – Execution Timelines Worker 33 workers 22 workers Worker Time Reservoir Labs 20 2.3.15
Experimental Evaluation • ADI – Execution Timelines Worker 33 workers 22 workers Worker Time Reservoir Labs 21 2.3.15
Conclusions • Effective task-parallelism with on-the-fly dependence resolution • Single persistent GPU kernel prevents global synchronization overhead • Evaluated against wavefront parallelism on stencil computations Reservoir Labs 22 2.3.15
The End • Questions ? Reservoir Labs 23 2.3.15
Recommend
More recommend