An Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware Tao Chen, Shreesha Srinath Christopher Batten , G. Edward Suh Computer Systems Laboratory School of Electrical and Computer Engineering Cornell University 51st Int’l Symp. on Microarchitecture Fall 2018
• Motivation • Computation Model Accelerator Architecture Design Methodology Evaluation Accelerating Static Parallel Algorithms on Reconfigurable Hardware ◮ Emerging CPU+FPGA platforms for (int i=0; i<n; i++) c[i] = a[i] + b[i]; (Xilinx Zynq, Altera Cyclone SoC) High ◮ HLS maps parallelism statically to Level highly pipelined and parallel PEs Synthesis __kernel void vvadd( __global int* c, __global int* a, __global int* b, int n ) { int id = get_global_id(0); Reconfig General if ( id < n ) Hardware Purpose c[id] = a[id] + b[id]; (FPGA) CPU } Shared Mem Sys Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware C. Batten 2 / 18
• Motivation • Computation Model Accelerator Architecture Design Methodology Evaluation Programmers are increasingly moving from thread- to task-centric programming ◮ Task-parallel programming int fib( int n ) { frameworks enable creating if (n < 2) tasks dynamically as the return n; int x = spawn fib(n-1); program executes int y = fib(n-2); sync ; ⊲ Intel Cilk Plus, Intel C++ TBB, return x + y; Microsoft’s .NET TPL, Java’s } Fork/Join, OpenMP ◮ Benefits of this approach: Reconfig General ⊲ hierarchical data structures Hardware Purpose ⊲ divide-and-conquer algos (FPGA) CPU ⊲ adaptive algorithms ⊲ arbitrary nesting, composition ⊲ automatic load balancing Shared Mem Sys ⊲ efficient in theory and practice Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware C. Batten 3 / 18
Motivation Computation Model Accelerator Architecture Design Methodology Evaluation Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware int fib( int n ) { Motivation if (n < 2) return n; int x = spawn fib(n-1); Computation Model int y = fib(n-2); sync ; return x + y; Accelerator Architecture } Design Methodology Reconfig Reconfig General General Hardware Hardware Purpose Purpose Evaluation (FPGA) (FPGA) CPU CPU Shared Mem Sys Shared Mem Sys Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware C. Batten 4 / 18
Motivation • Computation Model • Accelerator Architecture Design Methodology Evaluation Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware int fib( int n ) { Motivation if (n < 2) return n; int x = spawn fib(n-1); Computation Model int y = fib(n-2); sync ; return x + y; Accelerator Architecture } Design Methodology Reconfig Reconfig General General Hardware Hardware Purpose Purpose Evaluation (FPGA) (FPGA) CPU CPU Shared Mem Sys Shared Mem Sys Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware C. Batten 5 / 18
Motivation • Computation Model • Accelerator Architecture Design Methodology Evaluation Explicit Continuation Passing parent task A spawn spawn 〈 D,1 〉 〈 D,2 〉 cont = cont = child B C task spawn 〈 G,2 〉 cont = make E F successor Data-Flow Pattern make successor G 〈 D,2 〉 cont = arg1 arg2 D successor task Fork/Join Pattern Data-Parallel Pattern Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware C. Batten 6 / 18
Motivation • Computation Model • Accelerator Architecture Design Methodology Evaluation Example of Explicit Continuation Passing w/ Cilk int fib( int n ) task fib( cont int k, int n ) { { if (n < 2) if ( n < 2 ) return n; send_argument( k, n ); int x = spawn fib(n-1); else { int y = fib(n-2); cont int x, y; sync; spawn_next sum( k, ?x, ?y ); return x + y; spawn fib( x, n-1 ); } spawn fib( y, n-2 ); } } task sum( cont int k, int x, int y ) { send_argument( k, x+y ); } ◮ Cilk-1 used explicit continuation passing (JPDC’96) ◮ Cilk-5 used call/return semantics for parallelism (PLDI’98) ◮ Explicit continuation passing is an elegant match for hardware Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware C. Batten 7 / 18
Motivation Computation Model • Accelerator Architecture • Design Methodology Evaluation Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware int fib( int n ) { Motivation if (n < 2) return n; int x = spawn fib(n-1); Computation Model int y = fib(n-2); sync ; return x + y; Accelerator Architecture } Design Methodology Reconfig Reconfig General General Hardware Hardware Purpose Purpose Evaluation (FPGA) (FPGA) CPU CPU Shared Mem Sys Shared Mem Sys Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware C. Batten 8 / 18
Motivation Computation Model • Accelerator Architecture • Design Methodology Evaluation Scheduling Tasks with Work Stealing Task Queues Work in Progress PE 0 PE 1 PE 2 PE 3 ◮ Work stealing has good performance, space requirements, and communication overheads in both theory and practice Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware C. Batten 9 / 18
Motivation Computation Model • Accelerator Architecture • Design Methodology Evaluation Scheduling Tasks with Work Stealing Task Queues Work in Task A Progress PE 0 PE 1 PE 2 PE 3 ◮ Work stealing has good performance, space requirements, and communication overheads in both theory and practice Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware C. Batten 9 / 18
Motivation Computation Model • Accelerator Architecture • Design Methodology Evaluation Scheduling Tasks with Work Stealing Task Queues Task B Spawn Task B Work in Task A Progress PE 0 PE 1 PE 2 PE 3 ◮ Work stealing has good performance, space requirements, and communication overheads in both theory and practice Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware C. Batten 9 / 18
Motivation Computation Model • Accelerator Architecture • Design Methodology Evaluation Scheduling Tasks with Work Stealing Task Queues Dequeue Task B Work in Task B Progress PE 0 PE 1 PE 2 PE 3 ◮ Work stealing has good performance, space requirements, and communication overheads in both theory and practice Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware C. Batten 9 / 18
Motivation Computation Model • Accelerator Architecture • Design Methodology Evaluation Scheduling Tasks with Work Stealing Task Queues Task C Spawn Task C Work in Task B Progress PE 0 PE 1 PE 2 PE 3 ◮ Work stealing has good performance, space requirements, and communication overheads in both theory and practice Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware C. Batten 9 / 18
Motivation Computation Model • Accelerator Architecture • Design Methodology Evaluation Scheduling Tasks with Work Stealing Task Task C Queues Task D Spawn Task D Work in Task B Progress PE 0 PE 1 PE 2 PE 3 ◮ Work stealing has good performance, space requirements, and communication overheads in both theory and practice Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware C. Batten 9 / 18
Motivation Computation Model • Accelerator Architecture • Design Methodology Evaluation Scheduling Tasks with Work Stealing Task Queues Steal Task D Steal Task C Work in Task B Task D Task C Progress PE 0 PE 1 PE 2 PE 3 ◮ Work stealing has good performance, space requirements, and communication overheads in both theory and practice Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware C. Batten 9 / 18
Motivation Computation Model • Accelerator Architecture • Design Methodology Evaluation Scheduling Tasks with Work Stealing Task Queues Task E Task F Spawn Task E Spawn Task F Work in Task D Task C Progress PE 0 PE 1 PE 2 PE 3 ◮ Work stealing has good performance, space requirements, and communication overheads in both theory and practice Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware C. Batten 9 / 18
Motivation Computation Model • Accelerator Architecture • Design Methodology Evaluation Scheduling Tasks with Work Stealing Task Queues Steal Task E Steal Task F Work in Task E Task D Task C Task F Progress PE 0 PE 1 PE 2 PE 3 ◮ Work stealing has good performance, space requirements, and communication overheads in both theory and practice Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware C. Batten 9 / 18
Motivation Computation Model • Accelerator Architecture • Design Methodology Evaluation “Flexible” Architectural Template FPGA Networks Stealing Net IF Arg/Task Net IF Interface CPU Tile Tile Pending Arg & Task Task L1$ L1$ L1$ Store Router Cache Coherent Interconnect L2 Cache steal task succ Off-Chip DRAM TMU Processing task task Element in out Worker Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware C. Batten 10 / 18
Recommend
More recommend