leap shared memories
play

LEAP Shared Memories: Automating the Construction of FPGA Coherent - PowerPoint PPT Presentation

LEAP Shared Memories: Automating the Construction of FPGA Coherent Memories Hsin-Jung Yang , Kermin E. Fleming , Michael Adler , and Joel Emer Massachusetts Institute of Technology Intel Corporation May 12th, FCCM 2014


  1. LEAP Shared Memories: Automating the Construction of FPGA Coherent Memories Hsin-Jung Yang † , Kermin E. Fleming ‡ , Michael Adler ‡ , and Joel Emer †‡ † Massachusetts Institute of Technology ‡ Intel Corporation May 12th, FCCM 2014

  2. Motivation • Goal: simplifying parallel programming on FPGAs

  3. Motivation • Goal: simplifying parallel programming on FPGAs • 2D Heat Transfer Equation N M

  4. Motivation • Goal: simplifying parallel programming on FPGAs • 2D Heat Transfer Equation for( int t = 0 ; t < T ; t ++){ #pragma omp parallel num_threads(4){ int thread_id = omp_get_thread_num (); int bid_x = thread_id % 2 ; N int bid_y = thread_id / 2 ; for ( int y = bid_y *( N / 2 ); y < ( 1 + bid_y )*( N / 2 ); y ++) for ( int x = bid_x *( M / 2 ); x < ( 1 + bid_x )*( M / 2 ); x ++) U [ t + 1 , x , y ] = C0 * U [ t , x , y ] + Cx *( U [ t , x - 1 , y ]+ U [ t , x + 1 , y ]) M + Cy *( U [ t , x , y - 1 ]+ U [ t , x , y + 1 ]); } }

  5. Motivation • Goal: simplifying parallel programming on FPGAs • 2D Heat Transfer Equation for( int t = 0 ; t < T ; t ++){ #pragma omp parallel num_threads(4){ int thread_id = omp_get_thread_num (); int bid_x = thread_id % 2 ; N int bid_y = thread_id / 2 ; for ( int y = bid_y *( N / 2 ); y < ( 1 + bid_y )*( N / 2 ); y ++) for ( int x = bid_x *( M / 2 ); x < ( 1 + bid_x )*( M / 2 ); x ++) U [ t + 1 , x , y ] = C0 * U [ t , x , y ] + Cx *( U [ t , x - 1 , y ]+ U [ t , x + 1 , y ]) M + Cy *( U [ t , x , y - 1 ]+ U [ t , x , y + 1 ]); } operation on the shared array }

  6. Motivation • Goal: simplifying parallel programming on FPGAs • 2D Heat Transfer Equation for( int t = 0 ; t < T ; t ++){ #pragma omp parallel num_threads(4){ int thread_id = omp_get_thread_num (); int bid_x = thread_id % 2 ; N int bid_y = thread_id / 2 ; for ( int y = bid_y *( N / 2 ); y < ( 1 + bid_y )*( N / 2 ); y ++) for ( int x = bid_x *( M / 2 ); x < ( 1 + bid_x )*( M / 2 ); x ++) U [ t + 1 , x , y ] = C0 * U [ t , x , y ] + Cx *( U [ t , x - 1 , y ]+ U [ t , x + 1 , y ]) M + Cy *( U [ t , x , y - 1 ]+ U [ t , x , y + 1 ]); } operation on the shared array } implicit barrier synchronization

  7. Motivation • Goal: simplifying parallel programming on FPGAs • 2D Heat Transfer Equation N M

  8. Motivation • Goal: simplifying parallel programming on FPGAs • 2D Heat Transfer Equation N How to implement on FPGAs? M

  9. Programming on FPGA • 2D Heat Transfer Equation (using FPGA Block RAM) Engine Interface N RAM Block M

  10. Programming on FPGA • 2D Heat Transfer Equation (using FPGA Block RAM) Engine Interface N RAM Block M

  11. Programming on FPGA • 2D Heat Transfer Equation (using FPGA Block RAM) Engine Interface N RAM Block M

  12. Programming on FPGA • 2D Heat Transfer Equation (using FPGA Block RAM) Engine Interface N RAM Block M Difficulty: Problem size cannot fit in RAM block •

  13. Programming on FPGA • 2D Heat Transfer Equation (using LEAP Scratchpad) Engine Interface N M unlimited address space M. Adler et al. , “LEAP Scratchpads,” in FPGA, 2011.

  14. Programming on FPGA • 2D Heat Transfer Equation (using LEAP Scratchpad) Engine Interface N M unlimited address space M. Adler et al. , “LEAP Scratchpads,” in FPGA, 2011.

  15. Programming on FPGA • 2D Heat Transfer Equation (using LEAP Scratchpad) Engine Interface N M unlimited address space Difficulty: Single engine • is too slow M. Adler et al. , “LEAP Scratchpads,” in FPGA, 2011.

  16. Parallel Programming on FPGA • 2D Heat Transfer Equation Engine 2 Engine 3 Engine 4 Engine 1 N Interface M

  17. Parallel Programming on FPGA • 2D Heat Transfer Equation Engine 2 Engine 3 Engine 4 Engine 1 N Interface M Difficulty: Performance is limited

  18. Parallel Programming on FPGA • 2D Heat Transfer Equation Engine 2 Engine 3 Engine 4 Engine 1 N Interface M Difficulty: Performance is limited Serialized requests

  19. Parallel Programming on FPGA • 2D Heat Transfer Equation Engine 2 Engine 3 Engine 4 Engine 1 N Interface M Difficulty: Performance is limited Serialized requests Long latency if across FPGAs

  20. Parallel Programming on FPGA • 2D Heat Transfer Equation Engine 2 Engine 3 Engine 4 Engine 1 N Interface M

  21. Parallel Programming on FPGA • 2D Heat Transfer Equation Engine 2 Engine 3 Engine 4 Engine 1 N Interface M Difficulty: • Edge pixels are shared

  22. Parallel Programming on FPGA • 2D Heat Transfer Equation Engine 2 Engine 3 Engine 4 Engine 1 N Interface M Difficulty: • Edge pixels are shared Need cache coherence!

  23. Shared Memory Services: Coherent Scratchpad (CS) Engine Engine Engine Engine Interface Shared Cache Ring-based snoopy protocol pre-order request ordered request response (1) Ordering point

  24. Shared Memory Services: Coherent Scratchpad (CS) Engine Engine Engine Engine Interface Shared Cache Ring-based snoopy protocol pre-order request Modified MOSI protocol ordered request response (1) Ordering point

  25. Shared Memory Services: Coherent Scratchpad (CS) Engine Engine Engine Engine Interface Shared Cache Ring-based snoopy protocol pre-order request Modified MOSI protocol ordered request response (1) Ordering point (1) (2) Store data

  26. Shared Memory Services: Coherent Scratchpad (CS) Engine Engine Engine Engine Interface Shared Cache Ring-based snoopy protocol pre-order request Modified MOSI protocol ordered request response (1) Ordering point (1) (1) (2) Store data (2) Shared data ca (3) Store owner-bit information for every address

  27. Shared Memory Services: Coherent Scratchpad (CS) Engine Engine Engine Engine Interface Coherent Scratchpad Controller Shared Cache Ring-based snoopy protocol pre-order request Modified MOSI protocol ordered request response (1) Ordering point (2) Store data (3) Store owner-bit information owner bit data

  28. Shared Memory Services: Coherent Scratchpad

  29. Parallel Programming on FPGA • 2D Heat Transfer Equation for( int t = 0 ; t < T ; t ++){ #pragma omp parallel num_threads(4){ int thread_id = omp_get_thread_num (); int bid_x = thread_id % 2 ; N int bid_y = thread_id / 2 ; for ( int y = bid_y *( N / 2 ); y < ( 1 + bid_y )*( N / 2 ); y ++) for ( int x = bid_x *( M / 2 ); x < ( 1 + bid_x )*( M / 2 ); x ++) U [ t + 1 , x , y ] = C0 * U [ t , x , y ] + Cx *( U [ t , x - 1 , y ]+ U [ t , x + 1 , y ]) M + Cy *( U [ t , x , y - 1 ]+ U [ t , x , y + 1 ]); } } implicit barrier synchronization

  30. Parallel Programming on FPGA • 2D Heat Transfer Equation for( int t = 0 ; t < T ; t ++){ #pragma omp parallel num_threads(4){ int thread_id = omp_get_thread_num (); int bid_x = thread_id % 2 ; N int bid_y = thread_id / 2 ; for ( int y = bid_y *( N / 2 ); y < ( 1 + bid_y )*( N / 2 ); y ++) for ( int x = bid_x *( M / 2 ); x < ( 1 + bid_x )*( M / 2 ); x ++) U [ t + 1 , x , y ] = C0 * U [ t , x , y ] + Cx *( U [ t , x - 1 , y ]+ U [ t , x + 1 , y ]) M + Cy *( U [ t , x , y - 1 ]+ U [ t , x , y + 1 ]); } operation on the shared array } implicit barrier synchronization

  31. Parallel Programming on FPGA • 2D Heat Transfer Equation for( int t = 0 ; t < T ; t ++){ #pragma omp parallel num_threads(4){ int thread_id = omp_get_thread_num (); int bid_x = thread_id % 2 ; N int bid_y = thread_id / 2 ; for ( int y = bid_y *( N / 2 ); y < ( 1 + bid_y )*( N / 2 ); y ++) for ( int x = bid_x *( M / 2 ); x < ( 1 + bid_x )*( M / 2 ); x ++) U [ t + 1 , x , y ] = C0 * U [ t , x , y ] + Cx *( U [ t , x - 1 , y ]+ U [ t , x + 1 , y ]) M + Cy *( U [ t , x , y - 1 ]+ U [ t , x , y + 1 ]); } operation on the shared array } implicit barrier synchronization Finish the inner loop operations •  Computations complete  Memory operations complete Wait until all threads are finished •

  32. Shared Memory Services: Memory Consistency • Block RAM/Private Scratchpad Interface interface MEM_IFC#(type t_ADDR, type t_DATA); method void readReq (t_ADDR addr); method void write (t_ADDR addr, t_DATA data); method t_DATA readResp (); endinterface • Coherent Scratchpad Interface interface MEM_IFC#(type t_ADDR, type t_DATA); method void readReq (t_ADDR addr); method void write (t_ADDR addr, t_DATA data); method t_DATA readResp (); // t_REQ r := {READ, WRITE, FULL} method Bool requestPending (t_REQ r); endinterface

  33. Shared Memory Services: Memory Consistency • Block RAM/Private Scratchpad Interface interface MEM_IFC#(type t_ADDR, type t_DATA); method void readReq (t_ADDR addr); method void write (t_ADDR addr, t_DATA data); method t_DATA readResp (); endinterface • Coherent Scratchpad Interface interface MEM_IFC#(type t_ADDR, type t_DATA); method void readReq (t_ADDR addr); method void write (t_ADDR addr, t_DATA data); method t_DATA readResp (); // t_REQ r := {READ, WRITE, FULL} method Bool requestPending (t_REQ r); Fence support endinterface (memory consistency)

Recommend


More recommend