architectural support for parallel reduction in scalable
play

Architectural Support for Parallel Reduction in Scalable Shared - PowerPoint PPT Presentation

Architectural Support for Parallel Reduction in Scalable Shared Memory Multiprocessors in Scalable Shared-Memory Multiprocessors Mara Jess Garzarn M. Prvulovic Y. Zhang Mara Jess Garzarn , M. Prvulovic , Y. Zhang , A. Jula * , H.


  1. Architectural Support for Parallel Reduction in Scalable Shared Memory Multiprocessors in Scalable Shared-Memory Multiprocessors María Jesús Garzarán M. Prvulovic § Y. Zhang § María Jesús Garzarán , M. Prvulovic , Y. Zhang , A. Jula * , H. Yu * , L. Rauchwerger * , and J. Torrellas § * Texas A&M University U. of Zaragoza, Spain § University of Illinois

  2. Motivation � Reductions are important in scientific codes for (...) { for (...) { … x = x op expression; … } � Reduction parallelization algorithms are not scalable parallel _ for (...) { … lock(w[x[i]]); lock(w[x[i]]); x = x op expression; unlock(w[x[i]]); … } 2

  3. Contribution � New architectural support for parallel reductions in CC-NUMA: CC NUMA: • Speeds-up parallel reduction • Makes parallel reduction scalable M k ll l d i l bl � Increase 16-proc speedup from 2.7 to 7.6 8 Opt 6 p Speedup 4 NoOpt 2 0 Nprocs 1 4 8 16 3

  4. Outline � Background on Reduction � Parallelizing Reduction in Software P ll li i R d i i S f � Our contribution: Private Cache-Line Reduction (PCLR) � Evaluation � Related Work � Conclusions 4

  5. Background on Reduction � Reduction operation: for (…) { ( ) { … x = x op expression; … } – op : associative and commutative operator – x : does not occur in expression or anywhere else in the loop d t i i h l i th l � There may be complex flow dependences across iterations – Parallelization of reductions needs special transformations for (i= 0; i< Nodes; i++) w[x[i]] += expression; 5

  6. Outline � Background on Reduction � Parallelizing Reduction in Software P ll li i R d i i S f � Our contribution: Private Cache-Line Reduction (PCLR) � Evaluation � Related Work � Conclusions 6

  7. Parallelizing Reduction in Software (I) � Enclose access in unordered critical section parallel_for(...){ lock(w[x[i]]); w[x[i]]+=expression; [ [i]]+ i unlock(w[x[i]]); } } � Drawbacks: – Overhead – Contention increases with the # of processors. 7

  8. Parallelizing Reduction in Software (II) � Each processor accumulates on a private array for (i=0;i<array_size;i++) Clear Private Array w priv[pid][i]=0; _p [p ][ ] ; parallel_for (...) w_priv[pid][x[i]]+=expression; i [ id][ [i]] i barrier(); for (i=MyBegin;i<MyEnd;i++) for (p=0; p<NumProcs; p++) Merge w[i]+=w priv[p][i]; [ ] _p [p][ ]; barrier(); 8

  9. Drawbacks of the Privatization Method � Initialization phase – Sweeps the cache before the parallel loop starts for (i=0;i<array_size;i++) Clear Private Array y w priv[pid][i]=0; w_priv[pid][i]=0; 9

  10. Drawbacks of the Privatization Method � Merging phase: – Work proportional to the array size W k i l h i Private P1 P i t P1 P i Private P2 t P2 Shared array Shared array P1 P2 10

  11. Drawbacks of the Privatization Method � Merging phase: – Work proportional to the array size W k i l h i P i Private P1 P i Private P1 t P1 t P1 Private P2 P i P i Private P2 t P2 t P2 P i Private P3 t P3 P i Private P4 t P4 Shared array Shared array P1 This method is not scalable 11

  12. Outline � Background on Reduction � Parallelizing Reduction in Software P ll li i R d i i S f � Our contribution: Private Cache-Line Reduction (PCLR) (PCLR) � Evaluation � Related Work l d k � Conclusions 12

  13. Main Idea of PCLR Use non-coherent cache lines in the diff different processors as the temporary h private arrays • Remove initialization phase • Accumulate on cache lines Accumulate on cache lines • Remove the merging phase 13

  14. Removing Initialization � Caches lines are initialized on-demand on cache misses for (i=0;i<array_size;i++) Clear Private Array w priv[pid][i]=0; _p [p ][ ] ; parallel_for (...) w_priv[pid][x[i]]+=expression; i [ id][ [i]] i barrier(); for (i=MyBegin;i<MyEnd;i++) for (p=0; p<NumProcs; p++) Merge w[i]+=w priv[p][i]; [ ] _p [p][ ]; barrier(); 14

  15. Removing Initialization miss i CPU Neutral Element Cache Directory Memory Memory Network 15

  16. Accumulating on Cache Lines � No need to allocate a private array parallel_for (...) w[x[i]] += expression; [ [i]] + i w_priv[pid][x[i]]+=expression; i [ id][ [i]] i barrier(); for (i=MyBegin;i<MyEnd;i++) for (p=0; p<NumProcs; p++) Merge w[i]+=w priv[p][i]; [ ] _p [p][ ]; barrier(); 16

  17. Removing Initialization Accumulate on Accumulate on load CPU private cache lines add store Directory Memory Memory Network 17

  18. Removing the Merge � Lines are accumulated at the home on displacements parallel_for (...) w[x[i]]+=expression; [ [i]] i barrier(); CacheFlush(); for (i=MyBegin;i<MyEnd;i++) for (p=0; p<NumProcs; p++) barrier(); Merge w[i]+=w priv[p][i]; [ ] _p [p][ ]; barrier(); 18

  19. Removing the Merge CPU displacement Directory Directory Memory Memory Network Memory 19

  20. Recognizing Reduction Data � Naïve approach � Naïve approach – new load & store instructions � Advanced Mechanism Advanced Mechanism – shadow addresses [like Impulse] 20

  21. Shadow Addresses � The directory recognizes shadow addresses and translates them into the original ones them into the original ones Physical Memory original reduction array Directory recognizes and translates shadow reduction Uninstalled memory array array 21

  22. Atomicity Issues � A reduction op is composed of a pair load-store inst � A problem appears if a cache line is displaced between the A bl if h li i di l d b h reduction load and the store load r1, addr add r1, r1, r3 dd 1 1 3 store r1, addr 22

  23. Atomicity Issues � A reduction op is composed of a pair load-store inst � A problem appears if a cache line is displaced between the A bl if h li i di l d b h reduction load and the store cache r3 = 1 load r1, addr PC 5 add r1, r1, r3 dd 1 1 3 store r1, addr Home memory addr = 2 Fi Final result should be 8 l lt h ld b 8 23

  24. Atomicity Issues � A reduction op is composed of a pair load-store inst � A problem appears if a cache line is displaced between the A bl if h li i di l d b h reduction load and the store cache r3 = 1 load r1, addr r1 5 5 add r1, r1, r3 dd 1 1 3 PC store r1, addr displacement p Home memory addr = 2 + 5 Fi Final result should be 8 l lt h ld b 8 24

  25. Atomicity Issues � A reduction op is composed of a pair load-store inst � A problem appears if a cache line is displaced between the A bl if h li i di l d b h reduction load and the store cache r3 = 1 load r1, addr r1 5 add r1, r1, r3 dd 1 1 3 r1 6 store r1, addr PC Home memory addr = 2 + 5 Fi Final result should be 8 l lt h ld b 8 25

  26. Atomicity Issues � A reduction op is composed of a pair load-store inst � A problem appears if a cache line is displaced between the A bl if h li i di l d b h reduction load and the store r3 = 1 load r1, addr r1 5 6 add r1, r1, r3 dd 1 1 3 r1 6 store r1, addr 6 PC PC Home memory addr = 2 + 5 Fi Final result should be 8 l lt h ld b 8 26

  27. Atomicity Issues � A reduction op is composed of a pair load-store inst � A problem appears if a cache line is displaced between the A bl if h li i di l d b h reduction load and the store r3 = 1 load r1, addr r1 5 6 add r1, r1, r3 dd 1 1 3 r1 6 store r1, addr When 6 displaced displaced PC PC Home memory addr = 2 + 5 + 6 Fi Final result should be 8 l lt h ld b 8 27

  28. Solution to the Atomicity Problem � Atomic exchange neutral element - memory contents load r1, neutral load r1, addr , swap r1, addr p , add r1, r1, r3 add r1, r1, r3 store r1, addr store r1, addr The line can be displaced between swap and store 28

  29. Summary of Architectural Support Special support in directory/network controller � Intercept a reduction cache miss & return neutral elements l � U � Use ALU to merge data at displacements or at loop’s ALU t d t t di pl t t l p’ end � Reduction lines can be dirty in multiple caches 29

  30. Advantages of PCLR � Remove initialization phase: R i i i li i h – Avoid cache sweeping � No need allocate private arrays � Remove merging phase: – Work at the end: proportional to cache size instead of to array W k h d i l h i i d f size 30

  31. Outline � Background on Reduction � Parallelizing Reduction in Software P ll li i R d i i S f � Our contribution: Private Cache-Line Reduction (PCLR) � Evaluation � Related Work � Conclusions 31

  32. Evaluation Methodology � Execution-driven simulator � Scalable multiprocessor: 4-16 processors S l bl l i 4 16 � Detailed superscalar processor model � 32 KB L1 2-way, 512 KB L2 4-way � Round-trip latencies non-contention: L1 (2 cyc), L2 (10 cyc), Local Memory(104 cyc), 2-hop(297 cyc) � Floating-point unit in the directory controller: – fully pipelined, – latency (6 processor cyc). 32

  33. Applications � Fortran and C codes � Loops with reduction ops identified by the compiler – Euler [HPF-2 suite] – Equake [SPECfp2000 suite] Vml * [Sparse BLAS suite] Reduction loops account for an Reduction loops account for an – Vml [Sparse BLAS suite] avg. of 81.2% of Tsequential – Charmm * [CHARMM appl] – Nbf * [GROMOS appl] Nbf [GROMOS appl] * Kernels 33

Recommend


More recommend