Architectural Support for Parallel Reduction in Scalable Shared - PowerPoint PPT Presentation

Architectural Support for Parallel Reduction in Scalable Shared Memory Multiprocessors in Scalable Shared-Memory Multiprocessors María Jesús Garzarán M. Prvulovic § Y. Zhang § María Jesús Garzarán , M. Prvulovic , Y. Zhang , A. Jula * , H. Yu * , L. Rauchwerger * , and J. Torrellas § * Texas A&M University U. of Zaragoza, Spain § University of Illinois

Motivation � Reductions are important in scientific codes for (...) { for (...) { … x = x op expression; … } � Reduction parallelization algorithms are not scalable parallel _ for (...) { … lock(w[x[i]]); lock(w[x[i]]); x = x op expression; unlock(w[x[i]]); … } 2

Contribution � New architectural support for parallel reductions in CC-NUMA: CC NUMA: • Speeds-up parallel reduction • Makes parallel reduction scalable M k ll l d i l bl � Increase 16-proc speedup from 2.7 to 7.6 8 Opt 6 p Speedup 4 NoOpt 2 0 Nprocs 1 4 8 16 3

Outline � Background on Reduction � Parallelizing Reduction in Software P ll li i R d i i S f � Our contribution: Private Cache-Line Reduction (PCLR) � Evaluation � Related Work � Conclusions 4

Background on Reduction � Reduction operation: for (…) { ( ) { … x = x op expression; … } – op : associative and commutative operator – x : does not occur in expression or anywhere else in the loop d t i i h l i th l � There may be complex flow dependences across iterations – Parallelization of reductions needs special transformations for (i= 0; i< Nodes; i++) w[x[i]] += expression; 5

Parallelizing Reduction in Software (I) � Enclose access in unordered critical section parallel_for(...){ lock(w[x[i]]); w[x[i]]+=expression; [ [i]]+ i unlock(w[x[i]]); } } � Drawbacks: – Overhead – Contention increases with the # of processors. 7

Parallelizing Reduction in Software (II) � Each processor accumulates on a private array for (i=0;i<array_size;i++) Clear Private Array w priv[pid][i]=0; _p [p ][ ] ; parallel_for (...) w_priv[pid][x[i]]+=expression; i [ id][ [i]] i barrier(); for (i=MyBegin;i<MyEnd;i++) for (p=0; p<NumProcs; p++) Merge w[i]+=w priv[p][i]; [ ] _p [p][ ]; barrier(); 8

Drawbacks of the Privatization Method � Initialization phase – Sweeps the cache before the parallel loop starts for (i=0;i<array_size;i++) Clear Private Array y w priv[pid][i]=0; w_priv[pid][i]=0; 9

Drawbacks of the Privatization Method � Merging phase: – Work proportional to the array size W k i l h i Private P1 P i t P1 P i Private P2 t P2 Shared array Shared array P1 P2 10

Drawbacks of the Privatization Method � Merging phase: – Work proportional to the array size W k i l h i P i Private P1 P i Private P1 t P1 t P1 Private P2 P i P i Private P2 t P2 t P2 P i Private P3 t P3 P i Private P4 t P4 Shared array Shared array P1 This method is not scalable 11

Outline � Background on Reduction � Parallelizing Reduction in Software P ll li i R d i i S f � Our contribution: Private Cache-Line Reduction (PCLR) (PCLR) � Evaluation � Related Work l d k � Conclusions 12

Main Idea of PCLR Use non-coherent cache lines in the diff different processors as the temporary h private arrays • Remove initialization phase • Accumulate on cache lines Accumulate on cache lines • Remove the merging phase 13

Removing Initialization � Caches lines are initialized on-demand on cache misses for (i=0;i<array_size;i++) Clear Private Array w priv[pid][i]=0; _p [p ][ ] ; parallel_for (...) w_priv[pid][x[i]]+=expression; i [ id][ [i]] i barrier(); for (i=MyBegin;i<MyEnd;i++) for (p=0; p<NumProcs; p++) Merge w[i]+=w priv[p][i]; [ ] _p [p][ ]; barrier(); 14

Removing Initialization miss i CPU Neutral Element Cache Directory Memory Memory Network 15

Accumulating on Cache Lines � No need to allocate a private array parallel_for (...) w[x[i]] += expression; [ [i]] + i w_priv[pid][x[i]]+=expression; i [ id][ [i]] i barrier(); for (i=MyBegin;i<MyEnd;i++) for (p=0; p<NumProcs; p++) Merge w[i]+=w priv[p][i]; [ ] _p [p][ ]; barrier(); 16

Removing Initialization Accumulate on Accumulate on load CPU private cache lines add store Directory Memory Memory Network 17

Removing the Merge � Lines are accumulated at the home on displacements parallel_for (...) w[x[i]]+=expression; [ [i]] i barrier(); CacheFlush(); for (i=MyBegin;i<MyEnd;i++) for (p=0; p<NumProcs; p++) barrier(); Merge w[i]+=w priv[p][i]; [ ] _p [p][ ]; barrier(); 18

Removing the Merge CPU displacement Directory Directory Memory Memory Network Memory 19

Recognizing Reduction Data � Naïve approach � Naïve approach – new load & store instructions � Advanced Mechanism Advanced Mechanism – shadow addresses [like Impulse] 20

Shadow Addresses � The directory recognizes shadow addresses and translates them into the original ones them into the original ones Physical Memory original reduction array Directory recognizes and translates shadow reduction Uninstalled memory array array 21

Atomicity Issues � A reduction op is composed of a pair load-store inst � A problem appears if a cache line is displaced between the A bl if h li i di l d b h reduction load and the store load r1, addr add r1, r1, r3 dd 1 1 3 store r1, addr 22

Atomicity Issues � A reduction op is composed of a pair load-store inst � A problem appears if a cache line is displaced between the A bl if h li i di l d b h reduction load and the store cache r3 = 1 load r1, addr PC 5 add r1, r1, r3 dd 1 1 3 store r1, addr Home memory addr = 2 Fi Final result should be 8 l lt h ld b 8 23

Atomicity Issues � A reduction op is composed of a pair load-store inst � A problem appears if a cache line is displaced between the A bl if h li i di l d b h reduction load and the store cache r3 = 1 load r1, addr r1 5 5 add r1, r1, r3 dd 1 1 3 PC store r1, addr displacement p Home memory addr = 2 + 5 Fi Final result should be 8 l lt h ld b 8 24

Atomicity Issues � A reduction op is composed of a pair load-store inst � A problem appears if a cache line is displaced between the A bl if h li i di l d b h reduction load and the store cache r3 = 1 load r1, addr r1 5 add r1, r1, r3 dd 1 1 3 r1 6 store r1, addr PC Home memory addr = 2 + 5 Fi Final result should be 8 l lt h ld b 8 25

Atomicity Issues � A reduction op is composed of a pair load-store inst � A problem appears if a cache line is displaced between the A bl if h li i di l d b h reduction load and the store r3 = 1 load r1, addr r1 5 6 add r1, r1, r3 dd 1 1 3 r1 6 store r1, addr 6 PC PC Home memory addr = 2 + 5 Fi Final result should be 8 l lt h ld b 8 26

Atomicity Issues � A reduction op is composed of a pair load-store inst � A problem appears if a cache line is displaced between the A bl if h li i di l d b h reduction load and the store r3 = 1 load r1, addr r1 5 6 add r1, r1, r3 dd 1 1 3 r1 6 store r1, addr When 6 displaced displaced PC PC Home memory addr = 2 + 5 + 6 Fi Final result should be 8 l lt h ld b 8 27

Solution to the Atomicity Problem � Atomic exchange neutral element - memory contents load r1, neutral load r1, addr , swap r1, addr p , add r1, r1, r3 add r1, r1, r3 store r1, addr store r1, addr The line can be displaced between swap and store 28

Summary of Architectural Support Special support in directory/network controller � Intercept a reduction cache miss & return neutral elements l � U � Use ALU to merge data at displacements or at loop’s ALU t d t t di pl t t l p’ end � Reduction lines can be dirty in multiple caches 29

Advantages of PCLR � Remove initialization phase: R i i i li i h – Avoid cache sweeping � No need allocate private arrays � Remove merging phase: – Work at the end: proportional to cache size instead of to array W k h d i l h i i d f size 30

Evaluation Methodology � Execution-driven simulator � Scalable multiprocessor: 4-16 processors S l bl l i 4 16 � Detailed superscalar processor model � 32 KB L1 2-way, 512 KB L2 4-way � Round-trip latencies non-contention: L1 (2 cyc), L2 (10 cyc), Local Memory(104 cyc), 2-hop(297 cyc) � Floating-point unit in the directory controller: – fully pipelined, – latency (6 processor cyc). 32

Applications � Fortran and C codes � Loops with reduction ops identified by the compiler – Euler [HPF-2 suite] – Equake [SPECfp2000 suite] Vml * [Sparse BLAS suite] Reduction loops account for an Reduction loops account for an – Vml [Sparse BLAS suite] avg. of 81.2% of Tsequential – Charmm * [CHARMM appl] – Nbf * [GROMOS appl] Nbf [GROMOS appl] * Kernels 33

Architectural Support for Parallel Reduction in Scalable Shared - PowerPoint PPT Presentation

Architectural Support for Parallel Reduction in Scalable Shared Memory Multiprocessors in Scalable Shared-Memory Multiprocessors Mara Jess Garzarn M. Prvulovic Y. Zhang Mara Jess Garzarn , M. Prvulovic , Y. Zhang , A. Jula * , H.

OBAMA PRESIDENTIAL CENTER INTRODUCTION 2 INTRODUCTION 3 ARCHITECTURAL DESIGN 4 ARCHITECTURAL

Religious Architectural Religious Architectural Religious Architectural Religious Architectural

Architectural Resources Cambridge Architectural Resources Cambridge Architectural Resources

NES Architectural Ltd http://www.nes-solutions.co.uk/architectural Who Are we? NES Architectural

Fast Scalable Parallel Comparison Sort Fast, Scalable Parallel Comparison Sort On Hybrid Multicore

PASCAL A Parallel Algorithmic SCALable Framework A Parallel Algorithmic SCALable Framework for

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

Hows the Parallel Computing Revolution Going? Towards Parallel, Scalable VM Services Kathryn

Flexible Flexible Architectural Architectural Support Support for for Fine Fine Grain rain

ATLAS ATLAS A Scalable Emulator for A Scalable Emulator for Transactional Parallel Systems

Zarr - scalable storage of tensor Zarr - scalable storage of tensor data for parallel and

Basics Architectural Presentation Basics Architectural Presentation Filesize: 6.51 MB Reviews

Idealised Fault Tolerant Idealised Fault Tolerant Architectural Element Architectural Element

Banking software architecture 2 Architectural Styles 1 WebLogic Network Gatekeeper's software

Scalable String Matching on the Scalable String Matching on the Scalable String Matching on the

Introduction to Harm Reduction Definition of Harm Reduction Harm reduction refers to policies,

DISCUSSION TOPICS The Opportunity The Strategy The Bid The Timeline THE

NCAA National Collegiate Scouting Association NAIA NJCAA Previous Experience Why Are We Here?

AND DIVISION II FESTIVAL GOALS Create once -in-a- lifetime student -athlete experience.

Ensemble Docking Revisited Oliver Korb Cambridge Crystallographic Data Centre

THE RECRUITING PROCESS DIII DI DI NC NCJAA DII NA NAIA Where will the journey take YOU?

Organization Organization Camaraderie Fundraising School Participation S h l

The Product Datasheet- Formulation and Antibody Performance The Antibody Society Webcast series

Hyve Group plc Preliminary Results Year ended 30 September 2019 3 December 2019 Agenda