Case Study: Gauss-Seidel on Multicores Erik Hagersten Uppsala University Sweden Thanks: Dan Wallin(arch), Henrik Löf (sci comp) and Sverker Holmgren (sci comp) From Wallin et al, ICS 2006
Criteria for HPC Algorithms � Past: � Minimize communication � Maximize scalability (1000s of CPUs) Uppsala University � Multicores today: � Communication is “for free” [on some multicores] � Scalability is limited to 32 threads � The caches are tiny � Memory bandwidth is scarse � Data locality is key! (Both for Capacity and Capability Computing!) Institutionen för informationsteknologi | www.it.uu.se
Natural Order Gauss-Seidel = sweep path 1 = previous Uppsala University 1 1 = current 1 = data dependence 1 1,2,3,4 = iteration number = cacheline layout Institutionen för informationsteknologi | www.it.uu.se
Natural Order Gauss-Seidel = sweep path 1 = previous Uppsala University 1 1 = current 1 = data dependence 1 1,2,3,4 = iteration number 1 = cacheline layout 1 1 1 IF (convergence_test) <done> else <iterate again> Institutionen för informationsteknologi | www.it.uu.se
Natural Order Gauss-Seidel = sweep path 2 = previous Uppsala University 2 2 = current 2 = data dependence 2 1,2,3,4 = iteration number = cacheline layout Data dependence � Poor Parallelism � Institutionen för informationsteknologi | www.it.uu.se
Red-Black Gauss-Seidel = sweep path 0,5 = previous Uppsala University 0,5 0,5 = current 0,5 = data dependence 0,5 1,2,3,4 = iteration number = cacheline layout Institutionen för informationsteknologi | www.it.uu.se
Red-Black Gauss-Seidel step 0,5: update the blacks = sweep path 0,5 = previous Uppsala University 0,5 0,5 = current 0,5 = data dependence 0,5 1,2,3,4 = iteration number = cacheline layout Institutionen för informationsteknologi | www.it.uu.se
Red-Black Gauss-Seidel step 1,0 update all reds = sweep path = previous Uppsala University 1 = current 1 = data dependence 1 1,2,3,4 = iteration number 1 = cacheline layout 1 1 1 Update all blacks <barrier> Update all reds � great parallelism!!! <barrier> Institutionen för informationsteknologi | www.it.uu.se
Red-Black Gauss-Seidel Parallel version = sweep path = previous Uppsala University 1 thread 0 = current 1 = data dependence thread 1 1 1,2,3,4 = iteration number 1 = cacheline layout thread 2 1 thread 3 1 1 thread 4 IN PARALLELL { Update all blacks <barrier> Update all reds <barrier> } Institutionen för informationsteknologi | www.it.uu.se
Any Drawbacks of the Red-Black? � Poor Cache Locality of Red-Black: Uppsala University � Each element will be brought into the cache twice per iteration � � Natural Order: � Each element will be brought into the cache once per iteration � � You can do even better… � Natural Order with Temporal Blocking ☺ Institutionen för informationsteknologi | www.it.uu.se
G-S, temporal blocking: several iterations per sweep = sweep path 4 = previous 4 Uppsala University 4 = current 4 3 2 3 = data dependence 2 1 1,2,3,4 = iteration number = cacheline layout = active region Institutionen för informationsteknologi | www.it.uu.se
G-S, temporal blocking: several iterations per sweep = sweep path = previous 4 Uppsala University 4 = current 4 3 = data dependence 2 2 1 1,2,3,4 = iteration number 1 = cacheline layout = active region In this case: 4 iterations per “sweep”. ( σ = 4) σ = 1,0 for natural order G-S σ = 0,5 for red-black G-S Institutionen för informationsteknologi | www.it.uu.se
G-S 3D, σ =2 Uppsala University Institutionen för informationsteknologi | www.it.uu.se
G-S 3D, σ =2 Uppsala University Institutionen för informationsteknologi | www.it.uu.se
Acumem Graph, 3D N=129 Uppsala University 14 σ = 0,5 σ = 1 σ = 2 Cache miss ratio (percent) 12 σ = 4 σ = 8 σ = 16 10 8 6 4 2 0 512 KB 1 MB 2 MB 4 MB 8 MB 16 MB 32 MB Cache size Miss ratio ~ Memory bandwidth Institutionen för informationsteknologi | www.it.uu.se
Parallel G-S, temporal blocked thread 0 thread 1 thread 2 thread 3 = sweep path = previous 4 Uppsala University 4 = current 4 3 3 = data dependence 2 2 1 1,2,3,4 = iteration number 1 1 0 0 = cacheline layout = active region 1 = sync flag iteration no Synchronization flags Institutionen för informationsteknologi | www.it.uu.se
Parallel G-S, temporal blocked thread 0 thread 1 thread 2 thread 3 = sweep path = previous 4 Uppsala University 4 = current 4 3 3 = data dependence 2 2 1 1,2,3,4 = iteration number 2 1 1 0 1 = cacheline layout = active region 1 = sync flag iteration no Wait until ”lefty” is done: Synchronization Lots of communication flags • Producer/Consumer flag • Sharing of data values Institutionen för informationsteknologi | www.it.uu.se
Parallel G-S 3D Uppsala University t0 t1 t2 t3 Institutionen för informationsteknologi | www.it.uu.se
Parallel G-S 3D Uppsala University t0 0 0 0 0 0 t1 0 0 0 0 0 t2 0 0 0 0 0 t3 0 0 0 0 0 Flags Institutionen för informationsteknologi | www.it.uu.se
Parallel G-S 3D Uppsala University t0 1 0 0 0 0 0 t1 0 0 0 0 0 t2 0 0 0 0 0 t3 0 0 0 0 0 Institutionen för informationsteknologi | www.it.uu.se
Parallel G-S 3D Uppsala University t0 1 0 0 0 0 t1 0 0 0 0 0 t2 0 0 0 0 0 t3 0 0 0 0 0 Institutionen för informationsteknologi | www.it.uu.se
Parallel G-S 3D cacheline layout (size B bytes) Uppsala University t0 1 2 1 0 0 0 0 t1 1 0 0 0 0 0 t2 0 0 0 0 0 t3 0 0 0 0 0 Stratup cost = (#threads-1)/(N σ ) Communication: • one flag synchronization per N 2 /#threads ops • one communication miss per B*N/#threads bytes Institutionen för informationsteknologi | www.it.uu.se
Parallel Executiontime 2,5 threads=1 threads=2 Uppsala University Execution time per step (sec) 2 threads=4 threads=8 1,5 1 0,5 0 σ = 0,5 1 2 4 8 16 Temp.Blocked GS Natural RedBlack RBGS TBGS Order Institutionen för informationsteknologi | www.it.uu.se
Performance comparison with Red-Black σ = 0,5 N = 257, 32 threads (Sun E15 K, US IIIcu = SMP!!) σ = 1 3,5 σ = 2 Uppsala University σ = 4 3,0 σ = 8 Performance ratio TBGS/RBGS σ = 16 2,5 2,0 1,5 1,0 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 Threads Institutionen för informationsteknologi | www.it.uu.se
Multicore Simulation σ = 1 Uppsala University 2,5 Sun 15k Simulated Multicore 2,0 Performance ratio TBGS/RBGS 1,5 1,0 0,5 0,0 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 Threads Institutionen för informationsteknologi | www.it.uu.se
Using Gauss-Seidel Smoother in a Multigrid � G-S Important part of many real apps! Uppsala University � EX: G-S as a Smoother in “Multigrid” � Iterative algorithm � More efficient smother cuts #iterations Institutionen för informationsteknologi | www.it.uu.se
One slide summary � Today’s algorithms assume expensive communication � The communication cost of [some] multicores is Uppsala University close to zero � Locality is becoming key to performance [again] � Redesign HPC algorithms to face this fact! (For both Capacity and Capability computing) We show: * 3x performance gain * ~30x less bandwidth Is it time to revisit more algorithms? Institutionen för informationsteknologi | www.it.uu.se
Niagara 4 x DDR-2 = 25GB/s (!) Uppsala University Memory Memory Memory Memory ctrl ctrl ctrl ctrl Shared L2! 3MB L2 L2 L2 L2 Xbar = 134 GB/s …8 ... L1I L1D L1I L1D (wt) (wt) … 8 … CPU CPU Shared caches: Good or bad? Institutionen för informationsteknologi | www.it.uu.se
Recommend
More recommend