Case Study: Gauss-Seidel on Multicores Erik Hagersten Uppsala - PowerPoint PPT Presentation

Case Study: Gauss-Seidel on Multicores Erik Hagersten Uppsala University Sweden Thanks: Dan Wallin(arch), Henrik Löf (sci comp) and Sverker Holmgren (sci comp) From Wallin et al, ICS 2006

Criteria for HPC Algorithms � Past: � Minimize communication � Maximize scalability (1000s of CPUs) Uppsala University � Multicores today: � Communication is “for free” [on some multicores] � Scalability is limited to 32 threads � The caches are tiny � Memory bandwidth is scarse � Data locality is key! (Both for Capacity and Capability Computing!) Institutionen för informationsteknologi | www.it.uu.se

Natural Order Gauss-Seidel = sweep path 1 = previous Uppsala University 1 1 = current 1 = data dependence 1 1,2,3,4 = iteration number = cacheline layout Institutionen för informationsteknologi | www.it.uu.se

Natural Order Gauss-Seidel = sweep path 1 = previous Uppsala University 1 1 = current 1 = data dependence 1 1,2,3,4 = iteration number 1 = cacheline layout 1 1 1 IF (convergence_test) <done> else <iterate again> Institutionen för informationsteknologi | www.it.uu.se

Natural Order Gauss-Seidel = sweep path 2 = previous Uppsala University 2 2 = current 2 = data dependence 2 1,2,3,4 = iteration number = cacheline layout Data dependence � Poor Parallelism � Institutionen för informationsteknologi | www.it.uu.se

Red-Black Gauss-Seidel = sweep path 0,5 = previous Uppsala University 0,5 0,5 = current 0,5 = data dependence 0,5 1,2,3,4 = iteration number = cacheline layout Institutionen för informationsteknologi | www.it.uu.se

Red-Black Gauss-Seidel step 0,5: update the blacks = sweep path 0,5 = previous Uppsala University 0,5 0,5 = current 0,5 = data dependence 0,5 1,2,3,4 = iteration number = cacheline layout Institutionen för informationsteknologi | www.it.uu.se

Red-Black Gauss-Seidel step 1,0 update all reds = sweep path = previous Uppsala University 1 = current 1 = data dependence 1 1,2,3,4 = iteration number 1 = cacheline layout 1 1 1 Update all blacks <barrier> Update all reds � great parallelism!!! <barrier> Institutionen för informationsteknologi | www.it.uu.se

Red-Black Gauss-Seidel Parallel version = sweep path = previous Uppsala University 1 thread 0 = current 1 = data dependence thread 1 1 1,2,3,4 = iteration number 1 = cacheline layout thread 2 1 thread 3 1 1 thread 4 IN PARALLELL { Update all blacks <barrier> Update all reds <barrier> } Institutionen för informationsteknologi | www.it.uu.se

Any Drawbacks of the Red-Black? � Poor Cache Locality of Red-Black: Uppsala University � Each element will be brought into the cache twice per iteration � � Natural Order: � Each element will be brought into the cache once per iteration � � You can do even better… � Natural Order with Temporal Blocking ☺ Institutionen för informationsteknologi | www.it.uu.se

G-S, temporal blocking: several iterations per sweep = sweep path 4 = previous 4 Uppsala University 4 = current 4 3 2 3 = data dependence 2 1 1,2,3,4 = iteration number = cacheline layout = active region Institutionen för informationsteknologi | www.it.uu.se

G-S, temporal blocking: several iterations per sweep = sweep path = previous 4 Uppsala University 4 = current 4 3 = data dependence 2 2 1 1,2,3,4 = iteration number 1 = cacheline layout = active region In this case: 4 iterations per “sweep”. ( σ = 4) σ = 1,0 for natural order G-S σ = 0,5 for red-black G-S Institutionen för informationsteknologi | www.it.uu.se

G-S 3D, σ =2 Uppsala University Institutionen för informationsteknologi | www.it.uu.se

Acumem Graph, 3D N=129 Uppsala University 14 σ = 0,5 σ = 1 σ = 2 Cache miss ratio (percent) 12 σ = 4 σ = 8 σ = 16 10 8 6 4 2 0 512 KB 1 MB 2 MB 4 MB 8 MB 16 MB 32 MB Cache size Miss ratio ~ Memory bandwidth Institutionen för informationsteknologi | www.it.uu.se

Parallel G-S, temporal blocked thread 0 thread 1 thread 2 thread 3 = sweep path = previous 4 Uppsala University 4 = current 4 3 3 = data dependence 2 2 1 1,2,3,4 = iteration number 1 1 0 0 = cacheline layout = active region 1 = sync flag iteration no Synchronization flags Institutionen för informationsteknologi | www.it.uu.se

Parallel G-S, temporal blocked thread 0 thread 1 thread 2 thread 3 = sweep path = previous 4 Uppsala University 4 = current 4 3 3 = data dependence 2 2 1 1,2,3,4 = iteration number 2 1 1 0 1 = cacheline layout = active region 1 = sync flag iteration no Wait until ”lefty” is done: Synchronization Lots of communication flags • Producer/Consumer flag • Sharing of data values Institutionen för informationsteknologi | www.it.uu.se

Parallel G-S 3D Uppsala University t0 t1 t2 t3 Institutionen för informationsteknologi | www.it.uu.se

Parallel G-S 3D Uppsala University t0 0 0 0 0 0 t1 0 0 0 0 0 t2 0 0 0 0 0 t3 0 0 0 0 0 Flags Institutionen för informationsteknologi | www.it.uu.se

Parallel G-S 3D Uppsala University t0 1 0 0 0 0 0 t1 0 0 0 0 0 t2 0 0 0 0 0 t3 0 0 0 0 0 Institutionen för informationsteknologi | www.it.uu.se

Parallel G-S 3D Uppsala University t0 1 0 0 0 0 t1 0 0 0 0 0 t2 0 0 0 0 0 t3 0 0 0 0 0 Institutionen för informationsteknologi | www.it.uu.se

Parallel G-S 3D cacheline layout (size B bytes) Uppsala University t0 1 2 1 0 0 0 0 t1 1 0 0 0 0 0 t2 0 0 0 0 0 t3 0 0 0 0 0 Stratup cost = (#threads-1)/(N σ ) Communication: • one flag synchronization per N 2 /#threads ops • one communication miss per B*N/#threads bytes Institutionen för informationsteknologi | www.it.uu.se

Parallel Executiontime 2,5 threads=1 threads=2 Uppsala University Execution time per step (sec) 2 threads=4 threads=8 1,5 1 0,5 0 σ = 0,5 1 2 4 8 16 Temp.Blocked GS Natural RedBlack RBGS TBGS Order Institutionen för informationsteknologi | www.it.uu.se

Performance comparison with Red-Black σ = 0,5 N = 257, 32 threads (Sun E15 K, US IIIcu = SMP!!) σ = 1 3,5 σ = 2 Uppsala University σ = 4 3,0 σ = 8 Performance ratio TBGS/RBGS σ = 16 2,5 2,0 1,5 1,0 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 Threads Institutionen för informationsteknologi | www.it.uu.se

Multicore Simulation σ = 1 Uppsala University 2,5 Sun 15k Simulated Multicore 2,0 Performance ratio TBGS/RBGS 1,5 1,0 0,5 0,0 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 Threads Institutionen för informationsteknologi | www.it.uu.se

Using Gauss-Seidel Smoother in a Multigrid � G-S Important part of many real apps! Uppsala University � EX: G-S as a Smoother in “Multigrid” � Iterative algorithm � More efficient smother cuts #iterations Institutionen för informationsteknologi | www.it.uu.se

One slide summary � Today’s algorithms assume expensive communication � The communication cost of [some] multicores is Uppsala University close to zero � Locality is becoming key to performance [again] � Redesign HPC algorithms to face this fact! (For both Capacity and Capability computing) We show: * 3x performance gain * ~30x less bandwidth Is it time to revisit more algorithms? Institutionen för informationsteknologi | www.it.uu.se

Niagara 4 x DDR-2 = 25GB/s (!) Uppsala University Memory Memory Memory Memory ctrl ctrl ctrl ctrl Shared L2! 3MB L2 L2 L2 L2 Xbar = 134 GB/s …8 ... L1I L1D L1I L1D (wt) (wt) … 8 … CPU CPU Shared caches: Good or bad? Institutionen för informationsteknologi | www.it.uu.se

Case Study: Gauss-Seidel on Multicores Erik Hagersten Uppsala - PowerPoint PPT Presentation

Case Study: Gauss-Seidel on Multicores Erik Hagersten Uppsala University Sweden Thanks: Dan Wallin(arch), Henrik Lf (sci comp) and Sverker Holmgren (sci comp) From Wallin et al, ICS 2006 Criteria for HPC Algorithms Past: Minimize

1 AP Physics C E & M Gauss's Law 20160109 www.njctl.org 2 Gauss's Law Click on

Edward Seidel Assistant Director Edward Seidel, Assistant Director Directorate for Mathematical

Iterative Techniques in Matrix Algebra Jacobi & Gauss-Seidel Iterative Techniques II

Rethinking Last-Level Cache Management for Multicores Operating at Near-Threshold Farrukh Hijaz,

KPart: A Hybrid Cache Sharing-Partitioning Technique for Commodity Multicores Nosayba EI-Sayed

Multiprocessors/Multicores Presented by Yue Gao September 26, 2013 Presented by Yue Gao

Today. Climb an infinite ladder? Gauss and Induction i = 0 i = n ( n + 1 ) Child Gauss: ( n

What if Gauss had had a computer? Paul Zimmermann, INRIA, Nancy, France Celebrating 75 Years of

KALMAN FILTERS STRIKE BACK KALMAN FILTERS STRIKE BACK MATTHIEU BLOCH April 16, 2020 1 / 14

Part 3 Gauss Curvature flow Panagiota Daskalopoulos Columbia University Summer School on

GAUSS - GEANT4 based simulat ion f or LHCb GEANT4 Workshop 2 Oct ober 2002 W. Pokor ski /

Universal elliptic Gauss sums and applications Christian Berghoff Rheinische

Euclid NIR image simulation Gregor Seidel Max Planck Institute for Astronomy Heidelberg EUCLID

AHCAL Energy Resolution Katja Seidel MPI for Physics & Excellence Cluster Universe

Case study 2 Case study 2 Case study 2 Case study 2 Former Industrial Site, London: How has

JUST THE MATHS SLIDES NUMBER 17.5 NUMERICAL MATHEMATICS 5 (Iterative methods) for

The isomorphism problem for subshifts John D. Clemens Department of Mathematics Penn State

Pre-midsem Revision Lecture 11 CS 753 Instructor: Preethi Jyothi Tied-state Triphone Models

Smoke free environment policy A guide to managing staff in a smoke free environment Smoke free

Operational Semantics Part I Jim Royer CIS 352 February 12, 2019 1 / 22 [Syntax] [Big Steps]

Jean Monnet Chair Small Area Methods for Monitoring of

Medicaid Network Adequacy A Proactive Approach to Ensuring and Demonstrating Compliance Speaker:

Generalising CalabiYau Geometries Daniel Waldram Stringy Geometry MITP, 23 September 2015

F O R C H R I S T I A N L I V I N G OPEN YOUR BIBLE TO: EPHESIANS 2:11-22 D OWNLOAD T HIS P

Case Study: Gauss-Seidel on Multicores Erik Hagersten Uppsala - PowerPoint PPT Presentation

Case Study: Gauss-Seidel on Multicores Erik Hagersten Uppsala University Sweden Thanks: Dan Wallin(arch), Henrik Lf (sci comp) and Sverker Holmgren (sci comp) From Wallin et al, ICS 2006 Criteria for HPC Algorithms Past: Minimize

1 AP Physics C E &amp; M Gauss's Law 20160109 www.njctl.org 2 Gauss's Law Click on

Edward Seidel Assistant Director Edward Seidel, Assistant Director Directorate for Mathematical

Iterative Techniques in Matrix Algebra Jacobi &amp; Gauss-Seidel Iterative Techniques II

Rethinking Last-Level Cache Management for Multicores Operating at Near-Threshold Farrukh Hijaz,

KPart: A Hybrid Cache Sharing-Partitioning Technique for Commodity Multicores Nosayba EI-Sayed

Multiprocessors/Multicores Presented by Yue Gao September 26, 2013 Presented by Yue Gao

Today. Climb an infinite ladder? Gauss and Induction i = 0 i = n ( n + 1 ) Child Gauss: ( n

What if Gauss had had a computer? Paul Zimmermann, INRIA, Nancy, France Celebrating 75 Years of

KALMAN FILTERS STRIKE BACK KALMAN FILTERS STRIKE BACK MATTHIEU BLOCH April 16, 2020 1 / 14

Part 3 Gauss Curvature flow Panagiota Daskalopoulos Columbia University Summer School on

GAUSS - GEANT4 based simulat ion f or LHCb GEANT4 Workshop 2 Oct ober 2002 W. Pokor ski /

Universal elliptic Gauss sums and applications Christian Berghoff Rheinische

Euclid NIR image simulation Gregor Seidel Max Planck Institute for Astronomy Heidelberg EUCLID

AHCAL Energy Resolution Katja Seidel MPI for Physics &amp; Excellence Cluster Universe

Case study 2 Case study 2 Case study 2 Case study 2 Former Industrial Site, London: How has

JUST THE MATHS SLIDES NUMBER 17.5 NUMERICAL MATHEMATICS 5 (Iterative methods) for

The isomorphism problem for subshifts John D. Clemens Department of Mathematics Penn State

Pre-midsem Revision Lecture 11 CS 753 Instructor: Preethi Jyothi Tied-state Triphone Models

Smoke free environment policy A guide to managing staff in a smoke free environment Smoke free

Operational Semantics Part I Jim Royer CIS 352 February 12, 2019 1 / 22 [Syntax] [Big Steps]

Jean Monnet Chair Small Area Methods for Monitoring of

Medicaid Network Adequacy A Proactive Approach to Ensuring and Demonstrating Compliance Speaker:

Generalising CalabiYau Geometries Daniel Waldram Stringy Geometry MITP, 23 September 2015

F O R C H R I S T I A N L I V I N G OPEN YOUR BIBLE TO: EPHESIANS 2:11-22 D OWNLOAD T HIS P

1 AP Physics C E & M Gauss's Law 20160109 www.njctl.org 2 Gauss's Law Click on

Iterative Techniques in Matrix Algebra Jacobi & Gauss-Seidel Iterative Techniques II

AHCAL Energy Resolution Katja Seidel MPI for Physics & Excellence Cluster Universe