raising the arithmetic intensity of krylov solvers
play

Raising the Arithmetic Intensity of Krylov solvers 1 Applied - PowerPoint PPT Presentation

Raising the Arithmetic Intensity of Krylov solvers 1 Applied Mathematics, University of Antwerp, Belgium 2 Future Technology Group, Berkeley Lab, USA 3 Intel ExaScience Lab, Belgium 4 USI, Lugano, CH B. Reps 1 , P. Ghysels 2 , S. Donfack 4 , O.


  1. Raising the Arithmetic Intensity of Krylov solvers 1 Applied Mathematics, University of Antwerp, Belgium 2 Future Technology Group, Berkeley Lab, USA 3 Intel ExaScience Lab, Belgium 4 USI, Lugano, CH B. Reps 1 , P. Ghysels 2 , S. Donfack 4 , O. Schenk 4 , W. Vanroose 1 , 3 www.exa2ct.eu

  2. Exa2ct EU project

  3. GMRES, classical Gram-Schmidt 1: r 0 ← b − Ax 0 Sparse Matrix-Vector 2: v 0 ← r 0 / || r 0 || 2 product 3: for i = 0 , . . . , m − 1 do ◮ Only communication w ← Av i 4: with neighbors for j = 0 , . . . , i do 5: ◮ Good scaling but poor h j , i ← � w , v j � 6: bandwidth usage end for 7: Dot-product v i +1 ← w − � i ˜ j =1 h j , i v j 8: ◮ Global communication h i +1 , i ← || ˜ v i +1 || 2 9: v i +1 ← ˜ v i +1 / h i +1 , i ◮ Scales as log( P ) and 10: { apply Givens rotations to h : , i } 11: suffers from jitter 12: end for Scalar vector multiplication, 13: y m ← vector-vector addition argmin || H m +1 , m y m − || r 0 || 2 e 1 || 2 ◮ No communication 14: x ← x 0 + V m y m

  4. Performance of Kernel of dependent SpMV ◮ SIMD: SSE → AVX → mm512 on Xeon Phi. ◮ Similar with NEON on ARM. Bandwidth is the bottleneck and will remain even with 3D stacked memory. 64 with vectorization 32 attainable GFLOP/sec peak floating-point 16 peak memory BW 8 4 2 1 1/8 1/4 1/2 1 2 4 8 16 Arithmetic Intensity FLOP/Byte Arithmetic intensity (Flop/byte) S. Williams, A. Waterman and D. Patterson (2008)

  5. PATUS (Spatial Blocking) Intel Ivy Bridge (Xeon E5-2660v2) Intel Ivy Bridge (Xeon E5-2660v2) · 1 , 000 · 1 , 000 40 40 40 40 Memory Bandwidth in GBytes/s. Memory Bandwidth in GBytes/s. 30 30 Performance in GFlop/s Performance in GFlop/s 30 30 20 20 20 20 gcc 4.8.2 PATUS 1.1 BW gcc 4.8.2 BW PATUS 1.1 10 10 10 10 gcc 4.8.2 PATUS 1.1 BW gcc 4.8.2 BW PATUS 1.1 0 0 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 # of threads # of threads Figure : a. 7 point-stencil const coef., b. 25-point stencil var coef. ◮ PATUS leads to better results and lower bandwidth use. ◮ Good results, but Bandwidth saturation around 10 cores. Difficult to do better with spatial blocking. M. Christen, O. Schenk, and H. Burkhart. Patus: A code generation and autotuning framework for parallel iterative stencil computations on modern microarchitectures. IPDPS’11. M. Christen, O. Schenk, and C. Yifeng. Patus for convenient high-performance stencils: evaluation in earthquake simulations.

  6. Arithmetic intensity of k dependent SpMVs 1 SpMV k SpMVs k SpMVs in place flops 2 n nz 2 k · n nz 2 k · n nz words moved n nz + 2 n kn nz + 2 kn n nz + 2 n q 2 2 2k J. Demmel, CS 294-76 on Communication-Avoiding algorithms M. Hoemmen, Communication-avoiding Krylov subspace methods. (2010).

  7. Increasing the arithmetic intensity ◮ Divide the domain in tiles which fit in the cache ◮ Ground surface is loaded in cache and reused k times ◮ Redundant work at the tile edges u s +1 s + 1 i,j 3 u s i,j +1 2 s u s u s i,j i +1 ,j s 1 u s i − 1 ,j u s B 0 i,j − 1 0 0 B P. Ghysels, P. Klosiewicz, and W. Vanroose. ”Improving the arithmetic intensity of multigrid with the help of polynomial smoothers.” Num. Lin. Alg. Appl. 19 (2012): 253-267.

  8. Use the help of PLUTO Pluto: an automatic parallelizer and locality optimizer for multicores that performs a source to source code transformation. ◮ Automatic loop 8 4 7 parallelization and 6 vectorization. 5 t 4 2 ◮ Locality optimizations 3 1 2 based on the polyhedral 1 model. 0 2 1 3 4 5 6 7 8 9 10 11 12 13 14 15 16 0 x ◮ Important trade-off between parallelism and temporal locality. U. Bondhugula, M. Baskaran, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan, Automatic Transformations for Communication-Minimized Parallelization and Locality Optimization in the Polyhedral Model , Int. Conf. Compiler Construction (ETAPS CC), Apr 2008, Budapest, Hungary.

  9. PLUTO (Temporal blocking) Increase the arithmetic intensity based on the underlying stencils. ◮ Divide the domain in tiles 8 which fit in the cache. 4 7 6 ◮ Reuse the data in the 5 t tiles k times. 4 2 3 ◮ Use compiler auto - 1 2 1 vectorization capabilities 0 2 within each tiles. 1 3 4 5 6 7 8 9 10 12 13 14 15 16 0 x 11 ◮ Increased parallelism and temporal data reuse. ◮ Inner-tiles suitable for external spatial blocking tools e.g PATUS. ◮ Tiles representation suitable for most efficient scheduling strategy e.g TBB.

  10. Increasing the arithmetic intensity with a stencil compiler (Pluto) 256 256 naive naive naive, no-vec naive, no-vec pluto pluto pluto, no-vec pluto, no-vec 128 128 peak roofline peak roofline fitted roofline fitted roofline GFlop/s GFlop/s 64 64 32 32 16 16 0.25 0.5 1 2 4 8 16 32 64 0.25 0.5 1 2 4 8 16 32 64 arithmetic intensity (flop/byte) arithmetic intensity (flop/byte) Dual socket Intel Sandy Bridge, Intel Xeon Phi, 61 × 4 threads 16 × 2 threads

  11. Stencil Compilers ◮ Pochoir: (MIT) http://people.csail.mit.edu/yuantang/ ◮ Patus: (USI) https://code.google.com/p/patus/ ◮ Pluto: http://pluto-compiler.sourceforge.net ◮ Ghost: (Erlangen) http://tiny.cc/ghost ◮ HHRT: (Tokyo) ◮ Modesto,

  12. Krylov methods ◮ Classical Krylov K k ( A , v ) = span { v , Av , A 2 v , . . . , A k − 1 v } the residual and error can then be written as r ( k ) = P k ( A ) r (0) (1)

  13. Krylov methods ◮ Classical Krylov K k ( A , v ) = span { v , Av , A 2 v , . . . , A k − 1 v } the residual and error can then be written as r ( k ) = P k ( A ) r (0) (1) ◮ Polynomial Preconditioning P m ( A ) x = q ( A ) Ax = q ( A ) b K k ( P m ( A ) , v ) = span { v , P m ( A ) v , P m ( A ) 2 v , . . . , P m ( A ) k − 1 v } where P m ( A ) = ( A − σ 1 )( A − σ 2 ) · . . . · ( A − σ m ) is a fixed low-order polynomial r ( k ) = Q k ( P m ( A )) r (0) (2)

  14. Incomplete list of the literature on polynomial preconditioning Y. Saad, Iterative methods for sparse linear systems Chapter 12, SIAM (2003). D. O’Leary, Yet another polynomials preconditioner for the Conjugate Gradient Algorithm , Lin. Alg. Appl. (1991) p377 A. van der Ploeg, Preconditioning for sparse matrices with applications , (1994) M. Van Gijzen, A polynomial preconditioner for the GMRES algorithm J. Comp. Appl. Math 59 (1995): 91-107. A. Basermann, B. Reichel, C. Schelthoff, Preconditioned CG methods for sparse matrices on massively parallel machines , 23 (1997), 381398 . . .

  15. Convergence of CG with different short Polynomials

  16. Time to solution is reduced

  17. Recursive calculation of w k +1 = p k +1 ( A ) v 1: σ 1 = θ/δ 2: ρ 0 = 1 /σ 1 3: w 0 = 0, w 1 = 1 θ Av 4: ∆ w 1 = w 1 − w 0 5: for k=1,. . . do ρ k = 1 / (2 σ 1 − ρ k − 1 ) 6: � 2 � ∆ w k +1 = ρ k δ A ( v − w k ) + ρ k − 1 ∆ w k 7: w k +1 = w k + ∆ w k +1 8: 9: end for

  18. Mangled with Pluto

  19. Average cost for each matvec 0.05 total 0.045 matvec dot-product averaged time for each matvec 0.04 axpy 0.035 0.03 0.025 0.02 0.015 0.01 0.005 0 0 2 4 6 8 10 12 14 16 18 20 order of polynomial 2D 2024 × 2048 finite difference poisson problem on i7-2860QM

  20. Patus and Pluto

  21. Arithmetic Intensity Multigrid 2 h ( . . . ) I 2 h h A ) S ν 1 , MG = ( I − I h (3) where ◮ S is the smoother, for example ω -Jacobi where S = I − ω D − 1 A , 2 h and restriction I 2 h ◮ The interpolation I h h . These are all low arithmetic intensity.

  22. Multigrid Cost Model Increasing the number of smoothing steps per level 1 60 50 0.8 convergence rate # V-cycles 40 0.6 30 0.4 20 0.2 10 0 0 0 5 10 15 20 0 5 10 15 20 smoothing steps smoothing steps MG iterations = log(10 − 8 ) / log( ρ (V-cycle)) where ρ (V-cycle) is the spectral radius of the V-cycle, can be rounded to higher integer

  23. Work Unit Cost model ◮ Work Unit cost model ignores memory bandwidth 1 WU = smoother cost = O ( n ) ◮ (9 ν + 19)(1 + 1 4 + . . . ) log( tol ) log( ρ ) WU ≤ (9 ν + 19) 4 log( tol ) log( ρ ) WU 3 60 2500 50 2000 40 # V-cycles work units 1500 30 1000 20 500 10 0 0 0 5 10 15 20 0 5 10 15 20 smoothing steps smoothing steps

  24. 1200 0.016 peak BW, peak Flops 0.014 no SIMD 1000 stream BW 0.012 avg perf (Gflop/s) avg cost (s/Gflop) 800 0.01 peak BW, peak Flops 600 no SIMD 0.008 stream BW 0.006 400 0.004 200 0.002 0 0 0 5 10 15 20 0 5 10 15 20 # smoothing steps # smoothing steps Attainable GFlop/sec Average cost in sec/GFlop flops ( ν 1 × ω -Jac) ν 1 flops ( ω -Jac) � ν 1 × ω -Jac � = roof ( q ( ν 1 × ω -Jac)) ≈ (4) roof ( ν 1 q 1 ( ω -Jac))

  25. Roofline-based vs naive cost model 1.2 roofline model naive model 1 0.8 work units 0.6 0.4 0.2 0 0 5 10 15 20 # smoothing steps

  26. Timings for 2D 8190 2 (Sandy Bridge) 25 naive pluto 20 naive model roofline model 15 time (s) 10 5 0 0 2 4 6 8 10 12 14 16 18 20 smoothing steps P. Ghysels and W. Vanroose, Modelling the performance of geometric multigrid on many-core computer architectures , Exascience.com techreport 09.2013.1 Sept, 2013, to appear in SISC

  27. 3d results 511 3 (Sandy Bridge)

  28. Current Krylov Solvers ◮ User provides a matvec routine and selects Krylov method at command line. Krylov User Matvec ◮ No opportunities to increase Solver arithmetic intensity.

Recommend


More recommend