www.DLR.de • Chart 1 > SPPEXA’16 > J. Thies • Block Sparse Solvers > 2016-01-25 Can Block Iterative Eigensolvers Fulfill their Performance Promise? Jonas Thies Melven R¨ ohrig-Z¨ ollner Achim Basermann DLR Simulation and Software Technology Distributed Systems and Component Software High Performance Computing Group project ESSEX
www.DLR.de • Chart 2 > SPPEXA’16 > J. Thies • Block Sparse Solvers > 2016-01-25 Motivation Mathematical problem Roofline performance model • Find O (10) eigenpairs (2x 12 core Haswell EP) Ax i = λ i x i of a large, sparse matrix A . Peak Flop/s 1024 • Interior or extreme λ i , Peak bandwidth 256 DP GFlop/s • symmetric or general A . 64 BLAS1 (ddot) Memory gap 16 • Small memory bandwidth vs. 4 high Peak Flop rate 1 → Increase the compute intensity 1/4 1 4 16 64 Compute intensity [Flop / DP element]
www.DLR.de • Chart 3 > SPPEXA’16 > J. Thies • Block Sparse Solvers > 2016-01-25 Regular vs. block JDQR 1: start from initial spaces V H V = I , V A := AV , H = V H AV 2: while ... do 3: (sorted) Schur decomp., H = STS H , S H S = I 4: select shift θ = T 1 , 1 u = Vs 1 , u A = V A s 1 5: r = u A − θ u 6: 7: (lock converged eigenpairs in Q) 8: (shrink subspaces if necessary) ˜ 9: Q = [ Q u ] solve approximately for t ⊥ ˜ Q : ( I − ˜ Q ˜ Q H )( A − θ I )( I − ˜ Q ˜ Q H ) t = − r 10: 11: orthogonalize t against V extend V = [ V ˜ t ], V A , H = V H V A 12: 13: end while
www.DLR.de • Chart 3 > SPPEXA’16 > J. Thies • Block Sparse Solvers > 2016-01-25 Regular vs. block JDQR 1: start from initial spaces (. . . ) V H V = I , V A := AV , H = V H AV (. . . ) 2: while ... do while ... do 3: (sorted) Schur decomp., (. . . ) H = STS H , S H S = I (. . . ) 4: select shift θ = T 1 , 1 select shifts θ i = T i , i , i = 1 . . . n b u = Vs 1 , u A = V A s 1 u = VS : , 1: n b , u A = V A S : , 1: n b 5: r = u A − θ u 6: r : , i = u A : , i − θ i u : , i 7: (lock converged eigenpairs in Q) (. . . ) 8: (shrink subspaces if necessary) (. . . ) ˜ 9: Q = [ Q u ] ˜ Q = [ Q u ] solve approximately for t ⊥ ˜ Q : solve n b independent systems ( I − ˜ Q ˜ Q H )( A − θ I )( I − ˜ Q ˜ Q H ) t = − r 10: ( I − ˜ Q ˜ Q H )( A − θ i I )( I − ˜ Q ˜ Q H ) t : , i = − r : , i 11: orthogonalize t against V block-orthogonalize t against V , extend V = [ V ˜ t ], V A , H = V H V A 12: extend subspaces (by n b vecs) 13: end while end while
www.DLR.de • Chart 4 > SPPEXA’16 > J. Thies • Block Sparse Solvers > 2016-01-25 Numerical behavior Block size 2 Block size 4 3 . 5 3 . 5 Andrews ck656 cfd1 cry10000 relative overhead (# spMVM) relative overhead (# spMVM) finan512 dw8192 3 3 torsion1 rdb3200l 2 . 5 2 . 5 2 2 1 . 5 1 . 5 1 1 5 10 20 30 40 50 5 10 20 30 40 50 # eigenvalues found # eigenvalues found
www.DLR.de • Chart 5 > SPPEXA’16 > J. Thies • Block Sparse Solvers > 2016-01-25 Kernels needed for BJDQR • spMMVM: sparse Matrix times Multiple Vector Multiplication • block projection, y ← ( I − VV H ) x • block vector orthogonalization given V , W : V H V = I , find Q , R : W = QR , Q H Q = I , V H Q = 0. • subspace transformation for ‘thick restart’ V : , 1: k ← V : , 1: m · S , S ∈ C m × k , k < m • preconditioning operation approximating ( A − τ I ) − 1 x (so far unpreconditioned blocked Krylov → ESSEX-II)
www.DLR.de • Chart 6 > SPPEXA’16 > J. Thies • Block Sparse Solvers > 2016-01-25 Row major vs. column major storage 20 y j ← ( A − θ j ) x j S ← Q T Y 15 Y ← Y − QS runtime [s] 10 5 0 1 2 4 8 block size n b Tpetra/TBB, col major
www.DLR.de • Chart 6 > SPPEXA’16 > J. Thies • Block Sparse Solvers > 2016-01-25 Row major vs. column major storage 20 y j ← ( A − θ j ) x j S ← Q T Y 15 Y ← Y − QS runtime [s] 10 5 0 1 2 4 8 1 2 4 8 block size n b block size n b Tpetra/TBB, col major GHOST, col major
www.DLR.de • Chart 6 > SPPEXA’16 > J. Thies • Block Sparse Solvers > 2016-01-25 Row major vs. column major storage 20 y j ← ( A − θ j ) x j S ← Q T Y 15 Y ← Y − QS runtime [s] 10 5 0 1 2 4 8 1 2 4 8 1 2 4 8 block size n b block size n b block size n b Tpetra/TBB, col major GHOST, col major GHOST, row major
www.DLR.de • Chart 6 > SPPEXA’16 > J. Thies • Block Sparse Solvers > 2016-01-25 Row major vs. column major storage 20 y j ← ( A − θ j ) x j S ← Q T Y 15 Y ← Y − QS runtime [s] 10 5 0 1 2 4 8 1 2 4 8 1 2 4 8 block size n b block size n b block size n b Tpetra/TBB, col major GHOST, col major GHOST, row major • consequences for overall implementation (avoid strides!) • SELL-C- σ most helpful if SIMD/SIMT width > n b
www.DLR.de • Chart 7 > SPPEXA’16 > J. Thies • Block Sparse Solvers > 2016-01-25 Overall block speedup (strong scaling) 5000 n b =1 Setup n b =2 n b =4 4000 • Non-symmetric matrix from 7-point 3D PDE discretization ( n ≈ 1 . 3 · 10 8 , n nz ≈ 9 . 4 · 10 8 ) 3000 • Seeking 20 eigenvalues • Ivy Bridge Cluster 2000 Results 1000 • n b = 2: significantly faster • n b = 4: no further improvement 0 8 16 32 64 128 See R¨ ohrig-Z¨ ollner et al SISC 37(6), 2015 nodes (const. bar heigth ≡ linear strong scaling)
www.DLR.de • Chart 8 > SPPEXA’16 > J. Thies • Block Sparse Solvers > 2016-01-25 Block orthogonalization schemes • Given orthogonal vectors � � v 1 , . . . , v k = V • For X ∈ R n × n b find orthogonal Y ∈ R n × ˜ n b with V T Y = 0 YR 1 = X − VR 2 , and Two phase algorithms ˜ X ← ( I − VV T ) X Phase 1 Project: Phase 2 Orthogonalize: Y ← f ( ˜ X ) • communication optimal f : • CholQR or SVQB (Stathopoulos and Wu, SISC 2002) • TSQR (Demmel et al., SISC 2012) • Each phase messes with the accuracy of the other. → iterate
www.DLR.de • Chart 9 > SPPEXA’16 > J. Thies • Block Sparse Solvers > 2016-01-25 Raising the computational intensity Kernel fusion C ← V T ˜ Phase 2 ˜ X ← XB , X X T ˜ Phase 1 ˜ X ← X − VC , ˜ B ← ˜ X X T ˜ Phase 3 ˜ B ← ˜ ˜ X ← XB , X ⇒ use SVQB Increased precision Idea Calculate value and error of each arithmetic operation • Store intermediate results as double-double (DD) numbers • Based on arithmetic building blocks (2Sum, 2Mult) Muller et al.: Handbook of Floating-Point Arithmetic, Springer 2010 • Exploit FMA operations (AVX2)
www.DLR.de • Chart 10 > SPPEXA’16 > J. Thies • Block Sparse Solvers > 2016-01-25 Results: accuracy after one iteration Error in V T Y Error in Y T Y 10 0 10 0 breakdown! 10 − 2 10 − 2 10 − 4 10 − 4 10 − 6 10 − 6 10 − 8 10 − 8 10 − 10 10 − 10 different κ ( X , W ) 10 − 12 10 − 12 10 − 14 10 − 14 project+TSQR project/SVQB 10 − 16 10 − 16 DD project/SVQB 10 − 18 10 − 18 10 0 10 5 10 10 10 15 10 20 10 25 10 0 10 2 10 4 10 6 10 8 10 10 10 12 condition number κ ( X , V ) condition number κ ( X )
www.DLR.de • Chart 11 > SPPEXA’16 > J. Thies • Block Sparse Solvers > 2016-01-25 Results: runtime to orthogonality 0 . 1 Iterated project/SVQB ” ” with fused operations 0 . 09 ” ” with fused DD operations 0 . 08 0 . 07 runtime/block size n b [s] 0 . 06 0 . 05 0 . 04 0 . 03 0 . 02 0 . 01 0 10 × 1 10 × 2 10 × 4 20 × 1 20 × 2 20 × 4 40 × 1 40 × 2 40 × 4 n proj × n b
www.DLR.de • Chart 12 > SPPEXA’16 > J. Thies • Block Sparse Solvers > 2016-01-25 Block orthogonalization aleviates memory gap Peak Flop/s 1024 Faster through Peak bandwidth Fused DD 40 × 4 256 • block algorithms DP GFlop/s • kernel fusion 64 mod. G.-S. • increased precision arithmetic 16 (not data) 4 1 1/4 1 4 16 64 Compute intensity [Flop / DP element]
www.DLR.de • Chart 13 > SPPEXA’16 > J. Thies • Block Sparse Solvers > 2016-01-25 Summary Lessons learned • don’t use BLAS-3 for ‘tall, skinny matrices’ • row-major storage ” • implement algorithms accordingly (avoid strides) • use ‘free’ operations to increase accuracy/reduce iterations Outlook: Peta and beyond • reduction of synchronization kicks in • increasing memory gap favors block methods • crucial for ESSEX-II: matrix reordering and preconditioning
Recommend
More recommend