Can Block Iterative Eigensolvers Fulfill their Performance Promise? - PowerPoint PPT Presentation

www.DLR.de • Chart 1 > SPPEXA’16 > J. Thies • Block Sparse Solvers > 2016-01-25 Can Block Iterative Eigensolvers Fulfill their Performance Promise? Jonas Thies Melven R¨ ohrig-Z¨ ollner Achim Basermann DLR Simulation and Software Technology Distributed Systems and Component Software High Performance Computing Group project ESSEX

www.DLR.de • Chart 2 > SPPEXA’16 > J. Thies • Block Sparse Solvers > 2016-01-25 Motivation Mathematical problem Roofline performance model • Find O (10) eigenpairs (2x 12 core Haswell EP) Ax i = λ i x i of a large, sparse matrix A . Peak Flop/s 1024 • Interior or extreme λ i , Peak bandwidth 256 DP GFlop/s • symmetric or general A . 64 BLAS1 (ddot) Memory gap 16 • Small memory bandwidth vs. 4 high Peak Flop rate 1 → Increase the compute intensity 1/4 1 4 16 64 Compute intensity [Flop / DP element]

www.DLR.de • Chart 3 > SPPEXA’16 > J. Thies • Block Sparse Solvers > 2016-01-25 Regular vs. block JDQR 1: start from initial spaces V H V = I , V A := AV , H = V H AV 2: while ... do 3: (sorted) Schur decomp., H = STS H , S H S = I 4: select shift θ = T 1 , 1 u = Vs 1 , u A = V A s 1 5: r = u A − θ u 6: 7: (lock converged eigenpairs in Q) 8: (shrink subspaces if necessary) ˜ 9: Q = [ Q u ] solve approximately for t ⊥ ˜ Q : ( I − ˜ Q ˜ Q H )( A − θ I )( I − ˜ Q ˜ Q H ) t = − r 10: 11: orthogonalize t against V extend V = [ V ˜ t ], V A , H = V H V A 12: 13: end while

www.DLR.de • Chart 3 > SPPEXA’16 > J. Thies • Block Sparse Solvers > 2016-01-25 Regular vs. block JDQR 1: start from initial spaces (. . . ) V H V = I , V A := AV , H = V H AV (. . . ) 2: while ... do while ... do 3: (sorted) Schur decomp., (. . . ) H = STS H , S H S = I (. . . ) 4: select shift θ = T 1 , 1 select shifts θ i = T i , i , i = 1 . . . n b u = Vs 1 , u A = V A s 1 u = VS : , 1: n b , u A = V A S : , 1: n b 5: r = u A − θ u 6: r : , i = u A : , i − θ i u : , i 7: (lock converged eigenpairs in Q) (. . . ) 8: (shrink subspaces if necessary) (. . . ) ˜ 9: Q = [ Q u ] ˜ Q = [ Q u ] solve approximately for t ⊥ ˜ Q : solve n b independent systems ( I − ˜ Q ˜ Q H )( A − θ I )( I − ˜ Q ˜ Q H ) t = − r 10: ( I − ˜ Q ˜ Q H )( A − θ i I )( I − ˜ Q ˜ Q H ) t : , i = − r : , i 11: orthogonalize t against V block-orthogonalize t against V , extend V = [ V ˜ t ], V A , H = V H V A 12: extend subspaces (by n b vecs) 13: end while end while

www.DLR.de • Chart 4 > SPPEXA’16 > J. Thies • Block Sparse Solvers > 2016-01-25 Numerical behavior Block size 2 Block size 4 3 . 5 3 . 5 Andrews ck656 cfd1 cry10000 relative overhead (# spMVM) relative overhead (# spMVM) finan512 dw8192 3 3 torsion1 rdb3200l 2 . 5 2 . 5 2 2 1 . 5 1 . 5 1 1 5 10 20 30 40 50 5 10 20 30 40 50 # eigenvalues found # eigenvalues found

www.DLR.de • Chart 5 > SPPEXA’16 > J. Thies • Block Sparse Solvers > 2016-01-25 Kernels needed for BJDQR • spMMVM: sparse Matrix times Multiple Vector Multiplication • block projection, y ← ( I − VV H ) x • block vector orthogonalization given V , W : V H V = I , find Q , R : W = QR , Q H Q = I , V H Q = 0. • subspace transformation for ‘thick restart’ V : , 1: k ← V : , 1: m · S , S ∈ C m × k , k < m • preconditioning operation approximating ( A − τ I ) − 1 x (so far unpreconditioned blocked Krylov → ESSEX-II)

www.DLR.de • Chart 6 > SPPEXA’16 > J. Thies • Block Sparse Solvers > 2016-01-25 Row major vs. column major storage 20 y j ← ( A − θ j ) x j S ← Q T Y 15 Y ← Y − QS runtime [s] 10 5 0 1 2 4 8 block size n b Tpetra/TBB, col major

www.DLR.de • Chart 6 > SPPEXA’16 > J. Thies • Block Sparse Solvers > 2016-01-25 Row major vs. column major storage 20 y j ← ( A − θ j ) x j S ← Q T Y 15 Y ← Y − QS runtime [s] 10 5 0 1 2 4 8 1 2 4 8 block size n b block size n b Tpetra/TBB, col major GHOST, col major

www.DLR.de • Chart 6 > SPPEXA’16 > J. Thies • Block Sparse Solvers > 2016-01-25 Row major vs. column major storage 20 y j ← ( A − θ j ) x j S ← Q T Y 15 Y ← Y − QS runtime [s] 10 5 0 1 2 4 8 1 2 4 8 1 2 4 8 block size n b block size n b block size n b Tpetra/TBB, col major GHOST, col major GHOST, row major

www.DLR.de • Chart 6 > SPPEXA’16 > J. Thies • Block Sparse Solvers > 2016-01-25 Row major vs. column major storage 20 y j ← ( A − θ j ) x j S ← Q T Y 15 Y ← Y − QS runtime [s] 10 5 0 1 2 4 8 1 2 4 8 1 2 4 8 block size n b block size n b block size n b Tpetra/TBB, col major GHOST, col major GHOST, row major • consequences for overall implementation (avoid strides!) • SELL-C- σ most helpful if SIMD/SIMT width > n b

www.DLR.de • Chart 7 > SPPEXA’16 > J. Thies • Block Sparse Solvers > 2016-01-25 Overall block speedup (strong scaling) 5000 n b =1 Setup n b =2 n b =4 4000 • Non-symmetric matrix from 7-point 3D PDE discretization ( n ≈ 1 . 3 · 10 8 , n nz ≈ 9 . 4 · 10 8 ) 3000 • Seeking 20 eigenvalues • Ivy Bridge Cluster 2000 Results 1000 • n b = 2: significantly faster • n b = 4: no further improvement 0 8 16 32 64 128 See R¨ ohrig-Z¨ ollner et al SISC 37(6), 2015 nodes (const. bar heigth ≡ linear strong scaling)

www.DLR.de • Chart 8 > SPPEXA’16 > J. Thies • Block Sparse Solvers > 2016-01-25 Block orthogonalization schemes • Given orthogonal vectors � � v 1 , . . . , v k = V • For X ∈ R n × n b find orthogonal Y ∈ R n × ˜ n b with V T Y = 0 YR 1 = X − VR 2 , and Two phase algorithms ˜ X ← ( I − VV T ) X Phase 1 Project: Phase 2 Orthogonalize: Y ← f ( ˜ X ) • communication optimal f : • CholQR or SVQB (Stathopoulos and Wu, SISC 2002) • TSQR (Demmel et al., SISC 2012) • Each phase messes with the accuracy of the other. → iterate

www.DLR.de • Chart 9 > SPPEXA’16 > J. Thies • Block Sparse Solvers > 2016-01-25 Raising the computational intensity Kernel fusion C ← V T ˜ Phase 2 ˜ X ← XB , X X T ˜ Phase 1 ˜ X ← X − VC , ˜ B ← ˜ X X T ˜ Phase 3 ˜ B ← ˜ ˜ X ← XB , X ⇒ use SVQB Increased precision Idea Calculate value and error of each arithmetic operation • Store intermediate results as double-double (DD) numbers • Based on arithmetic building blocks (2Sum, 2Mult) Muller et al.: Handbook of Floating-Point Arithmetic, Springer 2010 • Exploit FMA operations (AVX2)

www.DLR.de • Chart 10 > SPPEXA’16 > J. Thies • Block Sparse Solvers > 2016-01-25 Results: accuracy after one iteration Error in V T Y Error in Y T Y 10 0 10 0 breakdown! 10 − 2 10 − 2 10 − 4 10 − 4 10 − 6 10 − 6 10 − 8 10 − 8 10 − 10 10 − 10 different κ ( X , W ) 10 − 12 10 − 12 10 − 14 10 − 14 project+TSQR project/SVQB 10 − 16 10 − 16 DD project/SVQB 10 − 18 10 − 18 10 0 10 5 10 10 10 15 10 20 10 25 10 0 10 2 10 4 10 6 10 8 10 10 10 12 condition number κ ( X , V ) condition number κ ( X )

www.DLR.de • Chart 11 > SPPEXA’16 > J. Thies • Block Sparse Solvers > 2016-01-25 Results: runtime to orthogonality 0 . 1 Iterated project/SVQB ” ” with fused operations 0 . 09 ” ” with fused DD operations 0 . 08 0 . 07 runtime/block size n b [s] 0 . 06 0 . 05 0 . 04 0 . 03 0 . 02 0 . 01 0 10 × 1 10 × 2 10 × 4 20 × 1 20 × 2 20 × 4 40 × 1 40 × 2 40 × 4 n proj × n b

www.DLR.de • Chart 12 > SPPEXA’16 > J. Thies • Block Sparse Solvers > 2016-01-25 Block orthogonalization aleviates memory gap Peak Flop/s 1024 Faster through Peak bandwidth Fused DD 40 × 4 256 • block algorithms DP GFlop/s • kernel fusion 64 mod. G.-S. • increased precision arithmetic 16 (not data) 4 1 1/4 1 4 16 64 Compute intensity [Flop / DP element]

www.DLR.de • Chart 13 > SPPEXA’16 > J. Thies • Block Sparse Solvers > 2016-01-25 Summary Lessons learned • don’t use BLAS-3 for ‘tall, skinny matrices’ • row-major storage ” • implement algorithms accordingly (avoid strides) • use ‘free’ operations to increase accuracy/reduce iterations Outlook: Peta and beyond • reduction of synchronization kicks in • increasing memory gap favors block methods • crucial for ESSEX-II: matrix reordering and preconditioning

Can Block Iterative Eigensolvers Fulfill their Performance Promise? - PowerPoint PPT Presentation

www.DLR.de Chart 1 > SPPEXA16 > J. Thies Block Sparse Solvers > 2016-01-25 Can Block Iterative Eigensolvers Fulfill their Performance Promise? Jonas Thies Melven R ohrig-Z ollner Achim Basermann DLR Simulation and

50 YEARS Let Us Fulfill Your Needs Let Us Fulfill Your Needs We Are VoIP Supply VoIP Supply

Eigensolvers for Large Eigensolvers for Large Electronic Structure Calculations Osni Marques (

Block Iterative Eigensolvers for Sequences of Dense Correlated Eigenvalue Problems Birkbeck

Iterative methods for Image Processing Lothar Reichel Como, May 2018. Lecture 3: Block iterative

Iterative Block Ciphers from Tweakable Block Ciphers with Long Tweaks Ryota Nakamichi and Tetsu

Basic Techniques II: Iterative Compression Marek Cygan Institute of Informatics University of

Chapter 12: Iterative Methods ES 240: Scientific and Engineering Computation. Iterative Methods

Development Figures are from : Agile and Iterative Development: A Manager's Guide, Craig

Problem 1 k zero bits n bits IV Block Block Block Block Cipher Cipher Cipher Cipher

A Study on SLEPc, a Library for Scalable Eigensolvers, and its Scientific and Engineering

Solving Nonlinear Eigenvalue Problems with SLEPc Jose E. Roman D. Sistemes Inform` atics i

Solving Regularized Total Least Squares Problems Based on Eigensolvers Heinrich Voss

Chapter 16 Chapter 16 The Elements: The he Elements: The d -Block -Block The d -Block

Introducing EF Block TM Introduction to EF Block Building Materials Overview of EF

CRYSTAL CITY BLOCK PLAN # CCBP- J-K 2019 1 BLOCK J-K Long Range Planning Committee Block

CRYSTAL CITY BLOCK PLAN # CCBP- G 1 BLOCK G (Metro Market Square block) 1 Long Range

Securing PIM-SM Link- Local Messages J.W. Atwood Salekul Islam Concordia University

Discourse Coherence Lecture Plan: Einf uhrung in Pragmatik Discourse cohesion and

Professional Communication in Computer Science Responsible Conduct in Science II Jiri Srba

10/16/19 Engaging with People Impacted by Intellectual and Developmental Disabilities Rick

Were not crazy, we promise! Trials and tribulations of a programming language designer

Round elimination in exact communication complexity Teresa Piovesan Joint work with et, Harry

Programming Abstractions Week 13-1: Streams Stephen Checkoway What is printed by this code?

Streams Announcements Efficient Sequence Processing Sequence Operations Map, filter, and reduce