new approaches to the direct solution of large scale
play

New Approaches to the Direct Solution of Large-Scale Banded Linear - PowerPoint PPT Presentation

www.bsc.es New Approaches to the Direct Solution of Large-Scale Banded Linear Systems Samuel Rodrguez Bernabeu samuel.rodriguez@bsc.es San Jos, California 2 Solving the linear system after reordering 10 Million degrees of freedom, double


  1. www.bsc.es New Approaches to the Direct Solution of Large-Scale Banded Linear Systems Samuel Rodríguez Bernabeu samuel.rodriguez@bsc.es San José, California

  2. 2

  3. Solving the linear system after reordering 10 Million degrees of freedom, double precision, complex linear system with a bandwidth of 2500 may require 1490 GB just to store LU matrix factors using LU factorization with partial pivoting. It is possible to solve it with less than 60 GB 3

  4. Solving the linear system 4

  5. Solving the linear system after reordering → Azad, Ariful, et al. "The Reverse Cuthill-McKee Algorithm in Distributed-Memory." arXiv preprint arXiv:1610.08128 (2016).. 5

  6. BUILDING BLOCK: LM-SPIKE: MEMORY FRIENDLY DIRECT SOLVER

  7. SPIKE algorithm Algorithm 1: SPIKE solver pseudocode. Given A , P and x , solves the A 1 x = P T b for x and returns the Px � PAP T � linear system Data: A , P , b B 1 Result: P T x 1 A := PAP T A 2 C 2 2 b := Pb 3 p = ComputeOptimalNumerOfPartitions( A , nrhs ) B 2 4 for i in p do  0 � V i = A − 1 A 3 5 C 3 i B i  C i � W i = A − 1 6 i 0 B 3 F i = A − 1 i b i 7 A 4 8 end C 4 x = SolveReducedSystem( V top,bottom , W top,bottom , F top,bottom 9 ¯ ) i i i 10 for i in p do  0 ✓ �  C i � ◆ x top x i = A − 1 x bottom ¯ ¯ F i − 11 i +1 − i i − 1 B i 0 12 end 13 x = P T x Illustration block matrix decomposition layout used by the SPIKE algorithm for a 4 partitions case ( p = 4). fully asynchronous work prevents scalability A parallel hybrid banded system solver: the SPIKE algorithm. Parallel computing , 32 (2), 177-194. Polizzi, E., & Sameh, A. H. (2006). 7

  8. SPIKE algorithm Algorithm 1: SPIKE solver pseudocode. Given A , P and x , solves the A 1 x = P T b for x and returns the Px � PAP T � linear system Data: A , P , b B 1 Result: P T x 1 A := PAP T A 2 C 2 2 b := Pb 3 p = ComputeOptimalNumerOfPartitions( A , nrhs ) B 2 4 for i in p do  0 � V i = A − 1 A 3 5 C 3 i B i  C i � W i = A − 1 6 i 0 B 3 F i = A − 1 i b i 7 A 4 8 end C 4 x = SolveReducedSystem( V top,bottom , W top,bottom , F top,bottom 9 ¯ ) i i i 10 for i in p do  0 ✓ �  C i � ◆ x top x i = A − 1 x bottom ¯ ¯ F i − 11 i +1 − i i − 1 B i 0 12 end 13 x = P T x Illustration block matrix decomposition layout used by the SPIKE algorithm for a 4 partitions case ( p = 4). fully asynchronous work prevents scalability A parallel hybrid banded system solver: the SPIKE algorithm. Parallel computing , 32 (2), 177-194. Polizzi, E., & Sameh, A. H. (2006). 8

  9. Controlling memory consumption PARDISO LM-SPIKE (p=5) LM-SPIKE (p=20) 9

  10. FIRST IDEA: LOCAL MATRIX BANDWIDTH REDUCTION

  11. Could we reduce the bandwidth locally ? → 11

  12. Could we reduce the bandwidth locally ? → 12

  13. Could we reduce the bandwidth locally ? →  �  �  �  A − 1 B � A B I A I 0 0 = CA − 1 D − CA − 1 B C D I I 0 0 13

  14. Could we reduce the bandwidth locally ? +20 Million degrees of freedom linear system arising from frequency domain discretization of Maxwell’s equations for CSEM. Notice that the inclusion of high-conductivity air layer makes the problem highly ill-conditioned. 14

  15. SECOND IDEA: COMPUTING A SUBSET OF THE INVERSE (JUST FOR DIAGONALLY DOMINANT CASES?)

  16. Computing selected elements of the inverse A − 1 A − 1 A − 1     b 11 11 12 13 Algorithm 1: SPIKE solver pseudocode. Given A , P and x , solves the x = A − 1 b = A − 1 A − 1 A − 1 b 21 x = P T b for x and returns the Px � PAP T � linear system     21 22 23 A − 1 A − 1 A − 1 b 31 31 32 33 Data: A , P , b Result: P T x 1 A := PAP T x 11 = A − 1 11 b 11 + A − 1 12 b 21 + A − 1 13 b 31 2 b := Pb 3 p = ComputeOptimalNumerOfPartitions( A , nrhs ) 4 for i in p do  0 � V i = A − 1 5 i B i  C i � W i = A − 1 6 i 0 F i = A − 1 i b i 7 8 end x = SolveReducedSystem( V top,bottom , W top,bottom , F top,bottom 9 ¯ ) i i i 10 for i in p do  0 ✓ �  C i � ◆ x top x i = A − 1 x bottom ¯ ¯ F i − 11 i +1 − i i − 1 B i 0 12 end 13 x = P T x Kuzmin, A., Luisier, M., & Schenk, O. (2013, August). Fast methods for computing selected elements of the Green’s function in massively parallel nanoelectronic device simulations. In European Conference on Parallel Processing (pp. 533-544). Springer Berlin Heidelberg. 16

  17. Computing selected elements of the inverse A − 1 A − 1 A − 1     b 11 11 12 13 Algorithm 1: SPIKE solver pseudocode. Given A , P and x , solves the x = A − 1 b = A − 1 A − 1 A − 1 b 21 x = P T b for x and returns the Px � PAP T � linear system     21 22 23 A − 1 A − 1 A − 1 b 31 31 32 33 Data: A , P , b Result: P T x 1 A := PAP T x 11 = A − 1 11 b 11 + A − 1 12 b 21 + A − 1 13 b 31 2 b := Pb 3 p = ComputeOptimalNumerOfPartitions( A , nrhs ) 4 for i in p do ?  0 � V i = A − 1 Diagonal Dominance 5 i B i y  C i � W i = A − 1 6 i 0 F i = A − 1 i b i 7 x 11 ≈ A − 1 11 b 11 8 end x = SolveReducedSystem( V top,bottom , W top,bottom , F top,bottom 9 ¯ ) i i i 10 for i in p do  0 ✓ �  C i � ◆ x top x i = A − 1 x bottom ¯ ¯ F i − 11 i +1 − i i − 1 B i 0 12 end 13 x = P T x Demko, S., Moss, W. F., & Smith, P. W. (1984). Decay rates for inverses of band matrices. Mathematics of computation , 43 (168), 491-499. 17

  18. Computing selected elements of the inverse A − 1 A − 1 A − 1     b 11 11 12 13 Algorithm 1: SPIKE solver pseudocode. Given A , P and x , solves the x = A − 1 b = A − 1 A − 1 A − 1 b 21 x = P T b for x and returns the Px � PAP T � linear system     21 22 23 A − 1 A − 1 A − 1 b 31 31 32 33 Data: A , P , b Result: P T x 1 A := PAP T x 11 = A − 1 11 b 11 + A − 1 12 b 21 + A − 1 13 b 31 2 b := Pb 3 p = ComputeOptimalNumerOfPartitions( A , nrhs ) 4 for i in p do ?  0 � V i = A − 1 Diagonal Dominance 5 i B i y  C i � W i = A − 1 6 i 0 F i = A − 1 i b i 7 x 11 ≈ A − 1 11 b 11 8 end x = SolveReducedSystem( V top,bottom , W top,bottom , F top,bottom 9 ¯ ) ? i i i 10 for i in p do  0 via Neumann Series ✓ �  C i � ◆ y x top x i = A − 1 x bottom ¯ ¯ F i − 11 i +1 − i i − 1 B i 0 12 end 13 x = P T x ∞ A − 1 = ( I � A ) k X ( ) k I � A k ∞ < 1 k =0 Demko, S., Moss, W. F., & Smith, P. W. (1984). Decay rates for inverses of band matrices. Mathematics of computation , 43 (168), 491-499. 18

  19. Computing selected elements of the inverse ∞ A − 1 = ( I � A ) k X ( ) k I � A k ∞ < 1 k =0 = · ( I − A ) k ( I − A ) k − 1 ( I − A ) Truncated SPIKE Neumann Truncated SPIKE O ( n O (( k u + k l ) 2 ) Memory 2 × ( k u + k l )) O ( n 2 × ( k u + k l ) 2 ) O (( k u + k l ) 3 ) FLOPs 1 . 83 × 10 − 13 8 . 10 × 10 − 13 Relative Error Space and time complexity comparison between LU (diagonal boosting) and Neumann series approach applied to the factorization stage of the SPIKE al- gorithm ( p = 2). Relative error is referred to the solution of the whole linear system using SPIKE. Mikkelsen, C. C. K., & Manguoglu, M. (2008). Analysis of the truncated SPIKE algorithm. 19 SIAM Journal on Matrix Analysis and Applications , 30 (4), 1500-1519.

  20. Computing selected elements of the inverse 400x 250x 20

  21. Could we extend this to more general cases? Direct evaluation of A − 1 : ∞ A − 1 = ( I � A ) k X ( ) k I � A k ∞ < 1 k =0 ∞ 1 � − 1 = I � P − 1 ( I � cA ) k I � P − 1 ( I � cA ) k ∞ < 1 � k X P − 1 A � � ( ) c k =0 Indirect or Nested evaluation of A − 1 : ∞ ( I � A ) − 1 = X A k ( ) k A k ∞ < 1 k =0 ∞ ( I � cA ) − 1 = X c k A k ( ) k cA k ∞ < 1 k =0 ! j ∞ ∞ ∞ I − ( I − A ) − 1 i − 1 A − 1 = − ( I − A ) − 1 h A k + X X X A k = ? ⇐ ⇒ k =0 j =0 k =0 21

  22. Takeways Ninja skills are required, but not su ffi cient It is possible to use direct solvers for solving large 3D problems. We’re working on on-the-fly factorizations for diagonal blocks. 22

  23. www.bsc.es Thank you! samuel.rodriguez@bsc.es 23

Recommend


More recommend