A Hybrid Multithreaded Direct Sparse Triangular Solver Andrew M. Bradley Thanks: E. Boman, C. Dohrmann, S. Hammond, W. Held, M. Heroux, M. Hoemmen, K. Kim, S. Olivier, A. Prokopenko, S. Rajamanickam SIAM CSC16 SAND2016-10150 C Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.
Problem Statement • Solve R P T Q x = b • Upper or lower sparse triangular matrix T • Row scaling R • Permutations P , Q • Solution and RHS x, b • (Everything that is needed for LDL, LU, incomplete factorizations, etc.) • Efficient to absorb user data • For many sequential RHS with • Same T or • Same nonzero pattern pat(T) • On a multi/many-core node 2
Solution Approach • Symbolic phase • Find parallelism in pat(T), the graph of T • Numeric phase • Load data structures with numbers • Solve phase 3
Motivation: Level Scheduling 5 4 log 10 #Rows 3 2 1 0 Reorder 0 5 10 15 Level Index 4
Motivation: Level Scheduling 14 13 12 11 log 2 #Rows 10 9 8 7 6 5 4 3 2 1 0 200 400 600 800 1000 1200 1400 1600 1 Cumulative Fraction #Rows 0.8 0.6 0.4 0.2 0 200 400 600 800 1000 1200 1400 1600 1 Cumulative Fraction NNZ 0.8 0.6 0.4 0.2 0 200 400 600 800 1000 1200 1400 1600 5 Level Index
Motivation: Hybrid 14 12 10 log 2 #Rows 8 6 Reorder 4 2 0 0 0.5 1 1.5 2 2.5 3 log 10 Level Index 6
Motivation: Hybrid Solve phase on Knights Corner Elastic cube, bilinear hexes, 86490 unknowns, L from LDL, NodeND 90 90 85 85 80 80 75 75 70 70 Hybrid solver Level scheduling only 65 65 Recursive blocking only Speedup w.r.t. MKL trisolver 60 60 mkl_cspblas_dcsrtrsv 55 55 50 50 45 45 40 40 Reorder 35 35 30 30 25 25 20 20 15 15 10 10 5 5 0 0 1 4 8 16 28 57 114 7 # Threads, KMP_AFFINITY=balanced
Software: HTS • Trilinos/packages/shylu/hts • C++ and OpenMP • Templated on row pointer, column index, and scalar types • CSR, CSC, forward, transpose, conjugate inputs • Effective nonzero pattern reuse • Will be an option in Ifpack2::LocalSparseTriangularSolver • Interface will support nonzero pattern reuse 8
Algorithms: Switching Method • Want robustness to downward and upward spikes in 𝑂 " . • Use levels 1 to k : n good ∼ 10 , f bad ∼ 1% N i ≡ size of level set i X C i ≡ N j j ≤ i C bad X N j ≡ i j ≤ i ∩ N j <n good C bad k ≡ arg max N k ≥ n good ≤ f bad C k ∩ k k 9
Algorithms: Level Scheduling Reorder 10
Algorithms: Pruned Point-to-Point Thread 0 Thread 1 Thread 0 Thread 1 0 1 0 1 Level 1 Level 1 2 3 2 3 Level 2 Level 2 4 5 Level 3 4 5 Level 3 Park, J., M. Smelyanskiy, N. Sundaram, and P. Dubey., "Sparsifying synchronization for high-performance shared-memory sparse triangular solver." In Supercomputing , pp. 124-140. Springer International Publishing, 2014. 11
Algorithms: Recursive Blocking serial trisolve serial mvp serial trisolve inverse parallel mvp sparse or dense parallel or serial inverse parallel mvp parallel mvp 12
10 11 12 13 14 15 16 17 18 0 1 2 3 4 5 6 7 8 9 Solve phase speedup w.r.t. MKL trisolver copter2 gas_sensor Results: UMFPACK LU on IB and KNC matrix-new_3 av41092 Hybrid Recursive blocking Level scheduling xenon2 OMP_PROC_BIND=spread OMP_PLACES=cores c-71 shipsec1 UMFPACK LU, Ivy Bridge, 20 threads xenon1 g7jac160 g7jac140sc mark3jac120 mark3jac100sc ct20stif vanbody ncvxbqp1 0.25 0.75 1.25 1.75 dawson5 0.5 1.5 venkat50 0 1 2 c-59 Straightforward reference serial trisolver speedup w.r.t. MKL trisolver 2D_54019_highK gas_sensor gridgena epb3 torso2 KnightsCorner Ivy Bridge xenon2 finan512 twotone shipsec1 torsion1 xenon1 jan99jac120 boyd1 c-73b hvdc2 rajat16 ct20stif hcircuit vanbody 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 0 5 dawson5 Solve phase speedup w.r.t. MKL trisolver copter2 gas_sensor matrix-new_3 av41092 xenon2 epb3 c-71 UMFPACK LU, Knights Corner, 240 threads shipsec1 xenon1 g7jac160 g7jac140sc KMP_AFFINITY=compact mark3jac120 boyd1 mark3jac100sc ct20stif hvdc2 vanbody ncvxbqp1 hcircuit dawson5 venkat50 c-59 2D_54019_highK gridgena epb3 torso2 finan512 twotone Hybrid Recursive blocking Level scheduling torsion1 jan99jac120 boyd1 c-73b hvdc2 13 rajat16 hcircuit
Results: UMFPACK LU on IB and KNC UMFPACK LU, Ivy Bridge, OMP_PROC_BIND=spread OMP_PLACES=cores UMFPACK LU, Knights Corner, KMP_AFFINITY=compact 19 80 18 60 threads 75 17 120 threads 70 16 240 threads 15 65 14 60 13 55 12 50 11 45 10 40 9 35 8 7 30 6 25 5 20 4 15 10 threads 3 20 threads 10 2 Solve phase speedup w.r.t. MKL trisolver Solve phase speedup w.r.t. MKL trisolver 40 threads 5 1 0 0 14 14 12 12 10 10 8 8 6 6 4 4 (Numeric phase time) / (parallel solve time) (Numeric phase time) / (parallel solve time) 2 2 0 0 10 10 8 8 6 6 4 4 (Symbolic phase time) / (serial solve time) 2 2 (Symbolic phase time) / (serial solve time) 0 0 gas_sensor epb3 hvdc2 gas_sensor epb3 hvdc2 copter2 xenon2 c-71 shipsec1 xenon1 vanbody dawson5 c-59 gridgena torso2 boyd1 rajat16 copter2 xenon2 c-71 shipsec1 xenon1 vanbody dawson5 c-59 gridgena torso2 boyd1 rajat16 av41092 g7jac160 g7jac140sc ct20stif ncvxbqp1 venkat50 finan512 twotone torsion1 c-73b hcircuit av41092 g7jac160 g7jac140sc ct20stif ncvxbqp1 venkat50 finan512 twotone torsion1 c-73b hcircuit mark3jac120 2D_54019_highK jan99jac120 mark3jac120 2D_54019_highK jan99jac120 mark3jac100sc mark3jac100sc matrix-new_3 matrix-new_3 14
UMFPACK LU, Ivy Bridge 20 threads, 822 UF matrices OMP_PROC_BIND=spread OMP_PLACES=cores 20 19 18 17 Solve phase speedup w.r.t. MKL trisolver 16 15 14 13 12 11 10 9 UMFPACK LU, Knights Corner 8 240 threads, 824 UF matrices 7 6 KMP_AFFINITY=compact 5 95 90 4 Median for ≥ N 3 85 Median for ≥ N 2 80 Solve phase speedup w.r.t. MKL trisolver 1 75 0 70 10 3 10 4 10 5 10 6 65 N 60 55 50 45 40 35 30 25 20 15 10 5 0 10 3 10 4 10 5 10 6 15 N
Future Work • Point-to-point level scheduling • Group rows into tasks to minimize #dependencies • Size tasks to reflect level of synchronization • Hybrid • Switching method(s) • Does not have to be 3 blocks; alternate • HTS • Improve formatting of recursively blocked part to take further advantage of dense sub- blocks • Direct sparse methods on GPU? 16
Recommend
More recommend