Improving Virtual Prototyping and Certification with Implicit Finite Element Method at Scale Seid Koric 1,2 , Robert F. Lucas 3 , Erman Guleryuz 1 1 National Center for Supercomputing Applications 2 Mechanical Science and Engineering Department, University of Illinois 3 Livermore Software Technology Corporation Blue Waters Symposium 2019, June 5 th
Seid Koric, Erman Guleryuz Todd Simons, James Ong Robert Lucas, Roger Grimes, Francois-Henry Rouet Jef Dawson, Ting-Ting Zhu 2
Overview of the project q Today: Virtual prototypes supplement physical tests in design and certification q Vision: Further reduce cost & risk (Supplement → Replacement) q Immediate goal: Increase impact of simulation technology q Impact of simulation = f (speed, scale, fidelity) q Performance scaling = f (code, input, machine) q FEM: Partial differential equations → Sparse linear system q HPC strategy: Sparse linear algebra → Dense linear algebra Rolls-Royce q Overall approach: Scale-analyze-improve with real-life models Representative Engine Model 3
Overview of challenges q More specific: These apply to LS-DYNA, and any other significant MCAE ISVs § Large legacy code, cannot start from scratch, must gracefully evolve § General-purpose code, cannot optimize for narrow class of problems § Key algorithms are NP-complete/hard, need to depend on heuristics q More universal: These probably apply to any significant scientific or engineering code § Limited number of software development tools, especially for performance engineering § Increasing complexity of hardware architectures, combined with frequent design updates § Performance portability constraints for codes used on many systems § Limited HPC access, especially true for ISVs 4
Parallel scaling at the beginning of the Blue Waters project 100M DOF, Three implicit load steps 14000 12000 Before, Hybrid (8 threads/MPI) 10000 Time (seconds) 8000 6000 4000 2000 0 128 256 512 1024 2048 MPI ranks 5
Improvement framework and progress highlights q Memory management improvements Dynamic allocation § q Existing algorithm improvements Scale-up Measure Inter-node communication § q Previously unknown bottlenecks Constraint processing § Improve Analyze q Entirely new algorithms Parallel matrix reordering § Parallel symbolic factorization § q Computation workflow modifications Offline parsing and decomposition of the model § 6
NCSA OVIS view of LS-DYNA execution 70 Sequential symbolic preprocessing Input processing 60 50 Free memory (GB) 40 30 20 Reordering Assemble, redistribute, 10 factor and solve 2X 0 0 20 40 60 80 100 120 140 Time (minutes) 105M DOF model, 256 MPI ranks, 8 threads each Free memory on MPI rank zero’s node 7
Multifrontal sparse linear solver Multifrontal factorization parallel scaling { } { } é ù t+ Δt t+Δt t+Δt K Δu = R ë û 20 30 i-1 i-1 i-1 Numeric factorization memory footprint 18 Sparse linear system Numeric factorization rate (Tflop/s) 25 16 14 20 per process (GB) 12 10 15 8 10 6 4 5 2 0 0 2,000 4,000 6,000 8,000 10,000 12,000 14,000 16,000 18,000 Assembly tree of submatrices Number of threads Multifrontal method: Input processing > Matrix reordering > Symbolic factorization > Numeric factorization > Triangular solution 8
Results – Comparison with MUMPS factorization 9
LS-GPart nested dissection for eight processors 10
Results – LS-GPart matrix reordering quality 10 AMD MMD Sparspak-ND nnz(L+U), normalized wrt Metis Metis Scotch Spectral LS-GPart 1 0.1 6 d t d d p t p o 4 r r 0 t u i m 1 l i e o h 1 3 r u o r w u R w e 3 p c s S 1 c C k C o e o s m o e e r 5 g n i i p G s m d n l c 1 e g r t n r a u o u 3 V L o m t a k o T r a a b C G H k M t c L LS-GPart added to reordering comparison presented in “Preconditioning using Rank- structured Sparse Matrix Factorization”, Ghysels, et.al., SIAM PP 2018 11
Results - LS-GPart performance 350 300 250 Time (seconds) 200 LS-GPart 150 ParMETIS 100 PT-Scotch 50 0 128 256 512 1024 2048 Processor count
Results – New symbolic factorization performance scaling mf3Sym with LS-GPart 100M DOF mf3Indist mf3FormTree 30 mf3Permute mf3Assign mf3Locate mf3PostOrder mf3Affinity mf3PtrInit 25 mf3PrmSize mf3Symbolic mf3Redist mf3Owner mf3ObjMember mf3KObjStats 20 Time (seconds) wait mf3DomSym mf3Finish mf3DomSymFct mf3SNtile 15 mf3SepTree mf3SMPsize mf3SepSymFct 10 ~300 sec. in original sequential code 5 0 128 256 512 1024 2048 MPI ranks 13
Results – Before and after Blue Waters engagement 100M DOF, Three implicit load steps 14000 Before, Hybrid (8 threads/MPI) 12000 After, MPP 10000 Time (seconds) After, Hybrid (8 threads/MPI) 8000 6000 4000 2000 0 128 256 512 1024 2048 MPI ranks 14
Results – Overall practical impact q Finite element model with 200 million degrees of freedom q Cumulative effect of better code and more compute resources q Two orders of magnitude reduction in time-to-solution q Work in progress for more practical impact 15
Future work and concluding remarks q Industrial challenges are beyond the capabilities of today’s H/W and S/W! q New design decisions based on finer grain analyses and more benchmarks! q More scale will also couple with more physics! q The right collaboration model accelerates progress! q HPC access is critical in advancing the state of the art! q Project benefits much broader community and sectors! q Special thanks to Blue Waters SEAS team for technical support! 16
17
Recommend
More recommend