Improving Virtually Guided Product Certification with Implicit Finite Element Analysis at Scale Seid Koric 1 , Robert F. Lucas 2 , Erman Guleryuz 1 1 National Center for Supercomputing Applications (NCSA) 2 Livermore Software Technology Corporation (LSTC) Blue Waters Symposium 2018, June 6 th
NCSA Private Sector Partners
Seid Koric, Erman Guleryuz Todd Simons, James Ong Robert Lucas, Roger Grimes, Francois-Henry Rouet Jef Dawson, Ting-Ting Zhu
Project overview • Long-term vision: Fully virtual product development and certification with digital twins • Pressing need for high-performance simulations to reduce development risks and costs • Finite Element Method widely used for product design • Challenge: Large-scale system-level models with wide spectrum of characteristic lengths • Parallel performance is key for impact • Parallel performance = f (code, input, platform) • Measure-analyze-improve cycle with large-scale real-life models
Simulation model: Gas turbine engine PDE Linear system { } { } t+ Δt t+Δt t+Δt Δu = R K i-1 i-1 i-1 Implicit FEM, Direct method based on factorization Linear system solved at each NR iterations Solving the linear system takes large portion of run time O(100 M) DOF Stiffness matrix [K] is sparse! Most coefficients are zero Triangular systems are easy to solve
Parallel efficiency of phases in linear solver 105M DOF, one implicit time step, 8 threads/MPI Symbolic factorization is sequential bottleneck E(p) = T(n,1) / p T(n,p) Triangular solution wall-clock time < 10 s
Sparse matrix reordering - Highlights • LS-GPart: Non-multilevel parallel nested dissection based on half-level sets • Graph theory leveraged, goal is to find a vertex separator of the adjacency graph of K • Recursive partitioning of the graph defined by a tree of vertex separators • Optimal vertex separators Optimal fill-in and factor FLOPS • Popular fill-reducing reordering tools: METIS (default), Scotch, ParMETIS, PT-Scotch • Ordering quality = f (factor non-zeros, factor flops)
70 60 Free memory (GB) 50 40 30 20 10 0 0 20 40 60 80 100 120 140 Time (minutes) Trace of available memory with Ovis Reordering performance and quality 4096 threads, 2 MPI/node, 8 threads/MPI
Numeric factorization - Highlights • Multifrontal block low-rank factorization • Sequence of dense matrix operations using the elimination tree • Factorization consists in a bottom-up traversal of the tree • Each node of the tree corresponds to a dense matrix, BLAS • Natural parallelization: Dependencies between tasks are captured by the elimination tree • Columns in different tree branches can be factorized in parallel
Numeric factorization Triangular solution
Memory trace for 200 M DOF Model xe (64 GB) and xe_himem (128 GB) combo Input processing and domain decomposition on a xe_himem node
Work-in-progress and long-term vision • Parallel symbolic factorization • Unknown bottleneck: constraint processing • Load balance, communication improvements • Relevant scale for full impact? • Dr. Yoon Ho, Rolls-Royce, ISC14
Concluding remarks • Current state: Scaling up to 16,000 threads with hybrid parallelization, 30 Tflop/s, 200 M DOF • Collaboration model matters: All stakeholders on board • Software development challenges: Access to HPC, lack of portable tools, hardware complexity • Additional challenges: Multiple development teams for large codes, evolve from present code • Beyond FLOPS: Workflow problem on path towards fully virtual product development • 2018 ASCR Leadership Computing Challenge (ALCC) award • Special thanks to Blue Waters SEAS team for great technical support!
Recommend
More recommend