Design, Implementation and Applications of PETSc-MUMPS Inteface - PDF document

Design, Implementation and Applications of PETSc-MUMPS Inteface Hong Zhang Computer Science, Illinois Institute of Technology Mathematics and Computer Science, Argonne National Laboratory 1 What is PETSc? Portable, Extensible Toolkit for Scientific computation • Sequential and parallel data structures • Sequential and parallel algebraic solvers • API for advanced methods • Portable(?) to virtually all systems • Funded largely by the US Dept. of Energy • www.mcs.anl.gov/petsc (free) 2

Structure of PETSc PETSc PDE Application Codes ODE Integrators Visualization Interface Nonlinear Solvers Linear Solvers Preconditioners + Krylov Methods Object-Oriented Grid Matrices, Vectors, Indices Management Profiling Interface Computation and Communication Kernels MPI, MPI-IO, BLAS, LAPACK PETSc Structure 3 PETSc Structure PETSc Numerical Components Nonlinear Solvers Time Steppers Newton-based Methods Backward Pseudo Time Other Euler Other Euler Stepping Line Search Trust Region Krylov Subspace Methods GMRES CG CGS Bi-CG-STAB TFQMR Richardson Chebychev Other Preconditioners Additive Block LU Jacobi ILU ICC Others Schwartz Jacobi (Sequential only) Matrices Compressed Blocked Compressed Block Sparse Row Sparse Row Diagonal Dense Matrix-free Other (AIJ) (BAIJ) (BDIAG) Distributed Arrays Index Sets Indices Block Indices Stride Other Vectors 4

What is MUMPS? MUltifrontal Massively Parallel sparse direct Solver • Solution of large linear systems with spd and general matrices • Iterative refinement and backward error analysis • Partial factorization and Schur complement matrix • Several orderings interfaced: AMD, AMF, PORD,METIS • Written in F90 with C interface • Parallel version requires BLACS and ScaLAPACK 5 What is MUMPS? • Expoits both parallelism arising from sparsity in the matrix and from dense factorizations kernels. • Partially funded by CEC ESPRIT IV long term research project • www.enseeiht.fr/irit/apo/MUMPS/ 6

MUMPS solves Ax=b in three main steps: 1. Analysis (Job=1): - the host performs an ordering - the host carries out symbolic factorization 2. Factorization A=LU or A=LDL^T (Job=2) : - A is distributed to processors - the numerical factorization on each frontal matrix is conducted by a master and one or more slave processors 3. Solution (Job=3) : - b is broadcast from the host - x is computed using the distributed factors - x is either assembled on the host or kept distributed on the processors 7 MUMPS: • Each of the phases can be called separately • Asynchronous communication Enable overlapping between communication and computation • Dynamic scheduling Algorithm can adapt itself at execution time to remap work and data to appropriate processors 8

PETSc-MUMPS Interface Enable an easy use of the MUMPS’ parallel sparse direct solvers under the PETSc environment for • algorithmic study • solving computational-intensive problems 9 Installation of PETSc and MUMPS 1. Download PETSc 2. Configure PETSc with ./configure.py <petsc_config_opts> --download-mumps=yes --download-scalapack=yes --download-blacs=yes 3. Build libraries: ./make all Reference: ~petsc/python/PETSc/packages/MUMPS.py 10

Design of PETSc-MUMPS Interface PETSc Client MUMPS PETSc id.job=JOB_INIT; MatCreate(comm,&A); id.job=1; // Analysis MatSetType(A,MATAIJMUMPS); MatLUFactorSymbolic(); id.job=2; // Factorization MatLUFactorNumeric(); id.job=3; // Solution MatSolve(); id.job=JOB_END; MatDestroy(); PETSc-MUMPS Interface MatCreate_AIJMUMPS(A); MatLUFactorSymbolic_AIJMUMPS(); MatLUFactorNumeric_AIJMUMPS(); MatSolve_AIJMUMPS(); MatDestroy_AIJMUMPS(); 11 Design of PETSc-MUMPS Interface Mat MATAIJ … MATSBAIJ MatOps MatOps MATSBAIJMUMPS MATAIJMUMPS MatCreate_SBAIJMUMPS(); MatCreate_AIJMUMPS(); MatCholeskyFactorSymbolic_SBAIJMUMPS(); MatLUFactorSymbolic_AIJMUMPS(); MatCholeskyFactorNumeric_SBAIJMUMPS(); MatLUFactorNumeric_AIJMUMPS(); MatSolve_SBAIJMUMPS(); MatSolve_AIJMUMPS(); MatDestroy_SBAIJMUMPS(); MatDestroy_AIJMUMPS(); 12

PETSc Vector • What are PETSc vectors? proc 0 – Fundamental objects for storing field solutions, right-hand sides, etc. proc 1 – Each process locally owns a subvector of contiguously numbered global indices • Create vectors via proc 2 – VecCreate(MPI_Comm,Vec *) • MPI_Comm - processes that share the vector proc 3 – VecSetSizes( Vec, int, int ) • number of elements local to this process proc 4 • or total number of elements – VecSetType(Vec,VecType) • Where VecType is – VEC_SEQ, VEC_MPI, or VEC_SHARED data objects: data objects: • VecSetFromOptions(Vec) lets you set the type vectors 13 vectors at runtime PETSc Matrix Distribution Each process locally owns a submatrix of contiguously numbered global rows. proc 0 proc 1 proc 2 } proc 3: locally owned rows proc 3 proc 4 MatGetOwnershipRange(Mat A, int *rstart, int *rend) – rstart: first locally owned row of global matrix – rend -1: last locally owned row of global matrix data objects: data objects: matrices 14 matrices

MUMPS Matrix Input Structure • Elemental format and input centrally on the host • Assembled format: 1. Input centrally on the host processor 2. Structure is provided on the host (analysis), entries are distributed across the processors (numeric factorization) 3. Both structure and entries are provided as local triplets (ICNTL(18)=3) 15 Matrix Conversion - MatLUFactorNumeric_AIJMUMPS(): PETSc MUMPS A_i, B_i: C_i: local diagonal local matrices in and off-diagonal triples matrices in row compressed format 16

Vector Conversion - MatSolve_AIJMUMPS(): MUMPS PETSc centralized b distributed b distributed x distributed x 17 Using PETSc-MUMPS Interface mpirun –np <np> petsc-prog \ –ksp_type preonly –pc_type lu \ –mat_type aijmumps <mumps_opts> mpirun –np <np> petsc-prog \ –ksp_type preonly –pc_type cholesky \ –mat_type sbaijmumps <mumps_opts> 18

An application: Modeling of Nanostructured Materials System size Accuracy * 19 Density-Functional based Tight-Binding (DFTB) 20

Matrices are • large: ultimate goal 50,000 atoms with electronic structure ~ N=200,000 • sparse: non-zero density -> 0 as N increases • dense solutions are requested: 60% eigenvalues and eigenvectors Dense solutions of large sparse problems! 21 DFTB-eigenvalue problem is distinguished by • (A, B) is large and sparse Iterative method • A large number of eigensolutions (60%) are requested Iterative method + multiple shift-and-invert • The spectrum has - poor average eigenvalue separation O(1/N), - cluster with hundreds of tightly packed eigenvalues - gap >> O(1/N) Iterative method + multiple shift-and-invert + robusness The matrix factorization of (A- σ B)=LDL T : • not-very-sparse(7%) <= nonzero density <= dense(50%) Iterative method + multiple shift-and-invert + robusness + efficiency Ax= λ Bx is solved many times (possibly 1000’s) • Iterative method + multiple shift-and-invert + robusness + efficiency + initial approximation of eigensolutions 22

Lanczos shift-and-invert method for Ax = λ Bx: K(C, v) = span{ v, Cv, C 2 v, …., C k-1 v } Eigenvalues of (A,B) close to σ Eigensolutions of T k and their eigenvectors 23 Lanczos shift-and-invert method for Ax = λ Bx: • Cost: - one matrix factorization: - many triangular matrix solves: • Gain: - fast convergence - clustering eigenvalues are transformed to well-separated eigenvalues - preferred in most practical cases 24

Multiple Shift-and-Invert Parallel Eigenvalue Algorithm Idea: distributed spectral slicing compute eigensolutions in distributed subintervals Proc[2] Proc[1] Proc[0] λ min σ [0] λ max σ [1] σ [2] i min i max 25 Software Structure — SIPs • S hift-and- I nvert P arallel Spectral Transform s Parallelize by spectrum intervals (multiple shifts) – Balance parallel jobs – Ensure global orthogonality of eigenvectors – Manage matrix storage – Builds on existing packages for data and solvers – SIPs ARPACK SLEPc PETSc MUMPS MPI 26

Software Structure S hift-and- I nvert P arallel Spectral Transform s ( SIPs ) • Select shifts • Bookkeep and validate eigensolutions • Balance parallel jobs • Ensure global orthogonality of eigenvectors • Subgroup of communicators ARPACK SLEPc PETSc MUMPS MPI 27 Software Structure — Algebra packages • SLEPc Scalable Library for Eigenvalue Problem – Computations • www.grycap.upv.es/slepc/ • ARPACK ARnoldi PACKage – • www.caam.rice.edu/software/ARPACK/ • MUMPS MUltifrontal Massively Parallel sparse direct – Solver • www.enseeiht.fr/lima/apo/MUMPS/ 28

Distributed spectral slicing • assign intervals to processors compute eigensolutions near shifts (the hard part) – validate and tally (handle overlaps later) – pick new shifts – shrink assigned spectrum (communication) – • iterate Process 2 Process 1 Process 0 σ � [1] σ � [2] λ min σ � [0] σ � [0] σ � [1] σ � [2] σ � [2] λ max i min i max 29 Domain decomposition: “Frequency and Space” • When a single process cannot store replicated matrices Use more processes, distribute matrix storage – introduce sub-communicators – • comb-like communication pattern 3 3 3 3 2 2 2 2 1 1 1 1 0 0 0 0 0 1 2 3 λ min λ max 30

Design, Implementation and Applications of PETSc-MUMPS Inteface - PDF document

Design, Implementation and Applications of PETSc-MUMPS Inteface Hong Zhang Computer Science, Illinois Institute of Technology Mathematics and Computer Science, Argonne National Laboratory 1 What is PETSc? Portable, Extensible Toolkit for

Algebraic multigrid in PETSc Mark Adams Lawrence Berkeley National Laboratory PETSc user

Fluid Interface Detection with PETSc and DONLP2 PETSc User Meeting Vienna 2016 Poster Session

Nonlinear Preconditioning in PETSc Matthew Knepley PETSc Team Computation Institute

To thread or not to thread? Why PETSc favors MPI-only Plenary Discussion PETSc User Meeting 2016

The Rise and Fall and Rise (Hopefully) d Ri (H f ll ) of Standard Mumps Arthur B. Smith Chair,

Understanding and Tuning Performance in PETSc (on emerging manycore, GPGPU, and traditional

Progress with PETSc on Manycore and GPU-based Systems on the Path to Exascale Richard Tran Mills

The P ortable E xtensible T oolkit for S cientific C omputing Toby Isaac (building on slides from

A massivelly parallel multigrid solver using PETSc for unstructured meshes on Tier0

Some Design Considerations (and a Few Matrix Implementation Details) in PETSc, the Portable,

Immunization Update: Shingles, Mumps and Best Practice Oh my!!! Kevin W. Cleveland, PharmD,

Massachusetts Department of Public Health Thursday, December 13, 2018, 12:00-1:00 p.m. No

MEMS Processes at CMP Bulk Micromachining MUMPs from MEMSCAP Teledyne DALSA MIDIS Micralyne

Sparse Matrix Computation with PETSc Portable, Extensible Toolkit for

Experiences and Challenges Scaling PFLOTRAN, a PETSc-based Code For Implicit Solution of

Flexible, Scalable Mesh and Data Management using PETSc DMPlex M. Lange 1 M. Knepley 2 L. Mitchell

Multiagent Constraint Optimization on the Constraint Composite Graph Ferdinando Fioretto Hong

Nonlinear model reduction Using machine learning to enable rapid simulation of extreme-scale

Logical Behavior Modeling for UML: Behavior as Composite Structure Conrad Bock, U.S. National

Collaboration and Contracting with Partners in Large Agile Development Vertrge, Zusammenarbeit

Sh Shar are e #G #GreyLit: yLit: Using Using So Social cial Me Media dia to Co Comm

Application performance on the UK's new HECToR service Fiona Reid 1,2 , Mike Ashworth 1,3 , Thomas

Digital Cookie 2018-2019 Parent Training Welcome Setting up your site Learning

Optimal control of the atmospheric re-entry of a space shuttle Emmanuel Trlat 1 1 MAPMO