sc13
play

SC13 GPU Technology Theater AmgX: Performance Acceleration for - PowerPoint PPT Presentation

SC13 GPU Technology Theater AmgX: Performance Acceleration for Large-Scale Iterative Methods Dr. Joe Eaton, NVIDIA Linear Solvers are Necessary CFD Energy Physics Nuclear Safety AmgX Overview Two forms of AMG Classical AMG, as in HYPRE,


  1. SC13 GPU Technology Theater AmgX: Performance Acceleration for Large-Scale Iterative Methods Dr. Joe Eaton, NVIDIA

  2. Linear Solvers are Necessary CFD Energy Physics Nuclear Safety

  3. AmgX Overview Two forms of AMG Classical AMG, as in HYPRE, strong convergence, scalar Un-smoothed Aggregation AMG, lower setup times, handles block systems Krylov methods GMRES, CG, BiCGStab , preconditioned and ‘flexible’ variants Classic iterative methods Block-Jacobi, Gauss-Seidel, Chebyshev, ILU0, ILU1 Multi-colored versions for fine-grained parallelism Flexible configuration All methods as solvers, preconditioners, or smoothers; nesting Designed for non-linear problems Allows for frequently changing matrix, parallel and efficient setup

  4. Easy to Use No CUDA experience necessary to use the library C API: links with C, C++ or Fortran Small API, focused Reads common matrix formats (CSR, COO, MM) Single GPU, Multi-GPU Interoperates easily with MPI, OpenMP, and Hybrid parallel applications Tuned for K20 & K40; supports Fermi and newer Single, Double precision Supported on Linux, Win64

  5. Minimal Example With Config //One header #include “ amgx_c.h ” solver(main)=FGMRES //Read config file main:max_iters=100 AMGX_create_config(&cfg, cfgfile); main:convergence=RELATIVE_INI main:tolerance=0.1 //Create resources based on config main:preconditioner(amg)=AMG AMGX_resources_create_simple(&res, cfg); amg:algorithm=AGGREGATION //Create solver object, A,x,b, set precision amg:selector=SIZE_8 AMGX_solver_create(&solver, res, mode, cfg); amg:cycle=V AMGX_matrix_create(&A,res,mode); amg:max_iters=1 AMGX_vector_create(&x,res,mode); amg:max_levels=10 AMGX_vector_create(&b,res,mode); amg:smoother(amg_smoother)=BLOCK_JACOBI //Read coefficients from a file amg:relaxation_factor= 0.75 amg:presweeps=1 AMGX_read_system(&A,&x,&b, matrixfile); amg:postsweeps=2 //Setup and Solve amg:coarsest_sweeps=4 AMGX_solver_setup(solver,A); determinism_flag=1 AMGX_solver_solve(solver, b, x);

  6. Drop-In Acceleration on Real Apps “Using AmgX has allowed us to exploit the power of the GPU while freeing up development time to concentrate on reservoir simulation” Garf Bowen, RidgewayKiteSoftware Total Time (s) 400k cell 1500 Adaptive implicit Lower 1150 is Black-oil model Better 1000 AMG Pressure solver 10 Time step benchmark 500 197 98 0 in-house one core in-house GPU AmgX

  7. Industrial Strength, Robust Designed to be used in commercial, academic, and research applications AmgX has run on clusters with 100s of nodes. Industrial problems >440 million unknowns have been solved successfully using 48 GPUs.

  8. ANSYS Fluent 15.0 -111 M aerodynamic problem 144 CPU cores – Amg 144 CPU cores 48 GPUs – AmgX 144 CPU cores + 48 GPUs Truck Body Model 36 2 X 29 111 M Mixed cells External aerodynamics Lower 2.7 X is Steady, k- e turbulence Better 18 Double-precision solver CPU: Sandy Bridge (E5- 2667); 12 cores per node 11 GPU: Tesla K40m, 4 per node 444 million unknowns Solver time Fluent solution time per iteration(secs) per iteration(secs)

  9. Integrating AmgX into your Application

  10. Integrates easily with MPI and OpenMP Adding GPU support to existing applications raises new issues  Proper ratio CPU cores / GPU?  How can multiple CPU cores (MPI ranks) share a single GPU?  How does MPI switch between two sets of ‘ranks’: one set for CPUs, one set for GPUs? AmgX handles this via Consolidation  Consolidate multiple smaller sub-matrices into single matrix  Handled automatically during PCIE data copy

  11. Original Problem Partitioned to 2 MPI Ranks Consolidated onto 1 GPU Rank 0 u 5 u 1 PCIE GPU u 2 u 5 u’ 4 u 1 u 5 u 3 u 1 u 2 u 4 u 2 Boundary exchange u 3 u 4 PCIE u 3 Rank 1 u 6 u 6 u 7 u’ 2 u 7 u 4 u 6 u 7

  12. Consolidation Examples Arbitrary Cluster: 1 CPU socket <=> 1 GPU 4 nodes x [2 CPUs + 3 GPUs] Dual socket CPU <=> 2 GPUs IB Dual socket CPU <=> 4 GPUs

  13. Benefits of Consolidation Add more GPUs or CPU cores without changing existing code Run 1 MPI rank per CPU core? Fine, keep it that way. Consolidation allows flexible ratio of CPUs/GPU Consolidation has similar benefits as “replication” strategies in multigrid: Consolidate many small coarse grid problems to reduce network communication

  14. Whole-App Performance Typical application profile Advance in time Compute Physics Solve non-linear PDE Linearize Next time step Inner loop Solve Linear Typically 50-90% System Accelerate this first Until converged

  15. Amdahl’s Law Typical performance example: Total Time 120 Simulation has 70% fraction on 100 linear solver 100 AmgX provides 3x speedup 80 Best possible result just 53.33 60 accelerating solver: 40 1/(1-.7)=10/3=3.33x 20 Achieved 3x for solver = 1.87x application speedup 0 Before After AmgX AmgX

  16. Drop-in Acceleration Solve Linear System is expensive Advance in time Moving data to GPU is relatively Compute cheap: 1-15% Physics Solve non-linear PDE Linearize Next time step Solve Linear AmgX System Move data Until converged to/from GPU

  17. What Comes Next? AmgX device pointer API allows data to start on GPU Advance in time Works with consolidation Compute No change to AmgX calls Physics Expand region CUDA of acceleration Solve non-linear matrix PDE assembly Linearize Next time step Solve Linear AmgX System Until converged

  18. Amdahl’s law revisited Total Time 3x Linear solver speedup, 70% 120 fraction 100 4x matrix assembly speedup, 100 25% fraction 80 Potential speedup is dramatically higher! 60 53.33 1/(1-.95) = 100/5 =20x 34.58 40 Achieved 3x solver and 4x 20 assembly= 2.89x application speedup 0 Before After After GPU AmgX Amgx Assembly

  19. AmgX – High Performance on Modern Algorithms Interested in Time-to-solution Comparisons always against state-of-the art algorithms and implementations CPU codes are mature and well tuned

  20. GPU Acceleration of Coupled Solvers Preview of ANSYS Fluent 15.0 Performance 8000 Sedan Model 7000 ANSYS Fluent Time (Sec) 7070 6000 5883 Lower 2.2x is 5000 Better 1.9x 4000 Sedan geometry 3000 CPU CPU+GPU 3180 3.6 M mixed cells Steady, turbulent 2000 External aerodynamics 1000 Coupled PBNS, DP AMG F-cycle on CPU 0 AMG V-cycle on GPU Segregated Solver Coupled Solver NOTE: Times for total solution

  21. Florida Matrix Collection Xeon E5-2670 @ 2.60GHz 8cores AmgX Classical on K40 speedup vs HYPRE 128GB memory 20 Total time to solution Higher is 11.85 Better 10 8.06 7.14 6.82 6.40 2.88 2.86 2.63 2.54 2.36 1.83 1.41 0

  22. miniFE Benchmark vs HYPRE Single Node, 1 Socket Cpu & 1 GPU Total Time All runs solved to 80 machine tolerance Xeon E5-2670 @ 70 2.60GHz 8cores Lower 60 is T 128GB memory Better i 50 1x K40 m 40 1GPU-Agg e 30 ( 1GPU-Classical s 20 HYPRE 8 Cores ) 10 0 0 1 2 3 4 5 6 7 Millions Number of Unknowns miniFE – “mini app” from Sandia. Performs assembly and solution of Finite Element mesh, typical of DOE codes

  23. miniFE Benchmark vs HYPRE Dual Socket CPU & 2 GPUs, Total Time All runs solved to 50 machine tolerance 45 Xeon E5-2670 @ Lower 2.60GHz 8cores 40 is T Better 128GB memory 35 i 2x K40 30 m 25 e 20 2GPU-Agg ( 15 s HYPRE 16 Cores 10 ) 5 0 0 1 2 3 4 5 6 7 Millions Number of Unknowns

  24. miniFE Benchmark vs HYPRE Single socket CPU & 1GPU, solve only, reuse setup All runs solved to 10 machine tolerance T 9 Xeon E5-2670 @ Lower 2.60GHz 2x8cores i 8 is 128GB memory 7 Better m 1x K40 6 e 5 ( s 1GPU-Agg 4 ) 1GPU-Classical 3 2 HYPRE 8 Cores 1 0 0 1 2 3 4 5 6 7 Millions Number of Unknowns

  25. AmgX Fast, scalable linear solvers, emphasis on iterative methods Flexible toolkit, GPU accelerated Ax = b solver Solve your problems faster with minimal disruption More than just fast solvers, AmgX helps you accelerate all of your code First step toward moving complex applications to GPU Public beta launching now: http://developer.nvidia.com/amgx

Recommend


More recommend