hacc extreme scaling and performance across diverse
play

HACC: Extreme Scaling and Performance Across Diverse Architectures - PowerPoint PPT Presentation

HACC: Extreme Scaling and Performance Across Diverse Architectures HACC (Hardware/Hybrid Accelerated Salman Habib HEP and MCS Divisions Cosmology Code) Framework Argonne National Laboratory Vitali Morozov David Daniel 100M on Mira Nicholas


  1. HACC: Extreme Scaling and Performance Across Diverse Architectures HACC (Hardware/Hybrid Accelerated Salman Habib HEP and MCS Divisions Cosmology Code) Framework Argonne National Laboratory Vitali Morozov David Daniel 100M on Mira Nicholas Frontiere Patricia Fasel 100M on Titan Hal Finkel Los Alamos National Laboratory Adrian Pope Zarija Lukic ASCR Katrin Heitmann Lawrence Berkeley National Laboratory HEP Kalyan Kumaran Justin Luitjens Venkatram Vishwanath NVIDIA Tom Peterka DES Joe Insley George Zagaris Argonne National Laboratory Kitware LSST

  2. Motivating HPC: The Computational Ecosystem • Motivations for large HPC campaigns: 1) Quantitative predictions 2) Scientific discovery, expose mechanisms 3) System-scale simulations (‘impossible experiments’) 4) Inverse problems and optimization • Driven by a wide variety of data sources, computational cosmology must address ALL of the above • Role of scalability/performance: 1) Very large simulations necessary, but not just a matter of running a few large simulations 2) High throughput essential 3) Optimal design of simulation campaigns 4) Analysis pipelines and associated infrastructure

  3. Data ‘Overload’: Observations of Cosmic Structure • Cosmology=Physics+Statistics SPT • Mapping the sky with large-area surveys across multiple wave-bands, at remarkably low levels of statistical error CMB temperature anisotropy: theory meets observations SDSS BOSS LSST Galaxies in a moon-sized patch (Deep Lens Survey). LSST will cover 50,000 times this size The same signal in the (~400PB of data) galaxy distribution

  4. Large Scale Structure: Vlasov-Poisson Equation Cosmological Vlasov-Poisson Equation • Properties of the Cosmological Vlasov-Poisson Equation: • 6-D PDE with long-range interactions, no shielding, all scales matter, models gravity-only, collisionless evolution • Extreme dynamic range in space and mass (in many applications, million to one, ‘everywhere’) • Jeans instability drives structure formation at all scales from smooth Gaussian random field initial conditions

  5. Large Scale Structure Simulation Requirements • Force and Mass Resolution: 2 Mpc • Galaxy halos ~100kpc, hence force resolution has to be ~kpc; with Gpc box-sizes, a dynamic range of a 20 Mpc million to one • Ratio of largest object mass to lightest Time is ~10000:1 100 Mpc • Physics: • Gravity dominates at scales greater 1000 Mpc than ~Mpc • Small scales: galaxy modeling, semi- Gravitational Jeans Instablity analytic methods to incorporate gas physics/feedback/star formation • Computing ‘Boundary Conditions’: Can the Universe be run • Total memory in the PB+ class as a short computational • Performance in the 10 PFlops+ class ‘experiment’? • Wall-clock of ~days/week, in situ analysis

  6. Combating Architectural Diversity with HACC • Architecture-independent performance/scalability: Roadrunner ‘Universal’ top layer + ‘plug in’ node-level components; minimize data structure complexity and data motion • Programming model: ‘C++/MPI + X’ where X = Hopper OpenMP, Cell SDK, OpenCL, CUDA, -- • Algorithm Co-Design: Multiple algorithm options, stresses accuracy, low memory overhead, no external libraries in simulation path Mira/Sequoia • Analysis tools: Major analysis framework, tools deployed in stand-alone and in situ modes Titan 1.005 Power spectra ratios across different 1.004 1.003 P(k) Ratio with respect to GPU code 1.003 implementations (GPU version as reference) 1.002 1.001 1.00 Edison 1 0.999 0.998 0.997 0.997 RCB TreePM on BG/Q/GPU P 3 M RCB TreePM on Hopper/GPU P 3 M 0.996 Cell P 3 M/GPU P 3 M Gadget-2/GPU P 3 M 0.995 0.1 1 k (h/Mpc) k[h/Mpc]

  7. Architectural Challenges Roadrunner: Prototype for modern accelerated architectures Mira/Sequoia Architectural ‘Features’ • Complex heterogeneous nodes • Simpler cores, lower memory/core (will weak scaling continue?) • Skewed compute/communication balance • Programming models? • I/O? File systems?

  8. Accelerated Systems: Specific Issues Imbalances and Bottlenecks • Memory is primarily host-side (32 GB vs. 6 GB) (against Roadrunner’s 16 GB vs. 16 GB), important thing to think about (in case of HACC, the grid/ particle balance) • PCIe is a key bottleneck; overall Strategies for Success Mira/Sequoia interconnect B/W does not • It’s (still) all about understanding match Flops (not even close) and controlling data motion • There’s no point in ‘sharing’ • Rethink your code and even work between the CPU and the approach to the problem GPU, performance gains will be • Isolate hotspots, and design for minimal -- GPU must dominate portability around them (modular • The only reason to write a code programming) for such a system is if you can • Like it or not, pragmas will never truly exploit its power (2 X CPU be the full answer is a waste of effort!)

  9. ‘HACC In Pictures’ RCB tree levels Mira/Sequ Text ~1 Mpc ~50 Mpc 0.8 Newtonian HACC Top Layer: HACC ‘Nodal’ Layer: 0.7 Force Two-particle Force 3-D domain decomposition Short-range solvers 0.6 Noisy CIC PM Force with particle replication at employing combination 0.5 boundaries (‘overloading’) of flexible chaining mesh 6th-Order sinc-Gaussian 0.4 for Spectral PM algorithm spectrally filtered PM and RCB tree-based force 0.3 Force (long-range force) evaluations 0.2 0.1 GPU: two options, Host-side 0 0 1 2 3 4 5 6 7 8 P3M vs. TreePM

  10. HACC: Algorithmic Features • Fully Spectral Particle-Mesh Solver: 6th-order Green function, 4th-order Super- Lanczos derivatives, high-order spectral filtering, high-accuracy polynomial for short-range forces • Custom Parallel FFT: Pencil-decomposed, high-performance FFT (up to 15K^3) • Particle Overloading: Particle replication at ‘node’ boundaries to reduce/delay communication (intermittent refreshes), important for accelerated systems • Flexible Chaining Mesh: Used to optimize tree and P3M methods • Optimal Splitting of Gravitational Forces: Spectral Particle-Mesh melded with direct and RCB (‘fat leaf’) tree force solvers (PPTPM), short hand-over scale (dynamic range splitting ~ 10,000 X 100); pseudo-particle method for multipole expansions • Mixed Precision: Optimize memory and performance (GPU-friendly!) • Optimized Force Kernels: High performance without assembly • Adaptive Symplectic Time-Stepping: Symplectic sub-cycling of short-range force timesteps; adaptivity from automatic density estimate via RCB tree • Custom Parallel I/O: Topology aware parallel I/O with lossless compression (factor of 2); 1.5 trillion particle checkpoint in 4 minutes at ~160GB/sec on Mira

  11. HACC on Titan: GPU Implementation (Schematic) P3M Implementation (OpenCL): • Spatial data pushed to GPU in Chaining Mesh large blocks, data is sub- Push ¡to ¡GPU Block partitioned into chaining mesh 3 ¡Grid ¡units cubes New Implementations (OpenCL and • Compute forces between particles CUDA): in a cube and neighboring cubes • P3M with data pushed only once • Natural parallelism and simplicity per long time-step, completely leads to high performance eliminating memory transfer • Typical push size ~2GB; large latencies (orders of magnitude push size ensures computation less); uses ‘soft boundary’ time exceeds memory transfer chaining mesh, rather than latency by a large factor rebuilding every sub-cycle • More MPI tasks/node preferred • TreePM analog of BG/Q code over threaded single MPI tasks written in CUDA, also produces (better host code performance) high performance

  12. HACC on Titan: GPU Implementation Performance • P3M kernel runs at Initial Strong Scaling Time (nsec) per substep/particle 1.6TFlops/node at Initial Weak Scaling 40.3% of peak (73% of algorithmic peak) Improved Weak Scaling • TreePM kernel was run on 77% of Titan at 20.54 PFlops at almost identical performance on the Ideal Scaling card • Because of less TreePM Weak Scaling overhead, P3M code is (currently) faster by factor of two in Number of Nodes time to solution 99.2% Parallel Efficiency

  13. HACC Science Simulations with 6 orders of dynamic range, exploiting all supercomputing architectures -- advancing science CMB SZ Sky Map Strong Lensing Synthetic Catalog The Outer Rim Simulation Large Scale Scientific Inference: Cosmological Parameters Structure Merger Trees

Recommend


More recommend