total work flow exploiting hybrid computing architectures
play

Total Work-Flow: Exploiting Hybrid Computing Architectures for - PowerPoint PPT Presentation

Total Work-Flow: Exploiting Hybrid Computing Architectures for Scientific Computing ScicomP 15 Ben Bergen Computational Physics (CCS-2) Los Alamos National Laboratory Brian Albright (X-1), Kevin Bowers (D.E. Shaw), Lin Yin (X-1), William


  1. Total Work-Flow: Exploiting Hybrid Computing Architectures for Scientific Computing ScicomP 15 Ben Bergen Computational Physics (CCS-2) Los Alamos National Laboratory Brian Albright (X-1), Kevin Bowers (D.E. Shaw), Lin Yin (X-1), William Daughton (X-1) LA-UR 09-02032 Operated by Los Alamos National Security, LLC for DOE/NNSA

  2. Overview  Roadrunner System Overview  Basic Considerations and Programming Models  Adapting VPIC Kinetic Plasma Code to Roadrunner  Optimizing Total Workflow  Open Science on Roadrunner LA-UR 09-02032 Slide 2 Operated by Los Alamos National Security, LLC for NNSA Operated by Los Alamos National Security, LLC for NNSA

  3. Roadrunner is a Cluster LA-UR 09-02032 Slide 3 Operated by Los Alamos National Security, LLC for NNSA Operated by Los Alamos National Security, LLC for NNSA

  4. Roadrunner is a Cluster of Clusters LA-UR 09-02032 Slide 4 Operated by Los Alamos National Security, LLC for NNSA Operated by Los Alamos National Security, LLC for NNSA

  5. Roadrunner is a Cluster of Clusters with Accelerators LA-UR 09-02032 Slide 5 Operated by Los Alamos National Security, LLC for NNSA Operated by Los Alamos National Security, LLC for NNSA

  6. Triblade Compute Node LA-UR 09-02032 Slide 6 Operated by Los Alamos National Security, LLC for NNSA Operated by Los Alamos National Security, LLC for NNSA

  7. Original Blade Topology  One-to-one affinity between Opteron core and Cell processor  Newer versions of DaCS support two-to-one affinity  Not sure about four-to-one??? LA-UR 09-02032 Slide 7 Operated by Los Alamos National Security, LLC for NNSA Operated by Los Alamos National Security, LLC for NNSA

  8. Roadrunner: Basic Considerations for Adaptation Roadrunner has three different architectures  First hybrid supercomputer of the current generation incorporating x86_64, PowerPC, and SPU ISAs. Opteron  Codes require three executables  x86_64 executable runs on the Opteron host processor  PowerPC executable runs on the Power Processing Element (PPE) accelerator processor  SPU threads runs on the eight Synergistic Processing PowerPC Element (SPE) special purpose vector unit processors  Three compilers: gcc, ppu-gcc, spu-gcc (also XL C/C++)  Design considerations: Process launch and SPE synchronization LA-UR 09-02032 Slide 8 Operated by Los Alamos National Security, LLC for NNSA Operated by Los Alamos National Security, LLC for NNSA

  9. Roadrunner: Basic Considerations for Adaptation Roadrunner has three different address spaces  Incorporates main memory on the Opteron and Cell eDP blades plus the local store user-controlled SRAM on the SPEs  Codes that run on Roadrunner must handle communication between these memory spaces  Distributed memory communication between Opteron hosts  Point-to-point communication between Opteron host and Cell accelerator  Direct Memory Access (DMA) communication between Cell main memory and SPE local store memory  Opteron and Cell have different endianness  Some byte-swapping is necessary  Cell blades are diskless  Design considerations: Communication and I/O LA-UR 09-02032 Slide 9 Operated by Los Alamos National Security, LLC for NNSA Operated by Los Alamos National Security, LLC for NNSA

  10. Roadrunner: Basic Considerations for Adaptation Multiple tools and programming models MPI  Process launch and synchronization  MPI, DaCS/ALF, libSPE2  Communication  MPI, DaCS/ALF, libSPE2 DaCS Hierarchical/heterogeneous advantages  Fault tolerance  Faults can be caught at multiple levels libSPE2  Scalability  Strong scalability is possible on SPEs  Weak scalability through distributed memory LA-UR 09-02032 Slide 10 Operated by Los Alamos National Security, LLC for NNSA Operated by Los Alamos National Security, LLC for NNSA

  11. Programming Models: Host-Centric (Function Offload) Pros  Allows staged development Opteron  Existing MPI codes will run on Opterons  Synchronous or asynchronous function offload to accelerator  Minimizes reliance on PPE (poor performer!) Cons  Potential data-movement bottleneck Cell  Offload cost must be amortized by work done on accelerator LA-UR 09-02032 Slide 11 Operated by Los Alamos National Security, LLC for NNSA Operated by Los Alamos National Security, LLC for NNSA

  12. Programming Models: Accelerator-Centric Pros  Also allows staged development Opteron  Existing MPI codes will run on PowerPC (PPE)  Hides complexity of hybrid architecture  Avoids data-movement bottleneck Cons  Heavier reliance on PPE Cell  Computationally intensive portions of code must run on SPEs  Requires “relay” to forward message traffic LA-UR 09-02032 Slide 12 Operated by Los Alamos National Security, LLC for NNSA Operated by Los Alamos National Security, LLC for NNSA

  13. Message Passing Relay Opteron Opteron Cell Cell Direct point-to-point communication is not possible between Cells LA-UR 09-02032 Slide 13 Operated by Los Alamos National Security, LLC for NNSA Operated by Los Alamos National Security, LLC for NNSA

  14. Message Passing Relay Opteron Opteron Relay forwards messages through hosts to peer Cell Cell Data Data LA-UR 09-02032 Slide 14 Operated by Los Alamos National Security, LLC for NNSA Operated by Los Alamos National Security, LLC for NNSA

  15. Programming Models: All Roads Lead Everywhere There is a natural evolution of both of these approaches into a Opteron fully hybrid computing model Scheduler  Initial difference is in program Locus or control-process  On “evolved” model the host process runs a Opteron Core task-queue  Tasks may be offloaded to other host-type cores or to accelerators Cell  Task data may live in worker’s memory to avoid data-movement bottlenecks Cell/GPU  More on how we can use this to follow! LA-UR 09-02032 Slide 15 Operated by Los Alamos National Security, LLC for NNSA Operated by Los Alamos National Security, LLC for NNSA

  16. Particle-In-Cell (PIC) Methods Simulate Plasma Physics VPIC modeling of a LLNL pF3D modeling Integrated LLNL Hydra single laser speckle of a laser beam modeling of ICF experiment  One application of VPIC is to simulate Laser Plasma Interactions (LPI) critical to understanding Inertial Confinement Fusion (ICF) at the National Ignition Facility (NIF)  Several difficulties arise during the compression of hohlraum capsules  Laser scattering – not enough energy to compress capsule  Laser scattering – laser does not target desired areas (unsymmetric compression)  Pre-heating – electrons heat plasma making compression more difficult LA-UR 09-02032 Slide 16 Operated by Los Alamos National Security, LLC for NNSA Operated by Los Alamos National Security, LLC for NNSA

  17. Particle-In-Cell Method Time Iteration grids Interpolate Advance Field Effects Particles + particles + + + + Accumulate Update Fields Currents Spatial Domain LA-UR 09-02032 Slide 17 Operated by Los Alamos National Security, LLC for NNSA Operated by Los Alamos National Security, LLC for NNSA

  18. VPIC – Vector Particle-In-Cell  3D, fully relativistic, electromagnetic Particle-In-Cell (PIC) code  Self-consistent evolution of a kinetic plasma  Charge conserving (no implicit solve)  Optimized for data motion  Single precision – half the memory bandwidth/double the theoretical peak  Single-pass particle processing  Field interpolation coefficients are pre-computed  Optimized for modern architectures  Uses short-vector, SIMD intrinsics (SSE, Altivec, SPU) Assumes that particles do not leave voxel in which they started  Exceptions are handled separately   O(N) particle sorting Improves spatial locality of particle data access  Improves temporal locality of Field data access  LA-UR 09-02032 Slide 18 Operated by Los Alamos National Security, LLC for NNSA Operated by Los Alamos National Security, LLC for NNSA

  19. Porting to Roadrunner (things that we did)  Message Passing Relay (MP Relay)  Flattens communication topology  Allows logical point-to-point communication between Cell processors Abstracts remote I/O layer for restart and visualization dumps   Pipelined execution  Code restructured for data-parallel thread execution  Current support for serial, pthreads, and SPE threads  Simple, common interface: init(), finalize(), execute(function_t), sync()  Particle data structures  Optimized for efficient communication via DMA requests  Can be tuned to cache size on traditional cached-memory architectures (padding)  Voxel cache (access to Field data)  Fully associative least recently used (LRU) policy  Simple interface: voxel_cache_fetch() and voxel_cache_wait()  Text overlay support  Allows acceleration of field advance, particle sorting and accumulators LA-UR 09-02032 Slide 19 Operated by Los Alamos National Security, LLC for NNSA Operated by Los Alamos National Security, LLC for NNSA

  20. Pipeline Abstraction Master Thread init execute sync finalize  Worker threads block for execute message to reduce thread creation overhead  pthreads implementation uses condition variables  SPE implementation uses mailboxes  SPE symbols are exposed to the PPE through _SPUEAR_ linker magic  Function call is implemented through mailbox message LA-UR 09-02032 Slide 20 Operated by Los Alamos National Security, LLC for NNSA Operated by Los Alamos National Security, LLC for NNSA

Recommend


More recommend