Petascale Parallelization of the Gyrokinetic Toroidal Code Stephane - PowerPoint PPT Presentation

Petascale Parallelization of the Gyrokinetic Toroidal Code Stephane Ethier, Princeton Plasma Physics Laboratory Mark Adams, Columbia University Jonathan Carter, Leonid Oliker, Lawrence Berkeley National Laboratory VECPAR 2010 June 23rd, 2010 1

Outline • System configurations – Blue Gene/P, Cray XT4, Hyperion cluster • Parallel gyro-kinetic toroidal code (GTC- P) – First fully parallel toroidal PIC code algorithm • ITER-sized scaling experiments – 128K IBM BG/P cores – 32K Cray XT4 cores – 2K Hyperion cores 2

Blue Gene/P System 1 to 72 or more Racks Cabled 8x8x16 Rack 32 Node Cards 1024 chips, 4096 procs Node Card 1 PF/s + (32 chips 4x4x2) 144 TB + 32 compute, 0-2 IO cards 14 TF/s 2 TB Compute Card 1 chip, 20 DRAMs 435 GF/s 64 GB Chip 4 processors 13.6 GF/s 2.0 GB DDR 13.6 GF/s Supports 4-way SMP 8 MB EDRAM Figure courtesy IBM 3

Blue Gene/P Interconnection Networks • 3 Dimensional Torus – Interconnects all compute nodes – 3.4 GB/s on all 12 node links (5.1 GB/s per node) – MPI: 3 µ s latency for one hop, 10 µ s to the farthest; bandwidth 1.27 GB/s – 1.7/2.6 TB/s bisection bandwidth • Collective Network – Interconnects all compute and I/O nodes – One-to-all broadcast functionality – Reduction operations functionality – 6.8 GB/s of bandwidth per link – MPI latency of one way tree traversal 5 µ s • Low Latency Global Barrier and Interrupt – MPI latency of one way to reach all 72K nodes 1.6 µ s Figure courtesy IBM 4

Cray XT4 • Single socket 2.3 GHz quadcore AMD Opteron per compute node • 37 Gflop/s peak per Compute PE node Login PE Network PE • Microkernel on System PE Compute PEs, full I/O PE featured Linux on Service PEs. Service Partition • Service PEs specialize Specialized by function Linux nodes Figure courtesy Cray 5

Cray XT4 Network 4 GB/sec AMD MPI Bandwidth Opteron 7.6 GB/sec Direct Attached Memory 7 HyperTransport . 6 G B 8.5 GB/sec / s 7.6 GB/sec e c Local Memory 7.6 GB/sec 7 Bandwidth . 6 G B 7.6 GB/sec / s Cray e c SeaStar2 Interconnect 6.5 GB/sec Torus Link Bandwidth MPI latency 4-8 µ s, bandwidth 1.7 GB/s Figure courtesy Cray 6

Hyperion Scalable Unit 134 Dual Socket Quad Core Compute Nodes (1,072 cores) ‏ 12x24 = 288 Port InfiniBand 4x DDR QsNet Elan3, 100BaseT Control 1 Login/ 4 Gateway nodes @ 1.5 4 Gateway nodes @ 1.5 144 Port IBA 4x 1 RPS Service/ GB/s GW GW GW GW GW GW GW GW GB/s Uplinks to 2x1 GbE & Master delivered I/O over delivered I/O over 1xIBA RAID spine switch 2x10GbE 4x DDR 1/10 GbE SAN 1GbE Management IBA 4x DDR SAN … 35 Lustre Object Storage Systems MD MD S S 2x10GbE+IBA 4x S 732 TB and 47 GB/s S Lustre MetaData Hyperion Phase 1 - 4 SU 46 TF/s Cluster  576 nodes and 4,608 cores, 12.1 TB/s memory bandwidth, 4.6 TB capacity  85 GF/s dual socket 2.5 GHz quad-core Intel LV Harpertown nodes Figure courtesy LLNL

Hyperion Connectivity 4 expansion SU's 8 SU base system • Bandwidth 4XIB DDR 2 GB/s peak • MPI latency 2-5 µ s, bandwidth 400 MB/s Figure courtesy LLNL

The Gyrokinetic Toroidal Code • 3D particle-in-cell code to study microturbulence in magnetically confined fusion plasmas • Solves the gyro-averaged Vlasov equation • Gyrokinetic Poisson equation solved in real space • 4-point average method for charge deposition • Global code (full torus as opposed to only a flux tube) • Massively parallel: typical runs done on 1000s processors • Nonlinear and fully self-consistent. • Written in Fortran 90/95 + MPI • Originally written by Z. Lin, subsequently modified 9

Fusion: DOE #1 Facility Priority November 10, 2003 Energy Secretary Spencer Abraham Announces Department of Energy 20-Year Science Facility Plan Sets Priorities for 28 New, Major Science Research Facilities #1 on the list of priorities is ITER, an unprecedented international collaboration on the next major step for the development of fusion #2 is UltraScale Scientific Computing Capability 10

Particle-in-cell (PIC) method • Particles sample distribution function (markers). • The particles interact via a grid, on which the potential is calculated from deposited charges. The PIC Steps • “SCATTER”, or deposit, charges on the grid (nearest neighbors) • Solve Poisson equation • “GATHER” forces on each particle from potential • Move particles (PUSH) • Repeat… 11

Parallel GTC: Domain decomposition + particle splitting • 1D Domain decomposition: – Several MPI processes allocated to a section of the torus • Particle splitting method – The particles in a toroidal section are equally divided between Processor 0 Processor 2 several MPI processes Processor 1 Processor 3 • Also has loop-level parallelism – OpenMP directives (not used in this study) 12

Radial grid decomposition • Non-overlapping geometric partitioning 13

Charge Deposition for charged rings Charge Deposition Step (SCATTER operation) GTC Classic PIC 4-Point Average GK (W.W. Lee) 14

Overlapping partitioning • Extend local domain to line up w/ grid • Extend local domain for gyro radius 15

Major Components in GTC • Particle work – O(p) – Major computational kernel “Moves particles” – Large body loops; Lots of loop level parallelism • Grid (cell) work – O(g) – Poisson solver, “smoothing”, E-field calculations • Particle-Grid work – O(p) – Major computational kernel “Scatter” and “Gather” – Unstructured and “random” access to grids – Semi-structured grids in GTC – Cache effects are critical 16

Major routines in GTC • Push ions (P,P-G) – Major computational kernel “Moves particles” – Large body loops; Gathers ; Lots of loop level parallelism • Charge deposition (P-G) – Major computational kernel “Scatter” – Pressure on cache – unstructured access to grid – block grid • Shift ions, communication (P) – Sorts out particles that move out of its domain and sends those to the “next” processor • Poisson Solver (G) – Solve Poisson Equation. Prior to 2007 the solve was redundantly executed on each processor. New version uses the PETSc solver to efficiently distribute • Smooth (G) and Field (G) – Smaller computational kernels – Prior to 2007 - NOT parallel 17

Weak Scaling Experiments • Keep number of cells and number of particles constant per process • Double size of device in each case – Final case is ITER sized plasma • Cray XT4 up to 32K cores – Quad core / flat MPI (Franklin - NERSC) • BG/P up to 128K cores – Quad core / flat MPI (Intrepid – ANL) • Hyperion up to 2K cores – Dual socket quad core / flat MPI (LLNL) 18

Absolute Performance • Cray XT4 is highest performing at both 2K and 32K procs • BG/P scaling is much better than XT4 going from 2K to 32K procs • Even though Hyperion (Xeon Harpertown) has higher peak performance than the XT4 (Opteron) performance lags at 2K procs. – Worse memory bandwidth compared with peak 19

Communication Performance • Shift routine is a good proxy for communication costs • BG/P has lowest percentage in communications – Also has lowest performing processor • Hyperion has highest percentage in communications 20

Performance on XT4 • Push and Charge scale well • Shift has moderate scaling • Field and Smooth scale relatively poorly 21

Performance on BG/P • Push, Charge and Shift scale well • Field and Smooth scale relatively poorly 22

Performance on Hyperion • Push and Charge scale well • Shift has moderate scaling • Field and Smooth scale relatively poorly 23

Load Imbalance • Field and Smooth are both dominated by grid-related work • At high processor count number of grid points per MPI task is imbalanced – Less grid points in radial domains near center of circular plane – Radial decomposition focuses on same number of particles as this is >80% of the work 24

Summary • Radial decomposition enables GTC to scale to ITER size devices – Impossible to fit full grid on a single node without radial decomposition • XT4 offers best performance, but perhaps not as scalable as BG/P • Hyperion IB cluster seems to lag, although more data required – Upgraded to Intel Nehalem nodes and enlarged since our work was completed 25

Acknowledgements • U. S. Department of Energy – O ffj ce of Fusion Energy Sciences under contract number DE-ACO2-76CH03073 – O ffj ce of Advanced Scientific Computing Research under contract number DE-AC02- 05CH11231 • Computing Resources – Argonne Leadership Computing Facility – National Energy Research Scientific Computing Center – Hyperion Project at LLNL 26

Petascale Parallelization of the Gyrokinetic Toroidal Code Stephane - PowerPoint PPT Presentation

Petascale Parallelization of the Gyrokinetic Toroidal Code Stephane Ethier, Princeton Plasma Physics Laboratory Mark Adams, Columbia University Jonathan Carter, Leonid Oliker, Lawrence Berkeley National Laboratory VECPAR 2010 June 23rd, 2010

Gyrokinetic simulations of ETG turbulence and Gyrokinetic simulations of ETG turbulence and zonal

Comparisons of gyrokinetic PIC and CIP codes Comparisons of gyrokinetic PIC and CIP codes

Optical force on toroidal nanostructures: toroidal dipole versus renormalized electric dipole

Code Parallelization Fabrice Schlegel Introduction Goal: Efficient parallelization and memory

Gyrokinetic simulations of magnetic fusion plasmas Tutorial 2 Virginie Grandgirard

Gyrokinetic simulations of magnetic fusion plasmas Tutorial 3 Virginie Grandgirard

Analysis and Parallelization Optimizations of Weather Codes Jess Labarta BSC Petascale Tools

Classification of integrable modules of twisted full toroidal Lie algebras Punita Batra

Speed up evaluation by parallelization /////////// November 2018 Michael Weiss Bayer AG

Parallelization and Parallelization and Proling Proling Programming for Statistical

Parallelization Parallelization Programming for Statistical Programming for Statistical Science

The Hybrid MHD-Gyrokinetic Code HMGC G. Vlad * * Associazione Euratom-ENEA sulla Fusione, C.R.

Identifying opportunities for parallelization In the hotspots of your code PARALLWARE SW

for Effective Speculative Parallelization in Hardware VICTOR A. YING MARK C. JEFFREY* DANIEL

OPEN PETASCALE LIBRARIES Advancing the development of numerical software for the new generation

Scalable Full-Text Search for Petascale File Systems Andrew W. Leung Ethan L. Miller

Ferrybridge Community Liaison Group 16 th October 2019 Agenda Safety Moment Previous

Nave Image Smoothing: Gaussian Blur Sylvain Paris MIT CSAIL Notation and Definitions

TMB-301: Study Ibalizumab Added to OBR for Adults Failing ART TMB-301: Study Design TMB-301:

When Is Assistance Really Helpful? Wayne Iba Mathematics and Computer Science Westmont Santa

St Strateg egic p c pivot oting 1 101 Webinar 5th May 2020 M e l i s s a Wra g g e |

Care of Underserved Patient with COPD 03/01/2018 Neeta Thakur MD MPH Neeta.Thakur@ucsf.edu

Crypto for the People Seny Kamara 2 3 4 5 Perspective as a Black person as an

Beyond the Comfort Zone Abrupt Climate Change and the Arctic Early Warming System Jason E. Box,

Sambuz

Useful Links

Newsletter

Mail Us