Petascale Parallelization of the Gyrokinetic Toroidal Code Stephane Ethier, Princeton Plasma Physics Laboratory Mark Adams, Columbia University Jonathan Carter, Leonid Oliker, Lawrence Berkeley National Laboratory VECPAR 2010 June 23rd, 2010 1
Outline • System configurations – Blue Gene/P, Cray XT4, Hyperion cluster • Parallel gyro-kinetic toroidal code (GTC- P) – First fully parallel toroidal PIC code algorithm • ITER-sized scaling experiments – 128K IBM BG/P cores – 32K Cray XT4 cores – 2K Hyperion cores 2
Blue Gene/P System 1 to 72 or more Racks Cabled 8x8x16 Rack 32 Node Cards 1024 chips, 4096 procs Node Card 1 PF/s + (32 chips 4x4x2) 144 TB + 32 compute, 0-2 IO cards 14 TF/s 2 TB Compute Card 1 chip, 20 DRAMs 435 GF/s 64 GB Chip 4 processors 13.6 GF/s 2.0 GB DDR 13.6 GF/s Supports 4-way SMP 8 MB EDRAM Figure courtesy IBM 3
Blue Gene/P Interconnection Networks • 3 Dimensional Torus – Interconnects all compute nodes – 3.4 GB/s on all 12 node links (5.1 GB/s per node) – MPI: 3 µ s latency for one hop, 10 µ s to the farthest; bandwidth 1.27 GB/s – 1.7/2.6 TB/s bisection bandwidth • Collective Network – Interconnects all compute and I/O nodes – One-to-all broadcast functionality – Reduction operations functionality – 6.8 GB/s of bandwidth per link – MPI latency of one way tree traversal 5 µ s • Low Latency Global Barrier and Interrupt – MPI latency of one way to reach all 72K nodes 1.6 µ s Figure courtesy IBM 4
Cray XT4 • Single socket 2.3 GHz quadcore AMD Opteron per compute node • 37 Gflop/s peak per Compute PE node Login PE Network PE • Microkernel on System PE Compute PEs, full I/O PE featured Linux on Service PEs. Service Partition • Service PEs specialize Specialized by function Linux nodes Figure courtesy Cray 5
Cray XT4 Network 4 GB/sec AMD MPI Bandwidth Opteron 7.6 GB/sec Direct Attached Memory 7 HyperTransport . 6 G B 8.5 GB/sec / s 7.6 GB/sec e c Local Memory 7.6 GB/sec 7 Bandwidth . 6 G B 7.6 GB/sec / s Cray e c SeaStar2 Interconnect 6.5 GB/sec Torus Link Bandwidth MPI latency 4-8 µ s, bandwidth 1.7 GB/s Figure courtesy Cray 6
Hyperion Scalable Unit 134 Dual Socket Quad Core Compute Nodes (1,072 cores) 12x24 = 288 Port InfiniBand 4x DDR QsNet Elan3, 100BaseT Control 1 Login/ 4 Gateway nodes @ 1.5 4 Gateway nodes @ 1.5 144 Port IBA 4x 1 RPS Service/ GB/s GW GW GW GW GW GW GW GW GB/s Uplinks to 2x1 GbE & Master delivered I/O over delivered I/O over 1xIBA RAID spine switch 2x10GbE 4x DDR 1/10 GbE SAN 1GbE Management IBA 4x DDR SAN … 35 Lustre Object Storage Systems MD MD S S 2x10GbE+IBA 4x S 732 TB and 47 GB/s S Lustre MetaData Hyperion Phase 1 - 4 SU 46 TF/s Cluster 576 nodes and 4,608 cores, 12.1 TB/s memory bandwidth, 4.6 TB capacity 85 GF/s dual socket 2.5 GHz quad-core Intel LV Harpertown nodes Figure courtesy LLNL
Hyperion Connectivity 4 expansion SU's 8 SU base system • Bandwidth 4XIB DDR 2 GB/s peak • MPI latency 2-5 µ s, bandwidth 400 MB/s Figure courtesy LLNL
The Gyrokinetic Toroidal Code • 3D particle-in-cell code to study microturbulence in magnetically confined fusion plasmas • Solves the gyro-averaged Vlasov equation • Gyrokinetic Poisson equation solved in real space • 4-point average method for charge deposition • Global code (full torus as opposed to only a flux tube) • Massively parallel: typical runs done on 1000s processors • Nonlinear and fully self-consistent. • Written in Fortran 90/95 + MPI • Originally written by Z. Lin, subsequently modified 9
Fusion: DOE #1 Facility Priority November 10, 2003 Energy Secretary Spencer Abraham Announces Department of Energy 20-Year Science Facility Plan Sets Priorities for 28 New, Major Science Research Facilities #1 on the list of priorities is ITER, an unprecedented international collaboration on the next major step for the development of fusion #2 is UltraScale Scientific Computing Capability 10
Particle-in-cell (PIC) method • Particles sample distribution function (markers). • The particles interact via a grid, on which the potential is calculated from deposited charges. The PIC Steps • “SCATTER”, or deposit, charges on the grid (nearest neighbors) • Solve Poisson equation • “GATHER” forces on each particle from potential • Move particles (PUSH) • Repeat… 11
Parallel GTC: Domain decomposition + particle splitting • 1D Domain decomposition: – Several MPI processes allocated to a section of the torus • Particle splitting method – The particles in a toroidal section are equally divided between Processor 0 Processor 2 several MPI processes Processor 1 Processor 3 • Also has loop-level parallelism – OpenMP directives (not used in this study) 12
Radial grid decomposition • Non-overlapping geometric partitioning 13
Charge Deposition for charged rings Charge Deposition Step (SCATTER operation) GTC Classic PIC 4-Point Average GK (W.W. Lee) 14
Overlapping partitioning • Extend local domain to line up w/ grid • Extend local domain for gyro radius 15
Major Components in GTC • Particle work – O(p) – Major computational kernel “Moves particles” – Large body loops; Lots of loop level parallelism • Grid (cell) work – O(g) – Poisson solver, “smoothing”, E-field calculations • Particle-Grid work – O(p) – Major computational kernel “Scatter” and “Gather” – Unstructured and “random” access to grids – Semi-structured grids in GTC – Cache effects are critical 16
Major routines in GTC • Push ions (P,P-G) – Major computational kernel “Moves particles” – Large body loops; Gathers ; Lots of loop level parallelism • Charge deposition (P-G) – Major computational kernel “Scatter” – Pressure on cache – unstructured access to grid – block grid • Shift ions, communication (P) – Sorts out particles that move out of its domain and sends those to the “next” processor • Poisson Solver (G) – Solve Poisson Equation. Prior to 2007 the solve was redundantly executed on each processor. New version uses the PETSc solver to efficiently distribute • Smooth (G) and Field (G) – Smaller computational kernels – Prior to 2007 - NOT parallel 17
Weak Scaling Experiments • Keep number of cells and number of particles constant per process • Double size of device in each case – Final case is ITER sized plasma • Cray XT4 up to 32K cores – Quad core / flat MPI (Franklin - NERSC) • BG/P up to 128K cores – Quad core / flat MPI (Intrepid – ANL) • Hyperion up to 2K cores – Dual socket quad core / flat MPI (LLNL) 18
Absolute Performance • Cray XT4 is highest performing at both 2K and 32K procs • BG/P scaling is much better than XT4 going from 2K to 32K procs • Even though Hyperion (Xeon Harpertown) has higher peak performance than the XT4 (Opteron) performance lags at 2K procs. – Worse memory bandwidth compared with peak 19
Communication Performance • Shift routine is a good proxy for communication costs • BG/P has lowest percentage in communications – Also has lowest performing processor • Hyperion has highest percentage in communications 20
Performance on XT4 • Push and Charge scale well • Shift has moderate scaling • Field and Smooth scale relatively poorly 21
Performance on BG/P • Push, Charge and Shift scale well • Field and Smooth scale relatively poorly 22
Performance on Hyperion • Push and Charge scale well • Shift has moderate scaling • Field and Smooth scale relatively poorly 23
Load Imbalance • Field and Smooth are both dominated by grid-related work • At high processor count number of grid points per MPI task is imbalanced – Less grid points in radial domains near center of circular plane – Radial decomposition focuses on same number of particles as this is >80% of the work 24
Summary • Radial decomposition enables GTC to scale to ITER size devices – Impossible to fit full grid on a single node without radial decomposition • XT4 offers best performance, but perhaps not as scalable as BG/P • Hyperion IB cluster seems to lag, although more data required – Upgraded to Intel Nehalem nodes and enlarged since our work was completed 25
Acknowledgements • U. S. Department of Energy – O ffj ce of Fusion Energy Sciences under contract number DE-ACO2-76CH03073 – O ffj ce of Advanced Scientific Computing Research under contract number DE-AC02- 05CH11231 • Computing Resources – Argonne Leadership Computing Facility – National Energy Research Scientific Computing Center – Hyperion Project at LLNL 26
Recommend
More recommend