petascale parallelization of the gyrokinetic toroidal code
play

Petascale Parallelization of the Gyrokinetic Toroidal Code Stephane - PowerPoint PPT Presentation

Petascale Parallelization of the Gyrokinetic Toroidal Code Stephane Ethier, Princeton Plasma Physics Laboratory Mark Adams, Columbia University Jonathan Carter, Leonid Oliker, Lawrence Berkeley National Laboratory VECPAR 2010 June 23rd, 2010


  1. Petascale Parallelization of the Gyrokinetic Toroidal Code Stephane Ethier, Princeton Plasma Physics Laboratory Mark Adams, Columbia University Jonathan Carter, Leonid Oliker, Lawrence Berkeley National Laboratory VECPAR 2010 June 23rd, 2010 1

  2. Outline • System configurations – Blue Gene/P, Cray XT4, Hyperion cluster • Parallel gyro-kinetic toroidal code (GTC- P) – First fully parallel toroidal PIC code algorithm • ITER-sized scaling experiments – 128K IBM BG/P cores – 32K Cray XT4 cores – 2K Hyperion cores 2

  3. Blue Gene/P System 1 to 72 or more Racks Cabled 8x8x16 Rack 32 Node Cards 1024 chips, 4096 procs Node Card 1 PF/s + (32 chips 4x4x2) 144 TB + 32 compute, 0-2 IO cards 14 TF/s 2 TB Compute Card 1 chip, 20 DRAMs 435 GF/s 64 GB Chip 4 processors 13.6 GF/s 2.0 GB DDR 13.6 GF/s Supports 4-way SMP 8 MB EDRAM Figure courtesy IBM 3

  4. Blue Gene/P Interconnection Networks • 3 Dimensional Torus – Interconnects all compute nodes – 3.4 GB/s on all 12 node links (5.1 GB/s per node) – MPI: 3 µ s latency for one hop, 10 µ s to the farthest; bandwidth 1.27 GB/s – 1.7/2.6 TB/s bisection bandwidth • Collective Network – Interconnects all compute and I/O nodes – One-to-all broadcast functionality – Reduction operations functionality – 6.8 GB/s of bandwidth per link – MPI latency of one way tree traversal 5 µ s • Low Latency Global Barrier and Interrupt – MPI latency of one way to reach all 72K nodes 1.6 µ s Figure courtesy IBM 4

  5. Cray XT4 • Single socket 2.3 GHz quadcore AMD Opteron per compute node • 37 Gflop/s peak per Compute PE node Login PE Network PE • Microkernel on System PE Compute PEs, full I/O PE featured Linux on Service PEs. Service Partition • Service PEs specialize Specialized by function Linux nodes Figure courtesy Cray 5

  6. Cray XT4 Network 4 GB/sec AMD MPI Bandwidth Opteron 7.6 GB/sec Direct Attached Memory 7 HyperTransport . 6 G B 8.5 GB/sec / s 7.6 GB/sec e c Local Memory 7.6 GB/sec 7 Bandwidth . 6 G B 7.6 GB/sec / s Cray e c SeaStar2 Interconnect 6.5 GB/sec Torus Link Bandwidth MPI latency 4-8 µ s, bandwidth 1.7 GB/s Figure courtesy Cray 6

  7. Hyperion Scalable Unit 134 Dual Socket Quad Core Compute Nodes (1,072 cores) ‏ 12x24 = 288 Port InfiniBand 4x DDR QsNet Elan3, 100BaseT Control 1 Login/ 4 Gateway nodes @ 1.5 4 Gateway nodes @ 1.5 144 Port IBA 4x 1 RPS Service/ GB/s GW GW GW GW GW GW GW GW GB/s Uplinks to 2x1 GbE & Master delivered I/O over delivered I/O over 1xIBA RAID spine switch 2x10GbE 4x DDR 1/10 GbE SAN 1GbE Management IBA 4x DDR SAN … 35 Lustre Object Storage Systems MD MD S S 2x10GbE+IBA 4x S 732 TB and 47 GB/s S Lustre MetaData Hyperion Phase 1 - 4 SU 46 TF/s Cluster  576 nodes and 4,608 cores, 12.1 TB/s memory bandwidth, 4.6 TB capacity  85 GF/s dual socket 2.5 GHz quad-core Intel LV Harpertown nodes Figure courtesy LLNL

  8. Hyperion Connectivity 4 expansion SU's 8 SU base system • Bandwidth 4XIB DDR 2 GB/s peak • MPI latency 2-5 µ s, bandwidth 400 MB/s Figure courtesy LLNL

  9. The Gyrokinetic Toroidal Code • 3D particle-in-cell code to study microturbulence in magnetically confined fusion plasmas • Solves the gyro-averaged Vlasov equation • Gyrokinetic Poisson equation solved in real space • 4-point average method for charge deposition • Global code (full torus as opposed to only a flux tube) • Massively parallel: typical runs done on 1000s processors • Nonlinear and fully self-consistent. • Written in Fortran 90/95 + MPI • Originally written by Z. Lin, subsequently modified 9

  10. Fusion: DOE #1 Facility Priority November 10, 2003 Energy Secretary Spencer Abraham Announces Department of Energy 20-Year Science Facility Plan Sets Priorities for 28 New, Major Science Research Facilities #1 on the list of priorities is ITER, an unprecedented international collaboration on the next major step for the development of fusion #2 is UltraScale Scientific Computing Capability 10

  11. Particle-in-cell (PIC) method • Particles sample distribution function (markers). • The particles interact via a grid, on which the potential is calculated from deposited charges. The PIC Steps • “SCATTER”, or deposit, charges on the grid (nearest neighbors) • Solve Poisson equation • “GATHER” forces on each particle from potential • Move particles (PUSH) • Repeat… 11

  12. Parallel GTC: Domain decomposition + particle splitting • 1D Domain decomposition: – Several MPI processes allocated to a section of the torus • Particle splitting method – The particles in a toroidal section are equally divided between Processor 0 Processor 2 several MPI processes Processor 1 Processor 3 • Also has loop-level parallelism – OpenMP directives (not used in this study) 12

  13. Radial grid decomposition • Non-overlapping geometric partitioning 13

  14. Charge Deposition for charged rings Charge Deposition Step (SCATTER operation) GTC Classic PIC 4-Point Average GK (W.W. Lee) 14

  15. Overlapping partitioning • Extend local domain to line up w/ grid • Extend local domain for gyro radius 15

  16. Major Components in GTC • Particle work – O(p) – Major computational kernel “Moves particles” – Large body loops; Lots of loop level parallelism • Grid (cell) work – O(g) – Poisson solver, “smoothing”, E-field calculations • Particle-Grid work – O(p) – Major computational kernel “Scatter” and “Gather” – Unstructured and “random” access to grids – Semi-structured grids in GTC – Cache effects are critical 16

  17. Major routines in GTC • Push ions (P,P-G) – Major computational kernel “Moves particles” – Large body loops; Gathers ; Lots of loop level parallelism • Charge deposition (P-G) – Major computational kernel “Scatter” – Pressure on cache – unstructured access to grid – block grid • Shift ions, communication (P) – Sorts out particles that move out of its domain and sends those to the “next” processor • Poisson Solver (G) – Solve Poisson Equation. Prior to 2007 the solve was redundantly executed on each processor. New version uses the PETSc solver to efficiently distribute • Smooth (G) and Field (G) – Smaller computational kernels – Prior to 2007 - NOT parallel 17

  18. Weak Scaling Experiments • Keep number of cells and number of particles constant per process • Double size of device in each case – Final case is ITER sized plasma • Cray XT4 up to 32K cores – Quad core / flat MPI (Franklin - NERSC) • BG/P up to 128K cores – Quad core / flat MPI (Intrepid – ANL) • Hyperion up to 2K cores – Dual socket quad core / flat MPI (LLNL) 18

  19. Absolute Performance • Cray XT4 is highest performing at both 2K and 32K procs • BG/P scaling is much better than XT4 going from 2K to 32K procs • Even though Hyperion (Xeon Harpertown) has higher peak performance than the XT4 (Opteron) performance lags at 2K procs. – Worse memory bandwidth compared with peak 19

  20. Communication Performance • Shift routine is a good proxy for communication costs • BG/P has lowest percentage in communications – Also has lowest performing processor • Hyperion has highest percentage in communications 20

  21. Performance on XT4 • Push and Charge scale well • Shift has moderate scaling • Field and Smooth scale relatively poorly 21

  22. Performance on BG/P • Push, Charge and Shift scale well • Field and Smooth scale relatively poorly 22

  23. Performance on Hyperion • Push and Charge scale well • Shift has moderate scaling • Field and Smooth scale relatively poorly 23

  24. Load Imbalance • Field and Smooth are both dominated by grid-related work • At high processor count number of grid points per MPI task is imbalanced – Less grid points in radial domains near center of circular plane – Radial decomposition focuses on same number of particles as this is >80% of the work 24

  25. Summary • Radial decomposition enables GTC to scale to ITER size devices – Impossible to fit full grid on a single node without radial decomposition • XT4 offers best performance, but perhaps not as scalable as BG/P • Hyperion IB cluster seems to lag, although more data required – Upgraded to Intel Nehalem nodes and enlarged since our work was completed 25

  26. Acknowledgements • U. S. Department of Energy – O ffj ce of Fusion Energy Sciences under contract number DE-ACO2-76CH03073 – O ffj ce of Advanced Scientific Computing Research under contract number DE-AC02- 05CH11231 • Computing Resources – Argonne Leadership Computing Facility – National Energy Research Scientific Computing Center – Hyperion Project at LLNL 26

Recommend


More recommend