A parallel Grad-Shafranov solver for real-time control of tokamak plasmas M. Rampp, R. Preuss, R. Fischer, L. Giannone, K. Hallatschek Computing Centre (RZG) of the Max-Planck-Society and Max-Planck-Institute for Plasma Physics (IPP) Frascati, Mar 26, 2012 slide 1 M. Rampp (RZG/MPG)
Overview Outline: GPEC (Garching Parallel Equilibrium Code), a descendant of GEC (by Lackner et al.) 1. Introduction 2. Basic equations and parallel algorithm for numerical solution 3. Benchmarks & Notes on implementation 4. Summary & Outlook References ⊲ R. Preuss: Real time equilibrium calculation , AUG Seminar, IPP (July 2009) ⊲ R. Preuss, R. Fischer, M. Rampp, K. Hallatschek, U. von Toussaint, L. Giannone, P. McCarthy: Parallel equilibrium algorithm for real-time control of tokamak plasmas , IPP-Report R/47 (2012) slide 2 M. Rampp (RZG/MPG)
Introduction Scope & Motivation ⊲ develop a ”fast” numerical solver for the Grad-Shafranov equation ⊲ ”fast”: enable application in plasma equilibrium codes (CLISTE, IDE @AUG): · real-time control of fusion experiments (moderate numerical resolution, runtime � 1 ms ) · efficient and scalable offline analysis (high numerical resolution) Strategy ⊲ rely on Open-Source Software, OSS (no vendor lock-in: technology, licensing, ITER policy) ⊲ employ commodity hardware & open and well-established software standards · x86 Linux servers or clusters · standard compilers, programming models & runtime (FORTRAN 90, OpenMP, MPI) ⊲ try to exploit all possible levels of parallelism: processors/tasks, cores/threads, instructions Status ⊲ GPEC fully implemented, validated and documented ⊲ integration into IDE completed (offline analysis of AUG data) ⊲ prototypical MPI-parallelization of IDE (towards real-time capability) slide 3 M. Rampp (RZG/MPG)
Context: Grad-Shafranov equation in equilibrium codes Grad-Shafranov equation in axisymmetric geometry: ∂r + ∂ 2 ∆ ∗ := r ∂ 1 ∂ j φ := r∂p ( r, ψ ) F ( ψ ) d F ( ψ ) ∆ ∗ ψ = − 4 π 2 µ 0 rj φ with + µ 0 ∂z 2 ∂r r ∂ψ r d ψ Application in equilibrium codes: ⊲ expansion of toroidal current density j φ into linear combination of N p + N F basis functions N p N F � � p ′ ( ψ ) = , FF ′ ( ψ ) = c h π h ( ψ ) d k ϕ k ( ψ ) h =1 k =1 ⊲ for each basis function, solve individually ∆ ∗ ψ h = − 4 π 2 µ 0 r 2 π h ( ψ old ) , ∆ ∗ ψ N p + k = − 4 π 2 µ 2 0 ϕ k ( ψ old ) ⊲ therefore: N p N F � � ψ = c i ψ i + d j ψ N p + j i =1 j =1 slide 4 M. Rampp (RZG/MPG)
Context: Grad-Shafranov equation in equilibrium codes ⊲ obtain weights c h , d k by linear regression with diagnostic signals m and computed ”response matrix” b : c 1 c 2 . . m 1 b 1 ( ψ 1 ) b 1 ( ψ 2 ) ... b 1 ( ψ N p + N F ) . m 2 b 2 ( ψ 1 ) b 2 ( ψ 2 ) ... b 2 ( ψ N p + N F ) c N p = · . . . . . . . . . . d 1 . . . . . d 2 m N m b N m ( ψ 1 ) b N m ( ψ 2 ) ... b N m ( ψ N p + N F ) . . . d N F Summary of the complete procedure ⊲ Picard iteration cycle: 1) solve GS-equation ∆ ∗ ψ = − g ( r, z ) individually for N p + N F basis functions. g ( r, z ) is computed using ψ of last iteration step 2) obtain c h , d k from linear regression 3) compute new ψ = � N p h =1 c h π h ( ψ ) + � N F k =1 d k ϕ k ( ψ ) ⊲ typical: 2. . . 10 iteration steps slide 5 M. Rampp (RZG/MPG)
Requirements for computational performance Real-time control ⊲ AUG machine control cycle: ≃ 1 ms ⊲ assume 10 iteration steps required for convergence ! � runtime per iteration � 0 . 1 ms for · N p + N F ≃ 8 calls of the GS solver · linear regression, calculation of response matrix, . . . Offline analysis ⊲ as fast as possible ⊲ goals: allow significantly larger numerical resolution, enable new data analysis methodologies Basic strategy ⊲ starting point: GS solver dominates runtime, GEC/CLISTE: O (ms) for single basis function ( N p + N F = 1 ) ⊲ achieve performance gains by parallelization slide 6 M. Rampp (RZG/MPG)
Excursus: why parallel? Moore’s law – reinterpreted ⊲ Moore’s law is holding, in the original terms of number of transistors · transistors on a chip still doubling every 18 months at constant cost · V ∼ l , I ∼ l ⇒ P ∼ l 2 (constant power density, Dennard scaling ) · but: era of decades of exponential clock rate growth has ended (around 2005) ⊲ Moore’s law ”reinterpreted” · number of cores per chip doubles every 18 months instead of clock � performance improvements are now coming from the increase in the number of cores on a processor . . . . . . plus other levels of parallelism (e.g. vec- tor instructions) H. Sutter: The Free Lunch Is Over. A Fundamental Turn Toward Concurrency in Software ◦ � free (performance) lunch is over http://www.gotw.ca/publications/concurrency-ddj.htm H. Esmaeilzadeh, et al.: Dark Silicon and the End of Multicore Scaling ◦ http://doi.acm.org/10.1145/2000064.2000108 ⊲ Future: 100 . . . 1000 cores on chip ? slide 7 M. Rampp (RZG/MPG)
Excursus: parallel challenges Typical architecture of a present-day x86 64 cluster Node 0 Node 1 Node 2 Network: ETH, IB, ... CPU Socket: 2 cores, 4 threads Node: 4 multicore CPUs ⊲ multiple nodes (1 . . . 1000): connected by Ethernet, InfiniBand network ⊲ multiple sockets (CPUs) per node (2. . . 4): connected by QPI, HT (NUMA!) ⊲ multiple cores per CPU (2. . . 16) ⊲ multiple hyper-threads per core (2) ⊲ multiple instructions per cycle (2. . . 4 DP ops, SSE: 128 Bit, AVX: 256 Bit) slide 8 M. Rampp (RZG/MPG)
Parallelization strategies Basic considerations ⊲ exploit fine-grained parallelism (spatial grid) within the multi-core CPU: threads, SIMD- vectorization (GPEC) ⊲ exploit coarse-grained parallelism (basis functions) across multiple CPUs (IDE) � fits well with program structure � flexibility of hybrid MPI(tasks)/OpenMP(threads) approach, examples: · distribute 8 BFs across 8 quadcore CPUs, each GPEC call uses all 4 CPU cores · distribute 8 BFs across 2 16-core CPUs, each GPEC call uses only 8 CPU cores Communication estimates ⊲ required bandwidth for communicating a 32 × 64 array of double precision numbers (corre- sponding to a single ψ k ): 32 × 64 · 8 Byte / 0 . 01 ms ≃ 1 . 6 GB/s ⊲ inter-node (InfiniBand: QDR: 1 GB/s, FDR: 1.7 GB/s) ⊲ intra-node (Stream 4-socket: ≃ 5 GB/s, 2-socket: ≃ 10 GB/s) slide 9 M. Rampp (RZG/MPG)
GPEC: discretization GS equation: ∆ ∗ ψ = − g ( r, z ) s + i ψ i +1 ,j + δ i ψ i,j + s − − 1 ( ψ i,j +1 + ψ i,j − 1 ) = ρ i,j i ψ i − 1 ,j + r i ⊲ standard finite-differences discretization on ( M × N )-grid MxM MxM ⊲ five-point stencil for ∆ ∗ := r ∂ ∂r + ∂ 2 1 ∂ 1 ∂z 2 ∂r r N 2 N−1 ... ... ... ... ... ... ... ... ... ... z j ... 1 0 0 1 ... ... M−1 M r i ⊲ solution methods for linear system: · Block-Cyclic Reduction: GEC · Fourier analysis: GPEC N · FACR, multigrid, . . . slide 10 M. Rampp (RZG/MPG)
GPEC: discretization & FA solution method Fourier analysis: applying a Discrete Sine i ˆ ψ i +1 ,l + ˆ δ i,l ˆ i − 1 ˆ s + ψ i,l + s + ψ i − 1 ,l = ˆ ρ i,l Transform (DST) allows to decouple lin- MxM ear system in one dimension 1 s + − 1 ( ψ i,j +1 + ψ i,j − 1 ) = ρ i,j i ψ i +1 ,j + δ i ψ i,j + s − i ψ i − 1 ,j + r i N − 1 ξ i,j = 2 2 � ˆ apply : ξ i,l sin( πjl/N ) ξ := ρ, ψ N l =1 i ˆ ψ i +1 ,l + ˆ δ i,l ˆ i − 1 ˆ s + ψ i,l + s + ⇒ ψ i − 1 ,l = ˆ ρ i,l ⊲ Basic solution scheme (since 1970’s): 1. compute M iDSTs of size N ⇒ ˆ ρ 2. solve N tridiag. systems of size M × M ⇒ ˆ ψ 3. compute M DSTs of size N ⇒ ψ ⊲ optimized FFTs and tridiag. solver N ⊲ fully parallelizable over M , N slide 11 M. Rampp (RZG/MPG)
More details of the complete GPEC algorithm Complete algorithm (Hagenov & Lackner, 1975) ⊲ additional ”inner” iteration for taming Dirichlet boundary condition 1. solve: ∆ ∗ ψ 0 = − g ( r, z ) in R , ψ 0 = 0 on δR � ∂ψ 0 ( r ∗ ) � 2. compute ψ on δR using Greens function ψ 1 ( r ) = − 1 � r ∗ G ( r, r ∗ ) d s ∗ ∂R ∂n 3. solve: ∆ ∗ ψ = − g ( r, z ) using ψ 1 on δR � apply 2 times the (iDST → tridiagonal solve → DST)-cycle � inbetween, save some work by transforming only the ”boundary” values (cf. Giannone et al. 2010) Main algorithmic steps dst 1 : M inverse discrete sine transforms (iDSTs) of size N trid 1 : solution of N tridiagonal linear systems of size M dst spare1 : 2 DSTs of size N plus 2 × ( M − 2) scalar products of size N for computing 2( N + M − 2) Fourier coefficients matvec : 1 matrix-vector multiplication of rank 2( N + M ) × 2( N + M ) dst spare2 : 2 iDSTs of size N plus a corresponding rank-1 update of a N × M matrix trid 2 : solution of N tridiagonal linear systems of size M dst 2 : M DSTs of size N slide 12 M. Rampp (RZG/MPG)
Recommend
More recommend