Applica'on Accelera'on on Current and Future Cray Pla4orms Alice Koniges, NERSC, Berkeley Lab David Eder, Lawrence Livermore Na'onal Laboratory (speakers) Robert Preissl, Jihan Kim (NERSC LBL), Aaron Fisher, Nathan Masters, Velimir Mlaker (LLNL), Stephan Ethier, Weixing Wang (PPPL), Mar'n Head‐Gordon (UC Berkeley), Nathan Wichmann (CRAY Inc.) CRAY User Group Mee'ng May 2010 CUG 2010
Various means of applica'on speedup are described for 3 different codes • GTS – magne+c fusion par+cle‐in‐cell code – Already op+mized and hybrid (MPI + OpenMP) – Consider advanced hybrid techniques to overlap communica+on and computa+on • QChem – computa+onal chemistry – Op+miza+on for GPU and accelerators • ALE‐AMR – hydro/materials/radia+on – Mul+physics code with MPI‐everywhere model – Library speedup – Is the code appropriate for hybrid? – Experiences with automa+c paralleliza+on tools CUG 2010
GTS is a massively parallel magne'c fusion applica'on • Gyrokine+c Tokamak Simula+on (GTS) code • Global 3D Par+cle‐In‐Cell (PIC) code to study microturbulence & transport in magne+cally confined fusion plasmas of tokamaks • Microturbulence: very complex, nonlinear phenomenon; key in determining instabili'es of magne+c confinement of plasmas • GTS: Highly op+mized Fortran90 (+C) code • Massively parallel hybrid paralleliza+on (MPI +OpenMP): tested on today’s largest computers (Earth Simulator, IBM BG/L, Cray XT) CUG 2010
PIC : follow trajectories of charged par+cles in electromagne+c fields • Sca[er : computa+on of charge density at each grid point arising from neighboring par+cles • Poisson's equa'on for compu+ng the field poten+al (solved on a 2D poloidal plane) • Gather : calculate forces on each par+cle from the electric poten+al • Push : moving par+cles in +me according to equa+ons of mo+on • Repeat CUG 2010
The Parallel Model of GTS has three independent levels • One‐dimensional (1D) domain decomposi'on in the toroidal direc'on 5 th PIC step: shicing par+cles between toroidal domains (MPI; limited to 128 planes), this can happen to adjacent or even to further toroidal domains • Divide par'cles between MPI processes within toroidal domain : each process keeps a copy of the local grid, requiring processes within a domain to sum their contribu+on to total grid charge density P 0 • OpenMP compiler direc'ves to heavily used loop regions exploi+ng shared P 1 memory capabili+es P 2 CUG 2010
Two different hybrid models in GTS: Using tradi+onal OpenMP worksharing constructs and OpenMP tasks OpenMP tasks enables us to overlap MPI communica+on with independent Computa+on and therefore the overall run+me can be reduced by the costs of MPI communica+on. CUG 2010
Overlapping Communica+on with Computa+on in GTS shic rou+ne due to data independent code sec+ons INDEPENDENT INDEPENDENT INDEPENDENT GTS shi( roun-ne Work on par+cle array (packing for sending, reordering, adding acer sending) can be overlapped with data independent MPI communica+on using OpenMP tasks. CUG 2010
Reducing the limita+ons of single threaded execu+on (MPI communica+on) can be achieved with OpenMP tasks Overlapping MPI_Allreduce with par-cle work Overlap : Master thread encounters (!$omp master) tasking statements and creates work for the thread team for deferred execu+on. MPI Allreduce call is immediately executed. MPI implementa+on has to support at least MPI_THREAD_FUNNELED Subdividing tasks into smaller chunks to allow beler load balancing and scalability among threads. CUG 2010
Further communica+on overlaps can be achieved with OpenMP tasks exploi+ng data independent code regions Overlapping par-cle reordering Par+cle reordering of remaining par+cles (above) and adding sent par+cles into array (right) & sending or receiving of shiced par+cles can be independently executed. Overlapping remaining MPI_Sendrecv CUG 2010
OpenMP tasking version outperforms original shicer, especially in larger poloidal domains !"#$%&"&'()* $%=">"$?" !(+,&-%./")* !"#$%&"&'()* $%>"?"%" !(+,&-%./")* &!!" %#!" %#!" %!!" 8@;A(23" 9@<A)34" 8(9*":;*0<" 9):+";<+1=" %!!" B+(3(2@-" $#!" B,)4)3@." $#!" $!!" $!!" #!" #!" !" !" &'()*+" ,--+*./0*" 1(--(2345-*" &*2.6*07" '()*+," -..,+/01+" 2)..)3456.+" '+3/7+18" Performance breakdown of GTS shicer rou+ne using 4 OpenMP threads per MPI process with varying domain decomposi+on and par+cles per cell on Franklin Cray XT4. MPI communica+on in the shic phase uses a toroidal MPI communicator (constantly 128) However, performance differences in the 256 MPI run compared to 2048 MPI run! Speed‐Up is expected to be higher on larger GTS runs with hundreds of thousands CPUs since MPI communica+on is more expensive CUG 2010
Early experiments that overlap communica+on with communica+on are promising for future HPC systems &!!" #!!" '&!" )>.@*;A"" )=.?*:@"" %#!" '%!" B4*A*;>C" A4*@*:=B" '$!" %!!" )*+,"-.,/0" )*+,"-.,/0" '#!" $#!" '!!" &!" $!!" %!" $!" #!" #!" !" !" '!#$" ('#" #(%" $!%'" #$%" %#(" 123"245/,..,.6"7#6$6&8"59,:12";<4,=>."9,4"123"945/,.." 123"245/,..,.6"7%6'689"5:,;12"<=4,>?.":,4"123":45/,.." • Overlapping MPI communica+on with other consecu+ve, data independent MPI Communica+on • Here: itera+ve execu+on of two consecu+ve MPI_Allreduce with small and larger messages on Hopper Cray XT5 • GTS shicer or pusher rou+nes have such consecu+ve MPI communica+on • Overlapping MPI_Allreduce with larger messages (~1K bytes) pays off when ra+o of threads/sockets per node is reasonable • Future HPC systems are expected to have many communica+on channels per node CUG 2010
Reducing overhead of single threaded execu+on is essen+al for massively parallel (hybrid) codes • Overhead of MPI communica+on increases when scaling applica+ons to large number of MPI processes (collec+ve MPI communica+on) • Adding OpenMP compiler direc+ves to heavily used loop can exploit the shared memory capabili+es • Overlapping MPI communica+on with independent computa+on by the new OpenMP tasking model makes usage of idle cores • Overlapping MPI communica+on with independent, consecu+ve MPI communica+on might be another way to reduce MPI overhead; especially regarding future HPC systems with many communica+on channels per node CUG 2010
Q‐Chem: Computa'onal chemistry can accurately model molecular structures Q‐Chem: used to model carbon capture (i.e. reac+vity of CO 2 with other materials) • Quantum calcula+ons: accurately predict molecular equilibrium structure (used as • an input to classical molecular dynamics/Monte Carlo simula+ons) RI‐MP2: resolu+on‐of‐the‐iden+ty second‐order Moller‐Plesset perturba+on • theory Treat correla+on energy with 2 nd order Moller‐Plesset theory – U+lize auxiliary basis sets to approximate atomic orbital densi+es – Strengths: no self‐interac+on problem (DFT), 80‐90% of correla+on energy – Weakness: fich‐order computa+onal dependency on system size (expensive) – Goal: accelerate RI‐MP2 method in Q‐Chem • Q‐Chem RI‐MP2 requirements: quadra+c memory, cubic storage, quar+c I/O, • quin+c computa+on CUG 2010
Dominant computa'onal steps are fich‐order RI‐MP2 rou'nes RI‐MP2 rou+ne: largely divided up • into seven major steps Test input molecules: glycine‐n • As system size increases, step 4 • becomes the dominant wall +me (e.g. glycine‐16, 83% of total wall +me is spent in step 4) Reason: step 4 contains three • quin+c computa+on rou+nes (BLAS3 matrix mul+plica+ons) and quar+c I/O read Greta Cluster (M. Head‐Gordon) : AMD Quad‐core Opterons Goal: op+mize step 4 • Wall +me in seconds CUG 2010
The GPU and the CPU are significantly different GPU: graphics processing units • GPU: More transistors devoted to data • computa+on (CPU: cache, loop control) Interest in high‐performance compu+ng • Use CUDA (Compute Unified Device • Architecture): parallel architecture developed by NVIDIA Step 4: CUDA matrix matrix mul+plica+ons • (~ 75 GFLOPS TESLA, ~225 GFLOPS FERMI for double precision) Concurrently execute CPU and GPU rou+nes • CUG 2010
CPU and GPU can work together to produce a fast algorithm Step 4 CPU Algorithm Step 4 CPU+GPU Algorithm T tot ≈ T load + T mm1 + T mm2 + T mm3 + T rest T tot ≈ max (T load , T mm1 ) + T mm3 + max(T mm2 , T copy ) + T rest CUG 2010
I/O bo[leneck is a concern for accelerated RI‐MP2 code Tesla/Turing (TnT): NERSC GPU Testbed • Sun SunFire x4600 Servers – AMD Quad Core processors (“Shanghai”), 4 NVidia FX‐5800 Quadro GPUs (4GB memory) – CUDA 2.3 gcc 4.4.2 ACML 4.3.0 – Franklin: NERSC • Cray XT4 system – 2.3 GHz AMD Opteron Quad Core – RI‐MP2 wall +me (seconds) Franklin 4945 TnT(CPU) 6542 TnT(GPU) 1405 TnT(GPU, 600 to 800 (?) beler I/O) 4.7x improvement, more so for beler I/O systems CUG 2010
Recommend
More recommend