DRAFT Using a Hybrid Cray S Supercomputer to M Model N Non-Icing Surfaces f for Cold- Climate Wind Turbines Accelerating Three-Body Potentials using GPUs NVIDIA Tesla K20X GE Global Research Masako Yamada
DRAFT Opportunity in Cold-Climate Wind Wind energy production > 285 GW/year and growing Cold regions favorable • • Lower human population Good wind conditions • 45-50 GW opportunity from 2013-2017 ~$2million/MW installed • Technical need • Anti-icing surfaces • 3-10% energy losses due to icing • Shut-downs • • Active heating expensive VTT Tech chnica cal Research ch Centre of Finland http://www.vtt.fi/news/2013/280520 13_wind_energy.jsp?lang=en 2 GE Title or job number 11/11/2013
DRAFT ALCC Awards 40 + 40 million hours DOE ASCR Leadership Computing Challenge Awards Energy-relevant applications 1. Non-Icing Surfaces for Cold-Climate Wind Turbines Jaguar (Cray XK6) at Oak Ridge National Lab • • Molecular dynamics using LAMMPS 1 million mW water molecule droplets on engineered surfaces • Completed >300 simulations • • Achieved >200x speedup from 2011 to 2013 • >5x from GPU acceleration 2. Accelerated Non-Icing Surfaces for Cold-Climate Wind Turbines Titan (Cray XK7, hybrid) at Oak Ridge National Lab • • “Time parallelization” via Parallel Replica method Expected 10 – 100x faster results • 3 GE Title or job number 11/11/2013
DRAFT Titan enables leadership-class study Size of simulation ~ 1 million molecules • • Droplet size >> critical nucleus size • Mimic physical dimensions (*somewhat) Duration of simulation ~ 1 microsecond • • Nucleation is an activated process • Freezing rarely observed in MD simulations Number of simulations ~ 100’s • • Study requires “embarrassingly parallel” runs • Different surfaces, ambient temperatures, conductivity • Multiple replicates required due to stochastic nature 4 *million molecule droplet ~ 50nm diameter GE Title or job number 11/11/2013
LOGO DRAFT Personal history with MD Year Software/Language # of Molecu cules Hardware 1995 Pascal Few Desktop Mac 2000 C, Fortran90 Hundreds IBM SP, SGI O2K 2010 NAMD, LAMMPS 1000’s Linux HPC Present GPU-enabled LAMMPS Millions Titan 1995 2000 2013 5 GE Title or job number 11/11/2013
DRAFT >200x overall speedup since 2011 1. Switched to mW water potential 3-body model is more expensive/complex than 2-body but Particle reduction – at least 3x • • Timestep increase – 10x No long-range forces • 2. LAMMPS dynamic load balance – 2-3x 3. GPU acceleration of 3-body model – 5x 2011: 6 femtosecond/1024 CPU-second (SPC/E) 2013: 2 picoseconds/1024 CPU-second (mW) 6 GE Title or job number 11/11/2013
DRAFT 1. mW water potential Stillinger Weber 3-body particle = one water molecule • Introduced in 2009, Nature paper in 2011 Bulk water properties comparable or better than existing • point-charge models • Much faster than point-charge models Exemplary test case by authors: 180x faster than SPC/E • GE production simulation: 40-50x faster than SPC/E • asymmetric million molecule droplet on engineered surface; loaded onto 64 nodes SPC/E mW 7 GE Title or job number 11/11/2013
DRAFT 2. LAMMPS dynamic load balance Introduced in 2012 Adjusts size of processor sub-domains to equalize number of particles 2-3x speedup for 1 million molecule droplets on 64 nodes (with user-specified processor mapping) No load balancing Default load balancing User-specified mapping 8 GE Title or job number 11/11/2013
DRAFT 3. GPU-acceleration of 3-body potential See details W. Michael Brown and Masako Yamada Implementing Molecular Dynamics on Hybrid High Performance Computers – Three-Body Potentials. Computer Physics Communications. 2013.Computer Physics Communications, (2013) 9 GE Title or job number 11/11/2013
DRAFT Load 1 million molecules on Host/CPU 1 million molecules 64 nodes • • Processor sub-domains correspond to “spatial” + partitioning of droplet + + + • 8 MPI tasks/node 1 core/paired-unit • 10 GE Title or job number 11/11/2013
DRAFT Per node ~ 15,000 molecules Host Accelerator AMD Opteron 6274 CPU NVIDIA Tesla K20X GPU “Kernel” Core0 Core 1 Private Local Memory Core1 Processor Core 2 Private Core2 1 Core3 Core 192 Private Core4 Core5 Host Memory Core6 Processor Core7 Global Memory Work iterm Work item Work item Work item Work item Work item Work item Work Group 2 Core8 Core9 Core10 …. Core11 Core12 Core13 Processor Core14 14 Core15 11 GE Title or job number Work item = fundamental unit of activity 11/11/2013
DRAFT Parallelization in LAMMPS Accelerator Host Time integration 3-body potential Thermostat/barostat Neighbor-lists Bond/angle calculations Statistics 12 GE Title or job number 11/11/2013
DRAFT Generic 3-body potential 𝑉 = 𝜚 𝒒 𝑗 , 𝒒 𝑘 , 𝒒 𝑙 𝑠 𝑗𝑘 < 𝑠 𝑑 , 𝑠 𝑗𝑙 < 𝑠 𝑑 𝑗 𝑘≠𝑗 𝑙>𝑘 0 otherwise 𝑠 𝑗𝑘 j Good ca candidate for GPU i 1. Occupies majority of 𝑠 𝑗𝑙 computational time 2. Can be decomposed k into independent 𝒒 𝑗 𝒒 𝑘 kernels/work-items 𝒒 𝑙 𝑠 𝑑 = cutoff Stillinger-Weber MEAM 𝑠 𝛽 = neighbor Tersoff (0,0,0) REBO/AIREBO skin Bond- order… 13 GE Title or job number 11/11/2013
DRAFT Redundant Computation Approach Atom-decomposition • 1 atom 1 computational kernel only • fewest operations (and effective parallelization) but – shared memory access a bottleneck Force-decomposition • 1 atom 3 computational kernels required • redundant computations but – reduced shared memory issues – many work-items = more effective use of cores 14 GE Title or job number 11/11/2013
DRAFT Stillinger-Weber Parallelization 𝑉 = 𝜚 2 (𝑠 𝑗𝑘 ) + 𝜚 3 𝑠 𝑗𝑘 , 𝑠 𝑗𝑙 , 𝜄 𝑘𝑗𝑙 𝑗 𝑘<𝑗 𝑗 𝑘≠𝑗 𝑙>𝑘 2-body operations 3 kernels 3-body operations Atom 𝑗 ( 𝑠 𝑗𝑘 < 𝑠 𝛽 ) .AND. ( 𝑠 𝑗𝑙 < 𝑠 𝛽 ) == .TRUE. no data update forces on i only dependencies 3-body operations ( 𝑠 𝑗𝑘 < 𝑠 𝛽 ) .AND. ( 𝑠 𝑗𝑙 < 𝑠 𝛽 ) == .FALSE. neighbor-of-neighbor interactions 15 GE Title or job number 11/11/2013
DRAFT Neighbor List 3-body force-decomposition approach involves • neighbor-of-neighbor operations Requires additional overhead • • increase in border size shared by two processes neighbor list for ghost atoms “straddling” across cores • GPU implementation not necessarily faster than • CPU but less time spent in host-accelerator data transfer (note: neighbor lists are huge) 16 GE Title or job number 11/11/2013
DRAFT GPU acceleration benefit >5x speedup achieved in production water droplet of 1 million molecules on engineered surface (64 nodes) Not limited to Stillinger-Weber -- applicable to MEAM, Tersoff, REBO, AIREBO, Bond-order, etc. 17 GE Title or job number 11/11/2013
DRAFT Implementation 18 GE Title or job number 11/11/2013
DRAFT 6 different surfaces Interaction potential developed at GE Global Research 19 GE Title or job number 11/11/2013
DRAFT Freezing front propagation Visualization of “latent heat” release 20 GE Title or job number 11/11/2013
DRAFT Visualizing crystalline regions particle mobility Steinhardt-Nelson order parameter Side View Bottom View
DRAFT Advanced visualization Mike Matheson, Oak Ridge National Lab Will include visuals/movies here 22 GE Title or job number 11/11/2013
DRAFT Next steps Quasi “ t ime parallelization” using Parallel Replica • Method Launch dozens of replicates simultaneously; monitor • ensemble behavior Expected outcome: 10-100x faster results • Analysis and application of simulation results • 23 GE Title or job number 11/11/2013
DRAFT Credits • Mike Brown (ORNL) – GPU acceleration • Paul Crozier (Sandia) – dynamic load balancing Valeria Molinero (Utah) – mW potential • Aaron Keyes (Umich, Berkeley) – Steinhardt-Nelson order parameters • • Art Voter/Danny Perez (LANL) – Parallel Replica method • Mike Matheson (ORNL) -- Visualization Jack Wells, Suzy Tichenor (ORNL) – General • Azar Alizadeh, Branden Moore, Rick Arthur, Margaret Blohm (GE Global Research) • This research was conducted in part under the auspices of the Office of Advanced Scientific Computing Research, Office of Science, U.S. Department of Energy under Contract No. DEAC05-00OR22725 with UT- Battelle, LLC. This research was also conducted in part under the auspices of the GE Global Research High Performance Computing program. 24 GE Title or job number 11/11/2013
Recommend
More recommend