Breaking Through the Barriers to GPU Accelerated Monte Carlo Particle Transport GTC 2018 Jeremy Sweezy Scientist Monte Carlo Methods, Codes and Applications Group 3/28/2018 Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA LA-UR-18-XXXX
What is Monte Carlo Particle Transport? – Follows the path of individual particles through a system – Uses pseudo-random numbers to sample processes – Randomly sample physical and non-physical processes – Attributed to Stanislaw Ulam and Enrico Fermi FERMIAC – Named because Ulam had an uncle who who would borrow money from relatives because he “just had to go to Monte Carlo” Los Alamos National Laboratory 3/23/18 | 2
Porting to Specialized Hardware is Prohibitively Expensive –The world’s production Monte Carlo codes have decades of development –LANL’s MCNP code has been in development since 1977 –Equally extensive amount of V&V effort –Codes have to run on desktop machines and super-computers –DOE HPC platforms have been in a state of flux for the last 10-years • Cell Broadband Engine • Intel Xeon Phi (MIC) • GPUs • ARM??? Barrier #1: Limited Resources (Money, People, Time) Los Alamos National Laboratory 3/23/18 | 3
Monte Carlo Random Walk on GPU Hardware has reached a Performance Wall • A least 6 different research groups have ported the Monte Carlo random walk to GPU hardware for neutron transport • All report results against different numbers of CPUs • All get the same results! • Almost all are extremely simplified • Production codes will likely have worse performance. 4.5x • What are the limitations? – Conditional branching – Random data access – No small computational intensive kernel to accelerate 3.0x Barrier #2: Performance of random walk on GPUs Los Alamos National Laboratory 3/25/18 | 4
How do You Define Performance? • A computer scientist might measure performance as an increase in speed. 𝑸 = 𝑼 𝑫𝑸𝑽 𝑼 𝑯𝑸𝑽 • A Monte Carlo specialist would measure performance as an balance between speed and statistical variance using a Figure-of-Merit 𝟑 𝑮𝑷𝑵 = 𝝉 𝑫𝑸𝑽 𝑼 𝑫𝑸𝑽 𝟑 𝝉 𝑯𝑸𝑽 𝑼 𝑯𝑸𝑽 𝑭𝒚𝒃𝒏𝒒𝒎𝒇: 𝑮𝑷𝑵 = 𝟏. 𝟐 𝟑 7 𝟐 min 𝟏. 𝟏𝟔 𝟑 7 𝟑 min = 𝟑 To date, almost all GPU implementations of Monte Carlo particle transport of have focused on increasing speed. Los Alamos National Laboratory 3/23/18 | 5
Next Event Estimator • Next-event estimator calculates the Cell 1 probability of a particle from a Image source or collision event reaches a Plane A point without interaction • Typically used for image tallies μ B Cell 2 𝒙 𝑻 𝑺, 𝑭 = 𝟑𝝆𝑺 𝟑 × 𝑶 𝑺 C 𝝉 𝒋 𝑺, 𝑭 𝒒 𝒋 𝝂, 𝑭 → 𝑭 G exp exp ( − M 𝚻 𝑼 𝒕, 𝑭 G 𝒆𝒕 ) 𝝉 𝑼 𝟏 𝒋S𝟐 Ray-cast One to two orders of magnitude faster on GPU hardware Los Alamos National Laboratory 3/23/18 | 6
Traditional Track-Length Estimator • The standard Monte Carlo fluence estimator • Uses the sampled distance in each cell as fluence estimator • Only contributes to cells through which the particle passes • Easy to compute • Nothing to accelerate on GPU B Cell 3 Cell 1 Cell 2 Computing has changed, we need to change our algorithms too! Los Alamos National Laboratory 3/25/18 | 7
Volumetric-Ray-Casting Estimator • For use in place of the traditional track-length estimator on GPU • Multiple pseudo-rays are generated at each source and collision event • Computational intensive estimator with lower variance B Cell 3 Ray-cast Cell 1 Cell 2 𝒙 𝟐UVWX U𝚻 𝑼,𝒋 𝑭 Y 𝒎 𝒋 𝒔 Y U𝒔 𝚻 𝑼 𝒔 + 𝛁′𝒕′, 𝑭 G 𝒆𝒕′ F 𝒋, 𝑭′ = exp − ∫ 𝑶𝚻 𝑼,𝒋 (𝑭 Y ) 𝟏 A neutron dance for a neutron fan. P.M. Dawn Los Alamos National Laboratory 3/25/18 | 8
MonteRay - Accelerating Monte Carlo Transport with GPU Ray Tracing • MonteRay – A library for accelerating Monte Carlo tallies with GPU • Random walk is maintained on CPU • Ray casting based tallies are calculated on the GPU –Next-Event estimator –Volumetric-Ray-Casting estimator, a new estimator designed for GPUs –Supports neutron and photon tallies • Can be incorporated into new and legacy Monte Carlo codes • Uses continuous energy cross-section data • Single precision ray casting • Single precision attenuation cross-sections • Double precision tallies Reduces cost of accelerating an existing Monte Carlo code with GPUs Los Alamos National Laboratory 3/23/18 | 9
MonteRay - Testing • Tests use: –GeForce GTX TitanX GPU with NVIDIA Maxwell architecture –2 CPUs (Intel Haswell E5-2660 v3 at 2.60 GHz), with 10 cores each • MonteRay linked with LANL’s C++ Monte Carlo code MCATK • MCATK uses MPI parallelism building shared ray buffers using MPI-3 shared memory • 3-D Cartesian Structured Mesh Geometry • 2 tests measured performance of the Next-event estimator • 4 tests measured the performance of the Volumetric-ray-casting estimator • Volumetric-ray-casting estimator performance on GPU compared to the Track-length estimator performance on the CPU • Base performance measured as compared to 8 CPU cores Los Alamos National Laboratory 3/23/18 | 10
Testing the Next-Event Estimator on GPU Hardware: Two Radiography Tests Los Alamos National Laboratory 3/23/18 | 11
MonteRay – Medical X-Ray Imaging Simulation • 50-keV X-ray beam • 0.12mm spot size • Radiograph used Next-Event Estimator • Simulation useful for designing collimator to minimize scattered contribution Los Alamos National Laboratory 3/23/18 | 12
MonteRay – Medical X-Ray Imaging Simulation • Source and Collided contribution calculated separately • Source contribution relatively easy to 14.5x 15.3x calculate • Collided contribution important for collimator design • Collided performance 15-18x Los Alamos National Laboratory 3/23/18 | 13
MonteRay – Industrial Radiography • Simulated a physical test object used at Los Alamos’ Dual Axis Radiographic Hydrodynamic Test Facility • Used 4-MeV mono-energetic X-ray beam • 100 x 100 image grid (10,000 estimators) to simulate image detector • Calculation of scatter component needed to design collimators and experiment, but too computational expensive I'm a peeping-tom techie with x-ray eyes – Patrick Lee MacDonald Los Alamos National Laboratory 3/23/18 | 14
MonteRay – Industrial Radiography GPU Performance vs Number of CPU Cores Source Collided 100 Relative Performance 28.5x 24.2x 10 0 5 10 15 20 Number of CPU Cores / GPU Collided calculation performance 15-32x! Los Alamos National Laboratory 3/23/18 | 15
Volumetric-Ray-Casting Estimator on GPU Hardware vs Track-Length Estimator on CPU Hardware Los Alamos National Laboratory 3/23/18 | 16
Cancer Treatment Simulation • 2-MeV Photon beam ( peak of 6MV medical accelerator photon spectrum) • 1-cm beam radius What is the dose to healthy Tumor tissue? 2-MeV Photon Beam GPU Performance vs 8 CPU Cores 14x performance improvement in healthy tissue Los Alamos National Laboratory 3/23/18 | 17
Cancer Treatment Simulation GPU Performance vs Number of CPU Cores in Healthy Tissue 14.3x 10.2x Performance is 14x vs 8 CPU cores or 10x vs 12 CPU cores Los Alamos National Laboratory 3/23/18 | 18
Pressured Water Reactor Assembly Simulation • 16x16 Fuel Assembly • Performance 7.5x in the Control Rods, 5x in the fuel, and 4.5x in the coolant Fuel Pin Control Rod GPU Performance vs 8 CPU Cores Los Alamos National Laboratory 3/23/18 | 19
Pressured Water Reactor Assembly Simulation GPU Performance vs Number of CPU Cores 7.2x 5.4x 6.0x 4.4x Compared to 8 CPU cores performance in control rod 7.2x and 6.0x in the fuel Los Alamos National Laboratory 3/23/18 | 20
Criticality Accident Simulation • Critical Uranium sphere in the corner of a concrete room • Concrete floor, walls, ceiling, and 4 concrete pillars Uranium Sphere GPU Performance vs 8 CPU Cores Performance increase of 14-16x in the center of the room Los Alamos National Laboratory 3/23/18 | 21
Criticality Accident Simulation – Smoother Fluence Estimate Track-Length Estimator Volumetric-Ray-Casting Estimator Los Alamos National Laboratory 3/23/18 | 22
Criticality Accident Simulation GPU Performance vs Number of CPU Cores 15x 10.5x Things are going great, and they’re only getting better – Patrick Lee MacDonald Los Alamos National Laboratory 3/23/18 | 23
Reflected Godiva Criticality Experiment Simulation • U-235 sphere reflected by water • Performance Improvement GPU Performance vs 8 CPU Cores –2.5x in the core –1.0x in the water Los Alamos National Laboratory 3/23/18 | 24
Reflected Godiva Criticality Experiment Simulation Variance Ratio vs Num. Collisions • Variance of the Volumetric-Ray-Casting 4.5 estimator approaches that of the Track-Length 4 estimator is strong scattering material. VRC ) 3.5 2 / σ 2 Variance Ratio ( σ TL GPU Performance vs. Num. CPU Cores 3 2.5 2.2x 2 1.5 2.2x 1 1 4 8 12 16 20 Number of Samples per Collision (N) Performance is limited by the estimator variance, not the GPU speed Los Alamos National Laboratory 3/23/18 | 25
Recommend
More recommend