Fast GPU Monte Carlo Simulation for Radiotherapy, DNA Ionization and Beyond 2017 GPU Technology Conference Shogo Okada <shogo@port.kobe-u.ac.jp> Koichi Murakami <koichi.murakami@kek.jp> Nick Henderson <nick.henderson@gmail.com>
Outline Algorithm research GPU Geant4 MPEXS experimentation Application development Geant4 multi-threading
Big Picture
( ~ p, k ) x, ~ k ∈ { γ , e − , e + , . . . } Goal: record effect of particle interaction in material
LISA Geant4 • Toolkit for simulation of particles traveling through and interacting with matter • Supports wide variety of physics models, ATLAS geometries, and materials • Extendable - users can add new models • Used in numerous and diverse application areas gMocren • high energy physics • medical physics • spacecraft • semiconductor devices • biology research
Parallelism • Simulations require many events for statistical significance • Events are IID • Each computation thread processes an event Challenges: • Random nature of simulation leads to thread divergence • Storage of secondary particles • Recording of energy deposition If you want to consider full capability of Geant4: • Very complicated geometry -- non uniform data structures • Many material types • Large data tables to support physics processes
MPEXS • MPEXS is an adaptation of the core simulation algorithm from Geant4 for GPU • Target application: X-ray radiotherapy • Geometry: uniformly discretized box • Material: Water with variable density • Physics: Low energy electromagnetics • Gamma: Compton scattering, photoelectric effect, pair-production • Electron/Positron: ionization, multiple scattering, Bremsstrahlung, positron annihilation • Each GPU thread tracks an active particle • Secondary particles are stored on thread-local secondary stacks • Threads deposit energy to a shared global domain (via atomicAdd )
MPEXS - Performance & Validation
Dose Distribution of slab phantoms z y Verification for Dose Distribution - phantom size : 30.5 x 30.5 x 30 cm - voxel size : 5 x 5 x 2 mm - field size : 10 cm 2 - SSD : 100 cm - slab materials : air (1) water (2) lung (3) bone source Beam particle and its initial kinetic energy: density - electron with 20MeV water 1.0 g/cm 3 - photon with 6MV Linac lung 0.26 g/cm 3 - photon with 18MV Linac bone 1.85 g/cm 3 air 0.0012 g/cm 3
Comparison of depth dose for γ 6MV (1) water depth dose distribution -3 10 × dose (Gy) G4 0.3 MPEXS G4CU 0.25 − G4 v9.6.3 � 0.2 − G4CU MPEXS 0.15 • x-axis: z-direction (cm) 0.1 • y-axis: dose (Gy) 0.05 residual 0.2 0 5 10 15 20 25 30 0.1 • residual = (G4CU − G4) / G4 0 MPEXS -0.1 -0.2 0 5 10 15 20 25 30 depth (cm) (2) lung (3) bone depth dose distribution depth dose distribution -3 -3 10 × 10 × 0.07 dose (Gy) dose (Gy) G4 G4 0.3 0.06 G4CU MPEXS MPEXS G4CU 0.05 0.25 0.04 0.2 0.03 0.15 lung bone 0.02 0.1 0.01 residual 0.2 residual 0 5 10 15 20 25 30 0.2 0 5 10 15 20 25 30 0.1 0.1 0 0 -0.1 -0.1 -0.2 -0.2 0 5 10 15 20 25 30 0 5 10 15 20 25 30 depth (cm) depth (cm)
Comparison of depth dose for γ 18MV (1) water depth dose distribution -3 10 × dose (Gy) G4 0.12 MPEXS G4CU 0.1 − G4 v9.6.3 � 0.08 − G4CU MPEXS 0.06 • x-axis: z-direction (cm) 0.04 • y-axis: dose (Gy) 0.02 residual 0.2 0 5 10 15 20 25 30 0.1 • residual = (G4CU − G4) / G4 MPEXS 0 -0.1 -0.2 0 5 10 15 20 25 30 depth (cm) (2) lung (3) bone depth dose distribution depth dose distribution -3 10 -3 × × 10 dose (Gy) dose (Gy) G4 G4 0.12 0.12 G4CU MPEXS MPEXS G4CU 0.1 0.1 0.08 0.08 0.06 0.06 lung bone 0.04 0.04 0.02 0.02 residual 0.2 residual 0.2 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0.1 0.1 0 0 -0.1 -0.1 -0.2 -0.2 0 5 10 15 20 25 30 0 5 10 15 20 25 30 depth (cm) depth (cm)
Comparison of depth dose for e- 20MeV (1) water depth dose distribution -3 10 × dose (Gy) 0.18 dose (Gy) G4 -4 10 G4CU 0.16 0.14 -5 − G4 v9.6.3 � 10 0.12 0.1 − G4CU MPEXS -6 10 log scale 0.08 0.06 0 5 10 15 20 25 30 depth (cm) • x-axis: z-direction (cm) 0.04 0.02 • y-axis: dose (Gy) 0 residual 0.2 0 5 10 15 20 25 30 0.1 • residual = (G4CU − G4) / G4 MPEXS 0 -0.1 -0.2 0 5 10 15 20 25 30 depth (cm) (2) lung (3) bone depth dose distribution depth dose distribution -3 10 × -3 × 10 dose (Gy) dose (Gy) dose (Gy) dose (Gy) 0.18 G4 G4 0.18 -4 -4 10 10 G4CU G4CU 0.16 0.16 log scale log scale 0.14 0.14 -5 10 -5 10 0.12 0.12 0.1 0.1 -6 10 -6 10 0.08 0.08 0 5 10 15 20 25 30 0 5 10 15 20 25 30 lung bone 0.06 depth (cm) depth (cm) 0.06 0.04 0.04 0.02 0.02 0 residual 0 0.2 0 5 10 15 20 25 30 residual 0.2 0 5 10 15 20 25 30 0.1 0.1 0 0 -0.1 -0.1 -0.2 -0.2 0 5 10 15 20 25 30 0 5 10 15 20 25 30 depth (cm) depth (cm)
Computation Time Performance 185~250 times speedup against single-core G4 simulation!! GPU: e- beam with 20MeV Tesla K20c (Kepler architecture) - 2496 cores, 706 MHz - (1) water (2) lung (3) bone 4096 x 128 threads - G4 # of primaries 1.84 1.87 1.65 - [msec/particle] 50M particles -> e- 20MeV - G4CU MPEXS 500M particles -> γ 6MV, 18MV - 0.00881 0.00958 0.00885 [msec/particle] × speedup factor CPU: 208 195 193 - Xeon E5-2643 v2 3.50 GHz ( = G4 / G4CU ) / MPEXS) γ beam with 6MV γ beam with 18MV (1) water (2) lung (3) bone (1) water (2) lung (3) bone G4 0.780 0.822 0.819 0.803 0.857 0.924 [msec/particle] G4CU MPEXS 0.00336 0.00331 0.00341 0.00433 0.00425 0.00443 [msec/particle] × speedup factor 232 248 240 185 201 208 ( = G4 / G4CU ) / MPEXS)
Algorithm Research
• MPEXS does not attempt to sort particles • Thread divergence: if threads in the same warp are tracking different particle kinds, then thread divergence occurs in physics process code • Size of particle stack is the same for each thread and is fixed at run-time. Some applications call for the generation of many secondary particles. This restriction meant that we could only run with a small number of active threads.
0 1 2 3 4 5 6 7 index particles in e- e+ e- e- � � � � memory � process � � � � computation e- e- e- e- process e+ e+ process particles in e- e+ e- e- � � � � memory
MPEXS Experiments • Initialize each thread with the same random number generator state. This leads to a non-physical simulation, but eliminates thread divergence . We saw a factor 3x speedup in these runs. • Measure the time it takes to sort particle index by selected process and perform a run length encode against the time for a single trip through event loop. Calculations indicate we should expect a factor 2x speedup if implemented in full simulation.
New Architecture • Goal 1: minimize/eliminate thread divergence • Goal 2: eliminate need for fixed-size and thread-local secondary stacks • Goal 3: maintain extensibility
How it works
input buffer ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ
input buffer ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ output buffers ˠ ˠ ˠ ˠ ˠ ˠ e- e-
input buffer ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ pop ˠ ˠ ˠ ˠ ˠ output buffers ˠ ˠ ˠ ˠ ˠ ˠ e- e-
input buffer ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ pop ˠ ˠ ˠ ˠ ˠ process selection Compton scattering ˠ ˠ ˠ ˠ ˠ Photoelectric effect output buffers ˠ ˠ ˠ ˠ ˠ ˠ e- e-
input buffer ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ pop ˠ ˠ ˠ ˠ ˠ process selection Compton scattering ˠ ˠ ˠ ˠ ˠ Photoelectric effect sort by selected process output buffers ˠ ˠ ˠ ˠ ˠ ˠ e- e-
input buffer ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ pop ˠ ˠ ˠ ˠ ˠ process selection Compton scattering ˠ ˠ ˠ ˠ ˠ Photoelectric effect sort by selected process secondary generation secondary particles ˠ ˠ ˠ e- e- e- e- e- output buffers ˠ ˠ ˠ ˠ ˠ ˠ e- e-
input buffer ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ pop ˠ ˠ ˠ ˠ ˠ process selection Compton scattering ˠ ˠ ˠ ˠ ˠ Photoelectric effect sort by selected process secondary generation secondary particles ˠ ˠ ˠ e- e- e- e- e- secondary storage output buffers ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ ˠ e- e- e- e- e- e- e-
Recommend
More recommend