portable performance for monte carlo simulation of photon
play

PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF PHOTON - PowerPoint PPT Presentation

PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF PHOTON MIGRATION IN 3D TURBID MEDIA FOR SINGLE AND MULTIPLE GPUS Fanny Nina-Paravecino, Leming Yu, Qianqian Fang*, David Kaeli Department of Electrical and Computer Engineering Department


  1. PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF PHOTON MIGRATION IN 3D TURBID MEDIA FOR SINGLE AND MULTIPLE GPUS Fanny Nina-Paravecino, Leming Yu, Qianqian Fang*, David Kaeli Department of Electrical and Computer Engineering Department of Bioengineering*

  2. SIMULATION OF PHOTON TRANSPORT INSIDE HUMAN BRAIN • Photon migration in 3D turbid media • Prediction of experimental outcomes • Simulation is a time- consuming task GTC April 4-7, 2016 | Silicon Valley 2

  3. MCX.SPACE GTC April 4-7, 2016 | Silicon Valley 3

  4. MCX AROUND THE WORLD ž Over 30,000 unique visits made from 148 countries ž Accumulative download is over 12,000 worldwide ž Over 900 registered users, from more than 350 institutions/companies around the world GTC April 4-7, 2016 | Silicon Valley 4

  5. MCX STATISTICS GTC April 4-7, 2016 | Silicon Valley 5

  6. OUTLINE ž Portable Performance Monte Carlo Extreme (MCX) — MCX in CUDA — Persistent Threads in CUDA (MCX) — Portable Performance MCX — Other enhacements — Results ž MCX on multiple GPUs — Performance Model — Partitioning Schemes — Performance Results GTC April 4-7, 2016 | Silicon Valley 6

  7. PORTABLE Photons initialization PERFORMANCE MCX 3D voxelated media GTC April 4-7, 2016 | Silicon Valley 7

  8. MONTE CARLO EXTREME (MCX) ž Estimates the 3D light (fluence) distribution by simulating a large number of independent photons ž Most accurate algorithm for a wide ranges of optical properties, including low-scattering/ voids, high absorption and short source- detector separation ž Computationally intensive, so a great target for GPU acceleration ž Widely adopted for bio-optical imaging applications: — Optical brain functional imaging — Fluorescence imaging of small animals for drug development — Gold stand for validating new optical imaging instrumentation designs and algorithms GTC April 4-7, 2016 | Silicon Valley 8

  9. MCX APPLICATIONS Simulation of photons inside human brain Imaging of bone marrow in the tibia Imaging of a complex mouse model using Monte Carlo simulations GTC April 4-7, 2016 | Silicon Valley 9

  10. MCX IN CUDA [1] … Loop of repetitions Thread i+1 Thread i Seed GPU RNG Start Launch a new photon with CPU RNG Compute a new scattering length Global Propagate photon until Memory cross voxel boundary Compute attenuation based on absorption Compute a Accumulate photon new scattering (optional) energy loss to the direction vector Repetition volume complete? n y y End of Exceeding scattering time gate? path? Retrieve solution n y n Terminate End of Total photon Normalize & save # reached? thread simulation solution CPU GPU [1] Q. Fang and D. A. Boas. "Monte Carlo simulation of photon migration in 3D turbid media accelerated by graphics processing units." Optics express 17.22 (2009): 20178-20190. GTC April 4-7, 2016 | Silicon Valley 10

  11. PERSISTENT THREADS (PT) IN MCX ž PT kernels alter the notion of a virtual thread lifetime, treating those threads as physical hardware threads ž PT kernels provide a view that threads are active for the entire duration of the kernel — We schedule only as many threads as the GPU SMs can concurrently run — The threads remain active until end of kernel execution Worker thread Thread exits Thread loop, clean up, initializes and and shut down enter thread Thread loops loop continuously GTC April 4-7, 2016 | Silicon Valley 11

  12. PORTABLE PERFORMANCE MCX Feature Fermi Kepler Maxwell MaxThreadBlocks/ 8 16 32 MP Maxthreads/MP 1536 2048 2058 MP 16 14 22 CUDA cores/MP 32 192 128 autoBlock = MaxThreadsPerMP / MaxBlocksPerMP autoThread = autoBlock * MaxBlocksPerMP * MP GTC April 4-7, 2016 | Silicon Valley 12

  13. OTHER ENHANCEMENTS ž Autopilot improvement ž Developed customized operation such as: — mcx_nextafter ž Reduced the use of SharedMemory — Enables more threads to be launch ž Avoided branch divergence by using indexes GTC April 4-7, 2016 | Silicon Valley 13

  14. IMPROVEMENT PER ENHANCEMENT Overall Performance 1.4x 980Ti GK110 2.4x 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Autopilot Reducing Shared Memory Increasing Local Memory/ Hide Latency Avoid branch divergence/ Customized function GTC April 4-7, 2016 | Silicon Valley 14

  15. PERFORMANCE MCX - RESULTS ž Baseline: MCX version Sep 12, 2015 Arch GPU Photons/ms Photons/ms Speedup (Baseline) Fermi GTX 590 2044.99 2901.92 1.4x Kepler GT 730 529.89 1263.74 2.4x Kepler GK110 2383.22 5238.34 2.2x Maxwell 980Ti 12268.98 19157.09 1.4x Performance (photons/ms) 25000 20000 Photons 15000 10000 5000 0 GTX 590 GT 730 GK110 980Ti GPUs GTC April 4-7, 2016 | Silicon Valley 15

  16. MCX AS A BENCHMARK Performance is changing dramatically • Same input -10x • Same code of sequence +10x 1799.76 1550.63 1800.00 1368.96 1600.00 1400.00 Size (KB) 1200.00 1000.00 800.00 600.00 400.00 0.14 200.00 MCX_core.sass 0.14 0.00 0.14 MCX_core.ptx Baseline After Improvement After Improvement with Hack CUDA 7.5 - Maxwell Compute 5.2 (980Ti) GTC April 4-7, 2016 | Silicon Valley 16

  17. MCX ON MULTIPLE GPUS

  18. MOTIVATION ž Monte Carlo eXtreme (MCX) simulation in OpenCL ž Distribute workloads among different devices — NVIDIA GPUs / AMD GPUs / CPUs GPU 1 thread GPU 2 thread thread GPU 3 MCXCL Partitioning Scheme Platform GTC April 4-7, 2016 | Silicon Valley 18

  19. METHODOLOGY ž Predict the kernel execution time — Evaluate the kernel runtime — Develop the performance model ž Partitioning Schemes Core-based Throughput Iterative Fminimax Nonlinear linear The number of Application Throughput- programming parallel compute throughput based iterative solution for units (photons/ms) partitioning minimax problem GTC April 4-7, 2016 | Silicon Valley 19

  20. PERFORMANCE MODEL ž Measure the kernel execution time on various devices ž Simulate 1M to 25M photon migrations GTC April 4-7, 2016 | Silicon Valley 20

  21. PERFORMANCE MODEL ž Given n devices: D 1 , D 2 , … D n ž Given linear performance for each device ž Given the performance for 1M and 2M for each device ž We can obtain the linear equation for each device as follows: y 1 = a 1 x 1 + c 1 Device 1 : y a x c = + Device 2 : 2 2 2 2 . . . . y a x c = + Device n : n n n n GTC April 4-7, 2016 | Silicon Valley 21

  22. PARTITIONING SCHEME ELABORATION ComputeUnits i Throughput i ∑ ∑ ComputeUnits i Throughput i Iterative Approximation Stop when Iteratively evaluate Core-based achieving the throughput-based Initialization max partitioning throughput GTC April 4-7, 2016 | Silicon Valley 22

  23. PERFORMANCE RESULTS Core-based Throughput Iterative Fminimax 30000 25000 20000 15000 10000 5000 0 10M 100M 10M 100M GTX 980 Ti + GTX 590 + GT 730 K40c + K20c Throughput Utilization Throughput Utilization 10M 100M 10M 100M Core-based 35.01% 41.65% Core-based 85.31% 97.56% Throughput 59.31% 93.42% Throughput 80.39% 87.89% Iterative 68.85% 93.77% Iterative 80.39% 87.89% Fminimax 68.85% 93.77% Fminimax 80.39% 87.89% Max throughput 9688 photons/ms Max throughput 30323 photons/ms GTC April 4-7, 2016 | Silicon Valley 23

  24. PERFORMANCE RESULTS Core-based Throughput Iterative Fminimax 4500 4000 3500 3000 2500 2000 1500 1000 500 0 10M 100M 10M 100M AMD 7970M + Intel i7-3740QM AMD 7970 + Fiji + Intel i7-4770 Throughput Utilization Throughput Utilization 10M 100M 10M 100M Core-based 19.32% 18.69% Core-based 15.10% 19.06% Throughput 18.81% 27.14% Throughput 16.38% 21.10% Iterative 18.78% 27.91% Iterative 16.38% 21.10% Fminimax 18.78% 27.91% Fminimax 16.38% 21.10% Max throughput 4529 photons/ms Max throughput 19176 photons/ms GTC April 4-7, 2016 | Silicon Valley 24

  25. SUMMARY ž We have improved the performance of MCX across a range of NVIDIA GPU architectures ž We have showed how to exploit Persistent Thread kernel to automatically tune MCX kernel ž We developed an iterative scheme to search the best partition to run MCX on multiple accelerators ž We obtained an 24% and 44% throughput utilization improvement (Iterative vs Core-based) for 10M and 100M photon simulations, respectively GTC April 4-7, 2016 | Silicon Valley 25

  26. FUTURE WORK ž Instrumentation of MCX — Leverage SASSI to instrument MCX and better characterize the behavior of a kernel to guide auto-tuning ž MCX on Multiple GPUs — Evaluate our partitioning optimization for multiple devices GTC April 4-7, 2016 | Silicon Valley 26

  27. MCX CHALLENGE ž Interested in improving performance of MCX over 40% compared to current version? — Monetary reward will be announced soon. Stay tuned to mcx.space GTC April 4-7, 2016 | Silicon Valley 27

  28. ACKNOWLEDGEMENT ž This project is funded by the NIH/NIGMS under the grant R01-GM114365 ž We would like to acknowledge NVIDIA for their support for this work through the NVIDIA Research Center program GTC April 4-7, 2016 | Silicon Valley 28

  29. THANK YOU! QUESTIONS? fninaparavecino@ece.neu.edu ylm@ece.neu.edu

Recommend


More recommend