red blood cell simulations with chemical transport
play

Red Blood Cell Simulations with Chemical Transport Properties Ansel - PowerPoint PPT Presentation

Red Blood Cell Simulations with Chemical Transport Properties Ansel L. Blumers Karniadakis Group, Brown University, Rhode Island, USA San Jos || GTC 2017 || May, 2017 Scientific Inquiries Aim to investigate Chemical-driven plaque


  1. Red Blood Cell Simulations with Chemical Transport Properties Ansel L. Blumers Karniadakis Group, Brown University, Rhode Island, USA San José || GTC 2017 || May, 2017

  2. Scientific Inquiries Aim to investigate Chemical-driven plaque and thrombus formation. Model chemical transport Released from red and white blood cells to plasma, Sieved through the vessel wall to surrounding tissue. Model red blood cells Red blood cell dynamics in the blood. 2 Ansel L. Blumers || ansel_blumers@brown.edu || GTC 2017 || May, 2017

  3. Dissipative Particle Dynamics A coarse-grained particle method for mesoscopic simulations. Pairwise Force Interaction X ( F C ij + F D ij + F R F i = ij ) i 6 = j F R ij = σ ij ω R ( r ij ) ξ ij δ t − 1 / 2 e ij random F D ij = − γ ij ω D ( r ij )( e ij · v ij ) e ij dissipative F C ij = α ij ω C ( r ij ) e ij conservative Groot, R.D., Warren, P.B., The Journal of Chemical Physics, 1997 3 Ansel L. Blumers || ansel_blumers@brown.edu || GTC 2017 || May, 2017

  4. Transport Properties Solves the Advection-Diffusion-Reaction equation. Pairwise Chemical Transport source / external random flux Fickian flux Li, Z., Yazdani, A., Tartakovsky, A., Karniadakis, G.E., The Journal of Chemical Physics, 2015 4 Ansel L. Blumers || ansel_blumers@brown.edu || GTC 2017 || May, 2017

  5. Modeling Red Blood Cell (RBC) Discretized with 500 vertices and held together by 3 types of bonded potentials. Bending rigidity Visco-elastic + hydrostatic-elastic Global area and volume constraints + local area constraint Fedosov, D.A., Caswell, B., Karniadakis, G.E., Biophysical Journal., 2010 5 Ansel L. Blumers || ansel_blumers@brown.edu || GTC 2017 || May, 2017

  6. Need a Fast & Robust Program Fusion 6 Ansel L. Blumers || ansel_blumers@brown.edu || GTC 2017 || May, 2017

  7. USER MESO 2.0 https://github.com/AnselGitAccount USER MESO Concen. & ADR Injects new capabilities into USER MESO * RBC Combining Open MPI, OpenMP and CUDA Simultaneously simulate ... chemical concentration field advection-diffusion-reaction processes USER MESO 2.0 red blood cell dynamics * S4518 , GTC 2014, Y.-H. Tang, Accelerating Dissipative Particle Dynamics Simulation on Kepler: Algorithm, Numerics and Application USERMESO 2.0 : Blumers, A., Tang, Y.-H., Li, Z., Li, X., Karniadakis, G. E., Computer Physics Communications, 2017 USERMESO : Tang, Y.-H., Karniadakis, G.E., Computer Physics Communications, 2014 7 Ansel L. Blumers || ansel_blumers@brown.edu || GTC 2017 || May, 2017

  8. Finding Neighbors Particles are binned into cells. Finding neighbors by calculating relative distance to other particles in adjacent cells. Particles in … Center Cell N Neighboring Cells M N x M predicates Strategy #1 : Each warp takes on particles in Center Cell. Strategy #2 : Each warp takes on particles in Neighboring Cells. 3 x 3 cells 8 Ansel L. Blumers || ansel_blumers@brown.edu || GTC 2017 || May, 2017

  9. Neighbor Kernel Optimization Atomics-free and parallel committing at warp-level dependency. Optimize : Save particle IDs to a neighbor list orderly . Solution : 1. Use __balloc(int) which returns a bit mask called ballot. 2. Use __popc(int) which returns number of bits set. 3. Return value of __popc(int) is broadcasted to each thread and further masked by a lane- specific mask 4. Place ID accordingly. S4518 , GTC 2014, Y.-H. Tang, Accelerating Dissipative Particle Dynamics Simulation on Kepler: Algorithm, Numerics and Application 9 Ansel L. Blumers || ansel_blumers@brown.edu || GTC 2017 || May, 2017

  10. Data Layout – Chemical Concentration Each particle carries an arbitrary number of chemical species. Flux is calculated using particle location and chemical concentration. Global Load 2D Texture Hardware: K20X Stall – Data Request ~ 23% ~ 44% Texture Cache Hit Rate ~ 52% ~ 36% Kernel Duration 33.7ms 43.7ms 30% 2D Texture Implementation: Each layer holds the concentration of all particles for one species. Culprit – Texture cache depletion Data locality in coordinates is optimized for texture cache hit rate. The concentration texture depletes the cache and thus disrupt the data locality optimization. Overall texture cache hit rate is therefore reduced significantly. 10 Ansel L. Blumers || ansel_blumers@brown.edu || GTC 2017 || May, 2017

  11. Parallel GPU/MPI Synchronization Domain decomposition: Broken up into sub-domains. Each sub-domain is computed by one CPU-GPU pair. Steps for RBC computation: 1) Compute total area and volume of local RBCs. 2) All-reduce across all nodes. 3) Enforce area and volume constraints for each RBC. Complication - Domain decomposition causes large synchronization overhead. Naive approach K-Gather Compute total area and volume of each RBC. Prior Processes prior to RBC computation. K-Apply Enforce area and volume constraints. Subsequent Processes subsequent to RBC computation. 11 Ansel L. Blumers || ansel_blumers@brown.edu || GTC 2017 || May, 2017

  12. Attempted Optimization #1: Multi-stream Strategy: Overlapping K-Gather and K-Apply . Uses the total area and volume from previous time step. Complication: Streaming multiprocessor saturation – kernels competing for computing resources. The resulting higher cache refresh rate is detrimental to efficiency. Consequence: Worse performance 12 Ansel L. Blumers || ansel_blumers@brown.edu || GTC 2017 || May, 2017

  13. Attempted Optimization #2: Multi-stream + Non-blocking Strategy: Overlapping data transfer and computation + Utilizing Non-blocking communication Algorithm: 1) Wait for the completion of MPI-Iallreduce from last time step. 5) Download data to host with asynchronous Memcpy-HtD . 2) Upload data to device with asynchronous Memcpy-HtD . 6) Enforce the area and volume constraints in K-Apply . 3) Compute the total area and volume of each RBC in K-Gather . 7) Wait for the completion of Memcpy-DtH . 4) Place asynchronous Memcpy-DtH in execution queue. 8) Sum total area and volume with MPI-Iallreduce . 13 Ansel L. Blumers || ansel_blumers@brown.edu || GTC 2017 || May, 2017

  14. Benchmark – Weak & Strong Scalings Metrics: (million particles) • (steps per second) à MPS/second Strong Weak Global volume - 2,097,152 Local volume (per node) - 32,768 For example: Hct 7% for a system volume of 32,768 translates to 24 RBCs and 131,072 pure fluid particles. Hct 35% for a system volume of 32,768 translates to 123 RBCs and 49,768 pure fluid particles. 14 Ansel L. Blumers || ansel_blumers@brown.edu || GTC 2017 || May, 2017

  15. Benchmark - Speedup Hardware: Titan, 8 Opteron 6274 clusters + 1 K20X / Cray XK7 node. Time (Non-blocking impl.) Speedup: Runtime: 1 rank with 8 OpenMP threads per node. Time (Blocking impl.) Strong Weak 15 Ansel L. Blumers || ansel_blumers@brown.edu || GTC 2017 || May, 2017

  16. Benchmark – Single Node Time (CPU-GPU hybrid) Speedup: Time (CPU only) Hct System Volume Solvent Particles RBC Count Total Particle Count Speedup 7% 8,192 32,768 6 37,768 3.8 16,382 65,536 12 71,536 5.1 32,768 131,072 24 143,072 5.4 65,536 262,144 49 286,644 5.7 35% 8,192 32,768 30 49,768 4.5 16,384 65,536 61 96,036 5.3 32,768 131,072 123 192,572 5.9 65,536 262,144 246 385,144 6.7 For example: Hct 7% for a system volume of 8,192 translates to 6 RBCs and 32,768 pure fluid particles. 16 Ansel L. Blumers || ansel_blumers@brown.edu || GTC 2017 || May, 2017

  17. Bonus Benchmark Hardware: Two Intel Xeon E5-2630L CPUs at 2.0 GHz, GeForce TITAN X Maxwell or TITAN X Pascal Runtime: 1 rank with 8 OpenMP threads. System Total particle TITAN X Maxwell TITAN X Pascal volume count speedup speedup 8,192 47,768 4.8 6.5 16,382 96,036 5.8 9.2 32,768 192,572 7.2 9.9 65,536 385,144 7.2 10.1 17 Ansel L. Blumers || ansel_blumers@brown.edu || GTC 2017 || May, 2017

  18. 720,778 particles ~12x speedup 5% represents RBCs @ than CPU-only 500 particles per RBC version 18 Ansel L. Blumers || ansel_blumers@brown.edu || GTC 2017 || May, 2017

  19. Summary USER MESO 2.0 offers GPU-accelerated ability to model advection-diffusion-reaction of chemical and RBC dynamics. Multi-stream scheduling and non-blocking communication are employed to maximize GPU- CPU concurrency. Comparing with CPU counterpart, USER MESO 2.0 produces up to 10.1 times speedup on one GPU over 16 cores in a single node. It is able to achieve a weak scaling efficiency of 91% across 256 nodes and almost linear strong scaling. 19 Ansel L. Blumers || ansel_blumers@brown.edu || GTC 2017 || May, 2017

  20. Thank you for listening! Acknowledgement: This work is supported by NIH and US Army Research Laboratory. The benchmarks were performed on TITAN at Oak Ridge National Laboratory through the Innovative and Novel Computational Impact on Theory and Experiment program under project BIP118. A.L.B. would like to acknowledge Wayne Joubert (ORNL) for his effort to coordinate machine reservations. We would like also to acknowledge the support of NVIDIA Corporation with the donation of TITAN X Pascal GPU. 20 Ansel L. Blumers || ansel_blumers@brown.edu || GTC 2017 || May, 2017

Recommend


More recommend