Blasting Sand with CUDA: MPM Sand Simulation for VFX Gergely Klár DreamWorks Animation
t n t n+1
t n t n+1
t n t n+1
Grid influence
Naïve Particles-to-Grid
Gather Particles-to-Grid
Our Solution • Each particle is read only once, • We efficiently use shared memory for the grids, • We significantly reduce the number of atomic operations, • And our secret sauce: a special data structure for particle queries.
1 CUDA 1 CUDA 1 CUDA Block Block Block 1 CUDA 1 CUDA 1 CUDA Block Block Block 1 CUDA 1 CUDA 1 CUDA Block Block Block
CellBins ParticleIDs Actual particle data
TileBins CellBins ParticleIDs Actual particle data
• In each block/tile: – Get blockIdx – Cells in the tile are TileBins[blockIdx-1].. TileBins[blockIdx]-1 – Get a cellId for each warp from this list • Each thread works on two affected grid nodes • Particles of a cell are CellBins[cellId-1]..CellBins[cellId]-1 • Compute the contribution from the particle • Store in shared – Write back to global
Tile & Cell Keys ● Particle coordinates: (px, py, pz) ● Cell coordinates: (ci, cj, ck) = ⌊ (px, py, pz)/ Δx ⌋ Δx ● Tile and in-tile coordinates: (ci, cj, ck) = (ti, tj, tk) ∙TILE_SIZE + (ri, rj, rk) 7 bits 7 bits 7 bits 3 bits 3 bits 3 bits ti tj tk ri rj rk 32 bit unsigned integer
Tile & Cell Keys Initial Particle IDs ● When sorted as uint32s, keys of the same tile will be consecutive sort ● RLE encoding counts the number of Particle IDs particles per cell ● The running sum of the counts gives RLE the offsets to particles inc. sum ● RLE encoding with a mask for the Cell Bins tile bits counts the number of non- empty cells per tile masked RLE ● The running sum of these counts gives the offsets to cells inc. sum Tile Bins
Results
Overall 1000 800 600 GPU 400 CPU 200 0 262K 884K 2,097K 7,000K # of particles nVidia Quadro K5200 Intel Xeon CPU E5-2697 v3 @ 2.60GHz w/ 28 cores Milliseconds per time step. Smaller is better.
Particles to Grids Grids to Particles 600 600 500 500 400 400 300 300 200 200 100 100 0 0 262K 884K 2,097K 7,000K 262K 884K 2,097K 7,000K Milliseconds per time step. Smaller is better.
Summary • Particle binning with sort-RLE-scan • Breaking the domain to tiles fitting to shared memory • Processing particles of a cell by a single warp
Special thanks to: • Ken Museth • Rob Tesdahl • Stephen Jones • David Tonnesen • Jeff Budsberg • Ibrahim Sani • Lawrence Lee
Thank you!
Recommend
More recommend