Out-of-Core Proximity Computation for Particle-based Fluid - PowerPoint PPT Presentation

Out-of-Core Proximity Computation for Particle-based Fluid Simulation Presenter: Duksu Kim 1 Myung-Bae Son 2 Young J. Kim 3 Jeong-Mo Hong 4 Sung-Eui Yoon 2 1 KISTI (Korea Institute of Science and Technology Information) 2 KAIST (Korea Advanced Institute of Science and Technology) 3 Ewha Woman’s University, Korea 4 Dongguk University, Korea

Particle-based Fluid Simulation 2

Motivation • To meet the higher realism, a large number of particles are required – Tens of millions particles • In-core algorithm (previous work) – Manage all data in GPU’s video memory – Can handle up to 5 M particles with 1 GB memory for particle- based fluid simulation • Recent commodity GPUs have 1 ~ 3 GB memories (up to 12 GB) 3

Contributions • Propose out-of-core methods that utilize heterogeneous computing resources and process neighbor search for a large number of particles • Propose a memory footprint estimation method to identify a maximal work unit for efficient out- of-core processing 4

Result NVIDIA mapped memory Tech. - Map CPU memory space into GPU memory address space Map-GPU Ours Up to 65.6 M Particles - Two hexa-core CPUs (192 GB Mem.) Maximum data size: 13 GB - One GPU (3 GB Mem.) 5

Particle-based Fluid Simulation Neighbor search Compute force Move particles 6

Particle-based Fluid Simulation Performance bottleneck Neighbor search - Takes 60~80% of simulation computation time Compute force ε Move particles ε -Nearest Neighbor ( ε -NN) 7

Preliminary: Grid-based ε -NN ε 𝑚 ( ε < 𝑚 ) 8

Preliminary: Grid-based ε -NN 𝑚 ( ε < 𝑚 ) 9

In-Core Algorithm (Data<Video Memory) Main memory (CPU side) GPU - Grid data - Particle data Video memory Results Assume: ε -NN Main memory is enough - can equip up to 4 TB 10

Data > Video Memory Main memory (CPU side) GPU - Grid data - Particle data Video memory Results ε -NN 11

Out-of-Core Algorithm Main memory (CPU side) GPU - Sub-grid( Block ) data - Particle data Video memory Results ε -NN 12

Boundary Region • Required data in adjacent blocks • Inefficient to handle in an out-of-core manner 13

Boundary Region • Required data in adjacent blocks • Inefficient to handle in an out-of-core manner • Multi-core CPUs handle the boundary region – CPU (main) memory contain all required data – Ratio of boundary regions is usually much smaller than inner regions 14

How to Divide the Grid ? 15

How to Divide the Grid ? • Goal: Find the largest block that fits to the GPU memory – Improve parallel computing efficiency • Process a large number of particles at once • Minimize data transfer overhead – Reduce the boundary region • As the ratio of boundary region is increased, the workload of CPU is increased 16

Required Memory Size for processing a block, B # of neighbor particles # of particles in B for the particle i (p i ) 𝑻 𝑪 = 𝒐 𝑪 𝑻 𝒒 + 𝑻 𝒐 𝒐 𝒒 𝒋 𝒒 𝒋 ∈𝑪 Data size Data size for storing a particle for storing a neighbor info. 17

Hierarchical Work Distribution Workload tree - # of particles in the block a b - # of neighbors in the block … a b Front nodes … c d c d 𝑻 𝑪 < GPU memory 18

Chicken-and-Egg Problem # of neighbor particles # of particles in B for the particle i, p i 𝑻 𝑪 = 𝒐 𝑪 𝑻 𝒒 + 𝑻 𝒐 𝒐 𝒒 𝒋 𝒒 𝒋 ∈𝑪 Data size Data size for storing a particle for storing a neighbor info. 19

Chicken-and-Egg Problem 𝑻 𝑪 = 𝒐 𝑪 𝑻 𝒒 + 𝑻 𝒐 𝒐 𝒒 𝒋 𝒒 𝒋 ∈𝑪 Our approach: Estimation the number of neighbors for particles 20

Problem Formulation • Assumption – Particles are uniformly distributed in a cell S(p, ε ) • Idea p ε – For a particle, the number of neighbors in a cell is proportional to the overlap volume between the search sphere and the cell weighted by the number of particles in the cell 21

Expected Number of Neighbors of a particle p located at (x, y, z) 𝑷𝒘𝒇𝒔𝒎𝒃𝒒 𝑻(𝒒 𝒚,𝒛,𝒜 , 𝜻 , 𝑫 𝒋 ) 𝑭 𝒒 𝒚,𝒛,𝒜 = 𝒐 𝑫 𝒋 ∗ 𝑾(𝑫 𝒋 ) 𝒋 - 𝑫 𝒋 : cells of 𝒒 𝒚,𝒛,𝒜 and its adjacency cells - 𝒐 𝑫 𝒋 : the number of particles in the cell - 𝑷𝒘𝒇𝒔𝒎𝒃𝒒 𝑻(𝒒 𝒚,𝒛,𝒜 , 𝜻 , 𝑫 𝒋 ) : overlap volume between them - 𝑾 𝑫 𝒋 : volume of the cell 22

Problem Formulation • Compute 𝐹 𝑞 𝑦,𝑧,𝑨 for each particle takes high computational overhead • Instead, (approximation) – Compute the average 𝐹 𝑞 𝑦,𝑧,𝑨 for particles in a cell – Use the value for all particles in the cell 23

The Average, Expected Number of Neighbors of particles in a cell 𝐷 𝑟 Expensive to compute at runtime 𝒎 𝒎 𝒎 𝟐 𝑭 𝑫 𝒓 = ∗ 𝑭 𝒒 𝒚,𝒛,𝒜 𝒆𝒚 𝒆𝒛 𝒆𝒜 𝑾 𝑫 𝒓 𝟏 𝟏 𝟏 - 𝑚 is the length of a cell along each dimension - 𝒒 𝒚,𝒛,𝒜 is a particle positioned at (x, y, z) on a local coordinate space in 𝐷 𝑟 24

The Average, Expected Number of Neighbors of particles in a cell 𝐷 𝑟 𝒎 𝒎 𝒎 𝟐 𝑭 𝑫 𝒓 = ∗ 𝑭 𝒒 𝒚,𝒛,𝒜 𝒆𝒚 𝒆𝒛 𝒆𝒜 𝑾 𝑫 𝒓 𝟏 𝟏 𝟏 ∗ 𝒐 𝑫 𝒋 ∗ 𝑬 𝑫 𝒓 , 𝑫 𝒋 𝟐 = 𝑾 𝑫 𝒋 𝑾 𝑫 𝒓 𝒋 𝑚 𝑚 𝑚 𝐸 𝐷 𝑟 , 𝐷 𝑗 = 𝑃𝑤𝑓𝑠𝑚𝑏𝑞 𝑇 𝑄 𝑦,𝑧,𝑨 , 𝜁 , 𝐷 𝑗 𝑒𝑦 𝑒𝑧 𝑒𝑨 0 0 0 25

The Average, Expected Number of Neighbors of particles in a cell 𝐷 𝑟 • Pre-compute 𝐸 𝐷 𝑟 , 𝐷 𝑗 – The value depends on the ratio between 𝑚 and 𝜁 values – 𝑚 and 𝜁 are not frequently changed by user – Use the Monte-Carlo method with many samples (e.g., 1 M) • Use look-up table at runtime 𝑚 𝑚 𝑚 𝐸 𝐷 𝑟 , 𝐷 𝑗 = 𝑃𝑤𝑓𝑠𝑚𝑏𝑞 𝑇 𝑄 𝑦,𝑧,𝑨 , 𝜁 , 𝐷 𝑗 𝑒𝑦 𝑒𝑧 𝑒𝑨 0 0 0 26

Validation • Correlation = 0.97 • Root Mean Square Error (RMSE) = 3.7 27

Chicken-and-Egg Problem Expected number of neighbors 𝑻 𝑪 = 𝒐 𝑪 𝑻 𝒒 + 𝑻 𝒐 𝒐′ 𝒒 𝒋 + 𝑻 𝑩𝒗𝒚 𝒒 𝒋 ∈𝑪 Auxiliary space to cover the estimation error 𝑻 𝑩𝒗𝒚 = 𝟒. 𝟖 ∗ 𝒐 𝑪 𝑻 𝒐 RMSE 28

Chicken-and-Egg Problem Expected number of neighbors 𝑻 𝑪 = 𝒐 𝑪 𝑻 𝒒 + 𝑻 𝒐 𝒐′ 𝒒 𝒋 + 𝑻 𝑩𝒗𝒚 𝒒 𝒋 ∈𝑪 Auxiliary space to cover the estimation error 𝑻 𝑩𝒗𝒚 = 𝟒. 𝟖 ∗ 𝒐 𝑪 𝑻 𝒐 RMSE 29

Results • Testing Environment – Two hexa-core CPUs – 192 GB main memory (CPU side) – One GPU (GeForce GTX 780) with 3 GB video memory 30

Results NVIDIA mapped memory Tech - Map CPU memory space into GPU memory address space Map-GPU Ours Up to 65.6 M Particles Maximum data size: 13 GB 31

15.8 M Particles Maximum data size: 6 GB Up to 32.7 M Particles Maximum data size: 16 GB 32

Results Up to 26 X Map-GPU Our method Up to 51 X 12 CPU cores A CPU core +One GPU Up to 8.4 X Up to 6.3 X 12 CPU cores 33

Conclusion • Proposed an out-of-core ε -NN algorithm for particle-based fluid simulation – Utilize heterogeneous computing resources – Utilize GPUs in out-of-core manner – Propose hierarchical work distribution method 34

Conclusion • Proposed an out-of-core ε -NN algorithm for particle-based fluid simulation • Presented a novel, memory estimation method – Based on expected number of neighbors 35

Conclusion • Proposed an out-of-core ε -NN algorithm for particle-based fluid simulation • Presented a novel, memory estimation method • Handled a large number of particles • Achieved much higher performance compared with a naïve OOC-GPU approach 36

Future Work • Extend to support multi-GPUs • Improve the parallelization efficiency by employing an optimization-based approach • Extend to other applications 37

Thanks! Any questions? (bluekdct@gmail.com) Project homepage: http://sglab.kaist.ac.kr/OOCNNS - Benchmark scenes are available in the homepage 38

Benefits of Our Memory Estimation Model • Fixed space VS Ours 39

Benefits of Hierarchical Workload Distribution • Larger block size shows a better performance – E.g., using 32 3 and 64 3 block sizes takes 22% and 30% less processing time in GPU than using 16 3 blocks on average 40

Benefits of Hierarchical Workload Distribution • But, the maximal block size varies depending on the benchmarks and region of the scene • Compared manually set fixed block size based on our estimation model, hierarchical approaches shows 33% higher performance on average 41

Out-of-Core Proximity Computation for Particle-based Fluid - PowerPoint PPT Presentation

Out-of-Core Proximity Computation for Particle-based Fluid Simulation Presenter: Duksu Kim 1 Myung-Bae Son 2 Young J. Kim 3 Jeong-Mo Hong 4 Sung-Eui Yoon 2 1 KISTI (Korea Institute of Science and Technology Information) 2 KAIST (Korea Advanced

Welcome Welcome Core: Core A Regional Destination Core: Core UL Core: Core Downtown

Tree Pr ee Proximity ximity Finding the good and bad of trees. joe@buildfax.com Tree

Planar Delaunay Triangulations and Proximity Structures Proximity Structures Given: a set P of n

Proximity Language Model A Language Model beyond Bag of Words through Proximity Jinglei Zhao 1

Caching, Parallelism, Fault Tolerance Marco Serafini COMPSCI 532 Lectures 2-3 Memory Hierarchy

Close Proximity Radiography www.tracoilandgas.com Overview What is Close Proximity Radiography?

Replay, Relay and Inverse-Sybil Attacks on Proximity Tracing Apps Krzysztof Pietrzak 2020

Behavioral Detection and Containment of Proximity Malware in Delay Tolerant Networks Wei Peng,

#prep X Assembly 03-B: Proximity Sensor + Right Fan You got the Dual Fan Upgrade? This is what

The distribution of the proximity function Timm Oertel Joseph Paat + Robert Weismantel +

Formal Definition of Computation Formal Definition of Computation p.1/28 Computation

Commission: Out of touch, out of date, out of pocket April 2017 Commission: Out of touch, out of

Motivation Memory is a shared resource Core Core Memory Core Core Threads requests

PSHE curriculum Robert Willmott Core Themes Core Theme 1: Health and Core Theme 2: Core Theme

Final Assembly Chip Core Your final project chip consists of a core The Chip Core is

Pedestrian Alert System (PAS) Improve safety for pedestrians with PAS - a proximity alert system

Implementing proximity based device-to-device communication in commercial LTE networks in The

Wearable / Attachable Beacons for Proximity Monitoring Designed for enterprises with people and

SUNCONTRACT - A SMART SOLUTION FOR ENERGY MANAGEMENT Tomaz Kricej - Director of Business

SOFIE Secure and Open Federation of IoT systems An overview December 4, 2017 Pekka Nikander,

PROVIDE PROXIMITY ON VIOLENCE: DEFENCE AND EQUITY Principal Investigator: IGNAZIA BARTHOLINI

Group Theme Presentation Jordi Lee, James Manetz, Sydney Vick How does proximity impact Justice

Childrens Health and Unconventional Natural Gas Development Shaina L. Stacy, PhD

DRAYTON SOUTH PROPOSAL VIEW PROXIMITY COMPARATIVE DISTANCE DRAYTON SOUTH Baywater Mt Arthur