Outline Problem Definition Summary of Work Done Space Filling Curves Bottom-Up Octree construction on the GPU Timing Results The Fast Multipole Method Timing and Quality Results Conclusion
Problem Definition To provide an efficient, parallel GPU based Global Illumination solution for point models which is many folds faster than its corresponding CPU implementation. INPUT : A 3-D point model with attributes like 3-D coordinates, default surface diffuse color, emmisivity and surface normals OUTPUT : A fast parallel Global Illumination solution showing effects like color bleeding and soft shadows
FMM for Global Illumination on GPU? Global Illumination problem is a N-Body problem since each particle is affected by the presence of all other particles (Quadratic in nature) The input data (points models in our case) is very large in size. More than 10 5 particles. Direct computations (on GPU) are not possible because of high memory requirements to utilize the available parallelism FMM solves the quadratic N-body problem efficiently in linear time by Approximating the solution to a user defined accuracy Using a hierarchical data structure (in our case, the octree)
Contributions View independent Octree on GPU visibility on GPU FMM on GPU Fast method to Fast parallel global calculate visibility illumination for point between all point models pairs in parallel Intend to submit as a Required for correct paper for consideration global illumination Non Adaptive Adaptive Adaptive Top Down Bottom Up Submitted to ICVGIP, 2008 for oral paper Fast Very fast but memory inefficient Memory efficient as compared to the non- Parent child relations adaptive version calculated using direct SFC indexing Intend to submit as a paper for consideration Published as a poster in I3D, 2008 Post order traversal Location of a leaf cell containing the queried point Acknowledgements: Rhushabh Goradia Least Common Ancestor of two cells etc. Prof. Srinivas Aluru
Space Filling Curves Consider the recursive bisection of a 2D area into non-overlapping cells of equal size A is a mapping of these cells to a one dimensional linear ordering We consider the z-sfc or Morton ordering Index of the cell with coordinates Z-SFC for k = 2
Octrees and SFCs 1. Octrees can be viewed as multiple SFCs at various resolutions 2. Parent can be generated from child’s SFC 3. To establish a total order on the cells of octree: given 2 cells a) if one is contained in the other, the subcell is taken to precede the supercell b) if disjoint, order according to the order of immediate subcells of the smallest supercell enclosing them The resulting linearization is identical to traversal
Octrees chains 10 9 4 8 1 8 4 2 5 7 6 3 6 5 9 1 3 2 10 7
Compressed Octrees Each node in compressed octree is either a leaf or has at least 2 children 10 9 4 8 9 1 10 8 4 2 5 7 6 2 3 3 6 5 1 7
Memory Efficient Bottom-Up Octree on the GPU
Intuitions INPUT : n points (say a bunny) belonging to some 3-D domain OUTPUT : Octree represented in post-order with parent-child relationships established BOTTOM-UP TRAVERSAL Since every internal node in an octree has leaves in its subtree, given the leaves we can somehow decode this hierarchical inheritance information and generate the internal nodes. PARALLEL STRATEGY Each internal node can be considered as a LCA of some particular leaf pairs (in a compressed octree). Given the leaves, generation of internal nodes can be parallelized since each of them can be generated independently from a leaf pair. Many leaf pairs might have the same LCA node resulting in duplicates which can be easily detected and removed. Parent-Child relationships can be established and octree can be generated from a given compressed octree using SFC indices across multiple levels.
Results Bunny (124531 points) 10000 8000 6000 4000 GPU 2000 CPU 0 CPU 5 GPU 6 7 8 9 Tree level GPU (ms) CPU (ms) 5 1218 1101 6 1482 1692 7 2041 2621 8 2501 4291 9 3669 9645
Results Ganesha (165646 points) 10000 8000 6000 4000 GPU 2000 CPU 0 CPU 5 GPU 6 7 8 9 Tree level GPU (ms) CPU (ms) 5 1463 1200 6 1762 1981 7 2396 2965 8 2923 4691 9 4501 8945
Fast Multipole Method
Fast Multipole Method The is concerned with evaluating the effect of a “set of sources” , on a set of “evaluation points” . More formally, given we wish to evaluate the sum Total Complexity:
attempts to reduce this complexity to The two main insights that make this possible are of the kernel into source and receiver terms Many application domains do not require the function be calculated at high accuracy FMM follows a Each node has associated
FMM: Building Interaction Lists Each node has two kind of interaction lists from where the transfer of energy takes place Far Cell List Near Cell List No far cell list at level 1 and level 0 since everything is near neighbor of other Transfer of energy from near neighbors happens only for leaves
FMM Algorithm
Step1: GPU implementation PARALLELIZATION STRATEGIES 1) Multiple threads per leaf (one thread per particle) One thread produces multipole expansion for each particle in the leaf Drawbacks: a) After generation of expansions they need to be consolidated, which will necessitate data transfer to GPU global memory (expensive) b) Fixed thread block size on GPU during run time. So, some threads may remain idle. 2) One thread per leaf One thread produces full multipole expansion for the entire leaf Advantage: Work of each thread is completely independent and so there is no need for shared memory When each leaf may have different number of particles, as the thread that finishes work for a given leaf simply takes care of another leaf, without waiting or need for synchronization with other threads. Drawback: To realize the full GPU load the number of leaves should be sufficiently large (atleast 8192).
FMM Algorithm For each level l = l max -1, ... 2
Step2: GPU implementation PARALLELIZATION STRATEGIES Iterate from the last level onto the root (root is at level 0) For every level, allocate, One thread per parent node One thread produces multipole-to-multipole translations for all the children Drawback: a) GPU load becomes very small at low l max (maximum lavels) Upward pass is not highly compute intensive as compared to the downward pass. The total time taken by the upward pass is actually 1% of the total time taken by the downward pass on CPU.
FMM Algorithm Most Expensive Step of the Algorithm PARALLELIZATION: Iterate from level 2 to last, compute the Multipole to Local Translation for each node at current level in parallel.
FMM Algorithm PARALLELIZATION: Iterate from level 2 to last, compute the Local to Local Translation for each node at current level in parallel.
FMM Algorithm Only for leaves of the Quadtree PARALLELIZATION: Iterate from level 2 to last, if PARALLELIZATION: Iterate from level 2 to leaf, each thread performs all near-neighbor last, if leaf, compute the Local Expansion computations for a particular leaf. for leaves at current level in parallel
Results: Quality Comparisons Bunny (124531 points) CPU GPU
Results: Quality Comparisons Ganesha (165646 points) CPU GPU
Results: Timing Comparisons (without visibility) Bunny (124531 points) 30 25 20 15 GPU 10 CPU 5 0 CPU 200 GPU 150 100 50 25 Number of GPU points per leaf GPU (hr) CPU (hr) speedup 200 1.01 15.96 15.8 150 1.09 19.18 17.6 100 1.16 21.11 18.2 50 1.21 23.81 19.5 25 1.30 25.87 19.9
Results: Timing Comparisons (without visibility) Ganpati (165646 points) 30 25 20 15 GPU 10 CPU 5 0 CPU 200 GPU 150 100 50 25 Number of GPU points per leaf GPU (hr) CPU (hr) Speedup 200 1.11 14.54 13.1 150 1.16 16.58 14.3 100 1.21 20.81 17.2 50 1.28 23.15 18.1 25 1.41 26.37 18.7
Conclusion 1. Non Adaptive octree (speedups of upto 500 times) 2. Adaptive octrees (speedups of upto 3 times) 3. FMM on the GPU for Global illumination (speedups of upto 20 times) Future Work All the 3 steps above can be done on GPU to make a complete system
References L. Greengard and V. Rokhlin. A fast algorithm for particle simulations. Journal of Computational Physics, 73:325 – 348, 1987. J. Carrier, L. Greengard, and V. Rokhlin. A Fast Adaptive Multipole Algorithm for Particle Simulations. SIAM Journal of Scientific and Statistical Computing, 9:669 – 686, July 1988. R. Beatson and L. Greengard. A Short Course on Fast Multipole Methods. H. Sagan. Space Filling Curves. Springer-Verlag, 1994. S. Seal and S. Aluru. Spatial Domain Decomposition Methods in Parallel Scientific Computing. Book chapter. N. A. Gumerov and R. Duraiswami. Fast Multipole Method on Graphics Processors. AstroGPU 2007. John D. Owens, David Luebke, Naga Govindaraju, Mark Harris, Jens Krger, Aaron E. Lefohn, and Timothy J. Purcell. A survey of general-purpose computation on graphics hardware. Computer Graphics Forum, 26(1):80 – 113, 2007. Nvidia CUDA Programming Guide. http://developer.nvidia.com/cuda
Recommend
More recommend