Outline Problem Definition Overview of FMM Parallel FMM Space Filling Curves and Compressed Octrees Parallel Compressed Octrees Computing Translations Octree textures on GPU
Problem Definition To implement Parallel Fast Multipole Method (FMM) on Graphics Hardware Using multi Using GPUs Processors (to be done) (already done) Parallel FMM FMM
Fast Multipole Method The is concerned with evaluating the effect of a “set of sources” , on a set of “evaluation points” . More formally, given we wish to evaluate the sum Total Complexity:
attempts to reduce this complexity to The two main insights that make this possible are of the kernel into source and receiver terms Many application domains do not require the function be calculated at high accuracy FMM follows a Each node has associated
Building Interaction Lists Each node has two kind of interaction lists Far Cell List Near Cell List No far cell list at level 1 and level 0 since everything is near neighbor of other Transfer of energy from near neighbors happens only for leaves Next : Passes of FMM
FMM Algorithm
FMM Algorithm
FMM Algorithm
FMM Algorithm
FMM Algorithm Only for leaves of the quadtree
Parallel Compressed Space Filling Octrees Curves Parallel FMM Building Interaction Lists and computing various Parallel Translations Bitonic Sort Parallel Prefix Sum
Parallel Space Filling Compressed Curves Octrees Parallel FMM Building Interaction Lists and computing various Parallel Translations Bitonic Sort Parallel Prefix Sum
Space Filling Curves Consider the recursive bisection of a 2D area into non-overlapping cells of equal sizesize A is a mapping of these cells to a one dimensional linear ordering Z-SFC for k = 2
SFC Construction The run time to order cells, is expensive since typically Represent integer coordinates of cell using and then interleaving the bits sarting from first dimension to form a integer Index of the cell with coordinates time to find the index
Parallel Compressed Space Filling Octrees Curves Parallel FMM Building Interaction Lists and computing various Parallel Translations Bitonic Sort Parallel Prefix Sum
Octrees 10 9 4 8 1 8 4 2 5 7 6 3 6 5 9 1 3 2 10 7
Compressed Octrees Each node in compressed octree is either a leaf or has atleast 2 children 10 9 4 8 9 1 10 8 4 2 5 7 6 2 3 3 6 5 1 7
Encapsulating spatial information lost in compression Store 2 cells in each node of the compressed octree Large cell : cell that encloses all the points the node represents Small cell : cell that encloses all the points the node represents 10 9 8 4 4 8 9 1 10 2 3 6 7 2 5 6 3 5 1 7
Octrees and SFCs Octrees can be viewed as multiple SFCs at various resolutions 01 11 1 To establish a total order on the cells of octree: given 2 cells if one is contained in the other, the subcell is taken to precede the supercell 00 10 0 if disjoint, order according to the order of immediate subcells of the smallest supercell enclosing them The resulting linearization is identical to 0 1 traversal
Parallel Compressed Octree Construction Consider points equally distributed across processors = pre-specified maximum resolution For each point, generate the index of the leaf cell containing it which is the cell at the max resolution Parallel sort the leaf indices to compute their SFC-linearization, or the left to right order of leaves in the compressed octree. Each processor obtains the leftmost leaf cell of the next processor. Why ? On each processor, construct a local compressed octree for the leaf cells within it and the borrowed leaf cell. Send the to appropriate processors Insert the received out of order nodes in the already existing sorted order of nodes
Parallel Space Filling Compressed Curves Octrees Parallel FMM Building Interaction Lists and computing various Parallel Translations Bitonic Sort Parallel Prefix Sum
Parallel FMM The FMM computation consists of the following phases Building the compressed octree Building interaction lists Computing multipole expansions using a bottom-up traversal Computing multipole to local translations for each cell using its interaction list Computing the local expansions using a top-down traversal Projecting the field at leaf cells back to the particles
Computing Multipole Expansions Each processor scans its local array from left to right If leaf node is reached compute its multipole expansion directly If node’s multipole is known, shift and add it to parent’s multipole expansion provided the parent is local to processor Use of postorder ? If the multipole expansion due to a cell is known but its parent lies in a different processor, it is labeled a If the multipole expansion at a node is not yet computed when it is visited, it is labeled a Residual nodes form a tree (termed the residual tree) The tree is present in its postorder traversal order, distributed across processors.
Multipole expansions on the residual tree can be computed using an efficient parallel upward tree accumulation algorithm The residual tree can be accumulated in rounds as compared to in case of global compressed octree Thus, the worst-case number of communications are reduced from of the tree to of the tree, which is much smaller
Computing Multipole to Local Translations An all-to-all communication is used to receive fields of nodes from the interaction lists that reside on remote processors Once all the information is available locally, the multipole to local translations are conducted within each processor as much as in the same way as in sequential FMM
Computing Local Expansions Similar to computing multipole expansions Calculate local expansions for the residual tree. Compute local expansions for the local tree using a scan of the The exact number of communication rounds required is the same as in computing multipole expansions
OctreeTextures on GPU A B C(2,0) D(3,0) A(0,0) B(1,0) C (1,0) D (3,0) (2,0) The content of the leaves is directly stored as an RGB value Alpha channel is used to distinguish between an index to a child and the content of a leaf alpha = 1 data alpha = 0.5 index alpha = 0 empty cell
A B C(2,0) D(3,0) A(0,0) B(1,0) C (1,0) D (3,0) (2,0) Retrieve the value stored in the tree at a point M є [0,1] × [0,1] The tree lookup starts from the root and successively visits the nodes containing the point M until a leaf is reached.
I 0 = (0,0) node A (root) C(2,0) D(3,0) A(0,0) B(1,0) + M (1,0) P x = I 0x + frac ( M .2 0 ) P S x P y = I 0y + frac ( M .2 0 ) (3,0) (2,0) S y frac(A) denotes the fractional part of A I 0 = (0,0) Let M=(0.7, 0.7) Coordinates of M within grid A = frac(M·2 0 ) = frac(0.7x1) = 0.7 x coordinate of the lookup point P in the texture = P x = {I 0x + frac(M.2 0 )}/S x = (0 + 0.7)/4 = 0.175 y coordinate of the lookup point P in the texture = P y = {I 0y + frac(M.2 0 )}/S y = (0 + 0.7)/1 = 0.7
I 1 = (1,0) node B C(2,0) D(3,0) A(0,0) B(1,0) M + (1,0) P x = I 1x + frac ( M .2 1 ) S x (2,0) P y = I 1y + frac ( M .2 1 ) (3,0) P S y I 1 = (1,0) M=(0.7, 0.7) Coordinates of M within grid B = frac(M·2 1 ) = frac(0.7x2) = 0.4 x coordinate of the lookup point P in the texture = P x = {I 1x + frac(M.2 1 )}/S x = (1 + 0.4)/4 = 0.35 y coordinate of the lookup point P in the texture = P y = {I 1y + frac(M.2 1 )}/S y = (0 + 0.4)/1 = 0.4
I 2 = (2,0) node C C(2,0) D(3,0) A(0,0) B(1,0) + P (1,0) P x = I 2x + frac ( M .2 2 ) M S x P y = I 2y + frac ( M .2 2 ) (3,0) (2,0) S y I 2 = (2,0) M=(0.7, 0.7) Coordinates of M within grid C = frac(M·2 2 ) = frac(0.7x4) = 0.8 x coordinate of the lookup point P in the texture = P x = {I 2x + frac(M.2 2 )}/S x = (2 + 0.8)/4 = 0.7 y coordinate of the lookup point P in the texture = P y = {I 2y + frac(M.2 2 )}/S y = (0 + 0.8)/1 = 0.8
References L. Greengard and V. Rokhlin. A fast algorithm for particle simulations. Journal of Computational Physics , 73:325 – 348, 1987. J. Carrier, L. Greengard, and V. Rokhlin. A Fast Adaptive Multipole Algorithm for Particle Simulations. SIAM Journal of Scientific and Statistical Computing , 9:669- 686, July 1988. R. Beatson and L. Greengard. A Short Course on Fast Multipole Methods. B. Hariharan and S. Aluru. Efficient parallel algorithms and software for compressed octrees with apllications to hierarchical methods. Parallel Computing , 31:311 – 331, 2005. B. Hariharan, S. Aluru, and B. Shanker. A Scalable Parallel Fast Multipole Method for Analysis of Scattering from Perfect Electrically Conducting Surfaces. Proc. Super- computing , page 42, 2002.
References contd.. H. Sagan. Space Filling Curves. Springer-Verlag, 1994. M. Harris. Parallel Prefix Sum (Scan) with CUDA. http://developer.download.nvidia.com/compute/cuda/sdk/website/samples.htm S. Lefebvre, S. Hornus, and F. Neyret. GPU Gems 2, Octree Textures on the GPU, pages 595 – 614. Addison Wesley, 2005. T. W. Christopher. Bitonic Sort Tutorial. http://www.tools-of-computing.com/tc/CS/Sorts/bitonic_sort.htm
Recommend
More recommend