Outline Project goal and problem definition Prior Work: Octree textures on GPU Parallel Octrees on GPU (2 implementations) Space Filling Curves CUDA programming model Octree Construction Results: Comparison of two approaches Conclusion
Dual Degree Project Goal To implement Parallel Fast Multipole Method (FMM) on Graphics Hardware Using multi Using GPUs Processors (to be done to (already done) achieve better speed-up) Parallel FMM FMM Parallel implementation of Octrees on the GPU is the first step towards implementing parallel FMM on GPUs
OctreeTextures on GPU A 1 B C(2,0) D(3,0) A(0,0) B(1,0) 1 C (1,0) D (3,0) (2,0) 0 1 0 1 The content of the leaves is stored as a RGB value Alpha channel is used to distinguish between an index to a child and the content of a leaf alpha = 1 data alpha = 0.5 index alpha = 0 empty cell
A B C(2,0) D(3,0) A(0,0) B(1,0) C (1,0) D (3,0) (2,0) Retrieve the value stored in the tree at a point M є [0,1] × [0,1] The tree lookup starts from the root and successively visits the nodes containing the point M until a leaf is reached.
I 0 = (0,0) node A (root) C(2,0) D(3,0) A(0,0) B(1,0) + M (1,0) 0.7 P x = I 0x + frac ( M .2 0 ) P S x P y = I 0y + frac ( M .2 0 ) (3,0) (2,0) S y 0.7 frac(A) denotes the fractional part of A I 0 = (0,0) // I D be the index of the data grid of the node visited at depth D Let M=(0.7, 0.7) Coordinates of M within grid A = frac(M·2 0 ) = frac(0.7x1) = 0.7 x coordinate of the lookup point P in the texture = P x = {I 0x + frac(M.2 0 )}/S x = (0 + 0.7)/4 = 0.175 y coordinate of the lookup point P in the texture = P y = {I 0y + frac(M.2 0 )}/S y = (0 + 0.7)/1 = 0.7
I 1 = (1,0) node B C(2,0) D(3,0) A(0,0) B(1,0) M + (1,0) P x = I 1x + frac ( M .2 1 ) S x (2,0) P y = I 1y + frac ( M .2 1 ) (3,0) P S y I 1 = (1,0) M=(0.7, 0.7) Coordinates of M within grid B = frac(M·2 1 ) = frac(0.7x2) = 0.4 x coordinate of the lookup point P in the texture = P x = {I 1x + frac(M.2 1 )}/S x = (1 + 0.4)/4 = 0.35 y coordinate of the lookup point P in the texture = P y = {I 1y + frac(M.2 1 )}/S y = (0 + 0.4)/1 = 0.4
I 2 = (2,0) node C C(2,0) D(3,0) A(0,0) B(1,0) + P (1,0) P x = I 2x + frac ( M .2 2 ) M S x P y = I 2y + frac ( M .2 2 ) (3,0) (2,0) S y I 2 = (2,0) M=(0.7, 0.7) Coordinates of M within grid C = frac(M·2 2 ) = frac(0.7x4) = 0.8 x coordinate of the lookup point P in the texture = P x = {I 2x + frac(M.2 2 )}/S x = (2 + 0.8)/4 = 0.7 y coordinate of the lookup point P in the texture = P y = {I 2y + frac(M.2 2 )}/S y = (0 + 0.8)/1 = 0.8
Drawbacks Very difficult to create such a data representation in Parallel No a priori knowledge of the position where a particular node in the octree is going to land in the texture Same amount of memory is allocated for both leaves and internal nodes. The internal nodes do not contain much information. So no need to allocate same memory space for an internal node and a leaf Difficult to compute the post order traversal
CUDA Programming Space Filling Model Curves Parallel Octrees on GPU Parallel Bitonic Sort
CUDA Programming Model Space Filling Curves Parallel Octrees on GPU Parallel Bitonic Sort
Space Filling Curves : Motivation Easy to implement in practice Mathematical representation that enables fast computation of data ownership Easy to parallelize Good quality load balancing A based domain decomposition meets the above requirements
Space Filling Curves Consider the recursive bisection of a 2D area into non-overlapping cells of equal sizesize A is a mapping of these cells to a one dimensional linear ordering Z-SFC for k = 2
SFC Construction The run time to order cells, is expensive since typically Integer coordinates of a cell having a particle : Represent each integer coordinate of cell using and then interleaving the bits starting from first dimension to form a integer Index of the cell with coordinates time to find the index Do a parallel integer sort to get the SFC ordering
SFC and Octrees SFC decomposition is very similar, though not identical to an octree decomposition Octrees can be viewed as multiple SFCs at various resolutions The process of assigning indices can be viewed hierarchically Any ambiguity ? check if a cell is contained in cell find the smallest cell containing and find the immediate sub cell of that contains a given cell
CUDA Programming Space Filling Model Curves Parallel Octrees on GPU Parallel Bitonic Sort
CUDA Programming Model Kernel Block Block Block (0,0) (1,0) (2,0) Block Block Block (0,1) (1,1) (2,1) Thread Thread Thread Thread (0,0) (1,0) (2,0) (3,0) Thread Thread Thread Thread (0,1) (1,1) (2,1) (3,1) Thread Thread Thread Thread (0,2) (1,2) (2,2) (3,2)
CUDA Programming Model Block of threads Grids of thread blocks Any computation that is done independently on different data many times, can be isolated into a function called kernel that is executed on the GPU as many different threads A GPU may run all the blocks of a grid sequentially if it has very few parallel capabilities, or in parallel if it has a lot of parallel capabilities. How CUDA fits easily with Octrees and SFC?
Octree Construction Maximum levels (L) given k=0 Nodes at level k : 2 kd The 2-D position of the parent of a node in the upper layer can directly be calculated from the 2- D position of the child node k=1 Also store {-1 for empty, -2 for filled, -3 for filled internal node} k=2 {number of empty nodes in the subtree of the 2 3 -1 node (including itself)} k=3 {(level, 2-D position in that level)} 0 2 3 -1
Octree Construction contd. Multiple passes considering two levels and in each pass Allocate threads so that each thread can handle four nodes T(0,1) T(1,1) These four nodes come one after the other in the SFC linearization of that level Each thread checks the number of empty nodes among those four nodes T(1,0) T(0,0)
Octree Construction contd. If then it sets the nodeType field of its parent to - 1 and numEmptyNodes field to the number of empty nodes in the subtree plus 1. The dataLocation field of the parent still remains null. If , then it sets the nodeType of the non-empty node to -1 and in its parent it sets numEmptyNodes to the number of empty nodes in the subtree plus 1, nodeType to the nodeType of non-empty node and dataLocation to be the dataLocation in the non-empty node. , it just set the nodeType field of the parent to -3 and numEmptyNodesto the number of empty nodes in the subtree. Repeat the same procedure for the remaining levels to generate the complete octree Highly data parallel with zero communication between the GPU threads
PostorderTraversal For each node directly calculate its postorder number ( ) according to the non-adaptive tree in parallel Maximum Total number Also , calculate the number of empty nodes ( ) level of nodes in before the current node in the post order tree numbering of non-adaptive tree 0 1 Final post order number in adaptive tree 1 5 2 21 To calculate PONA make use of a table structure 3 85 4 341 How to calculate ? 5 1365 time for processors 6 5461 … …
PostorderTraversal Calculates NE Calculates PONA
Octree Construction Initial memory allocation same as that in implementation one Also, allocate a global array having the size of the octree Within each node we store and as in implementation 1 We do not store number of empty nodes but we store the of the node Within each pass the construction of the parent level from child level nodes (or leaves) is exactly same as that in implementation one. Once the parent level is constructed, copy the SFC linearized child level to the global array G and delete the child array from the memory Copying in the next pass will start from where it ended in the last pass
PostorderTraversal Parallel sort the nodes in global array G to get the post order traversal of the tree. How? We order two nodes based on the level in which they are and their SFC index in the level Note that the SFC ordering of nodes in a particular level is same their ordering in the post order traversal of the tree For any two nodes (level L i , SFC in the level = i) and (level L j , SFC in the level = j) in , all nodes having SFCs less than or equal to at level L j will come before N i in the final post order o the tree On simplifying, in the array if the position of then a swap between N j and N i is required if
PostorderTraversal contd. For processors the sorting algorithm takes steps to sort elements Comparison function for sorting (id j , id i )
Recommend
More recommend