efficient stream reduction on the gpu efficient stream
play

Efficient Stream Reduction on the GPU Efficient Stream Reduction on - PowerPoint PPT Presentation

Efficient Stream Reduction on the GPU Efficient Stream Reduction on the GPU David Roger, Ulf Assarsson, Nicolas Holzschuch Grenoble Chalmers University Cornell University of Technology University Stream Reduction Removing unwanted elements


  1. Efficient Stream Reduction on the GPU Efficient Stream Reduction on the GPU David Roger, Ulf Assarsson, Nicolas Holzschuch Grenoble Chalmers University Cornell University of Technology University

  2. Stream Reduction Removing unwanted elements from a stream Input stream Reduced stream 2

  3. Applications ● Tree traversal: – Ray tracing – Collision detection ● Often the bottleneck 3

  4. Sequential Algorithm ● Algorithm: i=0 for j=0 to n-1 do if x[j] is valid then x[i]=x[j] i=i+1 ● Easy: one single loop ● Linear complexity 4

  5. On GPU ● Parallelism ● We assume no scatter – We will speak about scatter later 5

  6. Talk Structure ● Previous Works ● Algorithm Overview ● Details and Implementation ● Results ● Future Works & Conclusion 6

  7. Previous works: Horn's Method Input stream Prefix sum scan: computes the displacements Prefix sum 0 0 1 1 1 1 2 3 3 4 4 5 5 5 6 6 Dichotomic search: performs the displacements Reduced stream 7

  8. Previous works ● Prefix sum scan – Hillis and Steele, Horn: O(n log n) – Blelloch, Sengupta et al. , Harris et al. : O(n) – Sengupta et al. Hybrid: O(n) ● Dichotomic search: O(n log n) ● Overall complexity: O(n log n) 8

  9. Other approaches ● Geometry shader + stream output – NV_transform_feedback – Input stream: vertices in a VBO – Geometry shader discards NULL elements – Output stream: vertices in a VBO ● No fragments, no fragment shader ● Bitonic sort – Slow ● Sum scan + Scatter with vertex engine 9

  10. Talk Structure ● Previous Works ● Algorithm Overview ● Details and Implementation ● Results ● Future Works & Conclusion 10

  11. Talk Structure ● Previous Works ● Algorithm Overview ● Details and Implementation ● Results ● Future Works & Conclusion 11

  12. Our approach Input stream, split in blocks Reduction of the blocks Concatenation Reduced stream 12

  13. Reduction of the blocks ● In parallel ● Using previous works – Prefix sum scan – Dichotomic search ● Complexity – s: size of a block – One block: O(s log s) – n/s blocks: O(n log s) 13

  14. Concatenation of the blocks ● Prefix sum scan – Computes displacements of the blocks in parallel ● Line drawing – Segments extremities moved by scattering (vertex engine) – Other elements linearly interpolated (rasterization) ● Complexity: O(n) 14

  15. Concatenation of the blocks Reduced blocks Reduced stream 15

  16. Concatenation of the blocks Reduced blocks Reduced stream Move the extremities with the vertex shader 16

  17. Concatenation of the blocks Reduced blocks Reduced stream Move the extremities Rasterization with the vertex shader 17

  18. Algortihmic complexity ● All previous works: O(n log n) ● Our algorithm: O(n log s) – s is the size of the blocks – s is a constant ! 18

  19. Overview Input stream, split in blocks Prefix sum scan + Dichotomic search Prefix sum scan + Line drawing Reduced stream 19

  20. Why is it efficient ? The key is block concatenation: – Dichotomic search is avoided – Vertex engine: scatter ... but lesser efficiency ● Use it for a few elements (segment extremities) ● Interpolate the other elements 20

  21. Talk Structure ● Previous Works ● Algorithm Overview ● Details and Implementation ● Results ● Future Works & Conclusion 21

  22. Talk Structure ● Previous Works ● Algorithm Overview ● Details and Implementation ● Results ● Future Works & Conclusion 22

  23. Overview Input stream, split in blocks Prefix sum scan + Dichotomic search Prefix sum scan + Line drawing Reduced stream 23

  24. Overview Input stream, split in blocks Prefix sum scan + Dichotomic search Prefix sum scan + Line drawing Reduced stream 24

  25. Dichotomic search details Input block Gather: At output position i 0 1 2 3 4 5 6 7 j=8 j=8 9 10 11 12 13 14 15 Search j in input such as: i = j – sum[j] Prefix sum Search bounds: 0 0 1 1 1 1 2 3 3 4 4 5 5 5 6 6 i+sum[i] ≤ j ≤ i+sum[15] sum[j] = 3 Example: i = 5 6 ≤ j ≤ 11 Reduced block Search result j = 8 ? 0 1 2 3 4 i=5 =5 6 7 8 9 10 11 12 13 14 15 25

  26. Dichotomic search pseudo-code Search j 0 such as i = j 0 - sum[j 0 ]: while(found ≠ 0) { lowBound = i + sum[i] if (found < 0) lowBound = j upBound = i + sum[n-1] else upBound = j if(upBound > n-1) discard j = (lowBound + upBound) / 2 found = j-sum[j]-i } j = (lowBound + upBound)/2 found = j-sum[j]-i 26

  27. Dichotomic search improvement Search j 0 such as i = j 0 - sum[j 0 ]: while(found ≠ 0) { lowBound = i + sum[i] if (found < 0) lowBound = j - found upBound = i + sum[n-1] else upBound = j - found if(upBound > n-1) discard j = (lowBound + upBound) / 2 found = j-sum[j]-i } j = (lowBound + upBound)/2 found = j-sum[j]-i Because j – sum[j] is contracting! 27

  28. Overview Input stream, split in blocks Prefix sum scan + Dichotomic search Prefix sum scan + Line drawing Reduced stream 28

  29. Overview Input stream, split in blocks Prefix sum scan + Dichotomic search Prefix sum scan + Line drawing Reduced stream 29

  30. Lines wrapping ● We use 2D textures: wrap line segments – Split all segments in two ● Or – Use geometry engine to split only when necessary Concatenation 30

  31. Lines wrapping ● We use 2D textures: wrap line segments – Split all segments in two ● Or – Use geometry engine to split only when necessary Concatenation 31

  32. Lines wrapping ● We use 2D textures: wrap line segments – Split all segments in two ● Or – Use geometry engine to split only when necessary Concatenation 32

  33. Lines wrapping ● We use 2D textures: wrap line segments – Split all segments in two ● Or – Use geometry engine to split only when necessary Concatenation 33

  34. Lines wrapping ● We use 2D textures: wrap line segments – Split all segments in two ● Or – Use geometry engine to split only when necessary Concatenation 34

  35. Lines wrapping ● We use 2D textures: wrap line segments – Split all segments in two ● Or – Use geometry engine to split only when necessary Concatenation 35

  36. Lines wrapping ● We use 2D textures: wrap line segments – Split all segments in two ● Or – Use geometry engine to split only when necessary Concatenation 36

  37. Lines wrapping ● We use 2D textures: wrap line segments – Split all segments in two ● Or – Use geometry engine to split only when necessary Concatenation 37

  38. Lines wrapping ● We use 2D textures: wrap line segments – Split all segments in two ● Or – Use geometry engine to split only when necessary Concatenation 38

  39. Lines wrapping ● We use 2D textures: wrap line segments – Split all segments in two ● Or – Use geometry engine to split only when necessary Concatenation 39

  40. Talk Structure ● Previous Works ● Algorithm Overview ● Details and Implementation ● Results ● Future Works & Conclusion 40

  41. Talk Structure ● Previous Works ● Algorithm Overview ● Details and Implementation ● Results ● Future Works & Conclusion 41

  42. Behavior: linear complexity 42

  43. Behavior: block size 43

  44. Behavior: fill ratio 44

  45. Comparison with previous works 45

  46. Talk Structure ● Previous Works ● Algorithm Overview ● Details and Implementation ● Results ● Future Works & Conclusion 46

  47. Talk Structure ● Previous Works ● Algorithm Overview ● Details and Implementation ● Results ● Future Works & Conclusion 47

  48. Scatter ? (future work) ● Scatter available in CUDA ● Possible improvements 48

  49. Scatter ? (future work) Reduction of the blocks: Input stream, split in blocks ● without scatter: sum scan + search O(n log s) ● with scatter: sequential algo (loop over the block) O(n) Concatenation: ● Simpler ● No wrapping Reduced stream 49

  50. Scatter ? (future work) ● Overall complexity: O(n) ● ... but other techniques in O(n) – Sum scan (Harris et al. or Sengupta et al.) + scatter ● Future work: tests with CUDA – Expected speed up ≥ 2.5 50

  51. Conclusion ● Orthogonal to previous works: – We don't compete with them, we use them ! ● Better asymptotic complexity – O(n) Vs O(n log n) ● Significant speed up ● Does not require scatter 51

  52. Thank you 52

Recommend


More recommend