general transformations for gpu execution of tree
play

General Transformations for GPU Execution of Tree Traversals - PowerPoint PPT Presentation

General Transformations for GPU Execution of Tree Traversals Michael Goldfarb*, Youngjoon Jo**, Milind Kulkarni School of Electrical and Computer Engineering * Now at Qualcomm; ** Now at Google Thursday, November 21, 13 GPU execution of


  1. General Transformations for GPU Execution of Tree Traversals Michael Goldfarb*, Youngjoon Jo**, Milind Kulkarni School of Electrical and Computer Engineering * Now at Qualcomm; ** Now at Google Thursday, November 21, 13

  2. GPU execution of irregular programs • GPUs offer promise of massive, energy-efficient parallelism • Much success in mapping regular applications to GPUs • Regular memory accesses, predictable computation • Much less success in mapping irregular applications • Pointer-based data structures • Unpredictable, input-dependent computation and memory accesses 2 Thursday, November 21, 13

  3. Tree traversal algorithms • Many irregular algorithms are built around tree-traversal • Barnes-Hut • Nearest-neighbor • 2-point correlation • Numerous papers describing how to map tree traversal algorithms to GPUs 3 Thursday, November 21, 13

  4. Point correlation • Data mining algorithm • Goal: given a set of N points in k dimensions and a point p , find all points within a radius r of p • Naïve approach: compare all N points with p • Better approach: build kd- tree over points, traverse tree for point p , prune subtrees that are far from p 4 Thursday, November 21, 13

  5. Point correlation • Data mining algorithm • Goal: given a set of N points in k dimensions and a point p , find all points within a radius r of p • Naïve approach: compare all N points with p • Better approach: build kd- tree over points, traverse tree for point p , prune subtrees that are far from p 5 Thursday, November 21, 13

  6. Point correlation • Data mining algorithm • Goal: given a set of N points in k dimensions and a point p , find all points within a radius r of p • Naïve approach: compare all N points with p • Better approach: build kd- tree over points, traverse tree for point p , prune subtrees that are far from p 6 Thursday, November 21, 13

  7. Point correlation A 7 Thursday, November 21, 13

  8. Point correlation A G B 7 Thursday, November 21, 13

  9. Point correlation A G B C F 7 Thursday, November 21, 13

  10. Point correlation A G B C F D E 7 Thursday, November 21, 13

  11. Point correlation A G B C H K F D E 7 Thursday, November 21, 13

  12. Point correlation A G B C H K F D E I J 7 Thursday, November 21, 13

  13. Point correlation A G B C H K F D E I J 8 Thursday, November 21, 13

  14. Point correlation A G B C H K F D E I J 8 Thursday, November 21, 13

  15. Point correlation A G B C H K F D E I J 9 Thursday, November 21, 13

  16. Point correlation A G B C H K F D E I J 10 Thursday, November 21, 13

  17. Point correlation A G B C H K F D E I J 11 Thursday, November 21, 13

  18. Point correlation A G B C H K F D E I J 12 Thursday, November 21, 13

  19. Point correlation A G B C H K F D E I J 13 Thursday, November 21, 13

  20. Point correlation A G B C H K F D E I J 14 Thursday, November 21, 13

  21. Point correlation KDCell root = /* build kdtree */; Set<Point> ps; double radius; foreach Point p in ps { recurse(p, root, radius); } ... void recurse(Point p, KDCell node, double r) { if (tooFar(p, node, r)) return; if (node.isLeaf() && (dist(node.point, p) < r)) p.correlated++; else { recurse(p, node.left, r); recurse(p, node.right, r); } } 15 Thursday, November 21, 13

  22. Basic pattern TreeNode root; Set<Point> ps; foreach Point p in ps { recurse(p, root, ...); } ... recurse(Point p, KDCell node, ...) { if (truncate?(p, node, ...)) { ... } recurse(p, node.child1, ...); recurse(p, node.child2, ...); ... } 16 Thursday, November 21, 13

  23. Basic pattern TreeNode root; Set<Point> ps; foreach Point p in ps { recurse(p, root, ...); } ... recurse(Point p, KDCell node, ...) { if (truncate?(p, node, ...)) { ... } recurse(p, node.child1, ...); recurse(p, node.child2, ...); recursive traversal ... } 16 Thursday, November 21, 13

  24. Basic pattern TreeNode root; tree structure Set<Point> ps; foreach Point p in ps { recurse(p, root, ...); } ... recurse(Point p, KDCell node, ...) { if (truncate?(p, node, ...)) { ... } recurse(p, node.child1, ...); recurse(p, node.child2, ...); recursive traversal ... } 16 Thursday, November 21, 13

  25. Basic pattern TreeNode root; tree structure Set<Point> ps; foreach Point p in ps { recurse(p, root, ...); } repeated traversal ... recurse(Point p, KDCell node, ...) { if (truncate?(p, node, ...)) { ... } recurse(p, node.child1, ...); recurse(p, node.child2, ...); recursive traversal ... } 16 Thursday, November 21, 13

  26. Basic pattern TreeNode root; tree structure Set<Point> ps; foreach Point p in ps { recurse(p, root, ...); } repeated traversal ... recurse(Point p, KDCell node, ...) { if (truncate?(p, node, ...)) Lots of parallelism! { ... } recurse(p, node.child1, ...); recurse(p, node.child2, ...); recursive traversal ... } 16 Thursday, November 21, 13

  27. What’s the problem? • GPUs add high overhead for recursion • GPUs work best when memory accesses are regular and strided, but irregular algorithms have unpredictable memory accesses • Status quo: ad hoc solutions • New algorithm? New GPU techniques! 17 Thursday, November 21, 13

  28. What’s the problem? • GPUs add high overhead for recursion • GPUs work best when memory accesses are Want generally applicable techniques for mapping irregular applications to GPUs regular and strided, but irregular algorithms have unpredictable memory accesses • Status quo: ad hoc solutions • New algorithm? New GPU techniques! 17 Thursday, November 21, 13

  29. Contributions • Two general techniques for mapping tree- traversals to GPUs • Autoropes: eliminates recursion overhead • Lockstepping: promotes memory coalescing • Compiler pass to automatically apply techniques to recursive tree-traversal code • Significant GPU speedups on 5 tree-traversal algorithms 18 Thursday, November 21, 13

  30. Naïve GPU implementation • Warp -based SIMT (single-instruction, multiple- thread) execution • 32 points put in a single warp • Warp traverses tree • All points in warp must execute same instruction • If points diverge , some points sit idle while other threads execute 19 Thursday, November 21, 13

  31. Naïve GPU implementation A G B C H K F D E I J 20 Thursday, November 21, 13

  32. Naïve GPU implementation A G B C H K F D E I J 20 Thursday, November 21, 13

  33. Naïve GPU implementation A A G G B B C C H K F F D D E E I J 20 Thursday, November 21, 13

  34. Naïve GPU implementation A A G G B B C H H K K F D E I I J J 20 Thursday, November 21, 13

  35. Naïve GPU implementation A A G G B B C H K F D E I J 21 Thursday, November 21, 13

  36. Naïve GPU implementation A A G G B B C H K F D E I J 22 Thursday, November 21, 13

  37. Naïve GPU implementation A A G G B B C H K F D E I J 23 Thursday, November 21, 13

  38. Naïve GPU implementation A A G G B B C H K F D E I J 24 Thursday, November 21, 13

  39. Naïve GPU implementation A A G G B B C H K F D E I J 25 Thursday, November 21, 13

  40. Naïve GPU implementation A A G G B B C H K F D E I J 26 Thursday, November 21, 13

  41. Naïve GPU implementation A A G G B B C H K F D E I J 27 Thursday, November 21, 13

  42. Naïve GPU implementation A A G G B B C H K F D E I J 28 Thursday, November 21, 13

  43. Naïve GPU implementation A A G G B B C H K F D E I J 29 Thursday, November 21, 13

  44. Naïve GPU implementation A A G G B B C H K F D E I J 30 Thursday, November 21, 13

  45. Naïve GPU implementation A A G G B B C H K F D E I J 31 Thursday, November 21, 13

  46. Naïve GPU implementation A A G G B B C H K F D E I J 32 Thursday, November 21, 13

  47. Naïve GPU implementation A A G G B B C H K F D E I J 33 Thursday, November 21, 13

  48. Naïve GPU implementation A A G G B B C H K F D E I J 34 Thursday, November 21, 13

  49. Lots of accesses to tree • Many accesses just moving up the tree in order to later move down again • Lots of function stack manipulation • Trees are very large, cannot be stored in GPU’s fast memory • Want to minimize accesses to tree 35 Thursday, November 21, 13

  50. How to avoid extra accesses to tree? • Typical technique: ropes A • Pointers in each G B tree node that let a traversal jump to the next part C H K F of the tree • Effectively linearizes D E I J traversal 36 Thursday, November 21, 13

  51. How to avoid extra accesses to tree? • Typical technique: ropes A • Pointers in each G B tree node that let a traversal jump to the next part C H K F of the tree • Effectively linearizes D E I J traversal 36 Thursday, November 21, 13

  52. How to avoid extra accesses to tree? • Typical technique: ropes A • Pointers in each G B tree node that let a traversal jump to the next part C H K F of the tree • Effectively linearizes D E I J traversal 36 Thursday, November 21, 13

Recommend


More recommend