GPU Multisplit Saman Ashkiani, Andrew Davidson, Ulrich Meyer, John D. Owens S. Ashkiani (UC Davis) GPU Multisplit GTC 2016 1 / 16
Motivating Simple Example: Compaction Compaction (i.e., binary split) Traditional approach: Flags → Scan → Shuffle e.g., splitter: 10 input keys 25 12 4 76 7 17 6 1 compacted 4 7 6 1 S. Ashkiani (UC Davis) GPU Multisplit GTC 2016 2 / 16
Motivating Simple Example: Compaction Compaction (i.e., binary split) Traditional approach: Flags → Scan → Shuffle e.g., splitter: 10 Other option: split input into two buckets input keys 25 12 4 76 7 17 6 1 buckets 1 1 0 1 0 1 0 0 output keys 4 7 6 1 25 12 76 17 buckets 0 0 0 0 1 1 1 1 S. Ashkiani (UC Davis) GPU Multisplit GTC 2016 2 / 16
Motivating Simple Example: Compaction Compaction (i.e., binary split) Traditional approach: Flags → Scan → Shuffle e.g., splitter: 10 Other option: split input into two buckets Can also be solved by sorting keys Not always possible Loses “stability”, i.e., initial order within buckets not preserved input keys 25 12 4 76 7 17 6 1 buckets 1 1 0 1 0 1 0 0 output keys 4 7 6 1 25 12 76 17 buckets 0 0 0 0 1 1 1 1 sorted keys 1 4 6 7 12 17 25 76 S. Ashkiani (UC Davis) GPU Multisplit GTC 2016 2 / 16
What is Multisplit? Multisplit (generalization of binary split) Let’s try multiple buckets e.g., splitters: 10 and 20 input keys 25 17 4 76 7 12 6 1 buckets 2 1 0 2 0 1 0 0 output keys 4 7 6 1 17 12 25 76 buckets 0 0 0 0 1 1 2 2 S. Ashkiani (UC Davis) GPU Multisplit GTC 2016 3 / 16
What is Multisplit? Multisplit (generalization of binary split) Let’s try multiple buckets e.g., splitters: 10 and 20 Can also be solved by sorting keys input keys 25 17 4 76 7 12 6 1 buckets 2 1 0 2 0 1 0 0 output keys 4 7 6 1 17 12 25 76 buckets 0 0 0 0 1 1 2 2 sorted keys 1 4 6 7 12 17 25 76 buckets 0 0 0 0 1 1 2 2 S. Ashkiani (UC Davis) GPU Multisplit GTC 2016 3 / 16
Mutlisplit primitive Input : unordered set of keys (or key-value pairs) m , number of buckets a user specified function to identify buckets for each key Output : keys (or key-value pairs) separated into m buckets B 0 B 1 B 2 B 3 S. Ashkiani (UC Davis) GPU Multisplit GTC 2016 4 / 16
A Fast and Flexible Data-Organization Primitive characterizing key-value pairs into buckets General load balancing Priority queues Single Source Shortest Path (SSSP) Serial (Dijkstra): processing the vertex with the lowest weight Bellman-Ford-Moore → all vertices in parallel delta-stepping formulation of SSSP [Davidson et al., 2014] classifying vertices into buckets by their weights processing the lowest weights in parallel But no multisplit primitive → used radix-sort instead By using our own multisplit → 2.1x faster other applications colored prefix-sum reorganizing into 8 direction-based buckets in GPU based ray tracers [Yang et al., 2013] first step in building GPU hash tables [Alcantara et al., 2009] in the shallow stages of k -d tree construction [Wu et al., 2011] S. Ashkiani (UC Davis) GPU Multisplit GTC 2016 5 / 16
Common Approaches: Buckets 1 Recursive scan-based split B 0 = { i ≤ 40 } B 1 = { i > 40 } ⌈ log( m ) ⌉ rounds of binary splits Initial Keys 0 1 2 3 4 5 6 7 59 46 31 6 24 82 3 17 0 0 1 1 1 0 1 1 B 0 Exclusive scan 0 1 2 0 0 3 3 4 1 1 0 0 0 1 0 0 B 1 right to left exclusive scan 2 1 1 1 1 0 0 0 0 1 2 3 4 5 6 7 31 3 17 59 46 82 31 6 24 S. Ashkiani (UC Davis) GPU Multisplit GTC 2016 6 / 16
Common Approaches: 1 Recursive scan-based split Initial Keys ⌈ log( m ) ⌉ rounds of binary splits 0 1 2 3 4 5 6 7 2 Radix sort 59 46 31 6 24 82 3 17 sorting keys binary representation overkill (sorted within buckets) 0111011 0101110 0011111 0000110 initial order is not preserved 0011000 1010010 0000011 0010001 ≤ 7 splits 0 1 2 3 4 5 6 7 31 3 31 46 59 82 6 17 24 S. Ashkiani (UC Davis) GPU Multisplit GTC 2016 6 / 16
Common Approaches: 1 Recursive scan-based split Initial Keys ⌈ log( m ) ⌉ rounds of binary splits 0 1 2 3 4 5 6 7 2 Radix sort 59 46 31 6 24 82 3 17 sorting keys 1 overkill (sorted within buckets) New values 0 initial order is not preserved 59 46 31 6 24 82 3 17 3 Reduced bit sort New keys 1 0 0 0 0 1 1 0 sort (bucket ID, keys) ⌈ log m ⌉ -bit bucket IDs key-value sort 0 1 2 3 4 5 6 7 31 59 3 6 17 24 31 46 82 S. Ashkiani (UC Davis) GPU Multisplit GTC 2016 6 / 16
Common Approaches: 1 Recursive scan-based split Initial Keys ⌈ log( m ) ⌉ rounds of binary splits 0 1 2 3 4 5 6 7 2 Radix sort 59 46 31 6 24 17 82 3 sorting keys overkill (sorted within buckets) buffer B 0 17 24 3 31 6 initial order is not preserved 3 Reduced bit sort buffer B 1 82 46 59 sort (bucket ID, keys) ⌈ log m ⌉ -bit bucket IDs 4 Randomized insertions compaction a PRAM algorithm large buffers for buckets 0 1 2 3 4 5 6 7 31 24 random insertions 3 31 6 82 46 59 17 initial order is not preserved S. Ashkiani (UC Davis) GPU Multisplit GTC 2016 6 / 16
Designing an Efficient Approach Stable Multisplit → unique permutation + data movement 1 Deriving all permutations → global computations histogram ( h 0 , . . . , h m − 1 ) key order per bucket j − 1 � u i ∈ B j ⇒ p ( i ) = + |{ u r : u r ∈ B j , r < i }| h k � �� � k =0 Number of keys � �� � Number of keys in before me in my own bucket previous buckets B 0 B 1 B 2 B 3 S. Ashkiani (UC Davis) GPU Multisplit GTC 2016 7 / 16
Designing an Efficient Approach Stable Multisplit → unique permutation + data movement 1 Deriving all permutations → global computations histogram ( h 0 , . . . , h m − 1 ) key order per bucket 2 Final data movements → global random scatters B 0 B 1 B 2 B 3 S. Ashkiani (UC Davis) GPU Multisplit GTC 2016 7 / 16
Our high level ideas 1 Global computations Localize computations several large enough local subproblems − → local histograms a single small enough global computation − → global histogram several large enough local subproblems − → permutations + scatters Avoid shared memory and synchronization: utilize intrinsics Pre scan Local Global Scan Post scan Local S. Ashkiani (UC Davis) GPU Multisplit GTC 2016 8 / 16
Our high level ideas 1 Global computations Localize computations several large enough local subproblems − → local histograms a single small enough global computation − → global histogram several large enough local subproblems − → permutations + scatters Avoid shared memory and synchronization: utilize intrinsics 2 Global random scatters Reordering keys locally in the last stage → local multisplits more computational cost but better memory access pattern (coalesced writes) S. Ashkiani (UC Davis) GPU Multisplit GTC 2016 8 / 16
Granularity Tradeoffs We experimented with a couple different subproblem granularities warp 1 warp synchronous model with minimal warp divergence fast communication via warp-wide ballot/shuffles block 2 more expensive communication via shared memory cheaper global computation (scan over m × N blocks ) more locality to extract after reordering Property Direct MS Warp-level MS Block-level MS Subproblem warp warp block reordering – warp-wide reordering block-wide reordering computational load low medium high Coalesced memory access low medium high S. Ashkiani (UC Davis) GPU Multisplit GTC 2016 9 / 16
Implementation details & Optimizations an Pre scan Warp-level MS 1 Pre-scan (Local): read keys warp histogram · · · · · · h 0 , 0 h 1 , 0 h m − 1 , 0 bit-by-bit balloting 1 · · · · · · h 0 , 1 h 1 , 1 h m − 1 , 1 ⌈ log m ⌉ rounds 2 · · · · · · store warp histogram · · · · · · h 0 ,L − 1 h 1 ,L − 1 h m − 1 ,L − 1 1: procedure warp histogram ( bucket id[0:31] ) an Input: bucket id[0:31] Output: histo[0:m-1] for each thread i = 0:31 parallel warp do · · · · · · · · · · · · 2: histo bmp[i] = 0xFFFFFFFF; 3: for (int k = 0; k < ceil(log2(m)); k++) do 4: temp buffer = ballot(bucket id[i] & 0x01); 5: an if ((i >> k) & 0x01 ) then 6: 7: histo bmp[i] &= temp buffer; else 8: histo bmp[i] &= XOR(0xFFFFFFFF, temp buffer); 9: end if 10: bucket id[i] >>= 1; 11: end for 12: 13: histo[i] = popc(histo bmp[i]); 14: end for return histo[0:m-1]; 15: 16: end procedure S. Ashkiani (UC Davis) GPU Multisplit GTC 2016 10 / 16
Implementation details & Optimizations Pre scan Warp-level MS 1 Pre-scan (Local): read keys warp histogram · · · h 0 , 0 h 1 , 0 h m − 1 , 0 bit-by-bit balloting 1 · · · h 0 , 1 h 1 , 1 h m − 1 , 1 ⌈ log m ⌉ rounds 2 · · · store warp histogram · · · h 0 ,L − 1 h 1 ,L − 1 h m − 1 ,L − 1 2 Scan (Global): exclusive scan on histograms Scan m × N warps elements · · · · · · · · · · · · h 0 , 0 h 0 , 1 h 0 ,L − 1 h 1 , 0 h 1 , 1 h 1 ,L − 1 h m − 1 , 0 h m − 1 , 1 h m − 1 ,L − 1 S. Ashkiani (UC Davis) GPU Multisplit GTC 2016 10 / 16
Recommend
More recommend