GPU Sample Sort Nikolaj Leischner, Vitaly Osipov , Peter Sanders Institut für Theoretische Informatik - Algorithmik II 1 Vitaly Osipov: Fakultät für Informatik KIT – Universität des Landes Baden-Württemberg und GPU Sample Sort Institut für Theoretische Informatik nationales Grossforschungszentrum in der Helmholtz-Gemeinschaft www.kit.edu
Overview Introduction Tesla architecture Computing Unified Device Architecture Model Performance Guidelines Sample Sort Algorithm Overview High Level GPU Algorithm Design Flavor of Implementation Details Experimental Evaluation Future Trends 2 Vitaly Osipov: Fakultät für Informatik GPU Sample Sort Institut für Theoretische Informatik
Introduction multi-way sorting algorithms Sorting is important Divide-and-Conquer approaches: recursively split the input in tiles until the tile size is M (e.g cache size) sort each tile independently combine intermidiate results Two-way approaches: two-way distribution - quicksort � log 2 ( n / M ) scans to partition the input two-way merge sort � log 2 ( n / M ) scans to combine intermidiate results Multi-way approaches: k -way distribution - sample sort � only log k ( n / M ) scans to partition k -way merge sort � only log k ( n / M ) scans to combine Multiway approaches are benifitial when the memory bandwidth is an issue! 3 Vitaly Osipov: Fakultät für Informatik GPU Sample Sort Institut für Theoretische Informatik
Introduction multi-way sorting algorithms Sorting is important Divide-and-Conquer approaches: recursively split the input in tiles until the tile size is M (e.g cache size) sort each tile independently combine intermidiate results Two-way approaches: two-way distribution - quicksort � log 2 ( n / M ) scans to partition the input two-way merge sort � log 2 ( n / M ) scans to combine intermidiate results Multi-way approaches: k -way distribution - sample sort � only log k ( n / M ) scans to partition k -way merge sort � only log k ( n / M ) scans to combine Multiway approaches are benifitial when the memory bandwidth is an issue! 3 Vitaly Osipov: Fakultät für Informatik GPU Sample Sort Institut für Theoretische Informatik
Introduction multi-way sorting algorithms Sorting is important Divide-and-Conquer approaches: recursively split the input in tiles until the tile size is M (e.g cache size) sort each tile independently combine intermidiate results Two-way approaches: two-way distribution - quicksort � log 2 ( n / M ) scans to partition the input two-way merge sort � log 2 ( n / M ) scans to combine intermidiate results Multi-way approaches: k -way distribution - sample sort � only log k ( n / M ) scans to partition k -way merge sort � only log k ( n / M ) scans to combine Multiway approaches are benifitial when the memory bandwidth is an issue! 3 Vitaly Osipov: Fakultät für Informatik GPU Sample Sort Institut für Theoretische Informatik
Introduction multi-way sorting algorithms Sorting is important Divide-and-Conquer approaches: recursively split the input in tiles until the tile size is M (e.g cache size) sort each tile independently combine intermidiate results Two-way approaches: two-way distribution - quicksort � log 2 ( n / M ) scans to partition the input two-way merge sort � log 2 ( n / M ) scans to combine intermidiate results Multi-way approaches: k -way distribution - sample sort � only log k ( n / M ) scans to partition k -way merge sort � only log k ( n / M ) scans to combine Multiway approaches are benifitial when the memory bandwidth is an issue! 3 Vitaly Osipov: Fakultät für Informatik GPU Sample Sort Institut für Theoretische Informatik
NVidia Tesla Architecture 30 Streaming Processors (SM) × 8 Scalar Processors (SP) each overall 240 physical cores 16KB shared memory per SM similar to CPU L1 cache 4GB global device memory 4 Vitaly Osipov: Fakultät für Informatik GPU Sample Sort Institut für Theoretische Informatik
tl tl tl tl ... ... Input Thread Blocks 0 0 1 Prefix 1 Sum k-1 k-1 ... Input Thread Blocks ... Output Bucket indices 0 1 2 k-1 Computing Unified Device Architecture Model Global Memory Grid C code int main { Similar to SPMD TBlock(0,0) TBlock(0,1) TBlock(0,2) //serial (single-program multiple-data) model //parallel block of concurrent Kernel<<>> threads execute a scalar sequential //serial TBlock(1,0) TBlock(1,1) TBlock(1,2) program, a kernel } thread blocks virtualizes constitute a grid SP SP SP SP SP SP SP SP shared 5 Vitaly Osipov: Fakultät für Informatik GPU Sample Sort Institut für Theoretische Informatik
Performance Guidelines General pattern in GPU algorithm design decompose the problem into many data-independent sub-problems solve sub-problems by blocks of cooperative parallel threads Performance Guidelines conditional branching - follow the same execution path shared memory - exploit fast on-chip memory coalesced global memory operations - load/store requests to the same memory block � fewer memory accesses 6 Vitaly Osipov: Fakultät für Informatik GPU Sample Sort Institut für Theoretische Informatik
Algorithm Overview ❙❛♠♣❧❡❙♦rt✭ e = � e 1 , . . . , e n � , k ✮ begin if n < M then return ❙♠❛❧❧❙♦rt✭ e ✮ choose a random sample S = S 1 , . . . , S ak − 1 of e ❙♦rt✭ S ✮ � s 0 , s 1 , . . . , s k � = �− ∞ , S a , . . . , S a ( k − 1 ) , ∞ � for 1 ≤ i ≤ n do find j ∈ { 1 , . . . , k } , such that s j − 1 ≤ e i ≤ s j place e i in bucket b j return ❈♦♥❝❛t❡♥❛t❡✭❙❛♠♣❧❡❙♦rt✭ b 1 , k ✮ , . . . , ❙❛♠♣❧❡❙♦rt✭ b k , k ✮✮ end end Algorithm 1 : Serial Sample Sort 7 Vitaly Osipov: Fakultät für Informatik GPU Sample Sort Institut für Theoretische Informatik
High Level GPU Algorithm Design Phase 1. Choose splitters tl tl tl tl ... ... Input Phase 2. Each of p TB: Thread Blocks computes its elements 0 0 bucket indices 1 Prefix 1 Sum id , 0 ≤ id ≤ k − 1 k-1 k-1 ... Input stores the bucket sizes in Thread Blocks DRAM ... Output Phase 3. Prefix sum over the Bucket indices 0 1 2 k-1 k × p table � global offsets Parameters: Phase 4. distribution degree k = 128 threads per block t = 256 as in Phase 2 � local elements per thread l = 8 offsets number of blocks p = n / ( t · l ) local + global offsets � final positions 8 Vitaly Osipov: Fakultät für Informatik GPU Sample Sort Institut für Theoretische Informatik
High Level GPU Algorithm Design Phase 1. Choose splitters tl tl tl tl ... ... Input Phase 2. Each of p TB: Thread Blocks computes its elements 0 0 bucket indices 1 Prefix 1 Sum id , 0 ≤ id ≤ k − 1 k-1 k-1 ... Input stores the bucket sizes in Thread Blocks DRAM ... Output Phase 3. Prefix sum over the Bucket indices 0 1 2 k-1 k × p table � global offsets Parameters: Phase 4. distribution degree k = 128 threads per block t = 256 as in Phase 2 � local elements per thread l = 8 offsets number of blocks p = n / ( t · l ) local + global offsets � final positions 8 Vitaly Osipov: Fakultät für Informatik GPU Sample Sort Institut für Theoretische Informatik
High Level GPU Algorithm Design Phase 1. Choose splitters tl tl tl tl ... ... Input Phase 2. Each of p TB: Thread Blocks computes its elements 0 0 bucket indices 1 Prefix 1 Sum id , 0 ≤ id ≤ k − 1 k-1 k-1 ... Input stores the bucket sizes in Thread Blocks DRAM ... Output Phase 3. Prefix sum over the Bucket indices 0 1 2 k-1 k × p table � global offsets Parameters: Phase 4. distribution degree k = 128 threads per block t = 256 as in Phase 2 � local elements per thread l = 8 offsets number of blocks p = n / ( t · l ) local + global offsets � final positions 8 Vitaly Osipov: Fakultät für Informatik GPU Sample Sort Institut für Theoretische Informatik
High Level GPU Algorithm Design Phase 1. Choose splitters tl tl tl tl ... ... Input Phase 2. Each of p TB: Thread Blocks computes its elements 0 0 bucket indices 1 Prefix 1 Sum id , 0 ≤ id ≤ k − 1 k-1 k-1 ... Input stores the bucket sizes in Thread Blocks DRAM ... Output Phase 3. Prefix sum over the Bucket indices 0 1 2 k-1 k × p table � global offsets Parameters: Phase 4. distribution degree k = 128 threads per block t = 256 as in Phase 2 � local elements per thread l = 8 offsets number of blocks p = n / ( t · l ) local + global offsets � final positions 8 Vitaly Osipov: Fakultät für Informatik GPU Sample Sort Institut für Theoretische Informatik
Recommend
More recommend