gpu sample sort
play

GPU Sample Sort Nikolaj Leischner, Vitaly Osipov , Peter Sanders - PowerPoint PPT Presentation

GPU Sample Sort Nikolaj Leischner, Vitaly Osipov , Peter Sanders Institut fr Theoretische Informatik - Algorithmik II 1 Vitaly Osipov: Fakultt fr Informatik KIT Universitt des Landes Baden-Wrttemberg und GPU Sample Sort Institut


  1. GPU Sample Sort Nikolaj Leischner, Vitaly Osipov , Peter Sanders Institut für Theoretische Informatik - Algorithmik II 1 Vitaly Osipov: Fakultät für Informatik KIT – Universität des Landes Baden-Württemberg und GPU Sample Sort Institut für Theoretische Informatik nationales Grossforschungszentrum in der Helmholtz-Gemeinschaft www.kit.edu

  2. Overview Introduction Tesla architecture Computing Unified Device Architecture Model Performance Guidelines Sample Sort Algorithm Overview High Level GPU Algorithm Design Flavor of Implementation Details Experimental Evaluation Future Trends 2 Vitaly Osipov: Fakultät für Informatik GPU Sample Sort Institut für Theoretische Informatik

  3. Introduction multi-way sorting algorithms Sorting is important Divide-and-Conquer approaches: recursively split the input in tiles until the tile size is M (e.g cache size) sort each tile independently combine intermidiate results Two-way approaches: two-way distribution - quicksort � log 2 ( n / M ) scans to partition the input two-way merge sort � log 2 ( n / M ) scans to combine intermidiate results Multi-way approaches: k -way distribution - sample sort � only log k ( n / M ) scans to partition k -way merge sort � only log k ( n / M ) scans to combine Multiway approaches are benifitial when the memory bandwidth is an issue! 3 Vitaly Osipov: Fakultät für Informatik GPU Sample Sort Institut für Theoretische Informatik

  4. Introduction multi-way sorting algorithms Sorting is important Divide-and-Conquer approaches: recursively split the input in tiles until the tile size is M (e.g cache size) sort each tile independently combine intermidiate results Two-way approaches: two-way distribution - quicksort � log 2 ( n / M ) scans to partition the input two-way merge sort � log 2 ( n / M ) scans to combine intermidiate results Multi-way approaches: k -way distribution - sample sort � only log k ( n / M ) scans to partition k -way merge sort � only log k ( n / M ) scans to combine Multiway approaches are benifitial when the memory bandwidth is an issue! 3 Vitaly Osipov: Fakultät für Informatik GPU Sample Sort Institut für Theoretische Informatik

  5. Introduction multi-way sorting algorithms Sorting is important Divide-and-Conquer approaches: recursively split the input in tiles until the tile size is M (e.g cache size) sort each tile independently combine intermidiate results Two-way approaches: two-way distribution - quicksort � log 2 ( n / M ) scans to partition the input two-way merge sort � log 2 ( n / M ) scans to combine intermidiate results Multi-way approaches: k -way distribution - sample sort � only log k ( n / M ) scans to partition k -way merge sort � only log k ( n / M ) scans to combine Multiway approaches are benifitial when the memory bandwidth is an issue! 3 Vitaly Osipov: Fakultät für Informatik GPU Sample Sort Institut für Theoretische Informatik

  6. Introduction multi-way sorting algorithms Sorting is important Divide-and-Conquer approaches: recursively split the input in tiles until the tile size is M (e.g cache size) sort each tile independently combine intermidiate results Two-way approaches: two-way distribution - quicksort � log 2 ( n / M ) scans to partition the input two-way merge sort � log 2 ( n / M ) scans to combine intermidiate results Multi-way approaches: k -way distribution - sample sort � only log k ( n / M ) scans to partition k -way merge sort � only log k ( n / M ) scans to combine Multiway approaches are benifitial when the memory bandwidth is an issue! 3 Vitaly Osipov: Fakultät für Informatik GPU Sample Sort Institut für Theoretische Informatik

  7. NVidia Tesla Architecture 30 Streaming Processors (SM) × 8 Scalar Processors (SP) each overall 240 physical cores 16KB shared memory per SM similar to CPU L1 cache 4GB global device memory 4 Vitaly Osipov: Fakultät für Informatik GPU Sample Sort Institut für Theoretische Informatik

  8. tl tl tl tl ... ... Input Thread Blocks 0 0 1 Prefix 1 Sum k-1 k-1 ... Input Thread Blocks ... Output Bucket indices 0 1 2 k-1 Computing Unified Device Architecture Model Global Memory Grid C code int main { Similar to SPMD TBlock(0,0) TBlock(0,1) TBlock(0,2) //serial (single-program multiple-data) model //parallel block of concurrent Kernel<<>> threads execute a scalar sequential //serial TBlock(1,0) TBlock(1,1) TBlock(1,2) program, a kernel } thread blocks virtualizes constitute a grid SP SP SP SP SP SP SP SP shared 5 Vitaly Osipov: Fakultät für Informatik GPU Sample Sort Institut für Theoretische Informatik

  9. Performance Guidelines General pattern in GPU algorithm design decompose the problem into many data-independent sub-problems solve sub-problems by blocks of cooperative parallel threads Performance Guidelines conditional branching - follow the same execution path shared memory - exploit fast on-chip memory coalesced global memory operations - load/store requests to the same memory block � fewer memory accesses 6 Vitaly Osipov: Fakultät für Informatik GPU Sample Sort Institut für Theoretische Informatik

  10. Algorithm Overview ❙❛♠♣❧❡❙♦rt✭ e = � e 1 , . . . , e n � , k ✮ begin if n < M then return ❙♠❛❧❧❙♦rt✭ e ✮ choose a random sample S = S 1 , . . . , S ak − 1 of e ❙♦rt✭ S ✮ � s 0 , s 1 , . . . , s k � = �− ∞ , S a , . . . , S a ( k − 1 ) , ∞ � for 1 ≤ i ≤ n do find j ∈ { 1 , . . . , k } , such that s j − 1 ≤ e i ≤ s j place e i in bucket b j return ❈♦♥❝❛t❡♥❛t❡✭❙❛♠♣❧❡❙♦rt✭ b 1 , k ✮ , . . . , ❙❛♠♣❧❡❙♦rt✭ b k , k ✮✮ end end Algorithm 1 : Serial Sample Sort 7 Vitaly Osipov: Fakultät für Informatik GPU Sample Sort Institut für Theoretische Informatik

  11. High Level GPU Algorithm Design Phase 1. Choose splitters tl tl tl tl ... ... Input Phase 2. Each of p TB: Thread Blocks computes its elements 0 0 bucket indices 1 Prefix 1 Sum id , 0 ≤ id ≤ k − 1 k-1 k-1 ... Input stores the bucket sizes in Thread Blocks DRAM ... Output Phase 3. Prefix sum over the Bucket indices 0 1 2 k-1 k × p table � global offsets Parameters: Phase 4. distribution degree k = 128 threads per block t = 256 as in Phase 2 � local elements per thread l = 8 offsets number of blocks p = n / ( t · l ) local + global offsets � final positions 8 Vitaly Osipov: Fakultät für Informatik GPU Sample Sort Institut für Theoretische Informatik

  12. High Level GPU Algorithm Design Phase 1. Choose splitters tl tl tl tl ... ... Input Phase 2. Each of p TB: Thread Blocks computes its elements 0 0 bucket indices 1 Prefix 1 Sum id , 0 ≤ id ≤ k − 1 k-1 k-1 ... Input stores the bucket sizes in Thread Blocks DRAM ... Output Phase 3. Prefix sum over the Bucket indices 0 1 2 k-1 k × p table � global offsets Parameters: Phase 4. distribution degree k = 128 threads per block t = 256 as in Phase 2 � local elements per thread l = 8 offsets number of blocks p = n / ( t · l ) local + global offsets � final positions 8 Vitaly Osipov: Fakultät für Informatik GPU Sample Sort Institut für Theoretische Informatik

  13. High Level GPU Algorithm Design Phase 1. Choose splitters tl tl tl tl ... ... Input Phase 2. Each of p TB: Thread Blocks computes its elements 0 0 bucket indices 1 Prefix 1 Sum id , 0 ≤ id ≤ k − 1 k-1 k-1 ... Input stores the bucket sizes in Thread Blocks DRAM ... Output Phase 3. Prefix sum over the Bucket indices 0 1 2 k-1 k × p table � global offsets Parameters: Phase 4. distribution degree k = 128 threads per block t = 256 as in Phase 2 � local elements per thread l = 8 offsets number of blocks p = n / ( t · l ) local + global offsets � final positions 8 Vitaly Osipov: Fakultät für Informatik GPU Sample Sort Institut für Theoretische Informatik

  14. High Level GPU Algorithm Design Phase 1. Choose splitters tl tl tl tl ... ... Input Phase 2. Each of p TB: Thread Blocks computes its elements 0 0 bucket indices 1 Prefix 1 Sum id , 0 ≤ id ≤ k − 1 k-1 k-1 ... Input stores the bucket sizes in Thread Blocks DRAM ... Output Phase 3. Prefix sum over the Bucket indices 0 1 2 k-1 k × p table � global offsets Parameters: Phase 4. distribution degree k = 128 threads per block t = 256 as in Phase 2 � local elements per thread l = 8 offsets number of blocks p = n / ( t · l ) local + global offsets � final positions 8 Vitaly Osipov: Fakultät für Informatik GPU Sample Sort Institut für Theoretische Informatik

Recommend


More recommend