dynamic data structures for the gpu
play

Dynamic Data Structures for the GPU John Owens Child Family - PowerPoint PPT Presentation

Dynamic Data Structures for the GPU John Owens Child Family Professor of Engineering & Entrepreneurship Department of Electrical & Computer Engineering UC Davis Joint work with Martin Farach-Colton CUDA Programming Model (SPMD + SIMD)


  1. Dynamic Data Structures for the GPU John Owens Child Family Professor of Engineering & Entrepreneurship Department of Electrical & Computer Engineering UC Davis Joint work with Martin Farach-Colton

  2. CUDA Programming Model (SPMD + SIMD) • Flow: Copy data to “device” (GPU); run Host Device kernels; copy results back Grid 1 • A kernel is executed as a grid of thread Block Block Block Kernel 1 (0, 0) (1, 0) (2, 0) blocks Block Block Block • One thread block maps to one GPU (0, 1) (1, 1) (2, 1) “core” (SM) • Grid 2 A thread block is a batch of threads that can cooperate with each other by: Kernel 2 • E ffi ciently sharing data through Block (1, 1) shared memory • Thread Thread Thread Thread Thread Synchronizing their execution (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) • Thread Thread Thread Thread Thread Two threads from two di ff erent blocks (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) cannot cooperate Thread Thread Thread Thread Thread (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) • Blocks are independent

  3. Computation/Memory Hierarchy Level Computation Memory Global Kernels DRAM (12 GB) Blocks (MIMD within Per-block L2 cache (1.57 MB) a kernel) (~15) Shared/L1 cache (48 Warps (MIMD within Per-warp kB/SM x 15 SMs = a block) 720 kB) Registers (64k/SM * 4 Threads (32-wide B/register = 262 kB/ Per-thread SIMD within a SM * 15 SMs = 3.93 thread) ( ≥ 30k) MB)

  4. Memory: What does/doesn’t matter • Matters: • Use fastest level of 
 memory hierarchy 
 Initial key distribution Buckets e.g., thread coarsening 6 Warp level reordering 4 • Coalesced memory 
 Block level reordering 2 accesses (threads in 
 0 Multisplit result a warp should access 
 0 32 64 96 128 160 192 224 256 neighboring 
 locations in memory) • Doesn’t matter (to date): • Cache & cache-0bliviousness 
 1.57 MB L2 / 30k threads = 51 B/thread

  5. NVIDIA OptiX & the BVH Tero Karras. Maximizing parallelism in the construction of BVHs, octrees, and k - d trees. In High-Performance Graphics , HPG ’12, pages 33–37, June 2012.

  6. The problem • Many data structures are built on the CPU and used on the GPU • Very few data structures can be built on the GPU • Sorted array • (Cuckoo) hash table • Several application-specific DS (e.g., BVH tree) • No data structures can be updated on the GPU

  7. Scale of updates • Update 1–few items • Fall back to serial case, slow, probably don’t care • Update very large number of items • Rebuild whole data structure from scratch • Middle ground: our goal • Question: When do you do this in practice?

  8. Approach • Pick data structures useful in serial case, try to find parallelizations? • Pick what look like parallel-friendly data structures with parallel-friendly updates?

  9. If you think of other/interesting data structure candidates, I’m all ears! If you think “But surely he’s already considered X and rejected it”, you’re probably wrong!

  10. Cache-oblivious lookup array a) c) b) • Supports dictionary and range queries • log n sorted levels, each level 2x the size of the last • Insert into a filled level results in a merge, possibly cascaded. Operations are coarse (threads cooperate).

  11. COLA results/questions • Insertions/lookups for point queries • 600M/52M for COLA • 140M/326M for hash table • Deletes using tombstones • Semantics for parallel insert/delete operations? • Minimum batch size? • Atom size for searching? • Fractional cascading? Saman Ashkiani, Shengren Li, Martin Farach-Colton, Nina Amenta, and John D. Owens. GPU COLA: A dynamic dictionary data structure for the GPU. February 2016. Unpublished.

  12. Hash-array mapped trie (HAMT) Root • Hash maps in Clojure 0110 bitmap C node subtrie • S-nodes (key-value pairs) • C-nodes (branching nodes) bitmap 0101 key 0010 C node S node subtrie • Operations are fine (threads operate 
 key key 1001 0001 independently) S node S node • Has concurrent (CPU) implementation • Requires fine-grained memory allocation • Custom memory allocators?

  13. Relaxed Radix Balanced (RRB) Trees • Clojure and Scala’s Vector • ~Relaxed unsorted B-tree • Index/update/iterations cheap • concat/insert-at/split in O (log n )

  14. Packed memory array (PMA) • Di ff ers from RRB tree: • Stores ordered elements 
 (set not list) • Tree is implicit • Maintains gaps between elements • Insertions require rebalancing

  15. Cross-cutting issues • Useful models for GPU memory hierarchy • Independent threads vs. cooperative threads? • Memory allocation (& impact on hardware) • Persistent data structures • Integration into higher-level programming environments • Use cases!

Recommend


More recommend