understanding gpu performance
play

Understanding GPU performance How to get peak FLOPS (GPU version) - PowerPoint PPT Presentation

Understanding GPU performance How to get peak FLOPS (GPU version) Kenjiro Taura 1 / 7 Contents 1 Data Access Performance 2 / 7 Contents 1 Data Access Performance 3 / 7 Data access performance data access performance is important in GPU too


  1. Understanding GPU performance How to get peak FLOPS (GPU version) Kenjiro Taura 1 / 7

  2. Contents 1 Data Access Performance 2 / 7

  3. Contents 1 Data Access Performance 3 / 7

  4. Data access performance data access performance is important in GPU too 4 / 7

  5. Memory organization Pascal (P100) level line size capacity associativity L1 32B 24KB/SM ? L2 32B 4MB/device ? Global Memory 12/16GB N/A Shared Memory 64KB ( ∗ ) N/A Volta (V100) level line size capacity associativity L1 32B 32-128 KB/SM ( ∗ ) ? L2 32B 6MB/device ? Global Memory 16GB N/A Shared Memory ≤ 96KB ( ∗ ) N/A ∗ : 128KB is split between L1 and Shared Memory (configurable) source: https://arxiv.org/abs/1804.06826 5 / 7

  6. Global vs. Shared Memory global memory and L1/L2 cache are the ordinary memory that make a hierarchy cudaMalloc returns a global memory accesses to global memory are transparently cached into L1/L2 caches shared memory is an explicitly-managed scratch memory latency shorter than L1 (esp. on Pascal) you explicitly move between global and shared memory data shared only within a thread block programming interface is covered shortly 6 / 7

  7. Latency measurement the same pointer chasing experiment as we did on CPU ✞ for ( N times) { 1 p = p->next; 2 } 3 next pointers (link all elements in a random order) cache line size N elements 7 / 7

  8. Data size vs. latency even L1 cache hit takes 30 (Volta) - 100 (Pascal) cycles latency per load in a random list traversal 700 p 8 v 8 600 latency/load (GPU cycles) 500 400 300 200 100 0 1024 4096 16384 65536 262144 1 . 04858 × 10 6 4 . 1943 × 10 6 1 . 67772 × 10 7 6 . 71089 × 10 7 size of the region (bytes) 8 / 7

  9. Shared memory 9 / 7

Recommend


More recommend