SYNCHRONIZATION IS BAD, BUT IF YOU MUST… (S9329) Olivier Giroux, Distinguished Architect, ISO C++ Chair of Concurrency & Parallelism.
My coordinates Memory Model WG21 Community ogiroux@nvidia.com Architects ISO C++ Users NVIDIA GPU Engineers 2
WHAT THIS TALK IS ABOUT: 🛒 cudaDeviceSynchronize() 🛒 __syncthreads() 🛒 __shfl_sync() ✅ Using atomics to do blocking synchronization.
PSA: DON’T RUN SERIAL CODE IN THREADS T0 → Blocked Dammit Blocked Blocked Blocked T1 → Blocked Blocked SIGH!! Blocked T2 → Blocked T3 → Blocked Blocked Blocked T4 → Blocked Blocked Still blocked T5 → Blocked All blocked and Blocked T6 → Blocked no play makes t6 Blocked a dull thread…
PSA: RARE CONTENTION IS FINE T0 → Blocked Blocked T1 → Blocked T2 → T3 → Blocked T4 → T5 → Blocked T6 → Blocked Blocked
UNCONTENDED EXCHANGE LOCK struct mutex { // suspend atomic<> disbelief for now 1.E+09 Critical sections (per second) __host__ __device__ void lock() { 1.E+08 while(1 == l.exchange(1, memory_order_acquire)) ; } 1.E+07 __host__ __device__ void unlock() { l.store(0, memory_order_release); 1.E+06 } 1 2 4 8 16 32 64 128 256 512 10242048 Thread Occupancy atomic<int> l = ATOMIC_VAR_INIT(0); V100 CPU }; Awesome. 🎊 Thanks for attending my talk. 🎊
Deadlock. Deadlock. 🐱🕴 Deadlock. Deadlock. 7
SIMT ATOMIC CONCERN SCALE : 😏 Atomic result feeds branch, closes loop, Volta+ 😲 Atomic result feeds branch, closes loop 😭 Atomic result feeds branch, inside loop 🤩 Atomic result feeds branch, outside loop 🙃 Atomic result feeds arithmetic 😂 Atomic result ignored 👷 No atomics
SIMT FAMILY HISTORY Pixar CHAP Tesla SIMT Volta SIMT Scalar channel programs. Scalar thread programs. Scalar thread programs. Forward-progress = Nope Forward-progress = YES! 🤙 ☹︐ Time zero. SIMD 1966 1970 1984 2007 2017 Source: Wikipedia, SIGGRAPH proceedings, IEEE Micro. 9
APPLICABILITY 10
SYNCHRONIZATION DECISION CHECKLIST CONs: PROs 1. Serialization is bad. 1. Algorithmic gains. 2. Critical path / Amdahl’s law. 2. Latency hiding. 3. Latency is high. 3. Throughput is high TL;DR: Sometimes, it’s a win. 11
APP #1: GPU-RESIDENT METHODS Keep local state in registers & shared memory, with synchronization. Grid0<<<>>> Cooperative Grid 20x Global Barrier State Invalidation faster for Grid1<<<>>> RNN. See Greg Diamos ’ GTC 2016 talk for more.
APP #2: LOCK-FREE IS NOT ALWAYS FASTER // *continue* to suspend atomic<> disbelief for now __host__ __device__ bool lock_free_writer_version(atomic<int>& a, atomic<int>& b) { int expected = -1; if(a.compare_exchange_strong(expected, 1, memory_order_relaxed)) Exposed dependent latency b.store(1, memory_order_relaxed); return expected == -1; } // This version is a ~60% speedup at GPU application level, despite progress hazards. __host__ __device__ bool starvation_free_writer_version(atomic<int>& a, atomic<int>& b) { int expected_a = -1, expected_b = -1; bool success_a = a.compare_exchange_strong(expected_a, 1, memory_order_relaxed), Overlapped success_b = b.compare_exchange_strong(expected_b, 1, memory_order_relaxed); if(success_a) // Note: we almost always succeed at both. while(!success_b) // <-- This loop makes this a deadlock-free algorithm. success_b = b.compare_exchange_strong(expected_b = -1, 1, memory_order_relaxed); else if(success_b) Rarely-taken loop changes this b.store(-1, memory_order_relaxed); algorithm to a different category. return expected_a == -1; }
APP #3: CONCURRENT DATA STRUCTURES Even if mutexes hide in every node, GPUs can build tree structures fast. For more, see my CppCon 2018 talk on YouTube, and ‘Parallel Forall ’ blog post. Multi-threading (CPU) ? Acceleration (RTX 2070)
PRE-REQUISITES 15
Concurrent PR #1: FORWARD-PROGRESS algorithm taxomomy. Compute_6x. → Compute_7x. Every thread succeeds. App #2. Some thread succeeds. Eventually No scheduling requirements. Critical sections Maurice Herlihy and Nir Shavit. 2011. On the nature of progress. In Proceedings of the 15th (Any thread scheduler.) run isolated. eventually complete. international conference on Principles of Distributed Systems (OPODIS'11)
PR #2: MEMORY CONSISTENCY Classic CUDA C++. See PTX 6 chapter 8 for the asm. 🎊 Later this year! 🎊 ISO C++ 11 CUDA 9.0-10.2, Volta+ CUDA 10.3, Volta+ asm("fence.sc.sys;"); int atomic<int>::load(memory_order_seq_cst) int atomic<int>::load(memory_order_seq_cst) asm("ld.acquire.sys.b32 %0, [%1];":::memory); int atomic<int>::load(memory_order_acquire) asm("ld.acquire.sys.b32 %0, [%1];":::memory); int atomic<int>::load(memory_order_acquire) asm("ld.relaxed.sys.b32 %0, [%1];":::memory); int atomic<int>::load(memory_order_relaxed) int atomic<int>::load(memory_order_relaxed) OR : x = *(volatile int*)ptr; asm("fence.sc.sys;"); void atomic<int>::store(int, memory_order_seq_cst) void atomic<int>::store(int, memory_order_seq_cst) asm("st.relaxed.sys.b32 [%0], %1;":::memory); void atomic<int>::store(int, memory_order_release) asm("st.release.sys.b32 [%0], %1;":::memory); void atomic<int>::store(int, memory_order_release) asm("st.relaxed.sys.b32 [%0], %1;":::memory); void atomic<int>::store(int, memory_order_relaxed) void atomic<int>::store(int, memory_order_relaxed) OR : *(volatile int*)ptr = x; asm("fence.sc.sys;"); int atomic<int>::exchange(int, memory_order_seq_cst) int atomic<int>::exchange(int, memory_order_seq_cst) asm("atom.exch.acquire.sys.b32 %0, [%1], %2;":::memory); int atomic<int>::exchange(int, memory_order_acq_rel) asm("atom.exch.acq_rel.sys.b32 %0, [%1], %2;":::memory); int atomic<int>::exchange(int, memory_order_acq_rel) int atomic<int>::exchange(int, memory_order_release) asm("atom.exch.release.sys.b32 %0, [%1], %2;":::memory); int atomic<int>::exchange(int, memory_order_release) int atomic<int>::exchange(int, memory_order_acquire) asm("atom.exch.acquire.sys.b32 %0, [%1], %2;":::memory); int atomic<int>::exchange(int, memory_order_acquire) asm("atom.exch.relaxed.sys.b32 %0, [%1], %2;":::memory); int atomic<int>::exchange(int, memory_order_relaxed) int atomic<int>::exchange(int, memory_order_relaxed) OR: y = atomicExch_system(ptr, x); And so on... Our ASPLOS 2019 paper: https://github.com/NVlabs/ptxmemorymodel.
PR #3: TRUE SHARING • Concurrent data sharing between CPU and GPU is a new possibility. • Real usefulness has some more conditions. Load/store sharing Platform / allocator Atomic (low cont’n ) Atomic (high cont’n ) Any: ARM/Windows/Mac/Unmanaged Nope. Not at all. x86 Linux (CPU/GPU) Managed Yes. Technically… but no. x86 Linux (GPU/GPU) Managed YES! TRY IT! POWER Linux (all pairs) Managed
PRELIMINARIES 19
CONTENTION IS THE ISSUE, DIFFERENTLY. BW=1/Lat NUMA is a punishing Little’s Law 1.E-05 depressor of CPU perf. Bathtub curve is due to the statistical finally kicks in. Latency (seconds) likelihood of finding peer in pipeline. 1.E-06 1.E-07 1.E-08 1 2 4 8 16 32 64 128 256 512 1024 2048 Contending threads (count) V100 POWER X86 __host__ __device__ void test(int my_thread, int total_threads, int final_value) { for(int old ; my_thread < final_value; start += total_threads) while(!a.compare_exchange_weak(old = my_thread, my_thread + 1, memory_order_relaxed)) ; }
CONTENDING PROCESSORS ARE CRUSHED… 1.E-03 Crushed? ½ millisecond Latency (seconds) 1.E-03 2048 Latency (seconds) 1.E-06 1.E-04 512 1.E-05 128 1.E-06 1.E-07 32 Threads 1.E-08 (GPU x CPU) 8 1.E-09 1.E-09 1 2 4 8 163264 128 256 512 1024 2048 2 16 Thread Occupancy 8 4 0 2 1 V100 X86 0
…UNLESS THE PROCESSORS ARE NVLINK’ED . 1.E-03 1.E-03 Latency (seconds) Latency (seconds) 1.E-06 2048 1.E-06 2048 256 256 32 Threads Threads 32 (GPU x CPU) (GPU x CPU) 1.E-09 1.E-09 4 4 64 32 16 16 8 8 4 0 0 4 2 2 1 1 0 0 x86 + V100 (PCIE) POWER + V100 (NVLINK)
ALL OF THE FOLLOWING 1.E-05 SLIDES ARE Latency (seconds) NVLINK’ED . 1.E-06 2048 256 And not log scale, because it’s legible in Threads 32 (GPU x CPU) 1.E-07 linear scale now. 4 64 Thanks. 32 16 8 0 4 2 1 0
CONTENDED MUTEXES 24
CONTENDED MUTEXES AS AN EXERCISE TO THINK ABOUT THROUGHPUT AND FAIRNESS 25
CONTENDED EXCHANGE LOCK struct mutex { __host__ __device__ void lock() { while(1 == l.exchange(1, memory_order_acquire)) 1.E-05 ; Latency (seconds) } 1.E-06 2048 __host__ __device__ void unlock() { l.store(0, memory_order_release); 128 Threads } 1.E-07 (GPU x CPU) 8 32 0 atomic<int> l = ATOMIC_VAR_INIT(0); 8 2 0 }; Not awesome. 🎊 Stay. Keep attending my talk. 🎊
CONTENDED EXCHANGE LOCK Heavy system pressure: • A lot of requests • Each request is slow 1.E-05 Latency (seconds) 2048 1.E-06 512 128 32 Threads (GPU x CPU) 1.E-07 8 64 2 32 16 8 4 0 2 1 0
Recommend
More recommend