Algorithms for Optimization of Remote Core Locking Synchronization in Hierarchical Multicore Computer Systems Paznikov Alexey The first summer school apaznikov@gmail.com on practice and theory of concurrent computing Saint Petersburg Electrotechnical University "LETI“ July 3-7, 2017 Siberian State University of Telecommunications and Information St. Petersburg Sciences, Novosibirsk Rzhanov Institute of Semiconductor Physics Siberian Branch of ITMO University RAS, Novosibirsk
Multicore computer systems with shared memory QuickPath / Hyper-Transport L3 L3 Cache Cache L2 L2 L2 L2 L2 L2 L2 L2 Memory controller Memory controller L1 L1 L1 L1 L1 L1 L1 L1 NUMA-node 1 NUMA-node 2 CPU CPU C1 C2 C3 C4 C1 C2 C3 C4 cores cores L3 L3 Cache Cache L2 L2 L2 L2 L2 L2 L2 L2 L1 L1 L1 L1 L1 L1 L1 L1 CPU CPU C5 C6 C7 C8 C1 C2 C3 C4 cores cores Architecture features of modern multicore computer systems (CS) : ▪ Scalability: number of CPU cores exceeds 10 2 – 10 3 ▪ Hierarchical structure: multilevel cache, logical processors ▪ Non-uniform memory access (NUMA-systems) ▪ Heterogeneous structures: specialized accelerators and co-processors ▪ The variety of mechanisms for consistent state of memory 2
Synchronization in multithreaded programs Threads Locks T 1 T 2 Parallel execution of critical sections is serialized – replaced by sequential. Drawbacks: ▪ Low scalability ▪ Serialization Critical section ▪ Deadlocks, livelocks ▪ Starvation ▪ Priority inversion ▪ Unpredictable order of critical sections execution (FIFO) ▪ Reasonable overheads Implementations: ▪ Queue locks (CLH, MCS) ▪ Spinlocks (test-and-set-based, exponential backoff-based) Locks ▪ Flat combining (CC-Synch, DSM-Synch, Oyama lock, etc) ▪ Futex-based (PThreads mutex, PThread read-write mutex) 3
Synchronization in multithreaded programs Threads T 1 T 2 Lock-free algorithms and concurrent data structures Atomic operations (CAS – compare-and-swap, LL/SC – load link, store conditional) CAS – compare-and-swap are used for thread-safety. Drawbacks: CAS! ▪ High complexity of parallel programs development. CAS! ▪ АВА problem. CAS! ▪ Atomic operations are suitable for variable with size of computer word. CAS! ▪ Low efficiency of atomic operations. CAS! Modern algorithms: Solutions for ABA-problem: ▪ Lock-free producer-consumer ▪ Quiescent-based schemes ▪ Exponential backoff ▪ Pointer-based schemes ▪ Elimination arrays (hazard pointers, ▪ Diffraction trees drop-the-anchor, pass-the-buck) Lock-free algorithms and ▪ Sorting networks ▪ Reference counting data structures ▪ Cliff Click hash table ▪ Tagged state reference ▪ Skip-list ▪ Intermediate nodes ▪ Split-ordering ▪ TM-based 4
Synchronization in multithreaded programs Threads T 1 T 2 Transactional memory (TM) Organization of transactional sections which assures thread-safety of access to shared memory areas (not code sections). Drawbacks: ▪ Very high overheads Transaction ▪ Transactions can be cancelled ▪ Restricted operations inside transactional sections ▪ Debugging complexity ▪ Necessity of refactoring of the program TM implementations: ▪ GCC TM ▪ LazySTM ▪ TinySTM ▪ DTMC ▪ RSTM ▪ STM Monad Transactional memory 5
Synchronization in multithreaded programs Threads Locks T 1 T 2 Parallel execution of critical sections is serialized – replaced by sequential. Drawbacks: ▪ Low scalability ▪ Serialization Critical section ▪ Deadlocks, livelocks ▪ Starvation ▪ Priority inversion ▪ Unpredictable order of critical sections execution (FIFO) ▪ Reasonable overheads Implementations: ▪ Bottlenecks ▪ Queue locks (CLH, MCS) ▪ High access contention expensive ▪ Spinlocks (test-and-set-based, lock acquisition exponential backoff-based) Locks ▪ Flat combining (CC-Synch, DSM-Synch, Oyama lock, etc) Development of more efficient locking ▪ Futex-based (PThreads mutex, algorithms is urgent today! PThread read-write mutex) 6
Time of critical section execution T 1 T 2 T 3 Threads Time of critical section execution Transfer of lock ownership (critical path): Critical section t 𝑢 = 𝑢 1 + 𝑢 2 , CS1 here Critical path 𝑢 1 – time of execution of instructions of critical section, CS2 𝑢 2 – time of transfer of lock ownership. CS3 Time of transfer of lock ownership in the existing locking algorithms: Development of locking algorithms ▪ Spinlock Access to global flag which minimizes time ▪ PThread mutex Context switch of lock ownership is relevant. ▪ MCS Thread activation ▪ Flat combining Acquisition of global lock 7
Time of critical section execution v T 1 T 2 T 3 Threads Time of critical section execution (critical path): t 𝑢 = 𝑢 1 + 𝑢 2 + 𝒖 𝟒 , Global variable CS1 here Critical path 𝑢 1 – time of execution of to global variable Time of access instructions of critical section, CS2 𝑢 2 – time of transfer of lock ownership, 𝒖 𝟒 – time of access to global CS3 variables Localization of memory access in the existing locking algorithms: Development of locking algorithms which ▪ Spinlock no localization ensures memory access localization ▪ PThread mutex no localization is relevant today. ▪ MCS no localization ▪ Flat combining partial localization 8
Remote Core Locking (RCL) technics Remote Core Locking (RCL) T 1 T 2 T 3 RCL-server v 1 v 2 methods minimizes critical path of critical sections execution t 𝑢 = 𝑢 1 + 𝑢 2 + 𝑢 3 Due to minimization of Critical path CS1 ▪ time 𝑢 2 of transfer of ownership CS2 of lock ownership CS3 ▪ time 𝑢 3 of access to global variables Global variables Core Х Lozi J. P. et al. Remote Core Locking: Migrating Critical-Section Execution to Improve the Performance of Multithreaded Applications // USENIX Annual Technical Conference. – 2012. – P. 65-76. 9
Remote Core Locking (RCL) technics Remote Core Locking (RCL) T 1 T 2 T 3 RCL-server v 1 v 2 methods minimizes critical path of critical sections execution t 𝑢 = 𝑢 1 + 𝑢 2 + 𝑢 3 Due to minimization of Critical path CS1 ▪ time 𝑢 2 of transfer of ownership CS2 of lock ownership CS3 ▪ time 𝑢 3 of access to global variables Core Core Core Core Core Core Core Core CPU cores 1 2 3 4 5 6 7 8 Lozi J. P. et al. Remote Core Locking: L1 L1 L1 L1 L1 L1 L1 L1 Migrating Critical-Section Execution to Improve the Performance of L2 L2 L2 L2 L2 L2 L2 L2 Cache Multithreaded Applications // L3 L3 L3 L3 USENIX Annual Technical Conference. – Memory NUMA node 1 NUMA node 2 2012. – P. 65-76. 10 QuickPath / Hyper-Transport
Remote Core Locking (RCL) technics Time of critical section execution (critical path): T 1 T 2 T 3 RCL-server v 1 v 2 𝑢 = 𝑢 1 + 𝑢 2 + 𝑢 3 t Time 𝒖 𝟒 of access to global variable inside critical section depends on Critical path ▪ on which NUMA-node is allocated CS1 memory for the variable CS2 ▪ on which processor core runs CS3 RCL-server ▪ which thread on which processor core last time accessed to the variable Core Core Core Core Core Core Core Core CPU cores 1 2 3 4 5 6 7 8 Lozi J. P. et al. Remote Core Locking: L1 L1 L1 L1 L1 L1 L1 L1 Migrating Critical-Section Execution to Improve the Performance of L2 L2 L2 L2 L2 L2 L2 L2 Cache Multithreaded Applications // L3 L3 L3 L3 USENIX Annual Technical Conference. – v 1 v 2 Memory NUMA node 1 NUMA node 2 2012. – P. 65-76. 11 QuickPath / Hyper-Transport
Example of critical section execution in RCL liblock_lock_t lock; int main() { const char* liblock_name = "rcl"; liblock_lock_init (liblock_name, &topology->hw_threads[0], &lock, 0); int global_var = 0; Lock initialization void *cs(void* arg) { pthread_t tids[NTHREADS]; global_var++; for (int i = 0; i < NTHREADS; i++) { return NULL; liblock_thread_create (&tids[i], NULL, } thread, NULL); } Creation of thread void *thread(void* arg) { int i; for (int i = 0; i < NTHREADS; i++) { for (i = 0; i < NITERS; i++) { pthread_join(tids[i], NULL); liblock_exec (&lock, cs, NULL); } } return NULL; liblock_lock_destroy(&lock); Critical section execution } return 0; } 12
Example of critical section execution in RCL liblock_lock_t lock; int main() { const char* liblock_name = "rcl"; liblock_lock_init (liblock_name, &topology->hw_threads[0] , &lock, 0); int global_var = 0; Lock initialization 1 Core number void *cs(void* arg) { pthread_t tids[NTHREADS]; global_var++; for (int i = 0; i < NTHREADS; i++) { return NULL; liblock_thread_create (&tids[i], NULL, } thread, NULL); } Creation of thread void *thread(void* arg) { int i; for (int i = 0; i < NTHREADS; i++) { for (i = 0; i < NITERS; i++) { pthread_join(tids[i], NULL); liblock_exec (&lock, cs, NULL); } } return NULL; liblock_lock_destroy(&lock); Critical section execution } return 0; } 13
Recommend
More recommend