Read-Log-Update A Lightweight Synchronization Mechanism for Concurrent Programming Paper Reading Group Alexander Matveev Nir Shavit Pascal Felber Patrick Marlier Presents: Maksym Planeta 24.09.2015
Table of Contents Introduction RLU design Evaluation Conclusion
Table of Contents Introduction RLU design Evaluation Conclusion
Motivation What is bad with LRU? ◮ Complex to use for a writer; ◮ Optimized for low number of writers ◮ High delays in synchronize_rcu
Contributions RCU + STM = RLU. ◮ Update several objects with single counter increment;
Contributions RCU + STM = RLU. ◮ Update several objects with single counter increment; Traverse doubly linked lists in both directions!
Contributions RCU + STM = RLU. ◮ Update several objects with single counter increment; Traverse doubly linked lists in both directions! ◮ Stay compatible with RCU
RCU recap T1 T1 T1 T1 search(c) n rcu_read_lock() n rcu_dereference(b) n rcu_dereference(c) n rcu_read_unlock() ➊ ➌ ➎ ➐ a b c a b c a b c a b c T2 T2 T2 T2 remove(b) n n synchronize_rcu() n …grace period… n kfree(n) ➋ ➍ ➏ ➑ Figure 2. Concurrent search and removal with the RCU-based linked list. 171
Single point manipulation i n l i n e s t a t i c void l i s t a d d r c u ( struct l i s t h e a d ∗ new , l i s t h e a d ∗ prev , struct struct l i s t h e a d ∗ next ) { new − > next = next ; new − > prev = prev ; r c u a s s i g n p o i n t e r ( l i s t n e x t r c u ( prev ) , new ) ; next − > prev = new ; }
RLU style / ∗ . . . some important code that we c o n s i d e r l a t e r . . . ∗ / / ∗ Update r e f e r e n c e s ∗ / r l u a s s i g n p t r (&(new − > next ) , next ) ; r l u a s s i g n p t r (&( prev − > next ) , new ) ; / ∗ Commit ∗ / r l u r e a d e r u n l o c k ( ) ;
Table of Contents Introduction RLU design Evaluation Conclusion
Basic idea 1. All operations read the global clock when they start; 2. Clock is used to dereference shared objects; 3. Write operations write to a log (RCU-style copy of an object); 4. Increment global clock to commit write (Swap pointers in RCU); 5. Wait old readers to finish ( synchronize_rcu ); 6. Write-back objects from the log. (Corresponds to RCU memory reclamation)
Read-read example Threads Memory T1 T2 T3 mem 22 22 l-clock l-clock l-clock — g-clock 22 w-clock ∞ w-clock ∞ w-clock ∞ lock — w-log w-log w-log — — — O 1 — — — T 1 T 2 T 3 … … … lock — ➊ read g-clock O 2 (l-clock ← 22) ➊ read g-clock (l-clock ← 22) ➋ read O 1 lock — (not locked) ➋ read O 1 O 3 (not locked) … ➌ read O 2 (not locked) T1 T2 T3
Write-read example T1 T2 T3 mem 22 22 l-clock l-clock l-clock — g-clock 22 w-clock ∞ w-clock ∞ w-clock ∞ lock — w-log w-log w-log O 2 — — O 1 O 3 — — T 1 T 2 T 3 … … … T 2 lock ➌ read O 2 ➍ log O 2 O 2 (locked by T 2) (and lock ) → if (l-clock ≥ ➎ update O 2 T 2 lock T 2 .w-clock) (in w-log) steal new O 3 ➏ log O 3 copy from (and lock ) T 2 .w-log … → else ➐ update O 3 read O 2 (in w-log) T1 T2 T3
Read-write-steal example T1 T2 T3 mem l-clock 22 l-clock 23 l-clock 23 g-clock 23 23 w-clock ∞ w-clock w-clock ∞ lock — w-log w-log O 2 w-log — — O 1 O 3 — — T 1 T 2 T 3 … … … T 2 lock ➑ commit O 2 1) w-clock ← 23 2) g-clock ← 23 ➊ read g-clock T 2 lock 3) wait for (l-clock ← 23) readers (with ➋ read O 2 O 3 l-clock < 23) (locked by T 2 ) … … wait for T 1 … → if (l-clock ≥ … wait for T 1 … T 2 .w-clock) … wait for T 1 … steal new ➍ …done copy from 4) write back T 2 .w-log w-log → read copy T1 T2 T3
Real list add r l u l i s t a d d ( r l u t h r e a d d a t a t ∗ s e l f , int l i s t t ∗ l i s t , v a l t v a l ) { node t ∗ prev , ∗ next , ∗ node ; v a l t v ; r e s t a r t : r l u r e a d e r l o c k ( ) ; / ∗ Find r i g h t place . . . ∗ / i f ( ! r l u t r y l o c k ( s e l f , &prev ) | | ! r l u t r y l o c k ( s e l f , &next )) { r l u a b o r t ( s e l f ) ; goto r e s t a r t ; } new = rlu new node ( ) ; new − > v a l = v a l ; r l u a s s i g n p t r (&(new − > next ) , next ) ; r l u a s s i g n p t r (&( prev − > next ) , new ) ; r l u r e a d e r u n l o c k ( ) ; }
Real list add r l u l i s t a d d ( r l u t h r e a d d a t a t ∗ s e l f , int l i s t t ∗ l i s t , v a l t v a l ) { node t ∗ prev , ∗ next , ∗ node ; v a l t v ; r e s t a r t : r l u r e a d e r l o c k ( ) ; / ∗ Find r i g h t place . . . ∗ / i f ( ! r l u t r y l o c k ( s e l f , &prev ) | | ! r l u t r y l o c k ( s e l f , &next )) { r l u a b o r t ( s e l f ) ; goto r e s t a r t ; } new = rlu new node ( ) ; new − > v a l = v a l ; r l u a s s i g n p t r (&(new − > next ) , next ) ; r l u a s s i g n p t r (&( prev − > next ) , new ) ; r l u r e a d e r u n l o c k ( ) ; }
Real list add r l u l i s t a d d ( r l u t h r e a d d a t a t ∗ s e l f , int l i s t t ∗ l i s t , v a l t v a l ) { node t ∗ prev , ∗ next , ∗ node ; v a l t v ; r e s t a r t : r l u r e a d e r l o c k ( ) ; / ∗ Find r i g h t place . . . ∗ / i f ( ! r l u t r y l o c k ( s e l f , &prev ) | | ! r l u t r y l o c k ( s e l f , &next )) { r l u a b o r t ( s e l f ) ; goto r e s t a r t ; } new = rlu new node ( ) ; new − > v a l = v a l ; r l u a s s i g n p t r (&(new − > next ) , next ) ; r l u a s s i g n p t r (&( prev − > next ) , new ) ; r l u r e a d e r u n l o c k ( ) ; }
Reader lock 1: function RLU _ READER _ LOCK (ctx) ctx.is-writer ← false 2: ctx.run-cnt ← ctx.run-cnt +1 ⊲ Set active 3: memory fence 4: ctx.local-clock ← global-clock ⊲ Record global clock 5: 6: function RLU _ READER _ UNLOCK (ctx) ctx.run-cnt ← ctx.run-cnt +1 ⊲ Set inactive 7: if ctx.is-writer then 8: RLU _ COMMIT _ WRITE _ LOG (ctx) ⊲ Write updates 9: 173 173
Memory commit 44: function RLU _ COMMIT _ WRITE _ LOG (ctx) ctx.write-clock ← global-clock +1 ⊲ Enable stealing 45: FETCH _ AND _ ADD (global-clock, 1) ⊲ Advance clock 46: 47: RLU _ SYNCHRONIZE (ctx) ⊲ Drain readers RLU _ WRITEBACK _ WRITE _ LOG (ctx) ⊲ Safe to write back 48: RLU _ UNLOCK _ WRITE _ LOG (ctx) 49: ctx.write-clock ← ∞ ⊲ Disable stealing 50: RLU _ SWAP _ WRITE _ LOGS (ctx) ⊲ Quiesce write-log 51: 173
Pointer dereference 10: function RLU _ DEREFERENCE (ctx, obj) ptr-copy ← GET _ COPY (obj) ⊲ Get copy pointer 11: 12: if IS _ UNLOCKED (ptr-copy) then ⊲ Is free? return obj ⊲ Yes ⇒ return object 13: if IS _ COPY (ptr-copy) then ⊲ Already a copy? 14: ⊲ Yes ⇒ return object return obj 15: thr-id ← GET _ THREAD _ ID (ptr-copy) 16: if thr-id = ctx.thr-id then ⊲ Locked by us? 17: return ptr-copy ⊲ Yes ⇒ return copy 18: other-ctx ← GET _ CTX (thr-id) ⊲ No ⇒ check for steal 19: if other-ctx.write-clock ≤ ctx.local-clock then 20: return ptr-copy ⊲ Stealing ⇒ return copy 21: return obj ⊲ No stealing ⇒ return object 22: 173
RLU Deferring 1. On commit do not increment the global clock and execute RLU sync; 2. Instead, save writer-log and create a new log for the next writer 3. Synchronize when a writer tries to lock an object that is already locked.
RLU Deferring advantages 1. Reduce the amount of RLU synchronize calls 2. Reduce contention on a global clock 3. Less stealing – less cache misses
Table of Contents Introduction RLU design Evaluation Conclusion
Linked lists User-space linked list (1,000 nodes) 2% updates 20% updates 40% updates 7 RCU Harris Harris (HP) RLU (leaky) 6 Operations/ µ s 5 4 3 2 1 0 4 8 12 16 4 8 12 16 4 8 12 16 Number of threads Figure 4. Throughput for linked lists with 2% (left), 20% (middle), and 40% (right) updates. 176
Hash table User-space hash table (1,000 buckets of 100 nodes) 2% updates 20% updates 40% updates 14 RCU RLU (defer) Harris (HP) RLU Harris 12 (leaky) Operations/ µ s 10 8 6 4 2 0 4 8 12 16 4 8 12 16 4 8 12 16 Number of threads Figure 5. Throughput for hash tables with 2% (left), 20% (middle), and 40% (right) updates. 177
Resizable Hash table Resizable hash table (64K items, 8-16K buckets) 120 RCU 8K RCU 16K 100 Operations/ µ s RCU 8-16K RLU 8K 80 RLU 16K 60 RLU 8-16K 40 20 0 1 2 4 6 8 10 12 14 Number of threads Figure 6. Throughput for the resizable hash table. 177
Update only stress test (hash table) Hash table (10,000 buckets of 1 node) 100% updates 120 RCU RLU 100 Operations/ µ s RLU (defer) 80 60 40 20 0 1 2 4 6 8 10 12 14 16 Number of threads Figure 7. Throughput for the stress test on a hash table with 100% updates and a single item per bucket. 178
Citrus Search Tree (throughput) Citrus tree (100,000 nodes) 70 RCU 10% RCU 20% 60 Operations/ µ s RCU 40% 50 RLU 10% RLU 20% 40 RLU 40% 30 20 10 0 1 8 16 24 32 40 48 56 64 72 80 Number of threads 178
Recommend
More recommend