scalable numa aware blocking synchronization primitives
play

Scalable NUMA-aware Blocking Synchronization Primitives Sanidhya - PowerPoint PPT Presentation

Scalable NUMA-aware Blocking Synchronization Primitives Sanidhya Kashyap , Changwoo Min, Taesoo Kim The rise of big NUMA machines 'i The rise of big NUMA machines 'i The rise of big NUMA machines 'i Importance of NUMA awareness NUMA node 1


  1. Scalable NUMA-aware Blocking Synchronization Primitives Sanidhya Kashyap , Changwoo Min, Taesoo Kim

  2. The rise of big NUMA machines 'i

  3. The rise of big NUMA machines 'i

  4. The rise of big NUMA machines 'i

  5. Importance of NUMA awareness NUMA node 1 NUMA node 2 NUMA oblivious W1 W2 W3 W4 W6 W5 W6 W5 W3 W1 W4 W2 L File

  6. Importance of NUMA awareness NUMA node 1 NUMA node 2 NUMA oblivious W1 W2 W3 W4 W6 W5 W6 W5 W3 W1 W4 NUMA aware/hierarchical W2 L W1 W6 W2 W3 W4 W5 File

  7. Importance of NUMA awareness NUMA node 1 NUMA node 2 NUMA oblivious W1 W2 W3 W4 W6 W5 W6 W5 W3 W1 W4 NUMA aware/hierarchical W2 L W1 W6 W2 W3 W4 W5 File Idea: Make synchronization primitives NUMA aware!

  8. Lock's research efgorts and their use Lock's research efgorts Linux kernel lock Dekker's algorithm (1962) adoption / modifjcation Semaphore (1965) Lamport's bakery algorithm (1974) Backofg lock (1989) Ticket lock (1991) MCS lock (1991) HBO lock (2003) Hierarchical lock – HCLH (2006) Flat combining NUMA lock (2011) Remote Core locking (2012) Cohort lock (2012) RW cohort lock (2013) Malthusian lock (2014) HMCS lock (2015) AHMCS lock(2016)

  9. Lock's research efgorts and their use Lock's research efgorts Linux kernel lock Dekker's algorithm (1962) adoption / modifjcation Semaphore (1965) Lamport's bakery algorithm (1974) Backofg lock (1989) Ticket lock (1991) MCS lock (1991) HBO lock (2003) Hierarchical lock – HCLH (2006) Flat combining NUMA lock (2011) NUMA- Remote Core locking (2012) aware Cohort lock (2012) locks RW cohort lock (2013) Malthusian lock (2014) HMCS lock (2015) AHMCS lock(2016)

  10. Lock's research efgorts and their use Lock's research efgorts Linux kernel lock Dekker's algorithm (1962) adoption / modifjcation Semaphore (1965) Spinlock TTAS → 1990s Lamport's bakery algorithm (1974) Semaphore TTAS + block → Backofg lock (1989) Rwsem TTAS + block → Ticket lock (1991) Spinlock ticket → MCS lock (1991) 2011 Mutex TTAS + block → HBO lock (2003) Rwsem TTAS + block → Hierarchical lock – HCLH (2006) Spinlock ticket Flat combining NUMA lock (2011) 2014 → NUMA- Mutex TTAS + spin + block Remote Core locking (2012) → aware Rwsem TTAS + spin + block → Cohort lock (2012) locks RW cohort lock (2013) Spinlock qspinlock 2016 → Malthusian lock (2014) Mutex TTAS + spin + block → Rwsem TTAS + spin + block HMCS lock (2015) → AHMCS lock(2016)

  11. Lock's research efgorts and their use Lock's research efgorts Linux kernel lock Dekker's algorithm (1962) adoption / modifjcation Semaphore (1965) Spinlock TTAS → 1990s Lamport's bakery algorithm (1974) Semaphore TTAS + block → Backofg lock (1989) Rwsem TTAS + block → Ticket lock (1991) Spinlock ticket → MCS lock (1991) 2011 Adopting NUMA aware locks is not easy Mutex TTAS + block → HBO lock (2003) Rwsem TTAS + block → Hierarchical lock – HCLH (2006) Spinlock ticket Flat combining NUMA lock (2011) 2014 → NUMA- Mutex TTAS + spin + block Remote Core locking (2012) → aware Rwsem TTAS + spin + block → Cohort lock (2012) locks RW cohort lock (2013) Spinlock qspinlock 2016 → Malthusian lock (2014) Mutex TTAS + spin + block → Rwsem TTAS + spin + block HMCS lock (2015) → AHMCS lock(2016)

  12. Issues with NUMA-aware primitives ● Memory footprint overhead – Cohort lock single instance: 1600 bytes – Example: 1–4 GB of lock space vs 38 MB of Linux’s lock for 10 M inodes ● Does not support blocking/parking behavior

  13. Blocking/parking approach ● Under subscription – #threads <= #cores under-subscription ● Over subscription → Lock throughput – #threads > #cores ● Spin-then-park strategy 1) Spin for a certain duration #thread 2) Add to a parking list → 3) Schedule out (park/block)

  14. Blocking/parking approach ● Under subscription – #threads <= #cores under-subscription over-subscription ● Over subscription → Lock throughput – #threads > #cores ● Spin-then-park strategy 1) Spin for a certain duration #thread 2) Add to a parking list → 3) Schedule out (park/block)

  15. Blocking/parking approach ● Under subscription – #threads <= #cores under-subscription over-subscription ● Over subscription → Lock throughput – #threads > #cores Spinning ● Spin-then-park strategy 1) Spin for a certain duration #thread 2) Add to a parking list → 3) Schedule out (park/block)

  16. Blocking/parking approach ● Under subscription – #threads <= #cores under-subscription over-subscription ● Over subscription → Lock throughput – #threads > #cores Spinning ● Spin-then-park strategy Parking 1) Spin for a certain duration #thread 2) Add to a parking list → 3) Schedule out (park/block)

  17. Blocking/parking approach ● Under subscription – #threads <= #cores under-subscription over-subscription ● Over subscription → Spin + park Lock throughput – #threads > #cores Spinning ● Spin-then-park strategy Parking 1) Spin for a certain duration #thread 2) Add to a parking list → 3) Schedule out (park/block)

  18. Issues with blocking synchronization primitives ● High memory footprint for NUMA-aware locks ● Ineffjcient blocking strategy – Scheduling overhead in the critical path – Cache-line contention while scheduling out

  19. CST lock ● NUMA-aware lock ● Low memory footprint – Allocate socket specifjc data structure when used – 1.5–10X memory less memory consumption ● Effjcient parking/wake-up strategy – Limit the spinning up to a waiter’s time quantum – Pass the lock to an active waiter – Improves scalability by 1.2–4.7X

  20. CST lock design ● NUMA-aware lock ➢ Cohort lock principle + Mitigates cache-line contention and bouncing ● Memory effjcient data structure ➢ Allocate socket structure (snode) when used ➢ Snodes are active until the life-cycle of the lock + Does not stress the memory allocator

  21. CST lock design ● NUMA-aware parking list ➢ Maintain separate per-socket parking lists for readers and writers + Mitigates cache-line contention in over-subscribed scenario + Allows distributed wake-up of parked readers

  22. CST lock design ● Remove scheduler intervention ➢ Pass the lock to a spinning waiter ➢ Waiters park themselves if more than one tasks are running on a CPU (system load) + Scheduler not involved in the critical path + Guarantees forward progress of the system

  23. Lock instantiation ● Initially no snodes are Threads: allocated ● Thread in a particular socket_list socket_list socket initiates an global_tail global_tail allocation

  24. Lock instantiation ● Initially no snodes are Threads: T1/S1 T1/S1 allocated ● Thread in a particular socket_list [S1] socket_list socket_list [S1] socket_list socket initiates an global_tail global_tail allocation Socket 1

  25. Lock instantiation ● Initially no snodes are Threads: T1/S1 T2/S1 T3/S1 T1/S1 T2/S1 T3/S1 allocated ● Thread in a particular socket_list [S1] socket_list socket_list [S1] socket_list socket initiates an global_tail global_tail allocation Socket 1

  26. Lock instantiation ● Initially no snodes are Threads: T1/S1 T2/S1 T3/S1 T4/S2 T1/S1 T2/S1 T3/S1 T4/S2 allocated ● Thread in a particular socket_list [S1, S2] socket_list [S1] socket_list socket_list [S1, S2] socket_list [S1] socket_list socket initiates an global_tail global_tail allocation Socket 1 Socket 2

  27. CST lock phase Threads: CST lock instance socket_list socket_list global_tail global_tail ● Allocate thread specifjc structure on the stack ● Three states for each node – L locked → – UW unparked/spinning waiter → – PW parked / blocked / scheduled out waiter →

  28. CST lock phase Threads: T1/S1 T1/S1 CST lock instance UW L UW L socket_list [S1] socket_list socket_list [S1] socket_list waiting_tail waiting_tail global_tail global_tail parking_tail parking_tail Socket 1 snode_next snode_next ● Allocate thread specifjc structure on the stack ● Three states for each node – L locked → – UW unparked/spinning waiter → – PW parked / blocked / scheduled out waiter →

  29. CST lock phase Threads: T1/S1 T1/S1 CST lock instance UW L UW L socket_list [S1] socket_list socket_list [S1] socket_list T1 L L waiting_tail waiting_tail next next global_tail global_tail parking_tail parking_tail p_next p_next Socket 1 snode_next snode_next ● Allocate thread specifjc structure on the stack ● Three states for each node – L locked → – UW unparked/spinning waiter → – PW parked / blocked / scheduled out waiter →

  30. CST lock phase Threads: T1/S1 T1/S1 CST lock instance Acquire global lock UW L UW L socket_list [S1] socket_list socket_list [S1] socket_list T1 L L waiting_tail waiting_tail next next global_tail global_tail parking_tail parking_tail p_next p_next Socket 1 snode_next snode_next ● Allocate thread specifjc structure on the stack ● Three states for each node – L locked → – UW unparked/spinning waiter → – PW parked / blocked / scheduled out waiter →

Recommend


More recommend