Scalable NUMA-aware Blocking Synchronization Primitives Sanidhya - PowerPoint PPT Presentation

Scalable NUMA-aware Blocking Synchronization Primitives Sanidhya Kashyap , Changwoo Min, Taesoo Kim

The rise of big NUMA machines 'i

Importance of NUMA awareness NUMA node 1 NUMA node 2 NUMA oblivious W1 W2 W3 W4 W6 W5 W6 W5 W3 W1 W4 W2 L File

Importance of NUMA awareness NUMA node 1 NUMA node 2 NUMA oblivious W1 W2 W3 W4 W6 W5 W6 W5 W3 W1 W4 NUMA aware/hierarchical W2 L W1 W6 W2 W3 W4 W5 File

Importance of NUMA awareness NUMA node 1 NUMA node 2 NUMA oblivious W1 W2 W3 W4 W6 W5 W6 W5 W3 W1 W4 NUMA aware/hierarchical W2 L W1 W6 W2 W3 W4 W5 File Idea: Make synchronization primitives NUMA aware!

Lock's research efgorts and their use Lock's research efgorts Linux kernel lock Dekker's algorithm (1962) adoption / modifjcation Semaphore (1965) Lamport's bakery algorithm (1974) Backofg lock (1989) Ticket lock (1991) MCS lock (1991) HBO lock (2003) Hierarchical lock – HCLH (2006) Flat combining NUMA lock (2011) Remote Core locking (2012) Cohort lock (2012) RW cohort lock (2013) Malthusian lock (2014) HMCS lock (2015) AHMCS lock(2016)

Lock's research efgorts and their use Lock's research efgorts Linux kernel lock Dekker's algorithm (1962) adoption / modifjcation Semaphore (1965) Lamport's bakery algorithm (1974) Backofg lock (1989) Ticket lock (1991) MCS lock (1991) HBO lock (2003) Hierarchical lock – HCLH (2006) Flat combining NUMA lock (2011) NUMA- Remote Core locking (2012) aware Cohort lock (2012) locks RW cohort lock (2013) Malthusian lock (2014) HMCS lock (2015) AHMCS lock(2016)

Lock's research efgorts and their use Lock's research efgorts Linux kernel lock Dekker's algorithm (1962) adoption / modifjcation Semaphore (1965) Spinlock TTAS → 1990s Lamport's bakery algorithm (1974) Semaphore TTAS + block → Backofg lock (1989) Rwsem TTAS + block → Ticket lock (1991) Spinlock ticket → MCS lock (1991) 2011 Mutex TTAS + block → HBO lock (2003) Rwsem TTAS + block → Hierarchical lock – HCLH (2006) Spinlock ticket Flat combining NUMA lock (2011) 2014 → NUMA- Mutex TTAS + spin + block Remote Core locking (2012) → aware Rwsem TTAS + spin + block → Cohort lock (2012) locks RW cohort lock (2013) Spinlock qspinlock 2016 → Malthusian lock (2014) Mutex TTAS + spin + block → Rwsem TTAS + spin + block HMCS lock (2015) → AHMCS lock(2016)

Lock's research efgorts and their use Lock's research efgorts Linux kernel lock Dekker's algorithm (1962) adoption / modifjcation Semaphore (1965) Spinlock TTAS → 1990s Lamport's bakery algorithm (1974) Semaphore TTAS + block → Backofg lock (1989) Rwsem TTAS + block → Ticket lock (1991) Spinlock ticket → MCS lock (1991) 2011 Adopting NUMA aware locks is not easy Mutex TTAS + block → HBO lock (2003) Rwsem TTAS + block → Hierarchical lock – HCLH (2006) Spinlock ticket Flat combining NUMA lock (2011) 2014 → NUMA- Mutex TTAS + spin + block Remote Core locking (2012) → aware Rwsem TTAS + spin + block → Cohort lock (2012) locks RW cohort lock (2013) Spinlock qspinlock 2016 → Malthusian lock (2014) Mutex TTAS + spin + block → Rwsem TTAS + spin + block HMCS lock (2015) → AHMCS lock(2016)

Issues with NUMA-aware primitives ● Memory footprint overhead – Cohort lock single instance: 1600 bytes – Example: 1–4 GB of lock space vs 38 MB of Linux’s lock for 10 M inodes ● Does not support blocking/parking behavior

Blocking/parking approach ● Under subscription – #threads <= #cores under-subscription ● Over subscription → Lock throughput – #threads > #cores ● Spin-then-park strategy 1) Spin for a certain duration #thread 2) Add to a parking list → 3) Schedule out (park/block)

Blocking/parking approach ● Under subscription – #threads <= #cores under-subscription over-subscription ● Over subscription → Lock throughput – #threads > #cores ● Spin-then-park strategy 1) Spin for a certain duration #thread 2) Add to a parking list → 3) Schedule out (park/block)

Blocking/parking approach ● Under subscription – #threads <= #cores under-subscription over-subscription ● Over subscription → Lock throughput – #threads > #cores Spinning ● Spin-then-park strategy 1) Spin for a certain duration #thread 2) Add to a parking list → 3) Schedule out (park/block)

Blocking/parking approach ● Under subscription – #threads <= #cores under-subscription over-subscription ● Over subscription → Lock throughput – #threads > #cores Spinning ● Spin-then-park strategy Parking 1) Spin for a certain duration #thread 2) Add to a parking list → 3) Schedule out (park/block)

Blocking/parking approach ● Under subscription – #threads <= #cores under-subscription over-subscription ● Over subscription → Spin + park Lock throughput – #threads > #cores Spinning ● Spin-then-park strategy Parking 1) Spin for a certain duration #thread 2) Add to a parking list → 3) Schedule out (park/block)

Issues with blocking synchronization primitives ● High memory footprint for NUMA-aware locks ● Ineffjcient blocking strategy – Scheduling overhead in the critical path – Cache-line contention while scheduling out

CST lock ● NUMA-aware lock ● Low memory footprint – Allocate socket specifjc data structure when used – 1.5–10X memory less memory consumption ● Effjcient parking/wake-up strategy – Limit the spinning up to a waiter’s time quantum – Pass the lock to an active waiter – Improves scalability by 1.2–4.7X

CST lock design ● NUMA-aware lock ➢ Cohort lock principle + Mitigates cache-line contention and bouncing ● Memory effjcient data structure ➢ Allocate socket structure (snode) when used ➢ Snodes are active until the life-cycle of the lock + Does not stress the memory allocator

CST lock design ● NUMA-aware parking list ➢ Maintain separate per-socket parking lists for readers and writers + Mitigates cache-line contention in over-subscribed scenario + Allows distributed wake-up of parked readers

CST lock design ● Remove scheduler intervention ➢ Pass the lock to a spinning waiter ➢ Waiters park themselves if more than one tasks are running on a CPU (system load) + Scheduler not involved in the critical path + Guarantees forward progress of the system

Lock instantiation ● Initially no snodes are Threads: allocated ● Thread in a particular socket_list socket_list socket initiates an global_tail global_tail allocation

Lock instantiation ● Initially no snodes are Threads: T1/S1 T1/S1 allocated ● Thread in a particular socket_list [S1] socket_list socket_list [S1] socket_list socket initiates an global_tail global_tail allocation Socket 1

Lock instantiation ● Initially no snodes are Threads: T1/S1 T2/S1 T3/S1 T1/S1 T2/S1 T3/S1 allocated ● Thread in a particular socket_list [S1] socket_list socket_list [S1] socket_list socket initiates an global_tail global_tail allocation Socket 1

Lock instantiation ● Initially no snodes are Threads: T1/S1 T2/S1 T3/S1 T4/S2 T1/S1 T2/S1 T3/S1 T4/S2 allocated ● Thread in a particular socket_list [S1, S2] socket_list [S1] socket_list socket_list [S1, S2] socket_list [S1] socket_list socket initiates an global_tail global_tail allocation Socket 1 Socket 2

CST lock phase Threads: CST lock instance socket_list socket_list global_tail global_tail ● Allocate thread specifjc structure on the stack ● Three states for each node – L locked → – UW unparked/spinning waiter → – PW parked / blocked / scheduled out waiter →

CST lock phase Threads: T1/S1 T1/S1 CST lock instance UW L UW L socket_list [S1] socket_list socket_list [S1] socket_list waiting_tail waiting_tail global_tail global_tail parking_tail parking_tail Socket 1 snode_next snode_next ● Allocate thread specifjc structure on the stack ● Three states for each node – L locked → – UW unparked/spinning waiter → – PW parked / blocked / scheduled out waiter →

CST lock phase Threads: T1/S1 T1/S1 CST lock instance UW L UW L socket_list [S1] socket_list socket_list [S1] socket_list T1 L L waiting_tail waiting_tail next next global_tail global_tail parking_tail parking_tail p_next p_next Socket 1 snode_next snode_next ● Allocate thread specifjc structure on the stack ● Three states for each node – L locked → – UW unparked/spinning waiter → – PW parked / blocked / scheduled out waiter →

CST lock phase Threads: T1/S1 T1/S1 CST lock instance Acquire global lock UW L UW L socket_list [S1] socket_list socket_list [S1] socket_list T1 L L waiting_tail waiting_tail next next global_tail global_tail parking_tail parking_tail p_next p_next Socket 1 snode_next snode_next ● Allocate thread specifjc structure on the stack ● Three states for each node – L locked → – UW unparked/spinning waiter → – PW parked / blocked / scheduled out waiter →

Scalable NUMA-aware Blocking Synchronization Primitives Sanidhya - PowerPoint PPT Presentation

Scalable NUMA-aware Blocking Synchronization Primitives Sanidhya Kashyap , Changwoo Min, Taesoo Kim The rise of big NUMA machines 'i The rise of big NUMA machines 'i The rise of big NUMA machines 'i Importance of NUMA awareness NUMA node 1

NUMA-aware Reader-Writer Locks Tom Herold, Marco Lamina 04.02.2015 NUMA Seminar Agenda 1.

Automatic NUMA Balancing Rik van Riel, Principal Software Engineer, Red Hat Vinod Chegu, Master

COMP 633 - Parallel Computing Lecture 10 September 15, 2020 CC-NUMA (1) CC-NUMA implementation

NUMA-aware Matrix-Matrix-Multiplication Max Reimann, Philipp Otto 1 About this talk

Data Blocking Jon K. Nilsen Department of Physics and Scientific Computing Group University of

Delay Aware Packet Scheduling (DAPS) and receivers buffer blocking in CMT-SCTP Nicolas KUHN 1 ,

Pragmatic Primitives for Non-blocking Data Structures PODC 2013 Trevor Brown, University of

Last time Need for synchronization primitives 7: Synchronization Locks and building locks

COMP 633 - Parallel Computing Lecture 12 September 22, 2020 CC-NUMA (3) Synchronization

Content Synchronization Content Synchronization March 2nd 2005 Jukka Honkola T-110.456

NUMA Non-Uniform Memory Access Numa becomes more common because memory controllers get close

NUMA Support for Charm++ Does memory affinity matter? Christiane Pousa Ribeiro Maxime Martinasso

FreeBSD and NUMA John Baldwin NYC*BUG June 3, 2015 What is NUMA Non-Uniform Memory

NUMA-Friendly Stack (using Delegation and Elimination) Irina Calciu Justin Gottschlich Maurice

NUMA-ICTM: A Parallel Version of ICTM Exploiting Memory Placement Strategies for NUMA Machines

Data Blocking Jon K. Nilsen Department of Physics and Scientific Computing Group University of

Parent engagement in Fuel for Fun Barbara Lohse, PhD, RD Leslie Cunningham-Sabo, PhD, RD

September EM 3 Office of Retention and Student Success Fall 2017 Cohort Information Items for

Educator Licensure and Preparation Subcommittee SEPTEM BER 30, 2020 Agenda Options for 2020

BUILDING RESIDENT ENGAGEMENT IN THE DELTA REGION A County Health Rankings & Roadmaps Special

Club Directors Update September 17, 2020 Welcome and introduction 1. 2. 3. 4. 5. Overview

What Does ESSA Mean for English Learners and Accountability? @EdPolicyAIR #ESSAforELs English

Presidential Management Fellows Leadership Development Program (PMF LDP) Program Overview

Building Pathways from College to Career Bridget Strickler, Tanya Garcia, Georgetown Beth

Scalable NUMA-aware Blocking Synchronization Primitives Sanidhya - PowerPoint PPT Presentation

Scalable NUMA-aware Blocking Synchronization Primitives Sanidhya Kashyap , Changwoo Min, Taesoo Kim The rise of big NUMA machines 'i The rise of big NUMA machines 'i The rise of big NUMA machines 'i Importance of NUMA awareness NUMA node 1

NUMA-aware Reader-Writer Locks Tom Herold, Marco Lamina 04.02.2015 NUMA Seminar Agenda 1.

Automatic NUMA Balancing Rik van Riel, Principal Software Engineer, Red Hat Vinod Chegu, Master

COMP 633 - Parallel Computing Lecture 10 September 15, 2020 CC-NUMA (1) CC-NUMA implementation

NUMA-aware Matrix-Matrix-Multiplication Max Reimann, Philipp Otto 1 About this talk

Data Blocking Jon K. Nilsen Department of Physics and Scientific Computing Group University of

Delay Aware Packet Scheduling (DAPS) and receivers buffer blocking in CMT-SCTP Nicolas KUHN 1 ,

Pragmatic Primitives for Non-blocking Data Structures PODC 2013 Trevor Brown, University of

Last time Need for synchronization primitives 7: Synchronization Locks and building locks

COMP 633 - Parallel Computing Lecture 12 September 22, 2020 CC-NUMA (3) Synchronization

Content Synchronization Content Synchronization March 2nd 2005 Jukka Honkola T-110.456

NUMA Non-Uniform Memory Access Numa becomes more common because memory controllers get close

NUMA Support for Charm++ Does memory affinity matter? Christiane Pousa Ribeiro Maxime Martinasso

FreeBSD and NUMA John Baldwin NYC*BUG June 3, 2015 What is NUMA Non-Uniform Memory

NUMA-Friendly Stack (using Delegation and Elimination) Irina Calciu Justin Gottschlich Maurice

NUMA-ICTM: A Parallel Version of ICTM Exploiting Memory Placement Strategies for NUMA Machines

Data Blocking Jon K. Nilsen Department of Physics and Scientific Computing Group University of

Parent engagement in Fuel for Fun Barbara Lohse, PhD, RD Leslie Cunningham-Sabo, PhD, RD

September EM 3 Office of Retention and Student Success Fall 2017 Cohort Information Items for

Educator Licensure and Preparation Subcommittee SEPTEM BER 30, 2020 Agenda Options for 2020

BUILDING RESIDENT ENGAGEMENT IN THE DELTA REGION A County Health Rankings &amp; Roadmaps Special

Club Directors Update September 17, 2020 Welcome and introduction 1. 2. 3. 4. 5. Overview

What Does ESSA Mean for English Learners and Accountability? @EdPolicyAIR #ESSAforELs English

Presidential Management Fellows Leadership Development Program (PMF LDP) Program Overview

Building Pathways from College to Career Bridget Strickler, Tanya Garcia, Georgetown Beth

BUILDING RESIDENT ENGAGEMENT IN THE DELTA REGION A County Health Rankings & Roadmaps Special