shfmlocks scalable and practjcal locking for manycore
play

ShfmLocks: Scalable and Practjcal Locking for Manycore Systems - PowerPoint PPT Presentation

ShfmLocks: Scalable and Practjcal Locking for Manycore Systems Changwoo Min COSMOSS Lab / ECE / Virginia Tech https://cosmoss-vt.github.io/ File system becomes a botuleneck on manycore systems Embarrassingly parallel Exim mail server on


  1. ShfmLocks: Scalable and Practjcal Locking for Manycore Systems Changwoo Min COSMOSS Lab / ECE / Virginia Tech https://cosmoss-vt.github.io/

  2. File system becomes a botuleneck on manycore systems Embarrassingly parallel Exim mail server on RAMDISK application! btrfs F2FS 14k ext4 XFS 12k messages/sec 10k 1. Saturated 8k 6k 4k 2k 2. Collapsed 3. Never scales 0k 0 10 20 30 40 50 60 70 80 #core 2

  3. Even in slower storage medium fjle system becomes a botuleneck Exim email server at 80 cores 12k RAMDISK SSD 10k HDD 8k messages/sec 6k 4k 2k 0k btrfs ext4 F2FS XFS

  4. FxMark: File systems are not scalable in manycore systems DRBL DRBM DRBH DWOL DWOM DWAL DWTL 250 250 10 160 2 10 4 9 1.8 9 140 3.5 200 200 8 1.6 8 120 3 7 1.4 7 Create fjles on a shared directory 100 2.5 M ops/sec 150 M ops/sec 150 M ops/sec 6 M ops/sec M ops/sec 1.2 M ops/sec 6 M ops/sec 5 80 1 5 2 100 100 4 0.8 4 60 1.5 3 0.6 3 40 1 50 50 2 0.4 2 20 0.5 1 0.2 1 0 0 0 0 0 0 0 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 #core #core #core #core #core #core #core DWSL MRPL MRPM MRPH MRDL MRDM MWCL 140 80 9 5 500 8 2.5 8 4.5 450 70 7 120 4 400 2 7 60 6 100 3.5 350 6 50 5 3 300 1.5 M ops/sec Locks are critjcal in performance and scalability M ops/sec M ops/sec M ops/sec M ops/sec M ops/sec M ops/sec 80 5 40 2.5 250 4 4 60 2 200 1 30 3 3 1.5 150 40 20 2 2 1 100 0.5 20 10 1 1 0.5 50 0 0 0 0 0 0 0 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 #core #core #core #core #core #core #core MWRL DRBM:O_DIRECT MWCM MWUL MWUM MWRM DRBL:O_DIRECT 2.5 0.5 0.45 0.5 0.7 0.35 0.5 0.45 0.45 0.45 0.4 0.6 0.3 2 0.4 0.4 0.4 0.35 0.5 0.25 0.35 0.35 0.35 0.3 1.5 M ops/sec 0.3 M ops/sec M ops/sec 0.3 M ops/sec M ops/sec M ops/sec M ops/sec 0.3 0.4 0.2 0.25 0.25 0.25 0.25 0.2 0.3 0.15 0.2 0.2 1 0.2 0.15 0.15 0.15 0.15 0.2 0.1 0.1 0.1 0.1 0.5 0.1 0.1 0.05 0.05 0.05 0.05 0.05 0 0 0 0 0 0 0 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 #core #core #core #core #core #core #core Legend DWOL:O_DIRECT DWOM:O_DIRECT Exim RocksDB DBENCH 0.45 0.45 100k 700 18 0.4 0.4 90k 16 600 0.35 0.35 80k 14 btrfs 70k 500 0.3 0.3 12 ext4 messages/sec M ops/sec M ops/sec 60k 0.25 0.25 400 ops/sec 10 ext4NJ GB/sec 50k 0.2 0.2 8 F2FS 300 40k 0.15 0.15 tmpfs 6 30k 200 XFS 0.1 0.1 4 20k 100 0.05 0.05 10k 2 0 0 0k 0 0 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 #core #core #core #core #core

  5. Future hardware further exacerbates the problem Embarrassingly parallel Exim mail server on RAMDISK application! btrfs F2FS 14k ext4 XFS 12k messages/sec 10k 1. Saturated 8k 6k 4k 2k 2. Collapsed 3. Never scales 0k 0 10 20 30 40 50 60 70 80 #core 5

  6. Why this happens? : Memory access is NOT scalable 1. Read operations are scalable Read private Read shared

  7. Why this happens? : Memory access is NOT scalable 1. Read operations are 2. Write operations are scalable NOT scalable Read private Read shared Write private Write shared

  8. Why this happens? : Memory access is NOT scalable 1. Read operations are 2. Write operations are 3. Write operations scalable NOT scalable interfere read operations Read private Shared lock variable (fmag) Read shared Write private Write shared Shared data protected by the lock

  9. Why this happens? : Cache coherence is not scalable ● Cache coherent traffjc dominates!!! ● Writjng a cache line in a popular MESI protocol: – Writer’s cache: Shared → Exclusive – All readers’ cache line: Shared → Invalidate Socket-1 Socket-2 Should minimize contended Memory Memory cache lines and core-to-core LLC LLC communication traffic

  10. Lock’s research efgorts and their use Lock's research efgorts Linux kernel lock adoptjon / modifjcatjon Dekker's algorithm (1962) Semaphore (1965) Spinlock TTAS → 1990s Lamport's bakery algorithm (1974) Semaphore TTAS + block Adoptjng new locks is necessary but it is not easy → Backofg lock (1989) Rwsem TTAS + block → Ticket lock (1991) Spinlock ticket (2.6) → MCS lock (1991) 2011 Mutex TTAS + block (2.6) → HBO lock (2003) Rwsem TTAS + block → Hierarchical lock – HCLH (2006) Flat combining NUMA lock (2011) Spinlock ticket → 2014 NUMA- Mutex TTAS + spin + block (3.16) → Remote Core locking (2012) aware Rwsem TTAS + spin + block (3.16) → Cohort lock (2012) locks RW cohort lock (2013) Spinlock qspinlock (4.4) → 2016 Malthusian lock (2014) Mutex TTAS + spin + block → Rwsem TTAS + spin + block HMCS lock (2015) → AHMCS lock(2016)

  11. Two dimensions of lock design/goals 1) High throughput In high thread count Minimize lock contentjons In single thread No penalty when not contended In oversubscriptjon Avoid bookkeeping overheads 2) Minimal lock size Scales to millions of locks Memory footprint (e.g., fjle inode) 11

  12. Locks performance: Throughput (e.g., each thread creates a fjle, a serial operatjon, in a shared directory) 1 socket > 1 socket Oversubscribed ● Performance crashes afuer 1 socket. Operations / second Due to non-uniform memory access (NUMA). Accessing local socket memory is faster than the remote socket memory. Socket-1 Socket-2 Memory Memory # threads Stock LLC LLC 12

  13. Locks performance: Throughput (e.g., each thread creates a fjle, a serial operatjon, in a shared directory) 1 socket > 1 socket Oversubscribed ● Performance crashes afuer 1 socket. Operations / second Due to non-uniform memory access (NUMA). Accessing local socket memory is faster than the remote socket memory. ● NUMA also afgects oversubscriptjon. # threads Stock Prevent throughput crash afuer one socket 13

  14. Existjng research efgorts ● Making locks NUMA-aware: ○ Two level locks: per-socket and global Global lock ○ Generally hierarchical Socket lock ● Problems: ○ Require extra memory allocatjon Socket-1 Socket-2 ○ Do not care about single thread throughput ● Example: CST 1 14 1. Scalable NUMA-aware Blocking Synchronizatjon Primitjves. ATC 2017.

  15. Locks performance: Throughput (e.g., each thread creates a fjle, a serial operatjon, in a shared directory) 1 socket > 1 socket Oversubscribed Operations / second ● Maintains throughput: Beyond one socket (high thread count). In oversubscribed case (384 threads). ● Poor single thread throughput. Multjple atomic instructjons. # threads Stock CST Single thread matuers in non-contended cases Setup: 8-socket 192-core machine 15

  16. Locks performance: Memory footprint (e.g., each thread creates a fjle, a serial operatjon, in a shared directory) 140 Locks’ memory footprint ● CST has large memory footprint. Allocate socket structure and global lock. Worst case : ~1 GB footprint out of 32 GB applicatjon’s memory. 18 # threads Stock CST 16

  17. Locks performance: Memory footprint (e.g., each thread creates a fjle, a serial operatjon, in a shared directory) 820 Locks’ memory footprint ● CST has large memory footprint. Allocate socket structure and global lock. Worst case : ~1 GB footprint out of 32 GB applicatjon’s memory. 140 18 # threads Stock CST Hierarchical lock 17

  18. Locks performance: Memory footprint (e.g., each thread creates a fjle, a serial operatjon, in a shared directory) 820 Locks’ memory footprint ● CST has large memory footprint. Allocate socket structure and global lock. Worst case : ~1 GB footprint out of 32 GB applicatjon’s memory. 140 18 # threads Lock’s memory footprint afgect its adoptjon Stock CST Hierarchical lock 18

  19. Two goals in our new lock 1) NUMA-aware lock with no memory overhead 2) High throughput in both low/high thread count 19

  20. Key idea: Sort waiters on the fmy Observatjons: Hierarchical locks avoid NUMA by passing the lock within a socket Queue-based locks already maintain a set of waiters 20

  21. Shuffming : Design methodology Representjng a waitjng queue Socket id (e.g, socket 0) t1 Socket ID shuffmer: waiter’s qnode: tail 21

  22. Shuffming : Design methodology Another waiter is in a difgerent socket t1 t2 Socket ID shuffmer: waiter’s qnode: tail 22

  23. Shuffming : Design methodology More waiters join t1 t2 t3 t4 Socket ID shuffmer: waiter’s qnode: tail 23

  24. Shuffming : Design methodology Shuffmer (t1) sorts based on socket ID t1 t2 t3 t4 Socket ID shuffmer: waiter’s qnode: tail 24

Recommend


More recommend