scaling synchronization primitives
play

Scaling Synchronization Primitives Ph.D. Defense of Dissertation - PowerPoint PPT Presentation

Scaling Synchronization Primitives Ph.D. Defense of Dissertation Sanidhya Kashyap Rise of the multicore machines Hardware limitation: frequency stagnated CPU trends 10000 Frequency (MHz) 1000 100 Data-intensive applications performance 10


  1. Scaling Synchronization Primitives Ph.D. Defense of Dissertation Sanidhya Kashyap

  2. Rise of the multicore machines Hardware limitation: frequency stagnated CPU trends 10000 Frequency (MHz) 1000 100 Data-intensive applications performance 10 Number of hardware threads 1 1970 1980 1990 2000 2010 2020 Applications inherently scaled with increasing frequency Machines have multiples of processors (multi-socket) 2 Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten

  3. Multicore machines → “The free lunch is over” Today’s de facto standard: Concurrent applications that scale with increasing cores Operating systems Cloud services Data processing systems Databases Synchronization primitives Basic building block for designing applications 4

  4. Synchronization primitives Provide some form of consistency required by applications Determine the ordering/scheduling of concurrent events 6

  5. Embarrassingly parallel application performance Typical application performance on a manycore machine Higher is better 100 K 80 K Messages / second 60 K 40 K 20 K 0 K 1 2 4 24 48 72 96 120 144 168 192 # threads 7

  6. Embarrassingly parallel application performance Typical application performance on a manycore machine 1200 K 1000 K Messages / second 800 K 600 K 400 K 200 K 0 K 1 2 4 24 48 72 96 120 144 168 192 Synchronization required at several places # threads 8

  7. Future hardware will exacerbate scalability 100-1000s of CPUs Many-core Single core 10x ~ 1000x SSD Challenge: Maintain application scalability 9

  8. How can we minimize the overhead of synchronization primitives for large multicore machines? Efficiently schedule events by leveraging HW/SW 10

  9. Thesis contributions • Timestamping is costly on large multicore machines Hardware timestamping • Cache contention due to atomic instructions Eurosys’18 • Approach : Use per-core invariant hardware clock • Double scheduling in a virtualized environment VM VM VM double Hypervisor • Introduce various types of preemption problems scheduling OS • Approach : Expose semantic information across layers ATC’18 • Discrepancy between lock design and use Decouple lock design policy from • Approach : Decouple lock design from lock policy via HW / SW design policy shuffling mechanism SOSP’19 13

  10. Thesis contributions • Timestamping is costly on large multicore machines Hardware timestamping • Cache contention due to atomic instructions Eurosys’18 • Approach : Use per-core invariant hardware clock • Double scheduling in a virtualized environment VM VM VM double Hypervisor • Introduce various types of preemption problems scheduling OS • Approach : Expose semantic information across layers ATC’18 • Discrepancy between lock design and use Decouple lock design policy from • Approach : Decouple lock design from lock policy via HW / SW design policy shuffling mechanism SOSP’19 14

  11. Example: Email service 15

  12. Embarrassingly parallel application performance Process intensive and stresses memory subsystem, file system and scheduler ... 100 K 80 K Messages / second File system file_create (…) { 60 K spin_lock(superblock); … Scheduler process_create (…) { 40 K spin_unlock(superblock); mm_lock(process->lock); } Degrading performance … mm_unlock(process->lock); 20 K due to inefficient locks rename_file (…) { } write_lock(global_rename_lock); … 0 K processs_schedule (…) { write_unlock(global_rename_lock); 1 2 4 24 48 72 96 120 144 168 192 spin_lock(run_queue->lock); } … spin_unlock(run_queue->lock); # threads } 16

  13. Synchronization primitive: Locks • Provide mutual exclusion among tasks • Guard shared resource • Mutex, readers-writer lock, spinlock Threads want to modify the data structure Threads wait for their turn by either spinning or sleeping Lock protects the access to the data structure 17

  14. Locks: MOST WIDELY used primitive 200 # lock API() calls (x1000) 180 160 140 4X 120 100 80 60 40 20 2002 2019 Linux kernel More locks are in use to improve OS scalability 18

  15. Locks are used in a complicated manner A system call can acquire up to 12 locks (average of 4) 19

  16. Issue with current lock designs Design specific locks for hardware and software requirements Ticket lock HBO lock FC lock Cohort lock HMCS lock 1989 2006 2012 2014 2017 1991 2015 2003 2011 Backoff lock MCS lock HCLH lock CST lock RCL Malthusian lock Used in practice Best performance 22

  17. Issue with current lock designs Design specific locks for hardware and software requirements Locks in practice: Generic Focus on simple lock, more applicability Forgo hardware characteristic Worsening throughput with more cores Locks in research: Hardware specific design High throughput for high thread count Use extra memory Suitable for pre-allocated data HW/SW policies are statically tied together 23

  18. Incorporating HW/SW policies dynamically Scalable and practical locking algorithms 24

  19. Two trends driving locks’ innovation Evolving hardware Applications requirement 25

  20. Two dimensions of lock design / goals 1) High throughput In high thread count Minimize lock contentions In single thread No penalty when not contended In oversubscription Avoid bookkeeping overhead 2) Minimal lock size Scales to millions of locks Memory footprint 26

  21. Two dimensions of lock design / goals 1) High throughput In high thread count Minimize lock contentions In single thread No penalty when not contended Application : Multi-threaded to utilize cores to improve performance In oversubscription Avoid bookkeeping overhead Lock : Minimize lock contention while maintaining high throughput 2) Minimal lock size Scales to millions of locks Memory footprint (e.g., file inode) 27

  22. Two dimensions of lock design / goals 1) High throughput In high thread count Minimize lock contentions In single thread No penalty when not contended In oversubscription Avoid bookkeeping overhead Application : Single thread to do an operation; fine-grained locking Application: Use threads utilize cores to improve performance Locks: Minimize lock contention while maintaining high throughput Lock : Minimal or almost no lock/unlock overhead 2) Minimal lock size Memory footprint Scales to millions of locks 28

  23. Two dimensions of lock design / goals 1) High throughput In high thread count Minimize lock contentions In single thread No penalty when not contended In oversubscription Avoid bookkeeping overhead Application : More threads than cores; common scenario; eg. I/O wait 2) Minimal lock size Lock : Minimize scheduler overhead while waking or parking threads Scales to millions of locks Memory footprint 29

  24. Two dimensions of lock design / goals 1) High throughput In high thread count Minimize lock contentions In single thread No penalty when not contended Application : Locks are embedded in data structures; eg. file inodes In oversubscription Avoid bookkeeping overhead Lock : Can stress memory allocator or data structure alignment 2) Minimal lock size Scales to millions of locks Memory footprint 30

  25. Locks performance: Throughput Benchmark: Each thread creates a file, a serial operation, in a shared directory 1 1 socket > 1 socket Oversubscribed Throughput collapses after one socket ● Operations / second Due to non-uniform memory access (NUMA) # threads Stock Setup: 192-core/8-socket machine 1. Understanding Manycore Scalability of File Systems [ ATC’16] 31

  26. Locks performance: Throughput Benchmark: Each thread creates a file, a serial operation, in a shared directory 1 1 socket > 1 socket Oversubscribed Throughput collapses after one socket ● Operations / second Due to non-uniform memory access (NUMA) NUMA also affects oversubscription ● # threads Stock Prevent throughput collapse after one socket Setup: 192-core/8-socket machine 1. Understanding Manycore Scalability of File Systems [ ATC’16] 34

  27. Existing research efforts: Hierarchical locks Goal: high throughput at high thread count ● Making locks NUMA-aware: ● Global lock Use extra memory to improve throughput ○ Socket lock Two level locks: per-socket and the global ○ Socket-1 Socket-2 Avoid NUMA overhead ● → Pass global lock within the same socket 35

  28. Existing research efforts: Hierarchical locks Problems: ● Require extra memory allocation ○ Do not care about single thread throughput ○ Global lock Example: CST 2 ● Socket lock Allocates socket structure on first access ○ Socket-1 Socket-2 Handles oversubscription (# threads > # CPUS) ○ 2. Scalable NUMA-aware Blocking Synchronization Primitives [ ATC’17] 36

  29. Locks performance: Throughput Benchmark: Each thread creates a file, a serial operation, in a shared directory Maintains throughput: 1 socket > 1 socket Oversubscribed ● Beyond one socket (high thread count) Operations / second In oversubscribed case (384 threads) Poor single thread throughput ● Multiple atomic instructions # threads Stock CST Setup: 192-core/8-socket machine 37

Recommend


More recommend