the impact of thread per core architecture on application
play

The Impact of Thread- Per-Core Architecture on Application Tail - PowerPoint PPT Presentation

The Impact of Thread- Per-Core Architecture on Application Tail Latency Pekka Enberg, Ashwin Rao, and Sasu Tarkoma University of Helsinki ANCS 2019 1 /54 Introduction Thread-per-core architecture has emerged to eliminate overheads in


  1. The Impact of Thread- Per-Core Architecture on Application Tail Latency Pekka Enberg, Ashwin Rao, and Sasu Tarkoma University of Helsinki ANCS 2019 � 1 /54

  2. Introduction • Thread-per-core architecture has emerged to eliminate overheads in traditional multi-threaded architectures in server applications. • Partitioning of hardware resources can improve parallelism, but there are various trade-o ff s applications need to consider . • Takeaway: Request steering and OS interfaces are holding back the thread-per-core architecture. � 2 /54

  3. Outline • Overview of thread-per-core • A key-value store • Impact on tail latency • Problems in the approach • Future directions � 3 /54

  4. Outline • Overview of thread-per-core • A key-value store • Impact on tail latency • Problems in the approach • Future directions � 4 /54

  5. What is thread-per-core? • Thread-per-core = no multiplexing of a CPU core at OS level • Eliminates thread context switching overhead [Qin 2019; Seastar] • Enables elimination of thread synchronization by partitioning [Seastar] • Eliminates thread scheduling delays [Ousterhout, 2019] Ousterhout et al. 2019. Shenango: Achieving High CPU E ffi ciency for Latency-sensitive Datacenter Workloads. NSDI ’19. Qin et al. 2018. Arachne: Core-Aware Thread Management. OSDI ’18. � 5 /54 Seastar framework for high-performance server applications on modern hardware. http://seastar.io/

  6. Interrupt isolation for thread-per-core • The in-kernel network stack runs in kernel threads, which interfere with application threads. • Network stack processing must be isolated to CPU cores not running application thread. • Interrupt isolation can be done with IRQ a ffi nity and IRQ balancing configuration changes. • NIC receive side-steering (RSS) configuration needs to align with IRQ a ffi nity configuration. Li et al . 2014. Tales of the Tail: Hardware, OS, and Application-level Sources of Tail Latency. SOCC ‘14 � 6 /54

  7. Partitioning in thread-per-core • Partitioning of hardware resources (such as NIC and DRAM) can improve parallelism, by eliminating thread synchronization. • Di ff erent ways of partitioning resources: • Shared-everything, shared-nothing, and shared- something. � 7 /54

  8. Shared-everything CPU0 CPU1 CPU2 CPU3 DRAM Data � 8 /54

  9. Shared-everything CPU0 CPU1 CPU2 CPU3 DRAM Data Hardware resources are shared between all CPU cores. � 9 /54

  10. Shared-everything CPU0 CPU1 CPU2 CPU3 DRAM Data Every request can be processed on any CPU core. � 10 /54

  11. Shared-everything CPU0 CPU1 CPU2 CPU3 DRAM Data Data access must be synchronized. � 11 /54

  12. Shared-everything • Advantages: • Every request can be processed on any CPU core. • No request steering needed. • Disadvantages: • Shared-memory scales badly on multicore [Holland, 2011] • Examples: • Memcached (when thread pool size equals CPU core count) Holland et al . 2011. Multicore OSes: Looking Forward from 1991, Er, 2011. HotOS ‘11 � 12 /54

  13. Shared-nothing CPU0 CPU1 CPU2 CPU3 DRAM Data Data Data Data � 13 /54

  14. Shared-nothing CPU0 CPU1 CPU2 CPU3 DRAM Data Data Data Data Hardware resources are partitioned between CPU cores. � 14 /54

  15. Shared-nothing CPU0 CPU1 CPU2 CPU3 DRAM Data Data Data Data Request can be processed on one specific CPU core. � 15 /54

  16. Shared-nothing CPU0 CPU1 CPU2 CPU3 DRAM Data Data Data Data Data access does not require synchronization. � 16 /54

  17. Shared-nothing CPU0 CPU1 CPU2 CPU3 DRAM Data Data Data Data Requests need to be steered. � 17 /54

  18. Shared-nothing • Advantages: • Data access does not require synchronization. • Disadvantages: • Request steering is needed [Lim, 2014; Didona, 2019] • CPU utilisation imbalance if data is not distributed well (“hot partition”) • Sensitive to skewed workloads • Examples: • Seastar framework and MICA key-value store Didona et al . 2019. Sharding for Improving Tail Latencies in In-memory Key-value Stores. NSDI '19 Lim et al . 2014. MICA: A Holistic Approach to Fast In-memory Key-value. NSDI ’14 � 18 /54

  19. Shared-something CPU0 CPU1 CPU2 CPU3 DRAM Data Data � 19 /54

  20. Shared-something CPU0 CPU1 CPU2 CPU3 DRAM Data Data Hardware resources are partitioned between CPU core clusters . � 20 /54

  21. Shared-something CPU0 CPU1 CPU2 CPU3 DRAM Data Data No synchronization needed for data access on di ff erent CPU clusters. � 21 /54

  22. Shared-something CPU0 CPU1 CPU2 CPU3 DRAM Data Data Data access needs to be synchronised within the same CPU core cluster. � 22 /54

  23. Shared-something • Advantages: • Request can be processed on many cores • Shared-memory scales on small core counts [Holland, 2011]. • Improved hardware-level parallelism? • For example, partitioning around sub-NUMA clustering could improve memory controller utilization. • Disadvantages: • Request steering becomes more complex. Holland et al . 2011. Multicore OSes: Looking Forward from 1991, Er, 2011. HotOS ‘11 � 23 /54

  24. Takeaways • Partitioning improves parallelism, but there are trade-o ff s applications need to consider. • Isolation of the in-kernel network stack is needed to avoid interference with application threads. � 24 /54

  25. Outline • Overview of thread-per-core • A key-value store • Impact on tail latency • Problems in the approach • Future directions � 25 /54

  26. A shared-nothing, key-value store • To measure the impact of thread-per-core on tail latency, we designed a shared-nothing key-value store. • Memcached wire-protocol compatible for easier evaluation. • Software-based request steering with message passing between threads. • Lockless, single-producer, single-consumer (SPSC) queue per thread. � 26 /54

  27. Shared-nothing CPU0 CPU1 CPU2 CPU3 DRAM Data Data Data Data Taking the shared-nothing model… � 27 /54

  28. KV store design CPU0 CPU1 CPU2 CPU3 Userspace Application Application Message Passing Thread Thread Socket Socket Socket Socket Kernel SoftIRQ SoftIRQ Thread Thread IRQ Poll IRQ Poll Hardware NIC RX NIC RX DRAM DRAM Queue Queue Network …and implementing it on Linux. � 28 /54

  29. KV store design CPU0 CPU1 CPU2 CPU3 Userspace Application Application Message Passing Thread Thread Socket Socket Socket Socket Kernel SoftIRQ SoftIRQ Thread Thread IRQ Poll IRQ Poll Hardware NIC RX NIC RX DRAM DRAM Queue Queue Network In-kernel network stack isolated on its own CPU cores. � 29 /54

  30. KV store design CPU0 CPU1 CPU2 CPU3 Userspace Application Application Message Passing Thread Thread Socket Socket Socket Socket Kernel SoftIRQ SoftIRQ Thread Thread IRQ Poll IRQ Poll Hardware NIC RX NIC RX DRAM DRAM Queue Queue Network Application threads are running on their own CPU cores. � 30 /54

  31. KV store design CPU0 CPU1 CPU2 CPU3 Userspace Application Application Message Passing Thread Thread Socket Socket Socket Socket Kernel SoftIRQ SoftIRQ Thread Thread IRQ Poll IRQ Poll Hardware NIC RX NIC RX DRAM DRAM Queue Queue Network Message passing between the application threads. � 31 /54

  32. Outline • Overview of thread-per-core • A key-value store • Impact on tail latency • Problems in the approach • Future directions � 32 /54

  33. Impact on tail latency • Comparison of Memcached ( shared-everything ) and Sphinx ( shared-nothing ) • Measured read and update latency with the Mutilate tool • Testbed servers (Intel Xeon): • 24 CPU cores, Intel 82599ES NIC ( modern ) • 8 CPU cores, Broadcom NetXtreme II ( legacy ) • Varied IRQ isolation configurations. � 33 /54

  34. Impact on tail latency � 34 /54

  35. Impact on tail latency � 35 /54

  36. 99th percentile latency over concurrency for updates 99th Percentile Update Latency (ms) 2 . 5 Memcached (legacy) Sphinxd (legacy) 2 . 0 Memcached (modern) Sphinxd (modern) 1 . 5 1 . 0 0 . 5 0 . 0 24 48 72 96 120 144 168 192 216 240 264 288 312 336 360 384 Number of Concurrent Connections � 36 /54

  37. 99th percentile latency over concurrency for updates 99th Percentile Update Latency (ms) 2 . 5 Memcached (legacy) Sphinxd (legacy) 2 . 0 Memcached (modern) Sphinxd (modern) 1 . 5 Memcached 1 . 0 0 . 5 Sphinx 0 . 0 24 48 72 96 120 144 168 192 216 240 264 288 312 336 360 384 Number of Concurrent Connections � 37 /54

  38. 99th percentile latency over concurrency for updates 99th Percentile Update Latency (ms) 2 . 5 Memcached (legacy) Sphinxd (legacy) 2 . 0 Memcached (modern) Sphinxd (modern) 1 . 5 Memcached 1 . 0 0 . 5 Sphinx 0 . 0 24 48 72 96 120 144 168 192 216 240 264 288 312 336 360 384 Number of Concurrent Connections No locking, better CPU cache utilization. � 38 /54

  39. Latency percentiles for updates Sphinx Memcached 99 95 90 80 Percentile (%) 50 Memcached (legacy) Sphinxd (legacy) 20 Memcached (modern) 10 Sphinxd (modern) 5 1 0 . 0 0 . 5 1 . 0 1 . 5 2 . 0 Update Latency (ms) � 39 /54

Recommend


More recommend