enhancing server efficiency in the face of killer
play

Enhancing Server Efficiency in the Face of Killer Microseconds - PowerPoint PPT Presentation

Enhancing Server Efficiency in the Face of Killer Microseconds Amirhossein Mirhosseini, Akshitha Sriraman, Thomas F. Wenisch University of Michigan HPCA 2019 02 /18/2019 Killer Microseconds [Barroso17] Frequent microsecond-scale


  1. Enhancing Server Efficiency in the Face of Killer Microseconds Amirhossein Mirhosseini, Akshitha Sriraman, Thomas F. Wenisch University of Michigan HPCA 2019 02 /18/2019

  2. Killer Microseconds [Barroso’17] • Frequent microsecond-scale pauses in datacenter applications – Stalls for accessing emerging memory & I/O devices – Mid-tier servers synchronously waiting for leaf nodes – Brief idle periods in high-throughput microservices • Modern computing systems not effective in hiding microseconds – Micro-architectural techniques are insufficient – OS/software context switches are too coarse grain Enhancing Server Efficiency in the Face of Killer Microseconds 2

  3. Our proposal: Duplexity • Cost-effective highly multithreaded server design • Heterogeneous design --- Dy Dyads of cores: – Ma Mast ster er core re for latency-sensitive microservices – Le Lender core for latency-insensitive applications • Key idea 1: master core may “borrow” threads from the lender core to fill utilization holes • Key idea 2: cores protect threads’ cache states to avoid excessive tail latencies and QoS violations Dupl Duplexity improve ves core utilization by 4.8x in presence of killer microseconds Enhancing Server Efficiency in the Face of Killer Microseconds 3

  4. Outline • Killer microseconds • Why (scaling) SMT is not an option • Duplexity server architecture • Evaluation methodology and results Enhancing Server Efficiency in the Face of Killer Microseconds 4

  5. Modern HW is great at hiding nanosecond scale stalls… Caches! OoO! MLP! Spec! Prefetching! 50 nanoseconds Micro-architectural techniques are at best able to hide 100s of nanoseconds Enhancing Server Efficiency in the Face of Killer Microseconds 5

  6. Modern OS is great at hiding millisecond scale stalls… Context Switch! 5 milliseconds yawn OS context switching typically has an average overhead of at least 5-20us Enhancing Server Efficiency in the Face of Killer Microseconds 6

  7. But, today’s devices and microservices inflict μs-scale stalls • Em Emerg rging memory ry and I/ I/O technologies: – NVM, disaggregated memory, … : O(1 μs) – High-end flash, accelerators, … : O(10 μs) • Br Brief idle periods: – With μs-scale microservices, idle periods also shrink to μs scales • 200K QPS service at 50% load has average idle periods of only 10 μs Need HW/SW mechanisms to hide μs-scale latencies Enhancing Server Efficiency in the Face of Killer Microseconds 7

  8. Multithreading is the obvious solution • OS context switches are too coarse-grain for μs-scale periods – User-level cooperative multithreading [Cho’18] – Hardware (simultaneous) multithreading [Yamamoto’95][[Tullsen’95][Tullsen’96] … But, we need a lot of (10+) threads to fill μs-scale stall/idle periods Enhancing Server Efficiency in the Face of Killer Microseconds 8

  9. Simply adding more threads is not enough • Complicates fetch/dispatch/issue logic – Prolonging its critical path • Requires a larger register file • Pressure/thrashing in L1 caches • Hi Higher her tail latenc ency due due to int nter erfer erenc ence e amo mong ng thr hrea eads ds – Up to 5.7x higher tail latency – 1.5x higher tail at low load and low IPC co-runner Need co complexity management and pe perf rform rmanc ance isolat ation n mechanisms Enhancing Server Efficiency in the Face of Killer Microseconds 9

  10. Duplexity • Two main objectives: – Maximize performance density and energy efficiency • Fill utilization “holes” arising from killer microseconds Borrow latency-insensitive batch threads to fill microservices’ ut Bo utilizat ation n ho holes – Minimize disruption of latency-sensitive threads • Avoid excessive tail latency due to interference Latency Is Isolate e stateful uarch structures (e.g., caches) to avoid QoS QoS vi violations Enhancing Server Efficiency in the Face of Killer Microseconds 10

  11. Duplexity : a server made of “Dyads” • Master core – Designed for latency-sensitive microservices – “Borrows” threads from lender core to fill util holes • Lender core – Designed for latency-insensitive batch applications • Shared backlog of batch threads Lender Lender Shared thread backlogs core core ... ... Master Master LLC core core Lender Lender ... core ... core Master Master Memory, I/O core core controllers, etc Enhancing Server Efficiency in the Face of Killer Microseconds 11

  12. Lender Core • Latency-insensitive batch threads – In-order execution • Variable number of virtual contexts needed – FIFO run-queue of virtual contexts in memory Enhancing Server Efficiency in the Face of Killer Microseconds 12

  13. Lender Core • Hierarchical Simultaneous Multithreading (HSMT) – Backlog of virtual contexts – Inspired by Balanced Multithreading [Tune’04] Frontend Backend In-Order FIFO 0 V-contexts Issue PTR FIFO 1 Queues Functional Units PC 0 FIFO 2 Instruction Buffer Select PC 1 FIFO 3 PC 2 FIFO 4 PC 3 FIFO 5 Fetch 8-wa way In-or order SMT MT Dat Datapat apath PC 4 FIFO 6 PC 5 FIFO 7 PC 6 Register PC 7 File Instruction Data Cache Cache Enhancing Server Efficiency in the Face of Killer Microseconds 13

  14. Master Core • Single latency-sensitive ma er thread master • Borrows threads from the lender core to fill μs-scale holes – Single-threaded out-of-order mode for ma master thread >2x – Multi-threaded in-order mode for fi filler threads • Inspired by Morphcore [Khubaib’12] Filler threads thrash the cache, TLB, and branch predictor state of the master thread è Increase tail latency Enhancing Server Efficiency in the Face of Killer Microseconds 14

  15. Segregating State • Na ve solution : replicate all stateful uarch structures Naive – Register files, caches, branch predictor, TLBs, etc. ü Branch predictor û RF I/D û caches Master Thread I/D ü Master core only replicates TLBs in inexpensiv ive structures I/D û Filler ... caches (e.g., TLBs and predictors) Threads I/D ü ü û RF TLBs Branch predictor Caches and register files are large and power-hungry è Full replication undermines performance density and energy efficiency objectives Enhancing Server Efficiency in the Face of Killer Microseconds 15

  16. Segregating Register Files • Repurpose physical RF as architectural RF for filler threads • Retain master thread architectural registers – Facilitates fast restart when the stall resolves Master Thread Arch Regs RF Master Thread Arch+Phys Registers Filler Threads Arch Regs What about caches? Enhancing Server Efficiency in the Face of Killer Microseconds 16

  17. Master-Lender Dyads • Master core remotely accesses the L1 I/D caches of the lender core – Protects the master thread’s state – Allows filler threads to hit on their own cache state • L0 I/D caches as effective bandwidth filters Master-thread mode Filler-thread mode L1 Inst L1 Inst Lender Lender $ $ core core L1 Data L1 Data $ $ L0 L1 Inst L1 Inst Master Master $ $ L0 core L1 Data L1 Data core $ $ Master thread can almost immediately resume execution as stall resolves Enhancing Server Efficiency in the Face of Killer Microseconds 17

  18. Evaluation Methodology • Master thread: – Open source μs-scale microservices • Locality sensitive hashing, protocol routing, remote caching, word stemming • Filler threads: – Data-parallel distributed graph algorithms • Page Rank, single-source shortest path • Design Alternatives: – Baseline single-threaded OoO, SMT, Duplexity+Replication, more alternatives in the paper Enhancing Server Efficiency in the Face of Killer Microseconds 18

  19. Evaluation Duplexity achieves 34% higher average 34% core utilization compared to SMT Within 4% of the utilization achieve ved by Dupl Duplexity + r replication Enhancing Server Efficiency in the Face of Killer Microseconds 19

  20. Evaluation Duplexity improves performance density by 49%, 28%, and 10% 10% compared to baseline, SMT, and Dupl Duplexity+Repl plicat ation Enhancing Server Efficiency in the Face of Killer Microseconds 20

  21. Evaluation 2.7x SMT worsens tail latency by 2.7x on average (up to 5.7x) Duplexity maintains tail latency within 19% Enhancing Server Efficiency in the Face of Killer Microseconds 21

  22. Conclusions • Killer Microseconds: Frequent μs-scale pauses in microservices • Modern computing systems not effective in hiding microseconds • Our proposal: Duplexity – Cost-effective highly multithreaded server architecture – Heterogeneous design: • Master cores for latency-sensitive microservices • Lender cores for latency-insensitive batch application – Master core may “borrow” threads from the lender core to fill utilization holes – Cores protect their threads’ cache states to avoid QoS violations Dupl Duplexity improve ves utilization by 4.8x while maintaining tail latency within 19% Enhancing Server Efficiency in the Face of Killer Microseconds 22

  23. Questions? Enhancing Server Efficiency in the Face of Killer Microseconds 23

Recommend


More recommend