no tradeoff
play

No Tradeoff Low Latency + High Efficiency Christos Kozyrakis - PowerPoint PPT Presentation

No Tradeoff Low Latency + High Efficiency Christos Kozyrakis http://mast.stanford.edu Latency-critical Applications A growing class of online workloads Search, social networking, software-as-service (SaaS), Occupy 1,000s of servers in


  1. No Tradeoff Low Latency + High Efficiency Christos Kozyrakis http://mast.stanford.edu

  2. Latency-critical Applications A growing class of online workloads Search, social networking, software-as-service (SaaS), … Occupy 1,000s of servers in datacenters 2

  3. Example: Web-search Q Web Servers Front end Back end R Root Q Q Parent R Parent Parent R R Q Q … … Q Q Q Q Leaf R Leaf Leaf R R R Leaf R R Leaf Leaf Leaf R Leaf R Leaf R Metrics: queries per sec (QPS) at a 99 th -% latency threshold Massive distributed datasets, multiple tiers, high fan-out 3

  4. Characteristics High throughput + low latency (100 μ sec – 10msec) Focus on tail latency (e.g., 99 th percentile SLO) Distributed and (often) stateful Multi-tier with high fan-out Diurnal patterns but load is never 0 4

  5. Trends [Adrian Cockcroft’14] Apps as collection of micro-services Loosely-coupled services each with bounded context Even lower tail latency requirements From 100s to just a few μ sec 5

  6. Conventional Wisdom for LC Apps 6

  7. #1 LC apps cannot be power managed 7

  8. LC Apps & Power Management Power (DVFS off) [Lo et al’14] Tail latency Search load 8 (DVFS on)

  9. #2 LC apps must use dedicated servers 9

  10. LC Apps & Server Sharing Impact of interference on websearch’s latency 300% LLC >300% >300% >300% >300% >300% >300% >300% 264% 123% DRAM >300% >300% >300% >300% >300% >300% >300% 270% 122% B A HyperThread 110% 107% 114% 115% 105% 117% 120% 136% >300% D CPU power 124% 107% 116% 109% 115% 105% 101% 100% 100% 100% O Network 36% 36% 37% 37% 39% 42% 48% 55% 64% K 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% Load 10 [Lo et al’15]

  11. LC Apps & Core Sharing Achieved QoS w/ two low-latency workloads 90 Both meet QoS Memcached QPS (% Peak) 80 1ms RPC fails 70 Memcached fails Bad QoS with minor 60 Both fail to meet QoS interference! 50 QoS requirement 40 = 95 th < 5x low-load 95 th 30 20 10 10 20 30 40 50 60 70 80 90 1ms RPC-service QPS (% Peak) 11 [Leverich’14]

  12. #3 LC apps must use local Flash 12

  13. Local Vs Remote NVMe Access 1200 1000 p95 read latency (us) 800 75% throughput drop 600 400 Local Flash 200 2x latency iSCSI (1 core) libaio+libevent (1core) 0 0 50 100 150 200 250 300 IOPS (Thousands) 13

  14. Sharing NVMe Flash 2000 The curve you 1800 paid for 1600 1400 p95 read latency (us) The curve you get 1200 with sharing 1000 100%read 800 99%read 600 95%read 90%read 400 75%read 200 50%read 0 0 250 500 750 1000 1250 Total IOPS (Thousands) Flash performance varies a lot with read/write ratio 14

  15. #4 LC apps must bypass the kernel for I/O cannot use TCP cannot use Ethernet must use specialized HW 15

  16. LC Apps & Linux/TCP/Ethernet 64-byte TCP echo 60 10 Millions 50 8 40 6 4.8x 8.8x 30 HW Limit Gap Gap 4 20 Linux 2 10 0 0 Microseconds Requests per Second 16 [Belay et al’15]

  17. Conventional Wisdom for LC Apps 17

  18. Low Latency + High Efficiency Understand the problem Understand the opportunity 18

  19. Understanding the Latency Problem 4000 MM1 3500 memcached N=1 MM4 3000 Latency (usec) memcached N=4 2500 2000 [Delimitrou’15] 1500 1000 500 0 0.0 0.2 0.4 0.6 0.8 1.0 Load Latency-critical apps as queuing systems Tail latency affected by Service time, queuing delay, scheduling time, load imbalance 19

  20. Understanding the Latency Problem 1000 Provisioned QPS 95th-% 900 800 700 Latency (usecs) 600 500 400 300 200 100 0 3% 8% 14% 19% 24% 30% 35% 41% 46% 51% 57% 62% 68% 73% 78% 84% 89% 95% 100% Memcached QPS (% of peak) [Leverich’15] 20

  21. Understanding the Opportunity Significant latency slack! Tail latency [Lo et al’14] 0% 20% 40% 60% 80% 100% % of maximum cluster load Iso-latency: maintain constant tail latency at all loads Enables power management, HW sharing, adaptive batching, … Key point: use tail latency as control signal 21

  22. Low Latency + High Efficiency Understand the problem Understand the opportunity End-to-end & latency-aware management 22

  23. Pegasus: Iso-latency + Fine-grain DVFS L P L L L [Lo et al’14] Measures latency slack end-to-end Sets uniform latency goal across all servers RAPL as knob for power Power is set by workload specific policy 23

  24. Pegasus: Iso-latency + Fine-grain DVFS Achieves dynamic energy proportionality [Lo et al’14] 24

  25. Heracles: Iso-latency + QoS Isolation Latency readings Controller Can BE grow? CPU + LC workload CPU power Network Memory Internal feedback Net. loops DRAM CPU LLC CPU DVFS HTB BW Power BW [Lo et al’15] Goal : meet SLO while using idle resources for batch workloads End-to-end latency measurements control isolation mechanisms 25

  26. Heracles: Iso-latency + QoS Isolation Cores LLC Core freq. Network BW LC BE BE BE LC BE BE BE BE BE Max LC BE BE BE Latency readings Controller ü Can BE grow? L LC CPU + CPU Network workload Memory power Internal feedback Net. loops DRAM CPU LLC CPU DVFS HTB BW Power BW [Lo et al’15] 26

  27. Heracles: Iso-latency + QoS Isolation Cores LLC Core freq. Network BW LC BE LC BE LC BE BE Max Latency readings Controller L Can BE grow? L LC CPU + CPU Network workload Memory power Internal feedback Net. loops DRAM CPU LLC CPU DVFS HTB BW Power BW [Lo et al’15] 27

  28. Iso-latency + QoS Isolation >90% HW utilization No tail latency problems [Lo et al’15] LC apps + best-effort tasks on the same servers HW & SW isolation mechanisms (cores, caches, network, power) Iso-latency based control for resource allocation 28

  29. Low Latency + High Efficiency Understand the problem Understand the opportunity End-to-end & latency-aware management Optimize for modern multi-core hardware Specialize the SW 29

  30. System SW for Low Latency + High BW App 1 App 2 Userspace OS Kernel Kernelspace TCP/IP C C C C C 30 RX/TX

  31. System SW for Low Latency + High BW Control plane App 1 App 2 Userspace OS Kernel Kernelspace RX RX RX RX TCP/IP TX TX TX TX C C C C C 31 RX/TX

  32. System SW for Low Latency + High BW Control Ring 3 plane App 1 App 2 Guest Ring 0 OS kernel Host Ring 0 Dune RX RX RX RX TX TX TX TX C C C C C 32

  33. System SW for Low Latency + High BW Control HTTPd Memcached Ring 3 plane libIX libIX IX Custom Guest TCP/IP Transport Ring 0 OS Kernel Host Ring 0 Dune RX RX RX RX TX TX TX TX C C C C C 33 [Belay et al’14]

  34. Run-to-Completion 3 Ring 3 event-driven app Event Batched libIX Conditions Syscalls Guest 2 4 Ring 0 TCP/IP TCP/IP 5 Timer RX FIFO 6 RX TX 1 Improves data cache locality Removes scheduling unpredictably 34

  35. Adaptive Batching 3 Ring 3 event-driven app Event Batched libIX Conditions Syscalls Guest 2 4 Ring 0 TCP/IP TCP/IP Adaptive Batch 5 Calculation Timer RX FIFO 6 RX TX 1 Enabled by ISO-latency Improves instruction cache locality and prefetching 35

  36. IX: Low Latency + High Bandwidth Linux (99 th pct) IX (99 th pct) Linux (avg) IX (avg) 750 Latency ( µ s) + TCP + commodity HW 500 SLA 2x Less + protection Tail + 1M connections 5x Latency More 250 RPS 0 IX tail lower than 0 1 2 3 4 5 6 Linux avg USR: Throughput (RPS x 10 6 ) 36 [Belay et al’14]

  37. Low Latency + High Bandwidth + High Efficiency MC load MC load 0 50 100 150 200 0 50 100 150 200 MC latency MC latency 0 50 100 150 200 0 50 100 150 200 BE Performance Power 0 50 100 150 200 0 50 100 150 200 Time (seconds) Time (seconds) 37

  38. ReFlex: IX + QoS Scheduling App App Control Ring 3 Plane libIX libIX Guest IX ReFlex Ring 0 Linux kernel Host Dune Ring 0 RX RX SQ TX TX CQ Core Core Core 38

  39. ReFlex: IX + QoS Scheduling Ring 3 3 ReFlex Server Event Batched libIX Conditions Syscalls Guest 2 Ring 0 Scheduler NVMe TCP/IP TCP/IP NVMe 1 CQ TX SQ RX 4 39 [Klimovic et al’17]

  40. ReFlex: IX + QoS Scheduling Ring 3 ReFlex Server 7 Event Batched libIX Conditions Syscalls Guest Ring 0 6 Scheduler NVMe TCP/IP TCP/IP NVMe 8 CQ TX SQ RX 5 40 [Klimovic et al’17]

  41. ReFlex: Local ≈ Remote Latency 1000 900 Linux: ReFlex: 75K IOPS/core 850K IOPS/core 800 p95 Read Latency (us) 700 600 500 Local-1T 400 300 ReFlex-1T 200 100 Libaio-1T 0 0 250 500 750 1000 1250 IOPS (Thousands) 41

  42. ReFlex: Local ≈ Remote Latency 1000 900 800 p95 Read Latency (us) 700 600 Unloaded latency Local Flash 78 µs 500 Local-1T ReFlex 99 µs 400 Libaio 121 µs 300 ReFlex-1T 200 100 Libaio-1T 0 0 250 500 750 1000 1250 IOPS (Thousands) 42

  43. ReFlex: Local ≈ Remote Latency 1000 900 ReFlex saturates Flash device 800 p95 Read Latency (us) 700 600 Local-1T 500 Local-2T 400 ReFlex-1T 300 ReFlex-2T 200 Libaio-1T 100 Libaio-2T 0 0 250 500 750 1000 1250 IOPS (Thousands) 43

Recommend


More recommend