for l atency c ritical w orkloads
play

FOR L ATENCY C RITICAL W ORKLOADS H ARSHAD K ASTURE , D ANIEL S - PowerPoint PPT Presentation

U BIK : E FFICIENT C ACHE S HARING WITH S TRICT Q O S FOR L ATENCY C RITICAL W ORKLOADS H ARSHAD K ASTURE , D ANIEL S ANCHEZ ASPLOS 2014 Motivation 2 L. Barroso and U. Hlzle, The Case for Energy-Proportional Computing Low server


  1. U BIK : E FFICIENT C ACHE S HARING WITH S TRICT Q O S FOR L ATENCY C RITICAL W ORKLOADS H ARSHAD K ASTURE , D ANIEL S ANCHEZ ASPLOS 2014

  2. Motivation 2 L. Barroso and U. Hölzle, The Case for Energy-Proportional Computing  Low server utilization in datacenters is a major source of inefficiency

  3. Common Industry Practice 3 Latency Critical Application Core 3 Core 4 Core 5 Last Level Cache Core 0 Core 1 Core 2  Dedicated machines for latency-critical applications guarantees QoS

  4. Common Industry Practice 4 Latency Critical Application Core 3 Core 4 Core 5 Last Level Cache Core 0 Core 1 Core 2  Dedicated machines for latency-critical applications guarantees QoS  Under utilization of machine resources

  5. Colocation to Improve Utilization 5 Latency Critical Applications Core 3 Core 4 Core 5 Last Level Cache Core 0 Core 1 Core 2  Can utilize spare resources by colocating batch apps

  6. Sharing Causes Interference! 6 Latency Critical Applications Core 3 Core 4 Core 5 Last Level Cache Core 0 Core 1 Core 2  Can utilize spare resources by colocating batch apps  Contention in shared resources degrades QoS

  7. Outline 7  Introduction  Analysis of latency-critical apps  Inertia-oblivious cache management schemes  Ubik: Inertia-aware cache management  Evaluation

  8. Understanding Latency-Critical Applications 8 Client Back End Front End Client Back End Front End Client Back End Datacenter  Large number of backend servers participate in handling every user request  Total service time determined by tail latency behavior of backend

  9. Understanding Latency-Critical Applications 9  Service latency highly sensitive to changes in load

  10. Understanding Latency-Critical Applications 10 Active Idle Time  Short bursts of activity interspersed with idle periods  Need guaranteed high performance during active periods

  11. Inertia and Transient Behavior 11 Core 3 Core 4 Core 5 IPC Last Level Cache Core 0 Core 1 Core 2 Time

  12. Inertia and Transient Behavior 12 Transient begin Transient end Core 3 Core 4 Core 5 IPC Last Level Cache Core 0 Core 1 Core 2 Time  Transient lengths can dominate tail latency!  Any dynamic reconfiguration scheme has to be inertia-aware  Many hardware resources exhibit inertia  branch predictors, prefetchers, memory bandwidth…  LLCs are one of the biggest sources of inertia

  13. Outline 13  Introduction  Analysis of latency-critical apps  Inertia-oblivious cache management schemes  Ubik: Inertia-aware cache management  Evaluation

  14. Inertia-Oblivious Cache Management 14 LC1 LC2 Active Idle Active Core 2 Core 3 Idle Active Last Level Cache (LLC) Idle Core 0 Core 1 Active Idle Time Batch1 Batch2

  15. Unmanaged LLC (LRU Replacement) 15 LC1 LC2 Active Idle Active Core 2 Core 3 Idle Active Last Level Cache (LLC) Idle Core 0 Core 1 Active Idle Time Batch1 Batch2 LLC Space ✖ Unconstrained interference results in poor tail-latency behavior Time

  16. Utility Based Cache Partitioning (UCP) 16 LC1 LC2 Active Idle Active Core 2 Core 3 Idle Active Last Level Cache (LLC) Idle Core 0 Core 1 Active Idle Time Batch1 Batch2 LLC Space ✔ High batch throughput ✖ Poor tail latency (low allocation) Time Reconfigure

  17. OnOff: Efficient but Unsafe 17 LC1 LC2 Active Idle Active Core 2 Core 3 Idle Active Last Level Cache (LLC) Idle Core 0 Core 1 Active Idle Time Batch1 Batch2 LLC Space ✔ High batch throughput Time Batch Reconfigure

  18. Cross-Request LLC Inertia 18 Misses LLC Access Breakdown (%) Cross-request hits Hits (same request) Shore-MT, 2 MB LLC  Other applications qualitatively similar (see paper for details)

  19. StaticLC: Safe but Inefficient 19 LC1 LC2 Active Idle Active Core 2 Core 3 Idle Active Last Level Cache (LLC) Idle Core 0 Core 1 Active Idle Time Batch1 Batch2 LLC Space ✔ Low tail latency (preserve LLC state) ✖ Low batch throughput (poor space utilization) Time Batch Reconfigure

  20. Outline 20  Introduction  Analysis of latency-critical apps  Inertia-oblivious cache management schemes  Ubik: Inertia-aware cache management  Evaluation

  21. Ubik: Performance Guarantee 21 Progress Instructions with constant size Progress with Ubik deadline Time Request begins  Performance as well as overall progress under Ubik after the deadline is identical to static partitioning

  22. Ubik: Overview 22 Activity Time Size Target Size nominal static Actual Size size idle size Time

  23. Ubik: Overview 23 Activity Time Size boosted size Target Size nominal static Actual Size size idle size Time

  24. Ubik: Overview 24 Activity Time Size boosted size Target Size nominal static Actual Size size idle size Time

  25. Ubik: Overview 25 Activit y Time Size boosted size Target Size nominal static size Actual Size idle size Time  Constraint: Cycles lost during should be compensated for by the cycles gained during before the deadline

  26. Analyzing Transients 26  Need accurate predictions for  The length of the transient from s 1 to s 2  Cycles lost during the transient from s 1 to s 2 Size Progress with Instructions T transient Lost constant size (s 2 ) Performanc s 2 e Target Actual Size Progress Size with Ubik s 1 Transient Time Time begins Transient ends

  27. Hardware Support 27  Utility monitors to measure per-application miss curves Miss probability p S1 p S2 s 1 s 2 Size  Fine grained cache partitioning  Memory Level Parallelism (MLP) profiler

  28. Bounds on Transient Behavior 28 Size T transient s 2 Target   s 2  1 c c    Actual Size     M  s 2  s 1  M T transient   p s p s 2   Size s  s 1 s 1 ฀ Time Progress with Instructions Lost constant size (s 2 ) Performance (L)   s 2  1 p s 2 p s 2    1    L  M 1   M s 2  s 1   p s p s 1   s  s 1 Progress with Ubik Transient Time ฀ begins Transient ends

  29. Ubik: Partition Sizing 29  Use transient analysis to identify feasible ( idle size, boosted size ) pairs Size 1 Time deadline

  30. Ubik: Partition Sizing 30  Use transient analysis to identify feasible ( idle size, boosted size ) pairs Size 1 2 Time deadline

  31. Ubik: Partition Sizing 31  Use transient analysis to identify feasible ( idle size, boosted size ) pairs Size 1 2 3 Time deadline

  32. Ubik: Partition Sizing 32  Use transient analysis to identify feasible ( idle size, boosted size ) pairs I N F E A S I B L E Size 1 2 3 4 Time deadline

  33. Ubik: Partition Sizing 33  Use transient analysis to identify feasible ( idle size, boosted size ) pairs  Choose the pair that yields the maximum batch throughput Size 1 2 3 Time deadline  See paper for details

  34. Outline 34  Introduction  Analysis of latency-critical apps  Inertia-oblivious cache management schemes  Ubik: Inertia-aware cache management  Evaluation

  35. Workloads 35  Five diverse latency-critical apps  xapian (search engine)  masstree (in-memory key-value store)  moses (statistical machine translation)  shore-mt (multi-threaded DBMS)  specjbb (java middleware)  Batch applications: random mixes of SPECCPU 2006 benchmarks

  36. Target System 36  6 OOO cores LC1 LC2 LC3  Private L1I, L1D and L2 caches Core 3 Core 4 Core 5  12MB shared LLC L3 L3 L3 Bank 3 Bank 4 Bank 5 L3 L3 L3  400 6-app mixes: 3 Bank 0 Bank 1 Bank 2 latency-critical + 3 batch Core 0 Core 1 Core 2 apps  Apps pinned to cores Batch2 Batch3 Batch1

  37. Metrics 37  Baseline system has LC1 LC2 LC3 private LLCs Core 3 Core 4 Core 5  We report L3 3 L3 4 L3 5  Normalized tail latency L3 0 L3 1 L3 2  Throughput improvement for batch applications Core 0 Core 1 Core 2 Batch1 Batch2 Batch3

  38. Results: Unmanaged LLC (LRU) 38 Higher is better

  39. Results: UCP 39 Higher is better

  40. Results: OnOff 40 Higher is better

  41. Results: StaticLC 41 Higher is better

  42. Results: Ubik 42 Higher is better

  43. Results: Summary 43 OnOff LRU UCP StaticLC Ubik Private LLC Higher is better

  44. Conclusions 44  To guarantee tail latency, dynamic resource management schemes must be inertia-aware  Ubik: Inertia-aware cache capacity management  Preserves tail of latency-critical apps  Achieves high cache space utilization for batch apps  Requires minimal additional hardware

  45. T HANKS F OR Y OUR A TTENTION ! Q UESTIONS ?

Recommend


More recommend