U BIK : E FFICIENT C ACHE S HARING WITH S TRICT Q O S FOR L ATENCY C RITICAL W ORKLOADS H ARSHAD K ASTURE , D ANIEL S ANCHEZ ASPLOS 2014
Motivation 2 L. Barroso and U. Hölzle, The Case for Energy-Proportional Computing Low server utilization in datacenters is a major source of inefficiency
Common Industry Practice 3 Latency Critical Application Core 3 Core 4 Core 5 Last Level Cache Core 0 Core 1 Core 2 Dedicated machines for latency-critical applications guarantees QoS
Common Industry Practice 4 Latency Critical Application Core 3 Core 4 Core 5 Last Level Cache Core 0 Core 1 Core 2 Dedicated machines for latency-critical applications guarantees QoS Under utilization of machine resources
Colocation to Improve Utilization 5 Latency Critical Applications Core 3 Core 4 Core 5 Last Level Cache Core 0 Core 1 Core 2 Can utilize spare resources by colocating batch apps
Sharing Causes Interference! 6 Latency Critical Applications Core 3 Core 4 Core 5 Last Level Cache Core 0 Core 1 Core 2 Can utilize spare resources by colocating batch apps Contention in shared resources degrades QoS
Outline 7 Introduction Analysis of latency-critical apps Inertia-oblivious cache management schemes Ubik: Inertia-aware cache management Evaluation
Understanding Latency-Critical Applications 8 Client Back End Front End Client Back End Front End Client Back End Datacenter Large number of backend servers participate in handling every user request Total service time determined by tail latency behavior of backend
Understanding Latency-Critical Applications 9 Service latency highly sensitive to changes in load
Understanding Latency-Critical Applications 10 Active Idle Time Short bursts of activity interspersed with idle periods Need guaranteed high performance during active periods
Inertia and Transient Behavior 11 Core 3 Core 4 Core 5 IPC Last Level Cache Core 0 Core 1 Core 2 Time
Inertia and Transient Behavior 12 Transient begin Transient end Core 3 Core 4 Core 5 IPC Last Level Cache Core 0 Core 1 Core 2 Time Transient lengths can dominate tail latency! Any dynamic reconfiguration scheme has to be inertia-aware Many hardware resources exhibit inertia branch predictors, prefetchers, memory bandwidth… LLCs are one of the biggest sources of inertia
Outline 13 Introduction Analysis of latency-critical apps Inertia-oblivious cache management schemes Ubik: Inertia-aware cache management Evaluation
Inertia-Oblivious Cache Management 14 LC1 LC2 Active Idle Active Core 2 Core 3 Idle Active Last Level Cache (LLC) Idle Core 0 Core 1 Active Idle Time Batch1 Batch2
Unmanaged LLC (LRU Replacement) 15 LC1 LC2 Active Idle Active Core 2 Core 3 Idle Active Last Level Cache (LLC) Idle Core 0 Core 1 Active Idle Time Batch1 Batch2 LLC Space ✖ Unconstrained interference results in poor tail-latency behavior Time
Utility Based Cache Partitioning (UCP) 16 LC1 LC2 Active Idle Active Core 2 Core 3 Idle Active Last Level Cache (LLC) Idle Core 0 Core 1 Active Idle Time Batch1 Batch2 LLC Space ✔ High batch throughput ✖ Poor tail latency (low allocation) Time Reconfigure
OnOff: Efficient but Unsafe 17 LC1 LC2 Active Idle Active Core 2 Core 3 Idle Active Last Level Cache (LLC) Idle Core 0 Core 1 Active Idle Time Batch1 Batch2 LLC Space ✔ High batch throughput Time Batch Reconfigure
Cross-Request LLC Inertia 18 Misses LLC Access Breakdown (%) Cross-request hits Hits (same request) Shore-MT, 2 MB LLC Other applications qualitatively similar (see paper for details)
StaticLC: Safe but Inefficient 19 LC1 LC2 Active Idle Active Core 2 Core 3 Idle Active Last Level Cache (LLC) Idle Core 0 Core 1 Active Idle Time Batch1 Batch2 LLC Space ✔ Low tail latency (preserve LLC state) ✖ Low batch throughput (poor space utilization) Time Batch Reconfigure
Outline 20 Introduction Analysis of latency-critical apps Inertia-oblivious cache management schemes Ubik: Inertia-aware cache management Evaluation
Ubik: Performance Guarantee 21 Progress Instructions with constant size Progress with Ubik deadline Time Request begins Performance as well as overall progress under Ubik after the deadline is identical to static partitioning
Ubik: Overview 22 Activity Time Size Target Size nominal static Actual Size size idle size Time
Ubik: Overview 23 Activity Time Size boosted size Target Size nominal static Actual Size size idle size Time
Ubik: Overview 24 Activity Time Size boosted size Target Size nominal static Actual Size size idle size Time
Ubik: Overview 25 Activit y Time Size boosted size Target Size nominal static size Actual Size idle size Time Constraint: Cycles lost during should be compensated for by the cycles gained during before the deadline
Analyzing Transients 26 Need accurate predictions for The length of the transient from s 1 to s 2 Cycles lost during the transient from s 1 to s 2 Size Progress with Instructions T transient Lost constant size (s 2 ) Performanc s 2 e Target Actual Size Progress Size with Ubik s 1 Transient Time Time begins Transient ends
Hardware Support 27 Utility monitors to measure per-application miss curves Miss probability p S1 p S2 s 1 s 2 Size Fine grained cache partitioning Memory Level Parallelism (MLP) profiler
Bounds on Transient Behavior 28 Size T transient s 2 Target s 2 1 c c Actual Size M s 2 s 1 M T transient p s p s 2 Size s s 1 s 1 Time Progress with Instructions Lost constant size (s 2 ) Performance (L) s 2 1 p s 2 p s 2 1 L M 1 M s 2 s 1 p s p s 1 s s 1 Progress with Ubik Transient Time begins Transient ends
Ubik: Partition Sizing 29 Use transient analysis to identify feasible ( idle size, boosted size ) pairs Size 1 Time deadline
Ubik: Partition Sizing 30 Use transient analysis to identify feasible ( idle size, boosted size ) pairs Size 1 2 Time deadline
Ubik: Partition Sizing 31 Use transient analysis to identify feasible ( idle size, boosted size ) pairs Size 1 2 3 Time deadline
Ubik: Partition Sizing 32 Use transient analysis to identify feasible ( idle size, boosted size ) pairs I N F E A S I B L E Size 1 2 3 4 Time deadline
Ubik: Partition Sizing 33 Use transient analysis to identify feasible ( idle size, boosted size ) pairs Choose the pair that yields the maximum batch throughput Size 1 2 3 Time deadline See paper for details
Outline 34 Introduction Analysis of latency-critical apps Inertia-oblivious cache management schemes Ubik: Inertia-aware cache management Evaluation
Workloads 35 Five diverse latency-critical apps xapian (search engine) masstree (in-memory key-value store) moses (statistical machine translation) shore-mt (multi-threaded DBMS) specjbb (java middleware) Batch applications: random mixes of SPECCPU 2006 benchmarks
Target System 36 6 OOO cores LC1 LC2 LC3 Private L1I, L1D and L2 caches Core 3 Core 4 Core 5 12MB shared LLC L3 L3 L3 Bank 3 Bank 4 Bank 5 L3 L3 L3 400 6-app mixes: 3 Bank 0 Bank 1 Bank 2 latency-critical + 3 batch Core 0 Core 1 Core 2 apps Apps pinned to cores Batch2 Batch3 Batch1
Metrics 37 Baseline system has LC1 LC2 LC3 private LLCs Core 3 Core 4 Core 5 We report L3 3 L3 4 L3 5 Normalized tail latency L3 0 L3 1 L3 2 Throughput improvement for batch applications Core 0 Core 1 Core 2 Batch1 Batch2 Batch3
Results: Unmanaged LLC (LRU) 38 Higher is better
Results: UCP 39 Higher is better
Results: OnOff 40 Higher is better
Results: StaticLC 41 Higher is better
Results: Ubik 42 Higher is better
Results: Summary 43 OnOff LRU UCP StaticLC Ubik Private LLC Higher is better
Conclusions 44 To guarantee tail latency, dynamic resource management schemes must be inertia-aware Ubik: Inertia-aware cache capacity management Preserves tail of latency-critical apps Achieves high cache space utilization for batch apps Requires minimal additional hardware
T HANKS F OR Y OUR A TTENTION ! Q UESTIONS ?
Recommend
More recommend