SNC-Meister: Admitting More Tenants With Tail Latency SLOs Timothy Zhu Daniel S. Berger Mor Harchol-Balter Carnegie Mellon University University of Kaiserslautern Carnegie Mellon University Presented By: Zane Ma & Shuo Feng SNC-Meister: Admitting More Tenants with Tail Latency SLOs ▪︎ Zane Ma 1
Cloud Request Latency High performance cloud computing in a single datacenter Ex: MapReduce, Heron, HDFS Cloud networks provide latency service-level objectives (SLOs) Typically guarantee 99% or 99.9% request latency , rather than packet latency SNC-Meister: Admitting More Tenants with Tail Latency SLOs ▪︎ Zane Ma 2
Cloud Request Latency High performance cloud computing in a single datacenter Ex: MapReduce, Heron, HDFS Goal: Achieving high tenancy while Cloud networks provide latency meeting tail latency SLOs service-level objectives (SLOs) Typically guarantee 99% or 99.9% request latency , rather than packet latency SNC-Meister: Admitting More Tenants with Tail Latency SLOs ▪︎ Zane Ma 3
Latency Causes Assumption: typical behavior, no hardware failure, flash crowds, etc. Short lived bursts caused by network queues and services Datacenter Network Queue Queue Tenant VM 1 Switch Server VM Tenant VM 2 SNC-Meister: Admitting More Tenants with Tail Latency SLOs ▪︎ Zane Ma 4
Modeling Latency Deterministic Network Calculus Calculate fixed maximum rate/burst constraints from historical traces Consider worst case scenario from adversarial coordination (i.e. 100% latency) Used by Silo (SIGCOMM 2015), QJump (NSDI 2015), PriorityMeister (SoCC 2014) SNC-Meister: Admitting More Tenants with Tail Latency SLOs ▪︎ Zane Ma 5
Modeling Latency Deterministic Network Calculus Stochastic Network Calculus Calculate fixed maximum rate/burst Model maximum rate/ burstiness as a constraints from historical traces probabilistic distribution Consider worst case scenario from Does not assume all tenants are adversarial coordination (i.e. 100% adversarially correlated - lower target latency) latency percentile (e.g. 99.9%) Used by Silo (SIGCOMM 2015), QJump (NSDI 2015), PriorityMeister (SoCC 2014) SNC-Meister: Admitting More Tenants with Tail Latency SLOs ▪︎ Zane Ma 6
Modeling Latency Deterministic Network Calculus Stochastic Network Calculus SNC-Meister: Admitting More Tenants with Tail Latency SLOs ▪︎ Zane Ma 7
Modeling Latency Deterministic Network Calculus Stochastic Network Calculus 99.9% latency SNC-Meister: Admitting More Tenants with Tail Latency SLOs ▪︎ Zane Ma 8
Modeling Latency Deterministic Network Calculus Stochastic Network Calculus 99.9% latency SNC-Meister: Admitting More Tenants with Tail Latency SLOs ▪︎ Zane Ma 9
SNC Example Queue Queue Tenant VM 1 Server VM Switch Tenant VM 2 SNC-Meister: Admitting More Tenants with Tail Latency SLOs ▪︎ Zane Ma 10
SNC Example Arrival Processes Queue Queue A1 Tenant VM 1 A3 Server VM Switch A2 Tenant VM 2 SNC-Meister: Admitting More Tenants with Tail Latency SLOs ▪︎ Zane Ma 11
SNC Example Arrival Processes Queue Queue A1 Tenant VM 1 A3 Server VM Switch A2 Tenant VM 2 S1 S2 Service Processes SNC-Meister: Admitting More Tenants with Tail Latency SLOs ▪︎ Zane Ma 12
SNC Example Queue Queue A1 Tenant VM 1 A3 Switch Server VM A2 Tenant VM 2 S1 S2 Goal: Get 99% latency SLO bound between Tenant VM 1 and Server VM SNC-Meister: Admitting More Tenants with Tail Latency SLOs ▪︎ Zane Ma 13
SNC Example Queue Queue A1 Tenant VM 1 A3 Switch Server VM A2 Tenant VM 2 S1 S2 Goal: Get 99% latency SLO bound between Tenant VM 1 and Server VM Total latency = switch latency + server latency SNC-Meister: Admitting More Tenants with Tail Latency SLOs ▪︎ Zane Ma 14
SNC Example Queue Queue A1 Tenant VM 1 A3 Switch Server VM A2 Tenant VM 2 S1 S2 Goal: Get 99% latency SLO bound between Tenant VM 1 and Server VM Total latency = Latency( A1 , S1 , 0.99) + Latency( A3 , S2 , 0.99) SNC-Meister: Admitting More Tenants with Tail Latency SLOs ▪︎ Zane Ma 15
SNC Example Queue Queue A1 A1 Tenant VM 1 A3 Switch Server VM A2 Tenant VM 2 S1 S2 Goal: Get 99% latency SLO bound between Tenant VM 1 and Server VM Total latency = Latency(A1, S1, 0.99) + Latency(A3, S2, 0.99) S1 slowed down by A2 ! SNC-Meister: Admitting More Tenants with Tail Latency SLOs ▪︎ Zane Ma 16
SNC Example Queue Queue A1 A1 Tenant VM 1 A3 Switch Server VM A2 Tenant VM 2 S’1 S2 Goal: Get 99% latency SLO bound between Tenant VM 1 and Server VM Total latency = Latency(A1, S1, 0.99) + Latency(A3, S2, 0.99) S1 slowed down by A2 ! —> S’1 = Leftover( S1 , A2 ) SNC-Meister: Admitting More Tenants with Tail Latency SLOs ▪︎ Zane Ma 17
SNC Example Queue Queue A1 A1 Tenant VM 1 A3 Switch Server VM A2 Tenant VM 2 S’1 S2 Goal: Get 99% latency SLO bound between Tenant VM 1 and Server VM Total latency = Latency(A1, S1 S’1 , 0.99) + Latency(A3, S2, 0.99) S’1 = Leftover(S1, A2) SNC-Meister: Admitting More Tenants with Tail Latency SLOs ▪︎ Zane Ma 18
SNC Example Queue Queue A1 A1 Tenant VM 1 A3 Switch Server VM A2 Tenant VM 2 S’1 S2 Goal: Get 99% latency SLO bound between Tenant VM 1 and Server VM Total latency = Latency(A1, S’1, 0.99) + Latency(A3, S2, 0.99) S’1 = Leftover(S1, A2) A3 = Output( A1, S’1 ) SNC-Meister: Admitting More Tenants with Tail Latency SLOs ▪︎ Zane Ma 19
SNC Example Queue Queue A1 A1 Tenant VM 1 A3 Switch Server VM A2 Tenant VM 2 S’1 S2 Goal: Get 99% latency SLO bound between Tenant VM 1 and Server VM Total latency = Latency(A1, S’1, 0.99) + Latency(A3, S2, 0.99) S’1 = Leftover(S1, A2) A3 = Output(A1, S’1) Adding latencies does not preserve SLO %! SNC-Meister: Admitting More Tenants with Tail Latency SLOs ▪︎ Zane Ma 20
SNC Example Queue Queue A1 A1 Tenant VM 1 A3 Switch Server VM A2 Tenant VM 2 S’1 S2 Goal: Get 99% latency SLO bound between Tenant VM 1 and Server VM Total latency = Latency(A1, S’1, 0.99) + Latency(A3, S2, 0.99) S’1 = Leftover(S1, A2) A3 = Output(A1, S’1) Adding latencies does not preserve SLO %! Convolution(L1, L2, 0.99) SNC-Meister: Admitting More Tenants with Tail Latency SLOs ▪︎ Zane Ma 21
SNC Operators Operator Meaning Latency(A, S, N) N% latency for a given A, S Leftover(S, A) S adjusted/reduced by A Output(A, S) Resultant output distribution of A and S Convolution(L1, L2) Combine latencies L1, L2 Aggregation(A1, A2) Multiplexed A1 and A2 SNC-Meister: Admitting More Tenants with Tail Latency SLOs ▪︎ Zane Ma 22
SNC Implementation Challenges SNC order of operations optimizations Tunable dependencies between tenants Modeling burstiness - Markov Modulated Poisson Process Programming language abstraction for applying SNC operators SNC-Meister: Admitting More Tenants with Tail Latency SLOs ▪︎ Zane Ma 23
SNC Implementation Challenges SNC order of operations optimizations Tunable dependencies between tenants Modeling burstiness - Markov Modulated Poisson Process Programming language abstraction for applying SNC operators SNC-Meister: Admitting More Tenants with Tail Latency SLOs ▪︎ Zane Ma 24
SNC Implementation Challenges SNC order of operations optimizations Switching between high Tunable dependencies between tenants and low phases Modeling burstiness - Markov Modulated Poisson Process Programming language abstraction for applying SNC operators SNC-Meister: Admitting More Tenants with Tail Latency SLOs ▪︎ Zane Ma 25
SNC Implementation Challenges SNC order of operations optimizations Tunable dependencies between tenants Modeling burstiness - Markov Modulated Poisson Process Programming language abstraction for applying SNC operators SNC-Meister: Admitting More Tenants with Tail Latency SLOs ▪︎ Zane Ma 26
Experimental Setup Silo: DNC, fixed 1.5Kb bursts, trial and error manual bandwidth selection Silo++: Silo with dynamic bandwidth selection QJump: manual priority class assignment QJump++: QJump with automatically assigned priority class PriorityMeister: automatically derived rates from tenant trace Real production 2015 traces from large internet company SNC-Meister: Admitting More Tenants with Tail Latency SLOs ▪︎ Zane Ma 27
Results More Tenants High Network Utilization SNC-Meister: Admitting More Tenants with Tail Latency SLOs ▪︎ Zane Ma 28
Results #Tenants Scales to Scales to high SLO % Cluster Size SNC-Meister: Admitting More Tenants with Tail Latency SLOs ▪︎ Zane Ma 29
Future Work / Discussion Bootstrapping representative historical traces/logs is a chicken-and-egg problem. How can we improve the process? How can we build fault-tolerance into SNC-Meister? Any practical SLO mechanism should account for as many failure scenarios as possible. The paper makes an assumption about latency within a single datacenter, why do we need this assumption? What if this assumption is not met? When most of the tenants are dependent on one another, why does SNC show higher latency than DNC? SNC-Meister: Admitting More Tenants with Tail Latency SLOs ▪︎ Zane Ma 30
Backup Slides SNC-Meister: Admitting More Tenants with Tail Latency SLOs ▪︎ Zane Ma 31
SNC Operators SNC-Meister: Admitting More Tenants with Tail Latency SLOs ▪︎ Zane Ma 32
Recommend
More recommend