The Only Constant is Change: Incorporating Time-Varying Bandwidth - PowerPoint PPT Presentation

The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers Di Xie, Ning Ding, Y. Charlie Hu, Ramana Kompella 1

Cloud Computing is Hot Private Cluster 2

Key Factors for Cloud Viability • Cost • Performance 3

Performance Variability in Cloud • BW variation in cloud Bandwidth (Mbps) 1000 due to contention 900 [Schad’10 VLDB] 800 700 600 500 • Causing unpredictable 400 performance 300 200 100 0 Local Cluster Amazon EC2 4

Reserving BW in Data Centers • SecondNet [Guo’10] – Per VM-pair, per VM access bandwidth reservation • Oktopus [Ballani’11] – Virtual Cluster (VC) – Virtual Oversubscribed Cluster (VOC) 5

How BW Reservation Works Only fixed-BW reservation Request <N, B> Bandwidth B Time Virtual Switch 0 T . . . N VMs Virtual Cluster Model 2. Allocate and enforce the model 1. Determine the model 6

Network Usage for MapReduce Jobs Hadoop Sort, 4GB per VM 7

Network Usage for MapReduce Jobs Hadoop Sort, 4GB per VM Hadoop Word Count, 2GB per VM 8

Network Usage for MapReduce Jobs Hadoop Sort, 4GB per VM Hadoop Word Count, 2GB per VM Hive Join, 6GB per VM 9

Network Usage for MapReduce Jobs Hadoop Sort, 4GB per VM Hadoop Word Count, 2GB per VM Hive Join, 6GB per VM Hive Aggregation, 2GB per VM 10

Network Usage for MapReduce Jobs Time-varying network usage Hadoop Sort, 4GB per VM Hadoop Word Count, 2GB per VM Hive Join, 6GB per VM Hive Aggregation, 2GB per VM 11

Motivating Example • 4 machines, 2 VMs/machine, non-oversubscribed Not enough 1Gbps BW network 500Mbps 500Mbps • Hadoop Sort – N: 4 VMs – B: 500Mbps/VM 12

Motivating Example • 4 machines, 2 VMs/machine, non-oversubscribed 1Gbps network 500Mbps • Hadoop Sort – N: 4 VMs – B: 500Mbps/VM 13

Under Fixed-BW Reservation Model 1Gbps Bandwidth 500 500Mbps Job1 Job2 Job3 Time 0 5 10 15 20 25 30 Virtual Cluster Model 14

Temporally-Interleaved Virtual Cluster (TIVC) • Key idea: Time-Varying BW Reservations • Compared to fixed-BW reservation – Improves utilization of data center • Better network utilization • Better VM utilization – Increases cloud provider’s revenue – Reduces cloud user’s cost – Without sacrificing job performance 20

Challenges in Realizing TIVC Q1: What are right model functions? Q2: How to automatically derive the models? Bandwidth Bandwidth B B Time Time Virtual Switch 0 T 0 T Request Request <N, B> <N, B(t)> . . . N VMs Virtual Cluster Model 21

Challenges in Realizing TIVC Q3: How to efficiently allocate TIVC? Q4: How to enforce TIVC? 22

Challenges in Realizing TIVC • What are the right model functions? • How to automatically derive the models? • How to efficiently allocate TIVC? • How to enforce TIVC? 23

Challenges in Realizing TIVC • What are the right model functions? • How to automatically derive the models? • How to efficiently allocate TIVC? • How to enforce TIVC? 24

How to Model Time-Varying BW? Hadoop Hive Join 25

TIVC Models B B Bandwidth Bandwidth B b B b 0 0 T 1 T 2 T T 11 T 12 T 21 T 22 T 31 T 32 T Time Time B Bandwidth Virtual Cluster B b 0 T 32 T 11 T 11 T 12 T 21 T 22 T 31 T 32 T Time 26

Hadoop Sort 27

Hadoop Word Count v 28

Hadoop Hive Join 29

Hadoop Hive Aggregation 30

Challenges in Realizing TIVC  What are the right model functions? • How to automatically derive the models? • How to efficiently allocate TIVC? • How to enforce TIVC? 31

Possible Approach • “White - box” approach – Given source code and data of cloud application, analyze quantitative networking requirement – Very difficult in practice • Observation: Many jobs are repeated many times – E.g., 40% jobs are recurring in Bing’s production data center [Agarwal’12] – Of course, data itself may change across runs, but size remains about the same 32

Our Approach • Solution: “Black - box” profiling based approach 1. Collect traffic trace from profiling run 2. Derive TIVC model from traffic trace • Profiling: Same configuration as production runs – Same number of VMs How much BW – Same input data size per VM should we give to – Same job/VM configuration the application? 33

Impact of BW Capping 34

Impact of BW Capping No-elongation BW threshold 35

Choosing BW Cap • Tradeoff between performance and cost – Cap > threshold: same performance, costs more – Cap < threshold: lower performance, may cost less • Our Approach: Expose tradeoff to user 1. Profile under different BW caps Only below 2. Expose run times and cost to user threshold ones 3. User picks the appropriate BW cap 36

From Profiling to Model Generation • Collect traffic trace from each VM – Instantaneous throughput of 10ms bin • Generate models for individual VMs • Combine to obtain overall job’s TIVC model – Simplify allocation by working with one model – Does not lose efficiency since per-VM models are roughly similar for MapReduce-like applications 37

Generate Model for Individual VM 1. Choose B b 2. Periods where B > B b , set to B cap B cap BW B b Time 38

Maximal Efficiency Model Applicatio n Traffic Volume Efficiency  • Reserved Bandwdith Volume • Enumerate B b to find the maximal efficiency model B cap BW B b Time 39

Challenges in Realizing TIVC  What are the right model functions?  How to automatically derive the models? • How to efficiently allocate TIVC? • How to enforce TIVC? 40

TIVC Allocation Algorithm • Spatio-temporal allocation algorithm – Extends VC allocation algorithm to time dimension – Employs dynamic programming • Properties – Locality aware – Efficient and scalable • 99 th percentile 28ms on a 64,000-VM data center in scheduling 5,000 jobs 41

Challenges in Realizing TIVC  What are the right model functions?  How to automatically derive the models?  How to efficiently allocate TIVC? • How to enforce TIVC? 42

Enforcing TIVC Reservation • Possible to enforce completely in hypervisor – Does not have control over upper level links – Requires online rate monitoring and feedback – Increases hypervisor overhead and complexity • Observation: Few jobs share a link simultaneously – Most small jobs will fit into a rack – Only a few large jobs cross the core – In our simulations, < 26 jobs share a link in 64,000-VM data center 43

Enforcing TIVC Reservation • Enforcing BW reservation in switches – Avoid complexity in hypervisors – Can be implemented on commodity switches • Cisco Nexus 7000 supports 16k policers 44

Challenges in Realizing TIVC  What are the right model functions?  How to automatically derive the models?  How to efficiently allocate TIVC?  How to enforce TIVC? 45

Proteus: Implementing TIVC Models 1. Determine the model 2. Allocate and enforce the model 46

Evaluation • Large-scale simulation – Performance – Cost – Allocation algorithm • Prototype implementation – Small-scale testbed 47

Simulation Setup • 3-level tree topology – 16,000 Hosts x 4 VMs 50Gbps – 4:1 oversubscription 20 Aggr Switch … 10Gbps • Workload 20 ToR Switch … … – N: exponential distribution 1Gbps 40 Hosts around mean 49 … … … … – B(t): derive from real Hadoop apps 48

Batched Jobs • Scenario: 5,000 time-insensitive jobs 1/3 of each type All rest results are for mixed Completion 42% 21% 23% 35% time reduction 49

Varying Oversubscription and Job Size 25.8% reduction for non-oversubscribed network 50

Dynamically Arriving Jobs • Scenario: Accommodate users’ requests in shared data center – 5,000 jobs, Poisson arrival, varying load Rejected: VC: 9.5% TIVC: 3.4% 51

Analysis: Higher Concurrency • Under 80% load 28% higher 7% higher job 28% higher Rejected jobs Charge VM utilization VMs are large concurrency revenue VM 52

Tenant Cost and Provider Revenue • Charging model – VM time T and reserved BW volume B – Cost = N ( k v T + k b B ) Amazon target – k v = 0.004$/hr, k b = 0.00016$/GB utilization 12% less cost for tenants Providers make more money 53

Testbed Experiment • Setup – 18 machines – Tc and NetFPGA rate limiter • Real MapReduce jobs • Procedure – Offline profiling – Online reservation 54

Testbed Result Baseline suffers elongation, TIVC finishes job faster than VC, TIVC achieves similar Baseline finishes the fastest performance as VC 55

Conclusion • Network reservations in cloud are important – Previous work proposed fixed-BW reservations – However, cloud apps exhibit time-varying BW usage • We propose TIVC abstraction – Provides time-varying network reservations – Uses simple pulse functions – Automatically generates model – Efficiently allocates and enforces reservations • Proteus shows TIVC benefits both cloud provider and users significantly 56

The Only Constant is Change: Incorporating Time-Varying Bandwidth - PowerPoint PPT Presentation

The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers Di Xie, Ning Ding, Y. Charlie Hu, Ramana Kompella 1 Cloud Computing is Hot Private Cluster 2 Key Factors for Cloud Viability Cost

Non-constant Non-constant growth model growth model You are calculating the intrinsic value of

For personal use only For personal use only For personal use only For personal use only For

Motion with Constant Acceleration 1 Particle Under Constant Acceleration In the case of motion

Constant mean curvature surfaces in homogeneous manifolds Beno t Daniel August 29, 2012

Changing Times, Emerging Generations: A snapshot of the megatrends affecting higher education.

icoStructFoam a fluid-structure interaction solver Philip Evegren Cases icoFoam icoFoam

Constant Propagation and Interval Analysis Daniela Moldovan 29. May 2010 Daniela Moldovan

Table of Contents: Constant Speed Motion Click on the topic to go to that section. Motion in

(1/2) m = M (molecular weight) 2 = (3/2) RT Since N 0 (1/2)Mc

Constant Propagation on SSA form Advanced Compiler Techniques 2005 Erik Stenman Virtutech

Multi-Level Logic with Constant Depth: Multi-Level Logic with Constant Depth: Recent Research

+ - Can be constant or time varying +/- indicates polarity Current Source Can be

Linear Differential Equations With Constant Coefficients Alan H. Stein University of Connecticut

Measurement of the strong coupling constant by CMS Juska Pekkanen on behalf of the CMS

SURFACE CHARGE IS REAL A metal bar moves with constant speed to the right . A constant magnetic

Challenging the Challenging the Cosmological Constant Cosmological Constant Nemanja Kaloper, UC

2 nd CERN Advanced Performance Tuning Workshop - introduction Andrzej Nowak (CERN openlab)

Pr Profiling Energy Consumption of DASH Video St Streaming over 4G 4G LTE Networks Pr

Linux Systems Performance Brendan Gregg Senior Performance Architect Systems

ECE590-03 Enterprise Storage Architecture Fall 2016 Workload profiling and sizing Tyler Bletsch

UL HPC School 2017 PS6: Debugging, profiling and performance analysis UL High Performance

Personal Data and Ci/zenship The Technical perspec/ve Claudia

COVID 19 INSIGHTS: The challenges for students and families in Australias disadvantaged

GDG Community Building Tips ...ideas that work The struggle is real! Have ever been overwhelmed?