A Benes Packet Network Longbo Huang & Jean Walrand EECS @ UC Berkeley
Longbo Huang Institute for Interdisciplinary Information Sciences (IIIS) Tsinghua University
Data centers are important computing resources Provide most of our computing services - Web service: Facebook, Email - Information processing: MapReduce - Data storage: Flickr, Google Drive Google data centers within US Src: http://royal.pingdom.com/2008/04/11/map-of-all-google-data-center-locations/
Data centers are important computing resources Google data centers within US Src: http://royal.pingdom.com/2008/04/11/map-of-all-google-data-center-locations/
Data centers are important computing resources We focus on data center networking! Google data centers within US Src: http://royal.pingdom.com/2008/04/11/map-of-all-google-data-center-locations/
The data center networking problem Networking is the foundation of data centers’ functionality - Hundreds of thousands of interconnected servers - Dynamic traffic flowing among servers - Large volume of data requiring small latency - Traffic statistical info may be hard to obtain
The data center networking problem Networking is the foundation of data centers’ functionality - Hundreds of thousands of interconnected servers - Dynamic traffic flowing among servers - Large volume of data requiring small latency - Traffic statistical info may be hard to obtain Questions: - How to connect the servers? - How to route traffic to achieve best rate allocation? - How to ensure small delay? - How to adapt to traffic changes?
Benes Network + Utility Optimization + Backpressure Benes Network: - High throughput - Small delay (logarithmic in network size) - Connecting 2N servers with O(NlogN) switch modules Flow Utility Maximization - Ensure best allocation of resources Backpressure: - Throughput optimal - Robust to system dynamics - Require no statistical info
Benes Network Building a 2 n x2 n Benes network
Benes Network Routing circuits: UP DOWN 1 3 2 1 3 2 4 4 UP DOWN
Benes Network Routing circuits: 1 4 2 2 n 3 1 4 2 n – 1 ……. 2 n – 1 2 2 n 3 non-blocking for circuits full-throughput for packets
Benes Network Flow Utility Maximization - Random arrival A sd (t) - Flow control, admit R sd (t) in [0, A sd (t)] - Each (s, d) flow has utility U sd (r sd ) - Each link has capacity 1pk/s The flow utility maximization problem:
Benes Network Flow Utility Maximization - Random arrival A sd (t) - Flow control, admit R sd (t) in [0, A sd (t)] - Each (s, d) flow has utility U sd (r sd ) - Each link has capacity 1pk/s Backpressure can be directly applied. However, each node needs 2 n queues, one for each destination
Grouped-Backpressure (G-BP) The idea: - Divide traffic into two groups - Perform routing & scheduling on the mixed traffic - Rely on Backpressure & symmetry for stability Key components 1. A fictitious reference system for control 2. A special queueing structure 3. An admission & regulation mechanism 4. Dynamic scheduling
G-BP Component 1 - Reference System These nodes remain the same
G-BP Component 2 – Queueing Structure - Each switch node in columns 1 to n-1 maintains 4 queues (same for both systems)
G-BP Component 2 – Queueing Structure - Each input server in column 0 maintains 2 queues (same for both systems)
G-BP Component 2 – Queueing Structure - Each node in column n maintains 2 queues for D 1 and D 2 (also in the physical system)
G-BP Component 2 – Queueing Structure - Each node in columns n to 2n-1 maintains 2 queues (only the physical system)
G-BP Component 3 – Admission & Regulation Admission queue at input: Regulation queue at output:
G-BP Component 3 – Admission & Regulation Admission decisions at input: - Update γ sd (t): - Admit packets: (up flow to d in D 1 ) Note: q d (t ) is “idealized” In practice: - delayed arrivals at d - delayed feedback to s Input server admits pkts
G-BP Component 3 – Admission & Regulation Admission decisions at input: - Update γ sd (t): - Admit packets: (up flow to d in D 1 ) Source The need congestion to admit Destination congestion, Input server admits pkts passed to source
G-BP Component 3 – Admission & Regulation Admission decisions at input: - Update γ sd (t): - Admit packets: (low flow to d in D 2 ) Input server rejects pkts
G-BP Component 3 – Admission & Regulation Admission decisions at input: - Update γ sd (t): - Admit packets: (low flow to d in D 2 ) Source congestion The need to admit Destination congestion, Input server rejects pkts passed to source
Grouped-Backpressure Admission control
G-BP Component 4 – Dynamic Scheduling Which flow to serve over this link?
G-BP Component 4 – Dynamic Scheduling Define flow weights:
G-BP Component 4 – Dynamic Scheduling Define flow weights:
G-BP Component 4 – Dynamic Scheduling - If W 1U >W 2U & W 1U >0 , send 1U packets over link [m, m’] - At m’, randomly put the arrival into 1U or 1L
G-BP Component 4 – Dynamic Scheduling - If W 1U <W 2U & W 2U >0 , send 2U packets over link [m, m’] - At m’, randomly put the arrival into 2U or 2L
G-BP Component 4 – Dynamic Scheduling - If queue is not empty, transmit packet - Else remain idle
Grouped-Backpressure Admission control G-Backpressure based on fic sys
G-BP Component 4 – Dynamic Scheduling - If queue is not empty, transmit packet - Place packets into corresponding queues
Grouped-Backpressure Admission Free-flow control forwarding G-Backpressure based on fic sys
Grouped-Backpressure – Performance Theorem: Under the G-BP* algorithm, (i) both physical & fictitious networks are stable, and (ii) we achieve: * This is the idealized algorithm ….
Grouped-Backpressure – Performance Theorem: Under the G-BP algorithm, (i) both physical & fictitious networks are stable, and (ii) we achieve: Remarks: - No statistical info is needed - Distributed hop-by-hop routing & scheduling - Four queues per node (BP needs 2 n )
Grouped-Backpressure – Analysis Idea 0.5 0.5 0.5 1 0.5 0.5 0.5 0.5 0.5 N2: FF N1: BP - Update γ i (t) - Admit packets: - If H i (t)>Q i (t)+q(t), admit arrivals - Else, do not admit
Grouped-Backpressure – Analysis Idea 0.5 0.5 0.5 1 0.5 0.5 0.5 0.5 0.5 N2: FF N1: BP - Update γ i (t) H 1 (t), H 2 (t) are bdd - Admit packets: - If H i (t)>Q i (t)+q(t), admit arrivals q(t) is bounded - Else, do not admit
Grouped-Backpressure – Analysis Idea 0.5 0.5 0.5 1 0.5 0.5 0.5 0.5 0.5 N2: FF N1: BP Rates into Q 5 (t), Q 5 (t), Q 6 (t) stable H 1 (t), H 2 (t) are bdd Q 6 (t) are (1- ε )/2<0.5 q(t) is bounded
Grouped-Backpressure – Analysis Idea 0.5 0.5 0.5 1- ε 0.5 0.5 0.5 0.5 0.5 N2: FF N1: BP Q 5 (t), Q 6 (t) stable Network stability Q 1 (t) – Q 4 (t) stable by Backpressure
Grouped-Backpressure – Intuition The flow optimization problem: Due to the random arrival The augmented & relaxed flow opt problem: Taking the dual decomposition The dual form:
Grouped-Backpressure – Intuition The flow optimization problem: Due to the random arrival The augmented & relaxed flow opt problem: Taking the dual decomposition Admission Data queue queue The dual form:
Grouped-Backpressure – Proof Steps Step 1 - Define a Lyapnov function: Step 2 - Compute a Lyapunov drift Δ (t)=E{ L(t+1) - L(t) | X(t) } * =r ε * Step 3 - Plug in the opt solution of the relaxed problem, γ ε Step 4 - Do a telescoping sum Step 5 - H(t) is stable
Grouped-Backpressure – Simulation* Setting: 16x16 Benes network, ε=0.01 , utility=log(1+r) * This is the idealized algorithm ….
Grouped-Backpressure – Simulation Setting: 16x16 Benes network, ε=0.01 , utility=log(1+r) 1.8ms 1ms 0.5ms Note: For 1Gbps links and 500-Byte packets
Grouped-Backpressure – Simulation Delay versus network size – logarithmic growth V=20, ε=0.01 Delay reduced by “biasing” BP
Grouped-Backpressure – Simulation Delay versus network size – logarithmic growth V=20, ε=0.01 Assume each packet has 500 bytes, each 1ms link has 1Gbit/second. Then every slot is 4 microsecond. 0.7ms 0.6ms 0.6ms 0.5ms 0.5ms 0.4ms 0.35ms 0.3ms 0.26ms
Grouped-Backpressure – Simulation Setting: 16x16 Benes network, ε=0.01 , utility=wlog(1+r) Adaptation to change of traffic – At time 5, weights w sd change
Summary - Using Benes network and Backpressure for data center networking - Scalable: built with basic switch modules - Simple: four queues per node - Small delay: logarithmic in network size - High throughput: supports all rates in capacity region - Distributed: hop-by-hop routing and scheduling - Future research: Implementation issues
Thank you very much ! More info: www.eecs.berkeley.edu/~huang 50
Recommend
More recommend