Deep-Q: Traffic-driven QoS Inference using Deep Generative Network Shihan Xiao , Dongdong He, Zhibo Gong Network Technology Lab, Huawei Technologies Co., Ltd., Beijing, China 1
Background • What is a QoS Model? Traffic D elay, jitter, packet loss… QoS Model Network
Background • Why is it important? Online QoS Monitoring SLA guarantee & anomaly detection Delay Monitoring Path A A QoS model helps reduce most of the cost! Path B Path C Require high cost on real-time active QoS measurements! Monitor
Background • Why is it important? Online QoS Monitoring Offline Traffic Analysis Delay Inference SLA guarantee & anomaly detection Delay Monitoring Path A Path A Path B Path B Path C Path C A QoS model can do QoS inference Inference without QoS measurements Monitor + Traffic trace Network
Background • Why is it important? Online QoS Monitoring Offline Traffic Analysis “What if” Analysis Delay Inference SLA guarantee & anomaly detection Delay Prediction Delay Monitoring Path A Path A Path A Path B How QoS will change Path B Predict if a flow switches Path C from Path A to C? Path C Inference Path C Monitor + Traffic trace Network
Traditional Methods • 1. Network simulator NS2, NS3, OMNeT ++… Traffic Network Delay, jitter, packet loss Simulator Network Slow and Inaccurate 6
Traditional Methods • 2. Mathematical modeling Simplified assumptions Traffic Queuing Delay, jitter, packet loss Theory Network Large human-analysis cost & Inaccurate 7
Traditional Methods • 2. Mathematical modeling Simplified assumptions Traffic Queuing Delay, jitter, packet loss Theory Network Large human-analysis cost & Inaccurate A fast, accurate & low-cost QoS model is helpful! 8
Key Observations • Observation 1: Traffic load per link is much easier to collect & well- supported by existing tools (e.g., SNMP) than QoS values per path 9
Key Observations • Observation 1: Traffic load per link is much easier to collect & well- supported by existing tools (e.g., SNMP) than QoS values per path • Observation 2: Traffic load is the key factor of QoS changes Traffic: collected link load matrixes Delay, jitter, packet loss Node index QoS Model Node index 10
Key Observations • Observation 3: Different traffic loads lead to different QoS distributions Testbed measurement 40 traffic loads (per 20 min) Measured delay samples 11
Key Observations • Target Problem: Given a set of traffic load matrixes during time T, what are the distributions of QoS values (delay, jitter, loss...) of each network path during T? Different traffic loads lead to different QoS distributions 12
Solution of Deep-Q • Why deep learning helps? Low human-analysis cost Fast inference Data-driven VS. Human-engineered model Running time of Hours! Delay model Network Loss model QoS values Packets Simulator … Running time of Milliseconds! Traffic load matrix Auto Delay/Jitter/Loss Training … QoS values … … … 13
Key Technology: Deep Generative Network So what is the difference? • State-of-the-art DGNs in deep learning Image domain Network domain – GAN (Generative Adversarial Network) & VAE (Variational Autoencoder) Input: “this small bird has a pink breast and Input: number 2 crown, and black primaries and secondaries ” infer infer Input: traffic load matrixes infer Probability Delay (us) Source: ICML2016, “Generative Adversarial Text to Image Synthesis” Source: NIPS2014, “Semi - supervised Learning with Deep Generative Models” Deep-Q (Conditional) GAN Example (Conditional) VAE Example 14
Key Technology: Deep Generative Network • Differences Image Image do doma main Netw Ne twork k do doma main (GAN & VAE) (Deep-Q) (Deep Q) Application: text label to images Application: traffic load matrixes to QoS values Input Output Input Output Discrete Label Image samples Traffic statistics QoS values Continuous & Continuous & Discrete & Discrete & High Dimensional Low Dimensional Low/high Dimensional High Dimensional Target: the generated image samples satisfy “real” image Target: the generated QoS values satisfy real QoS distribution distribution and match the label class and match the traffic statistics Deep-Q requires a high accuracy on the output distribution, but GAN & VAE do not apply! 15
Deep-Q Solution • 1. Handle the continuous high-dimensional input – Extract traffic features from a sequence of high-dimensional traffic load matrixes LSTM (Long Short Term Memory) module: a state-of-the-art deep learning method to learn features from a data sequence Micro-load matrixes during time t … … 1 2 3 𝑜 𝑁 𝑢 𝑁 𝑢 𝑁 𝑢 𝑁 𝑢 Hidden State Hidden State Hidden State LSTM LSTM LSTM LSTM … … Cell Cell Cell Cell … … … … … … … … … … … … Traffic features 16
Deep-Q Solution • 2. Handle the continuous low-dimensional output – Challenge: high accuracy is required for QoS distribution inference – Solution: a new metric “ Cinfer loss” to accurately quantify the QoS distribution error X: Inferred QoS distribution CDF (Cumulative Distribution Function) Y: Target QoS distribution Cumulative Probability CDF curve of X CDF curve of Y Delay (ms) Height Difference 17
Deep-Q Solution • Deep-Q: A stable & accurate inference engine – Built upon VAE (Stable) and augmented with Cinfer Loss (Accurate) Target distribution A simple example of learning ability: Inferred distribution VAE: Stable but Inaccurate GAN: More accurate but unstable Deep-Q: Stable & Accurate KL Loss of GAN L2 Loss of VAE Cinfer Loss of Deep-Q 18
Deep-Q Solution • Cinfer-Loss computation for training – The exact computation is NP-hard – The approximation must be fully differentiable to compute gradients for training • Step 1: Discretization From integral to a discrete sum of bins Cumulative Probability Cumulative Probability Discretization Delay (ms) Delay (ms) 19
Deep-Q Solution • Cinfer-Loss computation for training – The exact computation is NP-hard – The approximation must be fully differentiable to compute gradients for training • Step 2: Bin Height Computation – required to be differentiable • An intuitive method: – Calculate the located bin index of each sample & Count the sample number per bin Cumulative Probability Ceil function is non-differentiable & difficult to approximate! Delay (ms) 20
Deep-Q Solution • Cinfer-Loss computation for training – The exact computation is NP-hard – The approximation must be fully differentiable to compute gradients for training • Step 2: Bin Height Computation – required to be differentiable • A differentiable method with some math tricks (borrowed from deep learning) Step 1): Use 𝑇𝑗𝑜 function Cumulative Probability Step 2): Approximate 𝑇𝑗𝑜 function with 𝑢𝑏𝑜ℎ Delay (ms) Approximation error< 10 −5 in experiments 21
Deep-Q Solution • Put it all together … … Sampling from N(0,1) … Network QoS (delay,jitter, Network QoS VAE VAE Z loss…) (delay,jitter, Encoder Decoder loss…) … … … Traffic load … … X’ … matrix X Inference phase of Deep-Q Space-time Traffic along time Features Training phase of Deep-Q Automatic feature engineering & QoS modeling : end-to-end training using Cinfer Loss Underlay Collect network traffic data 22
Experiment Setup • Testbed Topology Experiment topology of data center network Experiment topology of overlay IP network NEU200 Probe NEU200 Probe CPE CPE r0 AS-1 NEU200 Probe r1 NEU200 Probe Internet r4 r2 AS-2 r3 • Traffic traces: WIDE backbone network [1] – Training set: 24 hours of traffic traces on April 12, 2017 – Test set: 24 hours of traffic traces on April 13, 2017 • Neural network: TensorFlow implementation with 2 hidden layers [1] Traffic traces are public available at http://mawi.wide.ad.jp/mawi/ 23
Experiment Results • Delay Inference in Datacenter Topology Traffic Real delay Distribution error of inference Mean error of inference 90-percentile error of inference 99-percentile error of inference Deep learning Queuing theory 1. Deep learning methods achieve on average 3x higher accuracy over Queuing theory 2. Deep-Q achieves the lowest errors and most stable performance over all cases 24
Experiment Results • Packet Loss Inference in Overlay IP Topology Deep learning Queuing theory 1. Deep learning methods achieve on average 3x higher accuracy over Queuing theory 2. Deep-Q achieves the lowest errors and most stable performance over all cases Deep-Q inference speed < 10ms for network scale < 200 nodes 25
Conclusion • Deep-Q: an accurate, fast and low-cost QoS inference engine – Automation: LSTM module for auto traffic feature extraction – High stability: an extended VAE inference structure with the encoder and decoder – High accuracy : a new metric “ Cinfer loss” to accurately quantify the QoS distribution error • Future vision: – Learn device-level QoS models (routers/switches) → scalable network-level QoS models – Learn high-level application QoE from traffic traces 26
27
Recommend
More recommend