Intelligent Operation and Maintenance of Public Cloud Based on GPU- Accelerated Machine Learning Feng Xie(stephen.xf@alibaba-inc.com) Zhaowei Ouyang(zhaowei.oyzw@alibaba-inc.com)
AIOps on Public Cloud Product Maintenance and Scenario Portrait Scheduling Upgrading recommendations Scheduling Maintenance and Upgrading Portrait Product recommendations Resource arrangement Outage/Failure prediction Analysis of purchasing VM Portrait behavior Power management Customer Portrait Anomaly detection Case Resource demand analysis Load Balance Migration downtime prediction Cluster health portrait … … Algorithm Regression Time Series Classification Clustering … Platform HybridDB for MaxCompute Blink Rapids SLS Dask MySQL Node Data VM Data Cluster Data IDC Data Data KPI,Abnor.,Power KPI,Abnor.,Event KPI,Abnor.,Event Power, Rack Data
Machine Learning Platform Architecture Dask Worker Offline Data (Data (MaxComputer) Prepare) Message Dask Client Web Server Dask Worker Queue Scheduler (Train) Model Repository (OSS) Dask Worker (Predicte) Redis Online Data (SLS/HybridDB for MySQL/Blink)
KPI Prediction Load Balancing CPU Resource Traffic Network Scheduling Warning Storage Anomaly detection
CPU Load Time Series Periodicity Similarity
Training Flow Chart Is No periodic Acquire Training Clustering Training a general FFT Data ? regression model Is not periodic Yes Yes Training a cluster- Clustering specific regression ? model No Training a general regression model
Predicting Flow Chart Is No Predicting wih a periodic Acquire Historical FFT Classified? general regression Data model Is not periodic Yes Yes Predicting wih a cluster-specific Classified? regression model No Predicting wih a general regression model
Periodicity of Time Series The Fast Fourier Transform (FFT) is used to transform the time series data from the time domain to the frequency domain. The frequency domain distribution is analyzed to determine whether it is periodic or not. 𝑂−1 𝑦 𝑜 . 𝑓 −𝑗2𝜌𝑙𝑜/𝑂 𝑦 𝑙 = 𝑜=0
Basics of Signal Processing Input time series Original series Frequency Domain
Similarity of Time Series Use DTW distance as a measure of similarity between time series K i j = ℮ −𝑒 𝑗 𝑘 0 1 0 𝑜−1 0 𝑜 0 1 0 𝑜−1 0 𝑜 … … 2𝜏 2 1 0 … 1 𝑜−1 1 𝑜 1 0 … 1 𝑜−1 1 𝑜 = = … … … … … … … … … … … … 𝑜−1 0 𝑜−1 1 𝑜−1 𝑜 𝑜−1 0 𝑜−1 1 𝑜−1 𝑜 … … 𝑜 0 𝑜 1 𝑜 𝑜−1 𝑜 0 𝑜 1 𝑜 𝑜−1
GPU-Accelearated FFT and DTW distance calculation Use cuFFT to accelearate FFT calculation of massive time series data The calculation of Dynamic Time Warping(DTW) distance is a task with high time and space complexity. We can use the powerful parallel computing power of GPU to accelerate the calculation of DTW distance.
Clustering results
Time Series Regression Model Model Advantage Disadvantage ARIMA Simple Low accuracy Hyperparameter Optimization LSTM High accuracy Complicated Hyperparameter Optimization Poor interpretability XGBoost High accuracy Complicated Regression Tree Good Hyperparameter interpretability Optimization
Regression algorithm Result:XGBoost Predict Next 24 Hours Result
Regression algorithm accuracy Algorithm RMSE MAPE ARIMA 70% <5 70% <0.5 XGB 83% <5 83% <0.5 Regression tree
Migration downtime prediction Use XGBoost Classification Tree to predict whether a VM is migration-sensitive Feature • Average vCPU utilization(1 hour before migration) • Amplitude of fluctuation with vCPU utilization(one day before migration) • VM Instance Type(How many vCPU/Memory?) • …… Result • Migration-insensitive VM (downtime <= 100 ms) • Migration-sensitive VM (downtime > 100 ms)
Migration Prediction Flow Chart Whether insensitive is Classification Migrate immediately migration algorithm -sensitive sensitive Predict next 24 Regression algorithm hours load Predict a nearest Classification migration- algorithm insensitive window in next 24 hours
Classification Algorithm Accuracy:XGBoost Accuracy ≈ 70% Migration-sensitive Recal:76%
Classification Algorithm Performance:XGBoost 10ms Latency:60% drop Throughout:20x Speed-up GPU:NVIDIA Tesla P100 * 8 25ms CPU:2 Socket Intel Xeon E5-2682 v4 ( Broadwell )
Questions?
Recommend
More recommend