accelerated machine learning
play

Accelerated Machine Learning Feng Xie(stephen.xf@alibaba-inc.com) - PowerPoint PPT Presentation

Intelligent Operation and Maintenance of Public Cloud Based on GPU- Accelerated Machine Learning Feng Xie(stephen.xf@alibaba-inc.com) Zhaowei Ouyang(zhaowei.oyzw@alibaba-inc.com) AIOps on Public Cloud Product Maintenance and Scenario


  1. Intelligent Operation and Maintenance of Public Cloud Based on GPU- Accelerated Machine Learning Feng Xie(stephen.xf@alibaba-inc.com) Zhaowei Ouyang(zhaowei.oyzw@alibaba-inc.com)

  2. AIOps on Public Cloud Product Maintenance and Scenario Portrait Scheduling Upgrading recommendations Scheduling Maintenance and Upgrading Portrait Product recommendations Resource arrangement Outage/Failure prediction Analysis of purchasing VM Portrait behavior Power management Customer Portrait Anomaly detection Case Resource demand analysis Load Balance Migration downtime prediction Cluster health portrait … … Algorithm Regression Time Series Classification Clustering … Platform HybridDB for MaxCompute Blink Rapids SLS Dask MySQL Node Data VM Data Cluster Data IDC Data Data KPI,Abnor.,Power KPI,Abnor.,Event KPI,Abnor.,Event Power, Rack Data

  3. Machine Learning Platform Architecture Dask Worker Offline Data (Data (MaxComputer) Prepare) Message Dask Client Web Server Dask Worker Queue Scheduler (Train) Model Repository (OSS) Dask Worker (Predicte) Redis Online Data (SLS/HybridDB for MySQL/Blink)

  4. KPI Prediction Load Balancing CPU Resource Traffic Network Scheduling Warning Storage Anomaly detection

  5. CPU Load Time Series Periodicity Similarity

  6. Training Flow Chart Is No periodic Acquire Training Clustering Training a general FFT Data ? regression model Is not periodic Yes Yes Training a cluster- Clustering specific regression ? model No Training a general regression model

  7. Predicting Flow Chart Is No Predicting wih a periodic Acquire Historical FFT Classified? general regression Data model Is not periodic Yes Yes Predicting wih a cluster-specific Classified? regression model No Predicting wih a general regression model

  8. Periodicity of Time Series The Fast Fourier Transform (FFT) is used to transform the time series data from the time domain to the frequency domain. The frequency domain distribution is analyzed to determine whether it is periodic or not. 𝑂−1 𝑦 𝑜 . 𝑓 −𝑗2𝜌𝑙𝑜/𝑂 𝑦 𝑙 = 𝑜=0

  9. Basics of Signal Processing Input time series Original series Frequency Domain

  10. Similarity of Time Series Use DTW distance as a measure of similarity between time series K i j = ℮ −𝑒 𝑗 𝑘 0 1 0 𝑜−1 0 𝑜 0 1 0 𝑜−1 0 𝑜 … … 2𝜏 2 1 0 … 1 𝑜−1 1 𝑜 1 0 … 1 𝑜−1 1 𝑜 = = … … … … … … … … … … … … 𝑜−1 0 𝑜−1 1 𝑜−1 𝑜 𝑜−1 0 𝑜−1 1 𝑜−1 𝑜 … … 𝑜 0 𝑜 1 𝑜 𝑜−1 𝑜 0 𝑜 1 𝑜 𝑜−1

  11. GPU-Accelearated FFT and DTW distance calculation Use cuFFT to accelearate FFT calculation of massive time series data The calculation of Dynamic Time Warping(DTW) distance is a task with high time and space complexity. We can use the powerful parallel computing power of GPU to accelerate the calculation of DTW distance.

  12. Clustering results

  13. Time Series Regression Model Model Advantage Disadvantage ARIMA Simple Low accuracy Hyperparameter Optimization LSTM High accuracy Complicated Hyperparameter Optimization Poor interpretability XGBoost High accuracy Complicated Regression Tree Good Hyperparameter interpretability Optimization

  14. Regression algorithm Result:XGBoost Predict Next 24 Hours Result

  15. Regression algorithm accuracy Algorithm RMSE MAPE ARIMA 70% <5 70% <0.5 XGB 83% <5 83% <0.5 Regression tree

  16. Migration downtime prediction Use XGBoost Classification Tree to predict whether a VM is migration-sensitive Feature • Average vCPU utilization(1 hour before migration) • Amplitude of fluctuation with vCPU utilization(one day before migration) • VM Instance Type(How many vCPU/Memory?) • …… Result • Migration-insensitive VM (downtime <= 100 ms) • Migration-sensitive VM (downtime > 100 ms)

  17. Migration Prediction Flow Chart Whether insensitive is Classification Migrate immediately migration algorithm -sensitive sensitive Predict next 24 Regression algorithm hours load Predict a nearest Classification migration- algorithm insensitive window in next 24 hours

  18. Classification Algorithm Accuracy:XGBoost Accuracy ≈ 70% Migration-sensitive Recal:76%

  19. Classification Algorithm Performance:XGBoost 10ms Latency:60% drop Throughout:20x Speed-up GPU:NVIDIA Tesla P100 * 8 25ms CPU:2 Socket Intel Xeon E5-2682 v4 ( Broadwell )

  20. Questions?

Recommend


More recommend