Automating Operations with Machine Learning Matt Callanan Senior Software Development Engineer Expedia Group Willie Wheeler Principal Applications Engineer Expedia Group
Too many alerts Too many false alerts Alerts not actionable Diagnosis takes too long Remediation takes too long
Can automation and machine learning help solve these problems?
Overview Machine Automating Automating Learning Operations Ops with ML Overview
Automating Operations
System healthy System Remediate unhealthy Diagnose Detect
Firewalls Cloud Feature Services Switch Queues Network Databases Config Infrastructure Change Develop Build Deploy Monitor Rollback
Firewalls Feature Cloud Services Switch System healthy Queues Network Databases Config Change Infrastructure System Remediate unhealthy Develop Build Deploy Monitor Diagnose Detect Rollback
How does it work?
Bot Tracing Treatment Anomaly Metrics Diagnostics Deployment Remediation Detection Apps Alert Logging Manager
Bot Tracing Treatment Anomaly Metrics Diagnostics Deployment Remediation Detection Apps Alert Logging Manager
Machine Learning Overview
Machine learning systems perform tasks by learning from data, instead of requiring explicit programming.
Traditional programming Program Computer Outputs Inputs Machine Learning Inputs Computer Program Outputs
There are many approaches to ML.
Convolutional neural network Dog Recurrent “Hay dos “There are two neural estudiantes students in the en la cocina.” kitchen.” network https://www.flickr.com/photos/a_peach/8631368705
Data Ingestion Executable Model ML-enabled Pre-Processing Training Model Application Data Analysis Training Data Model Repo
Automating Ops with ML
1. Anomalies and Anomaly Detection 2. ML Ops 3. Our Approach 4. Situation Diagnostics
Anomalies and Anomaly Detection
An anomaly is an unusual data point.
Anomalies signal a change. We want to detect them.
Time Series Anomaly Detection with Adaptive Alerting
AA minimises Mean Time To Detect (MTTD) by performing anomaly detection on streaming time series data. AA supports classical and ML-based time series anomaly detection algos.
Constant threshold STL regression Recurrent neural network Holt-Winters Custom regression AWS random cut forest
} Strong prediction Anomaly } Weak } Anomaly T-1 Normal } } Weak Anomaly actual Strong Anomaly
Anomaly detectors metrics anomalies (Kafka streams) (Kafka) (Kafka) { { "metric": "details-tp99", "metric": "details-tp99", "timestamp": 1549932163, "timestamp": 1549932163, "value": 150 "value": 150, } "anomaly": "WEAK" }
ML training ML detectors Training data (S3) Anomaly detectors metrics anomalies (Kafka streams) (Kafka) (Kafka)
ML Ops
Data Scientists and Engineers tend to • operate in siloes Models can take many months to deploy • Training and retraining consistency issues •
Data Scientists: • Experiment Tracking for Training and HPO • Engineers: • Scalability, Robustness, Repeatability • Reliable Deployment of Models •
Non-Kubernetes Data Store (e.g. S3) Runtime ingest preprocess train deploy analyze Web Service, Container Op Container Op Container Op Container Op Container Op Streaming Kubeflow / Kubeflow Pipelines Kubernetes
Adaptive Alerting ML Training Repositories Docker Image Stack aa-ct-trainer “aa-ct-trainer” Docker Image - ingest_starter.py /aa-ct-trainer /pipeline aa-ct-trainer - preprocess_starter.py Python code Python code Dockerfile - analyze_starter.py “*_starter.py” scripts “create_pipeline.py” / aa-ct-trainer à /app - train_starter.py - deploy_starter.py aa-ml-training-pipelines-base aa-ml-training-core aa-ml-training-pipelines aa-ml-training-docker Docker Image Python Library Python Library aa-ml-training-pipelines-base “aa-ml- “aa-ml- Dockerfile training- training- aa-ml-training-app-base core” pipelines” pandas, scipy, matplotlib, pandas, scipy, matplotlib, Dockerfile Library Library pandas, scipy, matplotlib, boto3, etc. Kubeflow Pipelines SDK boto3, etc. Python Libraries boto3, etc. Python Library aa-ml-training-miniconda3 Python Libraries aa-ml-training-app-base Python Libraries Dockerfile Docker Image aa-ml-training-miniconda3 Docker Image Alpine Linux
Our Approach
• Anomaly Detection Building high-volume system • Several thousand auto-generated CT detectors for 5xx • Fine-tuned ML models • Exploring additional Deep Learning and LSTM models • • Diagnostics Automated runbook diagnostics • Exploring ML for situation diagnostics • • Auto-Remediation Automated rollback hints • Piloting full auto-rollback remediation •
• Ingest at least 4 weeks of data • Remove outliers (e.g. Hampel filtering) • Interpolate missing data • Analyse data set (EDA) • Visualise with scatter plots, ACF • Perform tests, e.g adfuller • Record properties of data set (e.g. stationary vs single/dual seasonality)
• Based on profile properties, select algorithms to explore • For each algorithm: Explore hyperparameter space (HPO) • For each HP combination, find the best fit • Select model with best set of HPs for algorithm • • Compare scores of each algorithm • Select the best model from the best algorithm • Ensemble detectors could serve multiple models
• When detector is created, a schedule is created, e.g. daily/weekly • Scheduler initiates Data Prep, Training, and Deployment tasks
• Good for fast anomaly detection • Detect mean shift over period of time • Less volatile anomaly classification
• Treat first 3 weeks as “training data” • Treat final 7 days as “test data”
Situation Diagnostics
Time series anomaly detection generates a large volume of individually unactionable anomalies. We need a small number of actionable anomalies.
ExpediaDotCom/haystack Haystack is Expedia Group’s OSS distributed tracing product. Key features: Dynamic Distributed State service tracing snapshotting graph Anomaly Trace detection on telemetry telemetry
ExpediaDotCom/haystack Haystack service graph
ExpediaDotCom/haystack Haystack data creates new automation opportunities: • Incident classification • Incident diagnostics • Incident prediction Can machine learning help?
A graph network works with graph- structured inputs and outputs. Similar to convolutional networks, but can learn graph topology (not just grid).
Use cases Incident Graph net + classification: classifier Hotel bookings drop Incident Geo Service: Graph net diagnostics: bad deployment Incident ALERT: Graph net Prediction: Hotel bookings drop in 8m
Summary
Close Your Loops • Open loops are costly and error prone Automate to Reduce MTTK, MTTD, MTTR • Outage Detection, Diagnosis, and Remediation can be automated to create closed loop systems Machine Learning • ML helps automate detection and diagnosis by converting historical observations into predictions and classifications MLOps • Break down the siloes between Data Science and Engineering Graph Networks • What story is your call graph telling you?
Ops automation https://www.youtube.com/watch?v=O8xLxNje30M Closing Loops and Opening Minds (AWS re:Invent) KubeFlow http://kubeflow.org Forecasting & anomaly detection https://otexts.com/fpp2/ Forecasting: Principles and Practice Outlier Detection: A Survey https://web.cs.hacettepe.edu.tr/~aykut/classes/spri ng2013/bil682/supplemental/Outlier_Detection_A_S urvey.pdf
Machine learning Awesome Public Datasets https://github.com/awesomedata/awesome- public-datasets Awesome TensorFlow https://github.com/jtoy/awesome-tensorflow Graph networks https://arxiv.org/abs/1806.01261 Relational inductive biases, deep learning, and graph networks A comprehensive survey on graph https://arxiv.org/abs/1901.00596 neural networks Graph Nets https://github.com/deepmind/graph_nets
Thanks! Matt Callanan: @mcallana Willie Wheeler: @williewheeler ExpediaDotCom/adaptive-alerting ExpediaDotCom/haystack Hiring: www.lifeatexpedia.com
Recommend
More recommend