Automating Operations with Machine Learning Matt Callanan Senior - PowerPoint PPT Presentation

Automating Operations with Machine Learning Matt Callanan Senior Software Development Engineer Expedia Group Willie Wheeler Principal Applications Engineer Expedia Group

Too many alerts Too many false alerts Alerts not actionable Diagnosis takes too long Remediation takes too long

Can automation and machine learning help solve these problems?

Overview Machine Automating Automating Learning Operations Ops with ML Overview

Automating Operations

System healthy System Remediate unhealthy Diagnose Detect

Firewalls Cloud Feature Services Switch Queues Network Databases Config Infrastructure Change Develop Build Deploy Monitor Rollback

Firewalls Feature Cloud Services Switch System healthy Queues Network Databases Config Change Infrastructure System Remediate unhealthy Develop Build Deploy Monitor Diagnose Detect Rollback

How does it work?

Bot Tracing Treatment Anomaly Metrics Diagnostics Deployment Remediation Detection Apps Alert Logging Manager

Machine Learning Overview

Machine learning systems perform tasks by learning from data, instead of requiring explicit programming.

Traditional programming Program Computer Outputs Inputs Machine Learning Inputs Computer Program Outputs

There are many approaches to ML.

Convolutional neural network Dog Recurrent “Hay dos “There are two neural estudiantes students in the en la cocina.” kitchen.” network https://www.flickr.com/photos/a_peach/8631368705

Data Ingestion Executable Model ML-enabled Pre-Processing Training Model Application Data Analysis Training Data Model Repo

Automating Ops with ML

1. Anomalies and Anomaly Detection 2. ML Ops 3. Our Approach 4. Situation Diagnostics

Anomalies and Anomaly Detection

An anomaly is an unusual data point.

Anomalies signal a change. We want to detect them.

Time Series Anomaly Detection with Adaptive Alerting

AA minimises Mean Time To Detect (MTTD) by performing anomaly detection on streaming time series data. AA supports classical and ML-based time series anomaly detection algos.

Constant threshold STL regression Recurrent neural network Holt-Winters Custom regression AWS random cut forest

} Strong prediction Anomaly } Weak } Anomaly T-1 Normal } } Weak Anomaly actual Strong Anomaly

Anomaly detectors metrics anomalies (Kafka streams) (Kafka) (Kafka) { { "metric": "details-tp99", "metric": "details-tp99", "timestamp": 1549932163, "timestamp": 1549932163, "value": 150 "value": 150, } "anomaly": "WEAK" }

ML training ML detectors Training data (S3) Anomaly detectors metrics anomalies (Kafka streams) (Kafka) (Kafka)

ML Ops

Data Scientists and Engineers tend to • operate in siloes Models can take many months to deploy • Training and retraining consistency issues •

Data Scientists: • Experiment Tracking for Training and HPO • Engineers: • Scalability, Robustness, Repeatability • Reliable Deployment of Models •

Non-Kubernetes Data Store (e.g. S3) Runtime ingest preprocess train deploy analyze Web Service, Container Op Container Op Container Op Container Op Container Op Streaming Kubeflow / Kubeflow Pipelines Kubernetes

Adaptive Alerting ML Training Repositories Docker Image Stack aa-ct-trainer “aa-ct-trainer” Docker Image - ingest_starter.py /aa-ct-trainer /pipeline aa-ct-trainer - preprocess_starter.py Python code Python code Dockerfile - analyze_starter.py “*_starter.py” scripts “create_pipeline.py” / aa-ct-trainer à /app - train_starter.py - deploy_starter.py aa-ml-training-pipelines-base aa-ml-training-core aa-ml-training-pipelines aa-ml-training-docker Docker Image Python Library Python Library aa-ml-training-pipelines-base “aa-ml- “aa-ml- Dockerfile training- training- aa-ml-training-app-base core” pipelines” pandas, scipy, matplotlib, pandas, scipy, matplotlib, Dockerfile Library Library pandas, scipy, matplotlib, boto3, etc. Kubeflow Pipelines SDK boto3, etc. Python Libraries boto3, etc. Python Library aa-ml-training-miniconda3 Python Libraries aa-ml-training-app-base Python Libraries Dockerfile Docker Image aa-ml-training-miniconda3 Docker Image Alpine Linux

Our Approach

• Anomaly Detection Building high-volume system • Several thousand auto-generated CT detectors for 5xx • Fine-tuned ML models • Exploring additional Deep Learning and LSTM models • • Diagnostics Automated runbook diagnostics • Exploring ML for situation diagnostics • • Auto-Remediation Automated rollback hints • Piloting full auto-rollback remediation •

• Ingest at least 4 weeks of data • Remove outliers (e.g. Hampel filtering) • Interpolate missing data • Analyse data set (EDA) • Visualise with scatter plots, ACF • Perform tests, e.g adfuller • Record properties of data set (e.g. stationary vs single/dual seasonality)

• Based on profile properties, select algorithms to explore • For each algorithm: Explore hyperparameter space (HPO) • For each HP combination, find the best fit • Select model with best set of HPs for algorithm • • Compare scores of each algorithm • Select the best model from the best algorithm • Ensemble detectors could serve multiple models

• When detector is created, a schedule is created, e.g. daily/weekly • Scheduler initiates Data Prep, Training, and Deployment tasks

• Good for fast anomaly detection • Detect mean shift over period of time • Less volatile anomaly classification

• Treat first 3 weeks as “training data” • Treat final 7 days as “test data”

Situation Diagnostics

Time series anomaly detection generates a large volume of individually unactionable anomalies. We need a small number of actionable anomalies.

ExpediaDotCom/haystack Haystack is Expedia Group’s OSS distributed tracing product. Key features: Dynamic Distributed State service tracing snapshotting graph Anomaly Trace detection on telemetry telemetry

ExpediaDotCom/haystack Haystack service graph

ExpediaDotCom/haystack Haystack data creates new automation opportunities: • Incident classification • Incident diagnostics • Incident prediction Can machine learning help?

A graph network works with graph- structured inputs and outputs. Similar to convolutional networks, but can learn graph topology (not just grid).

Use cases Incident Graph net + classification: classifier Hotel bookings drop Incident Geo Service: Graph net diagnostics: bad deployment Incident ALERT: Graph net Prediction: Hotel bookings drop in 8m

Summary

Close Your Loops • Open loops are costly and error prone Automate to Reduce MTTK, MTTD, MTTR • Outage Detection, Diagnosis, and Remediation can be automated to create closed loop systems Machine Learning • ML helps automate detection and diagnosis by converting historical observations into predictions and classifications MLOps • Break down the siloes between Data Science and Engineering Graph Networks • What story is your call graph telling you?

Ops automation https://www.youtube.com/watch?v=O8xLxNje30M Closing Loops and Opening Minds (AWS re:Invent) KubeFlow http://kubeflow.org Forecasting & anomaly detection https://otexts.com/fpp2/ Forecasting: Principles and Practice Outlier Detection: A Survey https://web.cs.hacettepe.edu.tr/~aykut/classes/spri ng2013/bil682/supplemental/Outlier_Detection_A_S urvey.pdf

Machine learning Awesome Public Datasets https://github.com/awesomedata/awesome- public-datasets Awesome TensorFlow https://github.com/jtoy/awesome-tensorflow Graph networks https://arxiv.org/abs/1806.01261 Relational inductive biases, deep learning, and graph networks A comprehensive survey on graph https://arxiv.org/abs/1901.00596 neural networks Graph Nets https://github.com/deepmind/graph_nets

Thanks! Matt Callanan: @mcallana Willie Wheeler: @williewheeler ExpediaDotCom/adaptive-alerting ExpediaDotCom/haystack Hiring: www.lifeatexpedia.com

Automating Operations with Machine Learning Matt Callanan Senior - PowerPoint PPT Presentation

Automating Operations with Machine Learning Matt Callanan Senior Software Development Engineer Expedia Group Willie Wheeler Principal Applications Engineer Expedia Group Too many alerts Too many false alerts Alerts not actionable

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Automating batch fecundity measurements Automating batch fecundity measurements using digital

REDHAT KICKSTART REDHAT KICKSTART Automating Linux Installation Automating Linux Installation

Automating the Automating the configuration of flow configuration of flow monitoring probes

Automating MySQL Deployments on Kubernetes Calin Don & Flavius Mecea Presslabs Automating

Automating Authority Work Automating authority work, or, Be your own authority control vendor

Automating Production of Cross Media Automating Production of Cross Media Content for

RANDOMIZING AND RANDOMIZING AND AUTOMATING ASSESSMENT AUTOMATING ASSESSMENT WITH R WITH R exams

Automating Operations with Machine Intelligence Rob Harrop CEO @ Skipjaq Co-founder @

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

Logical Foundations of Cyber-Physical Systems Andr Platzer Andr Platzer (CMU) LFCPS/08:

Tabled higher-order logic programming Brigitte Pientka Department of Computer Science Carnegie

Existence, Convergence and Efficiency Analysis of Nash Equilibrium and Its Application to Traffic

Course Script INF 5110: Compiler con- struction INF5110, spring 2018 Martin Steffen Contents

Unit 9: Static & Dynamic Scheduling Slides originally developed by Drew

Natural Language Processing Info 159/259 Lecture 6: Language models 1 (Sept 12, 2017) David

The Software Life Cycle Elaboration Production Software Engineering Deployment Modelling

The role of freely available and open-source software in our daily space operations Sacha Tholl

Automating Operations with Machine Learning Matt Callanan Senior - PowerPoint PPT Presentation

Automating Operations with Machine Learning Matt Callanan Senior Software Development Engineer Expedia Group Willie Wheeler Principal Applications Engineer Expedia Group Too many alerts Too many false alerts Alerts not actionable

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Automating batch fecundity measurements Automating batch fecundity measurements using digital

REDHAT KICKSTART REDHAT KICKSTART Automating Linux Installation Automating Linux Installation

Automating the Automating the configuration of flow configuration of flow monitoring probes

Automating MySQL Deployments on Kubernetes Calin Don &amp; Flavius Mecea Presslabs Automating

Automating Authority Work Automating authority work, or, Be your own authority control vendor

Automating Production of Cross Media Automating Production of Cross Media Content for

RANDOMIZING AND RANDOMIZING AND AUTOMATING ASSESSMENT AUTOMATING ASSESSMENT WITH R WITH R exams

Automating Operations with Machine Intelligence Rob Harrop CEO @ Skipjaq Co-founder @

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

Logical Foundations of Cyber-Physical Systems Andr Platzer Andr Platzer (CMU) LFCPS/08:

Tabled higher-order logic programming Brigitte Pientka Department of Computer Science Carnegie

Existence, Convergence and Efficiency Analysis of Nash Equilibrium and Its Application to Traffic

Course Script INF 5110: Compiler con- struction INF5110, spring 2018 Martin Steffen Contents

Unit 9: Static &amp; Dynamic Scheduling Slides originally developed by Drew

Natural Language Processing Info 159/259 Lecture 6: Language models 1 (Sept 12, 2017) David

The Software Life Cycle Elaboration Production Software Engineering Deployment Modelling

The role of freely available and open-source software in our daily space operations Sacha Tholl

Automating MySQL Deployments on Kubernetes Calin Don & Flavius Mecea Presslabs Automating

Unit 9: Static & Dynamic Scheduling Slides originally developed by Drew