automating operations with machine intelligence
play

Automating Operations with Machine Intelligence Rob Harrop CEO @ - PowerPoint PPT Presentation

Automating Operations with Machine Intelligence Rob Harrop CEO @ Skipjaq Co-founder @ SpringSource Automated performance management Why automate operations? Why now? What does automated operations look like? How do we build for automation?


  1. Automating Operations with Machine Intelligence Rob Harrop

  2. CEO @ Skipjaq Co-founder @ SpringSource Automated performance management

  3. Why automate operations? Why now? What does automated operations look like? How do we build for automation? Solving a real problem…

  4. Why automate operations?

  5. More Complexity

  6. Monolith -> Microservices Strong -> Eventual Consistency Assume reliability -> Assume failure

  7. More Deployments

  8. 40 30 20 10 Very end of 2009 Today Credit: Mike Brittain, Engineering Director @ Easy

  9. Less time to identify fixes Rollbacks more likely Tiny window for human intervention

  10. Harder Faster

  11. Why now?

  12. We have to

  13. We can

  14. Trends Cloud Containers Observability Microservices ML/AI

  15. Current trends provide the impetus and tools for automation by AI

  16. Automated Operations

  17. Move 37

  18. Move 78 - God’s Touch

  19. AI Human

  20. Types of Operation Actions Wholly performed by human Wholly performed by AI Co-operation between human and AI Actionable insight

  21. On Metrics Data is not insight Gathering metrics is not automating operations But , metrics are critical to automating operations

  22. Human ≠ Manual

  23. Actions by Human Testing Deployment Provisioning

  24. Cooperative Actions Anomaly alerting Rollback broken builds Dependency upgrade

  25. Actions by AI Predictive auto scaling Workload placement Automatic rollback Performance optimisation? Security?

  26. Actions and Actionable Insights

  27. Building for Automation

  28. Requirements for Operations Visible metrics and logs Ability to start/stop/restart/move workload Ability to change configuration Ability to modify dependencies Ability to wire/rewire external services

  29. Self-contained package Disposable processes Externally-configurable Externally-observable Externalised dependencies Externalised service wiring

  30. 12+1 Factor

  31. 13 th Factor - Observability Metrics as event streams Standard metrics - CPU usage, memory usage, … Service-specific metrics - Leads received, items sold, …

  32. Case Study Detecting Anomalous DB CPU

  33. Background Consumer-facing web application running Rails against PostgreSQL on AWS RDS Mix of transactional and batch workloads running against the same database Question: when is the DB unusually overloaded?

  34. Detecting Anomalies Policy-based Statistical model Predictive model Classification model

  35. Policy Based Fixed threshold alerting How well does this work?

  36. Not Very

  37. Statistical Model Twitter AnomalyDetection package - Seasonal Hybrid ESD Is this point unexpected in our distribution? - With seasonal and trend effects removed

  38. Statistical Model Stream Sliding window of observations Metrics (1 month, 1 year?) Each new observation run model (S - H - ESD) Is the new point an outlier?

  39. Predictive Model Train a model to predict values in the time series Prediction error > critical value => outlier

  40. x 1 a 1 (2) (2) a 2 x 2 (2) a 3 h W,b (x) x 3 Layer L 3 +1 +1 Layer L 1 Layer L 2

  41. h 0 h 1 h 2 h 3 h 4 A A A A A x 0 x 1 x 2 x 3 x 4 From: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

  42. Predictive Model Metrics Stream Prediction Training set 
 Model ?? last month Re-Train Is prediction error (Nightly, weekly?) an outlier???

  43. Handling Anomalies Actionable alerts - Confidence in predictions No alerts for pointless things

  44. Handling Anomalies Taking action - Rewiring services to read-replica? - Kill long-running queries?

  45. Handling Anomalies Confidence in the model leads to confidence in automation

  46. Summary Increasing complexity and deployment speed make operational automation a must We must build services that are ready for automation Simple models can often beat complex ones Cheap compute and storage makes large-scale ML available to everyone

  47. Thank You

Recommend


More recommend