pivotal s best practices for achieving sustainable cloud
play

Pivotals best practices for achieving sustainable Cloud Operations - PowerPoint PPT Presentation

Pivotals best practices for achieving sustainable Cloud Operations Konstantin Semenov - Principal Software Engineer About myself Started career in software engineering 22 years ago Involved in a diverse range of projects from database


  1. Pivotal’s best practices for achieving sustainable Cloud Operations Konstantin Semenov - Principal Software Engineer

  2. About myself Started career in software engineering 22 years ago Involved in a diverse range of projects – from database management to 3D modelling Enjoy playing music in my spare time

  3. Agenda Who are Cloud Ops? Pivotal Values Google SRE Extracted Practices Questions

  4. Who are Cloud Ops The reverse of Dev Ops 50% time spent on development Provide feedback to the product teams Develop best practices for operators

  5. Pivotal Core Values eXtreme Programming Test-driven development Small releases ➡ Small updates Pair programming ➡ Pair operations Continuous integration ➡ Continuous upgrade Collective ownership ➡ No superheroes, please Sustainable pace ➡ Sane working hours

  6. HumanOps Humans are part of the system Humans impact systems Humans impact business Human issues count as system issues Escalate to humans as a last resort

  7. HumanOps Human metrics System metrics

  8. Cloud Ops EU Our first experiment – a distributed operations team ! Regularly sharing context within distributed teams is hard

  9. Comic Relief A large-scale tele-marathon in the UK Collected over £82 million in donations in one night Deployed across AWS, vSphere and GCP

  10. Google Partnership We were invited to the CRE trial run Well-aligned with Pivotal principles

  11. Service Levels What does it mean to have 99.9%? What is the SLI/SLO/SLA relationship? How would you choose them?

  12. Service Levels Level Outage per month What does it mean to have 99.9%? 99% 7 hours 99.9% 43.2 minutes 99.95% 21.6 minutes 99.99% 4.32 minutes

  13. Service Levels Loss Issue MTTD MTTR MTBF Impact min/yr Containers run without dependent 0 3 min 90 days 10% 1 services It’s all about risk assessment VM exposed to Internet traffic 120 min 60 min 365 days 10% 18 Set clear expectations of performance Applications can cause collateral 60 min 30 min 90 days 0% 0 damage to log availability Traffic spike prevents mitigation 10 min 60 min 180 days 100% 142

  14. Service Levels It’s all about risk assessment Set clear expectations of performance

  15. Error budget Is usually defined within a 30-day rolling window Helps to prioritise innovation over stability It’s a budget - it is meant to be spent

  16. Pivotal Tracker Web-based project management system Over 100 000 active users Runs on commercial distribution of Cloud Foundry Migrated from AWS to GCP with no downtime

  17. Platform updates Security patches General support timeframe Scheduled nighttime/weekend maintenance windows More error-prone due to human factors No-one to ask for help when something fails

  18. Deployment train Inspired by agile release engineering All pending updates are applied every morning

  19. Train driver Controls what changes board the train, and whether the train is allowed to leave Holds the pager On duty for a week Writes deployment reports

  20. Fire drills Keep the teams in check with certain tooling Can be performed with development teams Are essential for becoming a train driver

  21. Dungeons & Dragons Help develop troubleshooting skills Gets the team more intimately familiar with various parts of the system Are fun!

  22. Toil snake Backup Upgrades Tunnel Backup Upgrades Tunnel Backup Toil reduction prioritisation tool Tunnel Backup Tunnel Backup Clearly indicates the biggest pain

  23. End of General Support Reviewed weekly Feedback to product teams

  24. Bit Rot Indicates how long a component hasn’t been updated Surfaces update issues

  25. Questions? "

  26. References Cloud Foundry Foundation 
 https://www.cloudfoundry.org/ HumanOps - https://www.humanops.com/ Google SRE - https://landing.google.com/sre/

  27. Thank you! #

Recommend


More recommend