service ownership
play

Service Ownership Learn Faster Holly Allen Service Engineering - PowerPoint PPT Presentation

Service Ownership Learn Faster Holly Allen Service Engineering @hollyjallen Holly Allen Software development and leadership for 18 years @hollyjallen,#QConSF Nov 2018 @hollyjallen,#QConSF Nov 2018 @hollyjallen,#QConSF Nov 2018 Software!


  1. Service Ownership Learn Faster Holly Allen Service Engineering @hollyjallen

  2. Holly Allen Software development and leadership for 18 years

  3. @hollyjallen,#QConSF Nov 2018

  4. @hollyjallen,#QConSF Nov 2018

  5. @hollyjallen,#QConSF Nov 2018

  6. Software! 😎 @hollyjallen,#QConSF Nov 2018

  7. @hollyjallen,#QConSF Nov 2018

  8. S L O W 😪 @hollyjallen,#QConSF Nov 2018

  9. @hollyjallen,#QConSF Nov 2018

  10. Measure Design Learn @hollyjallen,#QConSF Nov 2018

  11. Toyota Production System @hollyjallen,#QConSF Nov 2018

  12. @hollyjallen,#QConSF Nov 2018

  13. @hollyjallen,#QConSF Nov 2018

  14. @hollyjallen,#QConSF Nov 2018

  15. @hollyjallen,#QConSF Nov 2018

  16. @hollyjallen,#QConSF Nov 2018

  17. “” Kaizen Continuous Improvement @hollyjallen,#QConSF Nov 2018

  18. Measure Design Learn @hollyjallen,#QConSF Nov 2018

  19. @hollyjallen,#QConSF Nov 2018

  20. @hollyjallen,#QConSF Nov 2018

  21. @hollyjallen,#QConSF Nov 2018

  22. “” Executive dedication to learning @hollyjallen,#QConSF Nov 2018

  23. “” High Trust Teams @hollyjallen,#QConSF Nov 2018

  24. Measure Design Learn @hollyjallen,#QConSF Nov 2018

  25. @hollyjallen,#QConSF Nov 2018

  26. 🚁 Slack launched February 2014 @hollyjallen,#QConSF Nov 2018

  27. 5 Years Grew to 13+ million weekly active users, with active sessions of 10+ hours a day @hollyjallen,#QConSF Nov 2018

  28. 5 Years From 10 to 15,000 servers In 25 cloud data centers world-wide @hollyjallen,#QConSF Nov 2018

  29. 5 Years From 8 to 1,200 people In 9 offices world-wide @hollyjallen,#QConSF Nov 2018

  30. Measure Design Learn @hollyjallen,#QConSF Nov 2018

  31. @hollyjallen,#QConSF Nov 2018

  32. “” ✅ Continuous Deployment ✅ Experiment Frameworks ✅ User Research @hollyjallen,#QConSF Nov 2018

  33. Something didn't scale... @hollyjallen,#QConSF Nov 2018

  34. 😮 Centralized Operations @hollyjallen,#QConSF Nov 2018

  35. “” Who should be responsible for the management, monitoring and operation of a production application? @hollyjallen,#QConSF Nov 2018

  36. “” Centralized Operations Division of Labor @hollyjallen,#QConSF Nov 2018

  37. Devs Ops Features Cloud Infra Scale Deployment Architecture Monitoring @hollyjallen,#QConSF Nov 2018

  38. “” Ops is getting the pages @hollyjallen,#QConSF Nov 2018

  39. “” Product Development grew faster than Operations, A lot faster @hollyjallen,#QConSF Nov 2018

  40. 20 Product 1 Ops Developers Engineer @hollyjallen,#QConSF Nov 2018

  41. “” How can operations reliably reach the developers when there's a problem? @hollyjallen,#QConSF Nov 2018

  42. “” "Call Maude, she knows how this works" @hollyjallen,#QConSF Nov 2018

  43. Devs Ops I've never been Now I know I on-call before, can find a this is scary! developer when I need to. @hollyjallen,#QConSF Nov 2018

  44. “” Ops is getting the pages first pages Ultra-senior devs on-call @hollyjallen,#QConSF Nov 2018

  45. Measure Design Learn @hollyjallen,#QConSF Nov 2018

  46. “” How can operations reliably reach the developers when there's a problem? @hollyjallen,#QConSF Nov 2018

  47. 📠 Most devs go on-call Fall 2017 @hollyjallen,#QConSF Nov 2018

  48. “” Kaizen Continuous Improvement @hollyjallen,#QConSF Nov 2018

  49. “” "Wait, I'm on-call now?" @hollyjallen,#QConSF Nov 2018

  50. Devs Ops I'm glad I'm only I'll be able to on call a few reach a search times a year engineer if I need to. @hollyjallen,#QConSF Nov 2018

  51. “” Learn by Doing @hollyjallen,#QConSF Nov 2018

  52. “” On-call 3 times a year 🤕 @hollyjallen,#QConSF Nov 2018

  53. “” Ops is getting the pages first pages Ultra-senior devs on-call Seven One dev rotations @hollyjallen,#QConSF Nov 2018

  54. “” Continuous Deployment 100+ prod deploys a day @hollyjallen,#QConSF Nov 2018

  55. “” What Changed? @hollyjallen,#QConSF Nov 2018

  56. “” @hollyjallen,#QConSF Nov 2018

  57. “” @hollyjallen,#QConSF Nov 2018

  58. “” Page the dev @hollyjallen,#QConSF Nov 2018

  59. Devs Ops I don't These are the understand this machine alerts part of the code I'm seeing @hollyjallen,#QConSF Nov 2018

  60. “” Human Routers @hollyjallen,#QConSF Nov 2018

  61. “” "Call Andy, he knows how this works" @hollyjallen,#QConSF Nov 2018

  62. “” Postmortems weren't a great place for learning @hollyjallen,#QConSF Nov 2018

  63. “” Can we catch problems earlier? @hollyjallen,#QConSF Nov 2018

  64. “” @hollyjallen,#QConSF Nov 2018

  65. “” @hollyjallen,#QConSF Nov 2018

  66. “” @hollyjallen,#QConSF Nov 2018

  67. “” Investing in tech to make detection and remediation faster @hollyjallen,#QConSF Nov 2018

  68. Operations is out Reorg! Service Engineering is in Fall 2017 @hollyjallen,#QConSF Nov 2018

  69. “” How can Slack ensure that developers know when there's a problem? @hollyjallen,#QConSF Nov 2018

  70. “” Centralized Operations Service Ownership @hollyjallen,#QConSF Nov 2018

  71. Measure Design Learn @hollyjallen,#QConSF Nov 2018

  72. “” "We are the toolsmith and specialists. We empower Service Ownership" @hollyjallen,#QConSF Nov 2018

  73. Devs Service Features Cloud Platform Reliability Observability tools Performance Service Discovery Postmortems Define best practice @hollyjallen,#QConSF Nov 2018

  74. 👌 I joined Slack in February 2018 @hollyjallen,#QConSF Nov 2018

  75. “” How to empower development teams to improve service reliability? @hollyjallen,#QConSF Nov 2018

  76. Define • At least one alerting health service metric, like latency or throughput health and operational maturity @hollyjallen,#QConSF Nov 2018

  77. “” Send metrics to Prometheus Observability team is here to help! 🔯 @hollyjallen,#QConSF Nov 2018

  78. Define • Team should be on-call service ready • At least 4, preferably 6 health and engineers participating to operational make it sustainable • 24/7 or during the weekday, maturity depending on the service @hollyjallen,#QConSF Nov 2018

  79. Define • Runbooks for standard service actions and troubleshooting health and • Central location in our code operational repository • Up to date and useable by maturity any engineer @hollyjallen,#QConSF Nov 2018

  80. Define • Paging alerts should link to service the runbook • Make responding to an health and page easy operational • Practice incident response maturity @hollyjallen,#QConSF Nov 2018

  81. “” Incident Lunch ⛑ @hollyjallen,#QConSF Nov 2018

  82. • Devops generalists Site • Emotional intelligence Reliability • Mentoring • Ambassadors Engineers • Operational maturity @hollyjallen,#QConSF Nov 2018

  83. “” SRE embedded in dev teams @hollyjallen,#QConSF Nov 2018

  84. “” Devs SRE Ops @hollyjallen,#QConSF Nov 2018

  85. Devs SREs Um, where are I'm over here the SREs? doing operational tasks @hollyjallen,#QConSF Nov 2018

  86. “” SRE Ops is still getting the first pages @hollyjallen,#QConSF Nov 2018

  87. “” How do we lower operational burden on the SREs? @hollyjallen,#QConSF Nov 2018

  88. “” Plan: Send paging alerts to the development teams @hollyjallen,#QConSF Nov 2018

  89. Devs SREs We need We're going to training plan this out perfectly @hollyjallen,#QConSF Nov 2018

  90. @hollyjallen,#QConSF Nov 2018

  91. “” Host level alerts Hundreds of them @hollyjallen,#QConSF Nov 2018

  92. “” Test with the users @hollyjallen,#QConSF Nov 2018

  93. @hollyjallen,#QConSF Nov 2018

  94. 💫 Everything was fine! @hollyjallen,#QConSF Nov 2018

  95. “” Empowered Continuous Improvement @hollyjallen,#QConSF Nov 2018

  96. “” Devs SRE Ops @hollyjallen,#QConSF Nov 2018

  97. “” How do we test our understanding of how Slack will fail? @hollyjallen,#QConSF Nov 2018

  98. “” "Disasterpiece Theater is an ongoing series of exercises in which we will purposely cause a part of Slack to fail." @hollyjallen,#QConSF Nov 2018

  99. Measure Design Learn @hollyjallen,#QConSF Nov 2018

  100. • Increased engineer Success confidence Metrics • Validate reliability improvements • Learn something new • Practice incident response @hollyjallen,#QConSF Nov 2018

Recommend


More recommend