Service Ownership Learn Faster Holly Allen Service Engineering @hollyjallen
Holly Allen Software development and leadership for 18 years
@hollyjallen,#QConSF Nov 2018
@hollyjallen,#QConSF Nov 2018
@hollyjallen,#QConSF Nov 2018
Software! 😎 @hollyjallen,#QConSF Nov 2018
@hollyjallen,#QConSF Nov 2018
S L O W 😪 @hollyjallen,#QConSF Nov 2018
@hollyjallen,#QConSF Nov 2018
Measure Design Learn @hollyjallen,#QConSF Nov 2018
Toyota Production System @hollyjallen,#QConSF Nov 2018
@hollyjallen,#QConSF Nov 2018
@hollyjallen,#QConSF Nov 2018
@hollyjallen,#QConSF Nov 2018
@hollyjallen,#QConSF Nov 2018
@hollyjallen,#QConSF Nov 2018
“” Kaizen Continuous Improvement @hollyjallen,#QConSF Nov 2018
Measure Design Learn @hollyjallen,#QConSF Nov 2018
@hollyjallen,#QConSF Nov 2018
@hollyjallen,#QConSF Nov 2018
@hollyjallen,#QConSF Nov 2018
“” Executive dedication to learning @hollyjallen,#QConSF Nov 2018
“” High Trust Teams @hollyjallen,#QConSF Nov 2018
Measure Design Learn @hollyjallen,#QConSF Nov 2018
@hollyjallen,#QConSF Nov 2018
🚁 Slack launched February 2014 @hollyjallen,#QConSF Nov 2018
5 Years Grew to 13+ million weekly active users, with active sessions of 10+ hours a day @hollyjallen,#QConSF Nov 2018
5 Years From 10 to 15,000 servers In 25 cloud data centers world-wide @hollyjallen,#QConSF Nov 2018
5 Years From 8 to 1,200 people In 9 offices world-wide @hollyjallen,#QConSF Nov 2018
Measure Design Learn @hollyjallen,#QConSF Nov 2018
@hollyjallen,#QConSF Nov 2018
“” ✅ Continuous Deployment ✅ Experiment Frameworks ✅ User Research @hollyjallen,#QConSF Nov 2018
Something didn't scale... @hollyjallen,#QConSF Nov 2018
😮 Centralized Operations @hollyjallen,#QConSF Nov 2018
“” Who should be responsible for the management, monitoring and operation of a production application? @hollyjallen,#QConSF Nov 2018
“” Centralized Operations Division of Labor @hollyjallen,#QConSF Nov 2018
Devs Ops Features Cloud Infra Scale Deployment Architecture Monitoring @hollyjallen,#QConSF Nov 2018
“” Ops is getting the pages @hollyjallen,#QConSF Nov 2018
“” Product Development grew faster than Operations, A lot faster @hollyjallen,#QConSF Nov 2018
20 Product 1 Ops Developers Engineer @hollyjallen,#QConSF Nov 2018
“” How can operations reliably reach the developers when there's a problem? @hollyjallen,#QConSF Nov 2018
“” "Call Maude, she knows how this works" @hollyjallen,#QConSF Nov 2018
Devs Ops I've never been Now I know I on-call before, can find a this is scary! developer when I need to. @hollyjallen,#QConSF Nov 2018
“” Ops is getting the pages first pages Ultra-senior devs on-call @hollyjallen,#QConSF Nov 2018
Measure Design Learn @hollyjallen,#QConSF Nov 2018
“” How can operations reliably reach the developers when there's a problem? @hollyjallen,#QConSF Nov 2018
📠 Most devs go on-call Fall 2017 @hollyjallen,#QConSF Nov 2018
“” Kaizen Continuous Improvement @hollyjallen,#QConSF Nov 2018
“” "Wait, I'm on-call now?" @hollyjallen,#QConSF Nov 2018
Devs Ops I'm glad I'm only I'll be able to on call a few reach a search times a year engineer if I need to. @hollyjallen,#QConSF Nov 2018
“” Learn by Doing @hollyjallen,#QConSF Nov 2018
“” On-call 3 times a year 🤕 @hollyjallen,#QConSF Nov 2018
“” Ops is getting the pages first pages Ultra-senior devs on-call Seven One dev rotations @hollyjallen,#QConSF Nov 2018
“” Continuous Deployment 100+ prod deploys a day @hollyjallen,#QConSF Nov 2018
“” What Changed? @hollyjallen,#QConSF Nov 2018
“” @hollyjallen,#QConSF Nov 2018
“” @hollyjallen,#QConSF Nov 2018
“” Page the dev @hollyjallen,#QConSF Nov 2018
Devs Ops I don't These are the understand this machine alerts part of the code I'm seeing @hollyjallen,#QConSF Nov 2018
“” Human Routers @hollyjallen,#QConSF Nov 2018
“” "Call Andy, he knows how this works" @hollyjallen,#QConSF Nov 2018
“” Postmortems weren't a great place for learning @hollyjallen,#QConSF Nov 2018
“” Can we catch problems earlier? @hollyjallen,#QConSF Nov 2018
“” @hollyjallen,#QConSF Nov 2018
“” @hollyjallen,#QConSF Nov 2018
“” @hollyjallen,#QConSF Nov 2018
“” Investing in tech to make detection and remediation faster @hollyjallen,#QConSF Nov 2018
Operations is out Reorg! Service Engineering is in Fall 2017 @hollyjallen,#QConSF Nov 2018
“” How can Slack ensure that developers know when there's a problem? @hollyjallen,#QConSF Nov 2018
“” Centralized Operations Service Ownership @hollyjallen,#QConSF Nov 2018
Measure Design Learn @hollyjallen,#QConSF Nov 2018
“” "We are the toolsmith and specialists. We empower Service Ownership" @hollyjallen,#QConSF Nov 2018
Devs Service Features Cloud Platform Reliability Observability tools Performance Service Discovery Postmortems Define best practice @hollyjallen,#QConSF Nov 2018
👌 I joined Slack in February 2018 @hollyjallen,#QConSF Nov 2018
“” How to empower development teams to improve service reliability? @hollyjallen,#QConSF Nov 2018
Define • At least one alerting health service metric, like latency or throughput health and operational maturity @hollyjallen,#QConSF Nov 2018
“” Send metrics to Prometheus Observability team is here to help! 🔯 @hollyjallen,#QConSF Nov 2018
Define • Team should be on-call service ready • At least 4, preferably 6 health and engineers participating to operational make it sustainable • 24/7 or during the weekday, maturity depending on the service @hollyjallen,#QConSF Nov 2018
Define • Runbooks for standard service actions and troubleshooting health and • Central location in our code operational repository • Up to date and useable by maturity any engineer @hollyjallen,#QConSF Nov 2018
Define • Paging alerts should link to service the runbook • Make responding to an health and page easy operational • Practice incident response maturity @hollyjallen,#QConSF Nov 2018
“” Incident Lunch ⛑ @hollyjallen,#QConSF Nov 2018
• Devops generalists Site • Emotional intelligence Reliability • Mentoring • Ambassadors Engineers • Operational maturity @hollyjallen,#QConSF Nov 2018
“” SRE embedded in dev teams @hollyjallen,#QConSF Nov 2018
“” Devs SRE Ops @hollyjallen,#QConSF Nov 2018
Devs SREs Um, where are I'm over here the SREs? doing operational tasks @hollyjallen,#QConSF Nov 2018
“” SRE Ops is still getting the first pages @hollyjallen,#QConSF Nov 2018
“” How do we lower operational burden on the SREs? @hollyjallen,#QConSF Nov 2018
“” Plan: Send paging alerts to the development teams @hollyjallen,#QConSF Nov 2018
Devs SREs We need We're going to training plan this out perfectly @hollyjallen,#QConSF Nov 2018
@hollyjallen,#QConSF Nov 2018
“” Host level alerts Hundreds of them @hollyjallen,#QConSF Nov 2018
“” Test with the users @hollyjallen,#QConSF Nov 2018
@hollyjallen,#QConSF Nov 2018
💫 Everything was fine! @hollyjallen,#QConSF Nov 2018
“” Empowered Continuous Improvement @hollyjallen,#QConSF Nov 2018
“” Devs SRE Ops @hollyjallen,#QConSF Nov 2018
“” How do we test our understanding of how Slack will fail? @hollyjallen,#QConSF Nov 2018
“” "Disasterpiece Theater is an ongoing series of exercises in which we will purposely cause a part of Slack to fail." @hollyjallen,#QConSF Nov 2018
Measure Design Learn @hollyjallen,#QConSF Nov 2018
• Increased engineer Success confidence Metrics • Validate reliability improvements • Learn something new • Practice incident response @hollyjallen,#QConSF Nov 2018
Recommend
More recommend