have you tried turning it off and on again
play

Have You Tried Turning It Off and On Again? David N. Blank-Edelman - PDF document

7/6/18 source: http://leyanda.de/index.php?option=com_content&view=article&id=11 Have You Tried Turning It Off and On Again? David N. Blank-Edelman Senior Cloud Ops Advocate 1 7/6/18 @otterbook source:


  1. 7/6/18 source: http://leyanda.de/index.php?option=com_content&view=article&id=11 Have You Tried Turning It Off and On Again? David N. Blank-Edelman Senior Cloud Ops Advocate 1

  2. 7/6/18 @otterbook source: https://medium.com/@Ganticdotco/i-cant-help-but-think-of-the-blue-screen-of-death-f7a47be7ac67 2

  3. 7/6/18 @otterbook 3

  4. 7/6/18 This is Production. 4

  5. 7/6/18 source: https://www.flickr.com/photos/mayhem/4970272960/ This is Production. @otterbook 5

  6. 7/6/18 source: http://leyanda.de/index.php?option=com_content&view=article&id=11 6

  7. 7/6/18 Q&A @otterbook 7

  8. 7/6/18 Volunteers? @otterbook Rules @otterbook 8

  9. 7/6/18 Level Set: SRE @otterbook @otterbook 9

  10. 7/6/18 10

  11. 7/6/18 Seeking SRE CONVERSATIONS ABOUT RUNNING PRODUCTION SYSTEMS AT SCALE Edited by David N. Blank-Edelman @otterbook 11

  12. 7/6/18 • Airbnb • Microsoft • Amazon • Netflix • Apple • Pinterest • Baidu • Spotify • Dropbox • Stack Exchange • Etsy • Twitter • Facebook • Uber • GitHub • Yahoo! • LinkedIn • Yelp 12

  13. 7/6/18 13

  14. 7/6/18 What Makes SRE, SRE (dramatic recreation) • hire only coders • have an SLA for your service • measure and report performance against SLA • Use Error Budgets and gate launches on them • Common staffing pool for SRE and DEV • Excess Ops work overflows to DEV team • Cap SRE operational load at 50% • Share 5% of ops work with DEV team • Oncall teams at least 8 people, or 6x2 • Maximum of 2 events per oncall shift • Post mortem for every event • Post mortems are blameless and focus on process and technology, not people What Makes SRE, SRE (dramatic recreation) • hire only coders • have an SLA for your service • measure and report performance against SLA • Use Error Budgets and gate launches on them • Common staffing pool for SRE and DEV • Excess Ops work overflows to DEV team • Cap SRE operational load at 50% • Share 5% of ops work with DEV team • Oncall teams at least 8 people, or 6x2 • Maximum of 2 events per oncall shift • Post mortem for every event • Post mortems are blameless and focus on process and technology, not people 14

  15. 7/6/18 SLO @otterbook monitor SLO @otterbook 15

  16. 7/6/18 monitor SLO decide Observation #1: Create virtuous and reinforcing feedback loops 16

  17. 7/6/18 What Makes SRE, SRE (dramatic recreation) • hire only coders • have an SLA for your service • measure and report performance against SLA • Use Error Budgets and gate launches on them • Common staffing pool for SRE and DEV • Excess Ops work overflows to DEV team • Cap SRE operational load at 50% • Share 5% of ops work with DEV team • Oncall teams at least 8 people, or 6x2 • Maximum of 2 events per oncall shift • Post mortem for every event • Post mortems blameless and focus on process and technology, not people Observation #2: You can’t fire your way to reliable. 17

  18. 7/6/18 Observation #2: You can’t fire your way to resilient . The Actual Talk @otterbook 18

  19. 7/6/18 Q: What are the characteristics of an operations practice that actively influence a system towards greater resiliency? Q: What are some of the characteristics of an operations practice that actively influence a system towards greater resiliency? 19

  20. 7/6/18 The Nature of the Work @otterbook 20

  21. 7/6/18 Interfaces @otterbook 21

  22. 7/6/18 Data @otterbook 22

  23. 7/6/18 Errors @otterbook 23

  24. 7/6/18 Ambiguity @otterbook 24

  25. 7/6/18 “...I would like to beg you, dear Sir, as well as I can, to have patience with everything unresolved in your heart and to try to love the questions themselves as if they were locked rooms or books written in a very foreign language. Don't search for the answers, which could not be given to you now, because you would not be able to live them. And the point is, to live everything. Live the questions now. Perhaps then, someday far in the future, you will gradually, without even noticing it, live your way into the answer.” —Rainer Maria Rilke, Letters to a Young Poet (#4) 25

  26. 7/6/18 Q: What are some more of the characteristics of an operations practice that actively influence a system towards greater resiliency? @otterbook (More) Characteristics of an Operations Practice @otterbook 26

  27. 7/6/18 Check In @otterbook David N. Blank-Edelman Senior Cloud Ops Advocate @otterbook dnb@ microsoft.com /in/ dnblankedelman 27

Recommend


More recommend