the practice of chaos engineering
play

The Practice of Chaos Engineering Ana Medina Chaos Engineer at - PowerPoint PPT Presentation

The Practice of Chaos Engineering Ana Medina Chaos Engineer at Gremlin @ana_m_medina #reactive18 @ana_m_medina @ana_m_medina Gremlin Uber SFEFCU Google Quicken Loans Stanford University Miami Dade College #reactive18 @ana_m_medina How


  1. The Practice of Chaos Engineering Ana Medina Chaos Engineer at Gremlin @ana_m_medina

  2. #reactive18 @ana_m_medina @ana_m_medina Gremlin Uber SFEFCU Google Quicken Loans Stanford University Miami Dade College

  3. #reactive18 @ana_m_medina How many of you have heard of Chaos Engineering?

  4. #reactive18 @ana_m_medina How many of you have run a Chaos Engineering experiment?

  5. #reactive18 @ana_m_medina What is Chaos Engineering?

  6. #reactive18 @ana_m_medina Chaos Engineering Thoughtful, planned experiments designed to reveal the weakness in our systems.

  7. #reactive18 @ana_m_medina Chaos Engineering Inject something harmful to build an immunity . -@KoltonAndrus Gremlin Founder and CEO

  8. #reactive18 @ana_m_medina Why? ● Microservices ● Systems are scaling fast ● Downtime is really expensive ● Our dependencies will fail ● Pager fatigue and burnout really hurts

  9. #reactive18 @ana_m_medina Use Cases: ● Outage reproduction ● On-call training ● Strengthen new products ● Battle test new infrastructure and services

  10. #reactive18 @ana_m_medina What do you need before doing Chaos Engineering? ● Monitoring/Observability ● On-Call and Incident Management ● Cost of Downtime Per Hour

  11. #reactive18 @ana_m_medina Chaos Engineering is not ● Unexpected or unmonitored experiments ● Creating outages

  12. #reactive18 @ana_m_medina “Chaos Engineering Without Observability ... Is Just Chaos” -@mipsytipsy Charity Majors CEO of honeycomb

  13. #reactive18 @ana_m_medina Minimize the Blast radius

  14. #reactive18 @ana_m_medina Level 0 VALUE PROVIDED THE BEGINNING Prepare for host failures in the cloud Chaos Monkey APPROACH TAKEN Random MATURITY REQUIRED Low

  15. #reactive18 @ana_m_medina Level 1 VALUE PROVIDED THE FIRST STEP Prepare for host-level failures Infrastructure APPROACH TAKEN Failures Disciplined MATURITY REQUIRED Basic Operations

  16. #reactive18 @ana_m_medina Level 1.5 VALUE PROVIDED INTERMEDIATE Prepare for high impact events Network Failures APPROACH TAKEN Gameday MATURITY REQUIRED Networking expertise

  17. #reactive18 @ana_m_medina Level 2 VALUE PROVIDED THE NEXT STEP Safely validate the user experience Application APPROACH TAKEN Failures Precision Experiments MATURITY REQUIRED Advanced

  18. Latency added to 50% of android traffic

  19. Exceptions - 50% of android traffic failed

  20. #reactive18 @ana_m_medina You can and should inject chaos at every layer of your stack ● Application ● API ● Caching ● Database ● Hardware ● Cloud Infrastructure / Bare metal

  21. #reactive18 @ana_m_medina Top places to inject chaos

  22. #reactive18 @ana_m_medina

  23. #reactive18 @ana_m_medina https://www.gremlin.com/community/tutorials/what-i-learned-running-the- chaos-lab-kafka-breaks/

  24. #reactive18 @ana_m_medina Getting Started: ● Identify top 5 critical systems ● Choose system ● Whiteboard the system ● Determine what experiment you want to run: (resource, state, network) ● Determine Blast Radius

  25. #reactive18 @ana_m_medina Companies doing Chaos Engineering

  26. #reactive18 @ana_m_medina Chaos Days

  27. #reactive18 @ana_m_medina Chaos Days: Dedicated day for your entire company to focus on building resilience instead of new products. https://www.gremlin.com/community/tutorials/planning-your-own-chaos-day/

  28. #reactive18 @ana_m_medina “What could go wrong?” “Do we know what will happen if this breaks?”

  29. #reactive18 @ana_m_medina Chaos Day Crew: VP Engineering / CTO / COO Executive Assistant Engineering Director / Manager Senior Engineer New Grad / Intern Engineer

  30. #reactive18 @ana_m_medina What experiments can you run? • Reproduce outage conditions • Unpredictable circumstances • Large traffic spikes • Race conditions • Datacenter failure • Time travel - system clocks to be out of sync • Network errors • CPU overloads

  31. #reactive18 @ana_m_medina

  32. #reactive18 @ana_m_medina

  33. #reactive18 @ana_m_medina

  34. #reactive18 @ana_m_medina

  35. #reactive18 @ana_m_medina

  36. #reactive18 @ana_m_medina gremlin.com/chaos-monkey/

  37. #reactive18 @ana_m_medina Learn more: Join the Chaos, Join Slack: bit.ly/chaos-eng-slack 1,900+ members across the world

  38. #reactive18 @ana_m_medina THANKS! ana@gremlin.com @ana_m_medina

Recommend


More recommend