The Practice of Chaos Engineering Ana Medina Chaos Engineer at Gremlin @ana_m_medina
#reactive18 @ana_m_medina @ana_m_medina Gremlin Uber SFEFCU Google Quicken Loans Stanford University Miami Dade College
#reactive18 @ana_m_medina How many of you have heard of Chaos Engineering?
#reactive18 @ana_m_medina How many of you have run a Chaos Engineering experiment?
#reactive18 @ana_m_medina What is Chaos Engineering?
#reactive18 @ana_m_medina Chaos Engineering Thoughtful, planned experiments designed to reveal the weakness in our systems.
#reactive18 @ana_m_medina Chaos Engineering Inject something harmful to build an immunity . -@KoltonAndrus Gremlin Founder and CEO
#reactive18 @ana_m_medina Why? ● Microservices ● Systems are scaling fast ● Downtime is really expensive ● Our dependencies will fail ● Pager fatigue and burnout really hurts
#reactive18 @ana_m_medina Use Cases: ● Outage reproduction ● On-call training ● Strengthen new products ● Battle test new infrastructure and services
#reactive18 @ana_m_medina What do you need before doing Chaos Engineering? ● Monitoring/Observability ● On-Call and Incident Management ● Cost of Downtime Per Hour
#reactive18 @ana_m_medina Chaos Engineering is not ● Unexpected or unmonitored experiments ● Creating outages
#reactive18 @ana_m_medina “Chaos Engineering Without Observability ... Is Just Chaos” -@mipsytipsy Charity Majors CEO of honeycomb
#reactive18 @ana_m_medina Minimize the Blast radius
#reactive18 @ana_m_medina Level 0 VALUE PROVIDED THE BEGINNING Prepare for host failures in the cloud Chaos Monkey APPROACH TAKEN Random MATURITY REQUIRED Low
#reactive18 @ana_m_medina Level 1 VALUE PROVIDED THE FIRST STEP Prepare for host-level failures Infrastructure APPROACH TAKEN Failures Disciplined MATURITY REQUIRED Basic Operations
#reactive18 @ana_m_medina Level 1.5 VALUE PROVIDED INTERMEDIATE Prepare for high impact events Network Failures APPROACH TAKEN Gameday MATURITY REQUIRED Networking expertise
#reactive18 @ana_m_medina Level 2 VALUE PROVIDED THE NEXT STEP Safely validate the user experience Application APPROACH TAKEN Failures Precision Experiments MATURITY REQUIRED Advanced
Latency added to 50% of android traffic
Exceptions - 50% of android traffic failed
#reactive18 @ana_m_medina You can and should inject chaos at every layer of your stack ● Application ● API ● Caching ● Database ● Hardware ● Cloud Infrastructure / Bare metal
#reactive18 @ana_m_medina Top places to inject chaos
#reactive18 @ana_m_medina
#reactive18 @ana_m_medina https://www.gremlin.com/community/tutorials/what-i-learned-running-the- chaos-lab-kafka-breaks/
#reactive18 @ana_m_medina Getting Started: ● Identify top 5 critical systems ● Choose system ● Whiteboard the system ● Determine what experiment you want to run: (resource, state, network) ● Determine Blast Radius
#reactive18 @ana_m_medina Companies doing Chaos Engineering
#reactive18 @ana_m_medina Chaos Days
#reactive18 @ana_m_medina Chaos Days: Dedicated day for your entire company to focus on building resilience instead of new products. https://www.gremlin.com/community/tutorials/planning-your-own-chaos-day/
#reactive18 @ana_m_medina “What could go wrong?” “Do we know what will happen if this breaks?”
#reactive18 @ana_m_medina Chaos Day Crew: VP Engineering / CTO / COO Executive Assistant Engineering Director / Manager Senior Engineer New Grad / Intern Engineer
#reactive18 @ana_m_medina What experiments can you run? • Reproduce outage conditions • Unpredictable circumstances • Large traffic spikes • Race conditions • Datacenter failure • Time travel - system clocks to be out of sync • Network errors • CPU overloads
#reactive18 @ana_m_medina
#reactive18 @ana_m_medina
#reactive18 @ana_m_medina
#reactive18 @ana_m_medina
#reactive18 @ana_m_medina
#reactive18 @ana_m_medina gremlin.com/chaos-monkey/
#reactive18 @ana_m_medina Learn more: Join the Chaos, Join Slack: bit.ly/chaos-eng-slack 1,900+ members across the world
#reactive18 @ana_m_medina THANKS! ana@gremlin.com @ana_m_medina
Recommend
More recommend