Building Co Confidence i in Healthcare S Syst stems T s Through Ch Chaos E s Engineering Carl C Chesser @che5 e55er er | c | che5 e55er er.io
About m me
Our S Story Traffic M Management Pa Patterns In Intr troducing Chaos E Experiments Su Summary
Our S Story
The C Challenge Changing e existing service d deployments i in a c complex d deployment environment s supporting critical w workloads.
Incrementally B Build, Allowing f for C Change We w wanted t to c control o our deployments t through d declarative configuration, t that w would a allow us t to c continue t to c change.
Complexity w was G Growing within o our S Systems As w we p pursued m more w ways o of increasing a availability a and infrastructure f features, o our systems g grew i in s size a and co comple mplexit ity.
Cross F Functional T Team Alignment f for E Experiments As w we b built a and t tested o our infrastructure, w we w wanted cross t team a alignment when e evaluating t the l layers of t the i infrastructure.
Introducing O OpenStack We w wanted t to h have clean a availability z zone separation, a and therefore w wanted t to ensure w we d didn’t h have shared r resources.
Introd oduce t the T Tiger T Team We h had t three d different organizations, b but w wanted one c cross f functional t team. Platform rm a and O Opera rations a alre ready w were re located t together, a and n needed t to g get o our r Infra rastru ructure re t team l located t together a r as w well.
Starting w with D DC/OS When w we b began o our j journey, we w were l leveraging D DC/OS t to manage o our w workloads (via m marathon). With o our u r usage o of D DC/OS a and O OpenStack, w we w were re n needing t to better u r understand t the re reactions o of t these s systems i in c common failure re m modes.
Validate E Early C y Con oncerns We i introduced g gamedays to s start v validating concerns o of t the w whole syst sy stem em. Simulating t tra raffic t through t the s system w while k killing V VMs, poweri ring o off h hypervisor, s stopping a availability zo zones, a and s share red infra rastru ructure re i in D DC/OS.
Evol olving t to K o Kubernetes As w we l lived w with o our current s system, w we Look, t there o on the h horizon! knew w we w would need t to e evolve i it t to Ku Kubernetes.
Competing T Time i in Growing B Both S Systems As w we w were e evolving o our system, w we w wanted t to c collapse the a amount o of e effort a and t time to s start c comparing e effects o of production w workloads.
Leveraging S Spinnaker When w we b built o our d deployments for D DC/OS, w we a added s support f for DC/OS t to S Spinnaker. We t then l levera raged i it t to d deploy t to both s systems a as w we c compare red the b behavior i r in K Kubern rnetes.
Fear o of R Running E Experiments s on L Live T Traffic We a are n not Netflix lix! The i introduction o of c chaos experiments o on l live production s systems, e even for a a s small p percentage o of traffic, c can s seem t too r risky. Larry becomes defensive when first approached about applying chaos experiments in production at ACME corporation.
Introduced S Shadow T Traffic Rather t than d delaying when w we c could s start evaluating o our n newer system, w we c could leverage a a r replay o of production t traffic.
Traffic ic Manageme ment Pa Patterns
API PI G Gateway to to Facilitate C Change We e evolved o our s systems many t times b by l leveraging a c control g gate i into o our sy syst stem em. Used a as a an a abstra raction o of t the b backing s system.
Ch Chai aining Tr Traffic Supports a an A API PI g gateway t to s simply c call another g gateway, v versus t the b backing set o of s services.
Canary T Traffic Supports g gradually t transitioning a a subset o of t traffic t to a a d different t target b by leveraging c chaining. Avoid t the B Big B Bang.
Shadowing T Traffic Replays a a p percentage o of t traffic t to another b backend. Background re replay o of safe re requests. (re (read-only, H HTTP G GET) Build i in a a b bulkhead f for y your r resource p pool s supporting t the r replay o of t traffic t to a avoid u unnecessary stress o on y your s service a at b bursts o of t traffic.
Shadow A Allows E Early T Testing Rather t than i imposing a a c canary e early w with experiments, w where a a s small p percentage o of failure s still i introduces u undesirable r risk, look t to l leverage a a shadow o of t traffic.
Learning fr from Pr Production as w we b built t the N New We w were a able t to f further compare a and e evaluate t the system a as w we e expanded t the new a and a applied g gameday exe exercises.
Transitioning t to K Kubernetes became S Simple* We i identified a an i issue i in o our exi xisting s system, a and t through o our continual a assessment o of t f the n new system a and p practicing t traffic management, i it b became a a s simple* ch choice. * S Simple b by i it b being w well u understood, p practiced, a and s supported b by d data.
Applied i in o our C Cross S Site Kubernetes S Support Deploy s services a across data c center s sites, a and w we were a able t to l leverage traffic a across s sites f for a a site i incident.
Int Introduc ucing ing Cha Chaos os Expe Experi rime ments ts (g (gamedays) s)
Align t the I Introduction o of C Chao aos with O Org rgan anized E Experi riments Optimize e engineering f focus on t the i introduction o of c chaos as p planned e experiments. Minimize ze t the o opportunity f for c r chaos t to become a a s scape g goat f for m r mysteri rious is issues.
Prepare f for t the E Experiment Describe t the s scenario, w what is e expected t to o occur, h how i it will b be m measured, w who i is ne neede ded. d. Identify p pre rere requisites t that a are re n needed t to b be co completed (ex. i improved t telemetry o on c connection r refresh o of d data s store)
Observability i is C Critical You n need e easy a access t to essential t telemetry d data o of a all the p parts o of t the s system. You w want t to b be a able t to a ask d differe rent a and n new q questions of y your s r system w without h having t to c change t the s system. When y you d discover a a g gap i in v visibility, f focus o on h how t to m make i it e easy t y to rebuild y your s sys ystem w with t the i improvement t through l low c coordination.
Utilize a a D Dedicated S Space Have a a c common s space (physical/virtual) w where everyone a attends d during the e experiment. You w want t to o optimize ze c communication w when a assessing t the experi riment. S Schedule a adequate a amount o of t time f for m r multiple itera rations (e (ex. w whole a aftern rnoon). ).
Understand a and E Embrace needed C Compliance Pr Production s systems w will bear m more c compliance a and co cont ntrols ls. Much o of t this i is a around ri risk, s so f focus o on t the i introduction through l low ri risk s scenari rios (e (ex. n non-live s systems b being b built). ).
Plan t to b be S Surprised We g generally a always l learned something n new a about t the larger s system a and t the e effects of c compounding f failures. Capture re w what w was s surpri rising (a (actual re results v vs. w what w was the e expected re results) i ) in a an o open a and s searchable re repository. Plan a added t time t to d digest t the s surpri rises.
Cross F Functional I Involvement Helps s share k knowledge on h how d different l layers of a a s system a are v viewed during t the e experiment. Diverse p perspectives c can a accelera rate a and i improve g group l learn rning.
Prepares Y Your T Team Your e entire t team m may n not b be able t to p participate, b but t they should b be a able t to l learn f from the f findings. Experi riments h help y you p pra ractice h how y you l look i into t the s system, where re s signals n norm rmally a ari rise, a and i identifies g gaps o on e essential telemetry f for b r broader i r insight.
Su Summa mmary
Work t to b build c cross Plan f for y your functional t teams t to ex experiments maximize l learning Identify h y how t to m make i it e easy y to i improve o observability i y into Remind y your t teams a and your s sys ystem leadership o on m measurable improvements t through t this pr practice Identify h y how y you c can minimize r risk t through t traffic management a approaches
Recommend
More recommend