controlled chaos taming organic federated growth of
play

Controlled Chaos: Taming Organic, Federated Growth of Microservices - PowerPoint PPT Presentation

Controlled Chaos: Taming Organic, Federated Growth of Microservices Southwest meltdown July 20, 2016 1 5 - One of 2,000 routers fails Router Days - All monitors green - Billions of packets in, 0 out - 30 min to discovery $80M -$3.4B -


  1. Controlled Chaos: Taming Organic, Federated Growth of Microservices

  2. Southwest meltdown July 20, 2016 1 5 - One of 2,000 routers fails Router Days - All monitors green - Billions of packets in, 0 out - 30 min to discovery $80M -$3.4B - 12 hours rebooting Losses Mrkt Cap - 5 days to full recovery Image: goodfreephotos.com Source: https://www.washingtonpost.com/lifestyle/travel/airline-computer-outages-like-deltas-are-bound-to-repeat-themselves-heres-what-to-know/2016/08/11/578a83cc-5d8d-11e6-9d2f-b1a3564181a1_story.html

  3. Mission control Service landscape Behaviors Operational patterns Tobias Kunze Co-Founder & CEO tobias@glasnostic.com Image: NASA @tkunze

  4. The agile operating model Hierarchy Autonomous Teams Rapid Learning & Decision Cycles Agile Organization IT Ops Mission Control DevOps SRE Monolith Microservices Shared Services Organic Federated Growth Service Landscape Integration APIs Gateways Service Mesh Tra ffj c Control VMs PaaS SaaS Containers Serverless Cloud Ecosystem

  5. Security: evolving topologies, ephemeral actors Loss of perimeter Loss of blueprint Changing identities Volatile behaviors Fundamentally new challenge Image: FIXME

  6. Stability: complex emergent behaviors Large-scale Complex Non-linear Unpredictable Fundamentally new challenge

  7. replicas: 3 spec: Can’t engineer away containers: resources: limits: nvidia.com/gpu: 1 Resource limits memory: 400Mi Scaling behaviors requests: cpu: 200m Request behaviors memory: 100Mi livenessProbe: Pool sizing initialDelaySeconds: 30 periodSeconds: 3 http: timeout: 10s retries: attempts: 3 perTryTimeout: 2s nginx: Image: teachersource.com resources: limits: memory: 1Gi

  8. Environment over code

  9. Source: nats.aero

  10. Successful 
 missions 
 are run. Image: NASA

  11. Coping strategies Do Monitor Trace nothing nodes requests

  12. Service mesh Natural landscape evolves faster than baroque YAML Source: gagliardiphotography.com

  13. Golden signals Requests Latency Concurrency Bandwidth Image source: bbc.co.uk

  14. Operational patterns Control Systemic Failures Assure Performance Deploy with Confidence Build Resilience Bulkhead Circuit Breaker Quarantine Fault Injection Backpressure Quality of Service Canary Brownout Segmentation Blast Radius

  15. Ex. 1: cache thrashing L a e n d c i s v c r a e p S e Behavior 2 3 1. New organic growth 1 h t 2. Upstream fan-out changed w o r 4 G 3. Shared cache thrashes c E i 4. Wide, unspecific slowness n m a e g r r g O e n t Quarantine B Remediation e h a 1. See widespread slowness v i o r 2. Identify bottleneck 3. Correlate with deployment 4. Quarantine deployment

  16. Ex. 2: cascading failure at Target Behavior Remediation 1. K8s flip-flopping 1. See logging spikes 2. Backpressure, circuit-break Migrate 7 11 Migrate Logging spikes Intermittently available 3 Service Ka fl a Sidecar 8 CPU spike Backpressure Circuit Breaker CPU spike 4 OpenStack K8s Starvation 5 Docker 9 Starvation Network outage 2 Unhealthy 6 Node 10 Unhealthy Cluster Maintenance 1 Source: https://medium.com/@daniel.p.woods/on-infrastructure-at-scale-a-cascading-failure-of-distributed-systems-7cff2a3cd2df

  17. Ex 3: security breach Behavior Remediation 1. DoS 1. Identify sources 2. Segmentation violation 2. Segment Org 1 Org m DoS 1 Sources Room 1 Room n Participant Segmentation Gateway Relay Segmentation violation 2 Security, Governance

  18. Runtime control examples Deploy to Production

  19. Runtime control examples Deploy to Production Architect in Real Time

  20. Architect in real time Car 1,000,000’s of Access Cars Managed IoT IoT Access Access 100’s of Stream Auth Applications Processing Data Sanitization Streaming Access Diagnostics Data Car Data Synthesis Analytics Query Pipelines Planning Privacy Service Layer

  21. Runtime control examples Deploy to Production Architect in Real Define Structure Time

  22. Summary New reality Golden signals Agile Service Environment Rapid MTTR operating landscape over code Operational model patterns

  23. Takeaways Developers Operators 1. Avoid distributed systems 1. Environment over code 2. Resilient federations 2. Rapid detect–react loops 3. Compensate 3. Signals, patterns 4. Redundancy, plausibility 4. No root causes: 5. Debug at unit level remediate 6. Defer design to runtime 5. No process debugging 6. Architect at runtime Applies to all decentralized architectures, not just microservices.

  24. Mission Control for Agile Architectures glasnostic.com 
 Tobias Kunze Co-Founder & CEO tobias@glasnostic.com 
 Image: NASA @tkunze

Recommend


More recommend