building and running applications at scale in zalando
play

Building and running applications at scale in Zalando Online - PowerPoint PPT Presentation

Building and running applications at scale in Zalando Online fashion store Checkout case By Pamela Canchanya About Zalando About Zalando > 250 visits ~ 5.4 billion EUR per million month revenue 2018 > 300.000 > 26 product


  1. Building and running applications at scale in Zalando Online fashion store Checkout case By Pamela Canchanya

  2. About Zalando

  3. About Zalando > 250 visits ~ 5.4 billion EUR per million month revenue 2018 > 300.000 > 26 product choices > 15.500 > 70% 17 million ~ 2.000 employees in of visits via countries brands Europe active customers mobile devices

  4. Black Friday at a glance

  5. Zalando Tech

  6. From monolith to microservice architecture Reorganization > 1000 microservices

  7. Tech organization > 200 > 1100 development developers teams Platform

  8. End to end responsibility

  9. Checkout Goal “Allow customers to buy seamlessly and conveniently”

  10. Checkout landscape REST & messaging Java Scala Communication Node JS React AWS programming languages client side & ETCD Cassandra Docker Kubernetes Many data storage more configurations container infrastructure

  11. Checkout architecture Dependencies Tailor Backend for Cassandra frontend Frontend Checkout fragments service Dependencies Dependencies Skipper

  12. Checkout is a critical component in the shopping journey - Direct impact in business revenue - Direct impact in customer experience

  13. Checkout challenges in a microservice ecosystem - Increase points of failures - Multiple dependencies evolving independently

  14. Lessons learnt building Checkout with - Reliability patterns - Scalability - Monitoring

  15. Building microservices with reliability patterns

  16. Checkout confirmation page Delivery Destination Delivery Service Payments Service Cart

  17. Checkout confirmation page Delivery Service

  18. Unwanted error

  19. Doing retries for (var i = 1; i < numRetries; i++) { try { return getDeliveryOptionsForCheckout(cart) } catch(error) { if (i >= numRetries) { throw error; } } }

  20. Retry for transient errors like a network error or service overload

  21. Retries for some errors try { getDeliveryOptionsForCheckout(cart) match { case Success() => // return result case TransientFailure => // retry operation case Error => // throw error } } catch { println("Delivery options exception") }

  22. Retries with exponential backoff Attempt 3 Attempt 1 Attempt 2 100 ms 100 ms 100 ms Exponential Exponential Backoff time Backoff time

  23. Exhaustion of retries and failures become permanent

  24. Prevent execution of operations that are likely to fail

  25. Circuit breaker pattern Circuit breaker pattern - Martin Fowler blog post

  26. Open circuit, operations fails immediately error rate > threshold 50% Target getDeliveryOptionsForCheckout = failure

  27. Fallback as alternative of failure Fallback: Only Standard delivery Unwanted failure: no Checkout service with a default delivery promise

  28. Putting all together Do retries of operations with exponential backoff Wrap operations with a circuit breaker Handle failures with fallbacks when possible Otherwise make sure to handle the exceptions circuitCommand( getDeloveryOptionsForCheckout(cart) .retry(2) ) .onSuccess(//do something with result) .onError(getDeloveryOptionsForCheckoutFallback)

  29. Scaling microservices

  30. Traffic pattern

  31. Traffic pattern

  32. Microservice infrastructure Incoming requests Load balancer Distributed Use Zalando Instance Instance Instance by instance base image Container Node env JVM env

  33. Scaling horizontally Load balancer Instance Instance Instance Container

  34. Scaling horizontally Load balancer Instance Instance Instance Instance Container

  35. Scaling vertically Load balancer Instance Instance Instance Container

  36. Scaling vertically Load balancer Instance Instance Instance Container

  37. Scaling consequences Cassandra > service connections > saturation and risk of unhealthy database

  38. Microservices cannot be scalable if downstream microservices cannot scale

  39. Low traffic rollouts 1 2 1 2 3 4 3 4 Service v2 Service v1 Traffic 0% Traffic 100%

  40. High traffic rollouts 1 2 3 1 2 3 4 4 5 6 Service v2 Service v1 Traffic 0% Traffic 100%

  41. Rollout with not enough capacity

  42. Rollouts should consider allocate same capacity like version with 100% traffic

  43. Monitor microservices

  44. Monitoring microservice ecosystem Microservice Application platform Communication Hardware Four layer model of microservice ecosystem

  45. Monitoring microservice ecosystem Microservice Application platform Infrastructure metrics Communication Hardware For layer model of microservice ecosystem

  46. Monitoring microservice ecosystem Microservice Microservice metrics Application platform Communication Hardware For layer model of microservice ecosystem

  47. First example

  48. Hardware metrics

  49. Communication metrics

  50. Rate and responses of API endpoints

  51. Dependencies metrics

  52. Language specific metrics

  53. Second Example

  54. Infrastructure metrics

  55. Node JS metrics

  56. Frontend microservice metrics

  57. Anti pattern: Dashboard usage for outage detection

  58. Alerting “ Something is broken, and somebody needs to fix it right now! Or, something might break soon, so somebody should look soon. ” Practical Alerting - Monitoring distributed systems Google SRE Book

  59. Alert Unhealthy instances 1 of 5 No more memory, JVM is misconfigured

  60. Alert Service checkout is returning 4XXs responses above threshold 25% Recent change broke contract of API for unconsidered business rule

  61. Alert No orders in last 5 minutes Downstream dependency is experimenting connectivity issues

  62. Alert Checkout database disk utilization is 80% Saturation of data storage by an increase in traffic

  63. Alerts notify about symptoms

  64. Alerts should be actionable

  65. Incident response Figure Five stages of incident response. Microservices ready to production

  66. Example of postmortem Summary of incident No orders in last 5 minutes 13.05.2019 between 16:00 and 16:45 Impact of customers 2K customers could not complete checkout Impact of business 50K euros loss of order that could be completed Analysis of root cause Why there was no orders? Action items ...

  67. Every incident should have postmortem

  68. Preparing for Black Friday - Business forecast - Load testing of real customer journey - Capacity planning

  69. Checklist for every microservice involved in Black Friday - Is the architecture and dependencies reviewed? - Are the possible point of failures identified and mitigated? - Are reliability patterns implemented? - Are the configurations adjustable without need of deployment? - Do we have scaling strategy? - Is monitoring in place? - Are all alerts actionable? - Is our team prepared for 24x7 incident management?

  70. Situation room

  71. Black Friday pattern of requests > 4,200 orders/m

  72. My summary of learnings - Think outside the happy path and mitigate failures with reliability patterns - Services are scalable proportionally with their dependencies - Monitor the microservice ecosystem

  73. Resources - Service reliability engineering - Production ready micro services - Monitoring and alerting Tool used by Zalando - Taylor - Skipper - Load testing in Zalando - Kubernertes in Zalando

  74. Obrigada Thank you Danke Contact Pamela Canchanya pam.cdm@posteo.net @pamcdm

  75. Building and running applications at scale in Zalando Online fashion store Checkout case By Pamela Canchanya

Recommend


More recommend