crisis to calm a story of data validation netflix
play

Crisis to Calm: A Story of Data Validation @ Netflix Lavanya - PowerPoint PPT Presentation

Crisis to Calm: A Story of Data Validation @ Netflix Lavanya Kanchanapalli Rollback data Increase capacity Netflix starts to work Root Cause Duplicate objects Apps failed Cascading failures Netflix goes down! Data


  1. Crisis to Calm: A Story of Data Validation @ Netflix Lavanya Kanchanapalli

  2. ● Rollback data ● Increase capacity ● Netflix starts to work

  3. Root Cause ● Duplicate objects ● Apps failed ● Cascading failures ● Netflix goes down!

  4. Data Change Netflix Microservices Cloud App1 Appn App2 Change in Behavior

  5. Metadata Architecture Traffic Netflix Netflix Source Services Netflix Services Netflix System Services Netflix Services Video Service Metadata Service Source System Amazon S3

  6. ● Single publisher ● Multiple consumers ● Hollow (hollow.how) ● Versioned data ● Fast propagation

  7. Bad data happens!

  8. Leaked Content

  9. Disables Features

  10. Deletes data

  11. Data change = Code Push

  12. 1 2 2 3 3 Detection Staggering Staggering Rollback Rollback

  13. 1 2 3 3 Detection Staggering Rollback Rollback

  14. 1 2 3 Detection Staggering Rollback

  15. Many Unknowns

  16. Would it be too slow?

  17. Would there be too many failures?

  18. Would it cost too much?

  19. 1 2 3 Detection Staggering Rollback

  20. 1 2 2 3 3 Detection Staggering Staggering Rollback Rollback

  21. Chapter 1: Circuit Breakers

  22. Metadata Architecture Traffic Netflix Netflix Source Services Netflix Services Netflix System Services Netflix Services Video Service Metadata Service Source System Amazon S3

  23. Circuit Breakers ● Integrity checks ● Duplicate detection ● Object counts ● Semantic checks

  24. Know your data change

  25. Knobs are key to sanity ● On/ off ● Threshold ● Exclusions

  26. Business value is the key

  27. Efficiency ● Change Isolation ● Sampling

  28. Chapter 2: Canaries

  29. Traditional Canaries Traffic Shadow Shadow Netflix Traffic Traffic Netflix Source Services Netflix Services Netflix System Services Netflix Services Video Service Metadata Service Baseline Canary Source (Old Code) (New Code) System Amazon S3

  30. Data Canaries Traffic Netflix Netflix Data Netflix Service Canary Netflix Services Netflix Source Service Services Netflix System Services Netflix Services Video Service Metadata Service Source System Amazon S3

  31. Netflix Data Canary Service ● Pick key use case(s) Netflix Data Netflix Service ● Pick data to test Canary Service ● Test with latest data

  32. 1 2 3 Detection Staggering Rollback Rollback

  33. Amazon WS Global Infrastructure STAGGERED ROLLOUT

  34. 1 2 3 Detection Staggering Rollback

  35. Keep calm & rollback ● Pin back ● Root cause ● Unpin

  36. Rollback ● Visibility ● Traversing

  37. Data Diff UI

  38. Circuit Breaker UI

  39. Pinning UI

  40. Data validation is key to high availability ● Data change = Code push ● Circuit breakers & canaries ● Staggering and rollback

  41. Thank You

Recommend


More recommend