Crisis to Calm: A Story of Data Validation @ Netflix Lavanya Kanchanapalli
● Rollback data ● Increase capacity ● Netflix starts to work
Root Cause ● Duplicate objects ● Apps failed ● Cascading failures ● Netflix goes down!
Data Change Netflix Microservices Cloud App1 Appn App2 Change in Behavior
Metadata Architecture Traffic Netflix Netflix Source Services Netflix Services Netflix System Services Netflix Services Video Service Metadata Service Source System Amazon S3
● Single publisher ● Multiple consumers ● Hollow (hollow.how) ● Versioned data ● Fast propagation
Bad data happens!
Leaked Content
Disables Features
Deletes data
Data change = Code Push
1 2 2 3 3 Detection Staggering Staggering Rollback Rollback
1 2 3 3 Detection Staggering Rollback Rollback
1 2 3 Detection Staggering Rollback
Many Unknowns
Would it be too slow?
Would there be too many failures?
Would it cost too much?
1 2 3 Detection Staggering Rollback
1 2 2 3 3 Detection Staggering Staggering Rollback Rollback
Chapter 1: Circuit Breakers
Metadata Architecture Traffic Netflix Netflix Source Services Netflix Services Netflix System Services Netflix Services Video Service Metadata Service Source System Amazon S3
Circuit Breakers ● Integrity checks ● Duplicate detection ● Object counts ● Semantic checks
Know your data change
Knobs are key to sanity ● On/ off ● Threshold ● Exclusions
Business value is the key
Efficiency ● Change Isolation ● Sampling
Chapter 2: Canaries
Traditional Canaries Traffic Shadow Shadow Netflix Traffic Traffic Netflix Source Services Netflix Services Netflix System Services Netflix Services Video Service Metadata Service Baseline Canary Source (Old Code) (New Code) System Amazon S3
Data Canaries Traffic Netflix Netflix Data Netflix Service Canary Netflix Services Netflix Source Service Services Netflix System Services Netflix Services Video Service Metadata Service Source System Amazon S3
Netflix Data Canary Service ● Pick key use case(s) Netflix Data Netflix Service ● Pick data to test Canary Service ● Test with latest data
1 2 3 Detection Staggering Rollback Rollback
Amazon WS Global Infrastructure STAGGERED ROLLOUT
1 2 3 Detection Staggering Rollback
Keep calm & rollback ● Pin back ● Root cause ● Unpin
Rollback ● Visibility ● Traversing
Data Diff UI
Circuit Breaker UI
Pinning UI
Data validation is key to high availability ● Data change = Code push ● Circuit breakers & canaries ● Staggering and rollback
Thank You
Recommend
More recommend