The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016
our world ACROBAT FLEA parallel world
# A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable. --Leslie Lamport
our world ENGINEER ACROBAT FLEA computing parallel world
PROLOGUE DISTRIBUTED SYSTEMS
DECOMPOSING THE MONOLITH Devices Proxy/Routing Traffic Traffic Edge Edge Edge Service Service Service Netflix Netflix Edge Service Service Service Netflix Netflix Playback Playback Playback Service Service Service
# Distributed systems are different because they fail often. --Jeff Hodges Notes on Distributed Systems for Young Bloods
TABLE OF CONTENTS FORCES AT WORK CHAPTER 1: THE WEIRD DATA IN THE CATALOG • Metadata impacts on availability CHAPTER 2: THE VANISHING OF CRITICAL SERVICES • Crashing services and cascading failures CHAPTER 3: THE THROTTLE • Latency spikes and the impact of fallbacks
Whoops, something went wrong … Netflix Streaming Error We’re having trouble playing this title right now. Please try again later or select a different title.
CHAPTER ONE THE WEIRD DATA IN THE CATALOG
45 MINUTES!! Clock, by heyyobecky4lyfe, Tumblr
VIDEO METADATA ARCHITECTURE Traffic Netflix Netflix Source Services Netflix Services Netflix System Services Netflix Services Video Service Metadata Service Source System Amazon S3
Netflix Playback Service { String msg = “This should never happen! ” ; throw new IllegalStateException(msg); } Amazon S3
MITIGATION BLAST RADIUS Explosion, CC BY 2.0, Andrew Kuznetsov 2008, Flikr
Amazon WS Global Infrastructure
Amazon WS Global Infrastructure STAGGERED ROLLOUT
Diagnosis? Pager
PREVENTION CANARIES Canary, CC BY 2.0, Steve P2008 2014, Flikr
TRADITIONAL CANARY Traffic Traffic Traffic Netflix Netflix Source Services Netflix Services Netflix System Services Netflix Services Video Service Metadata Service Baseline Canary Source (Old Code) (New Code) System Amazon S3
CONSISTENCY VALID STATE TRANSITIONS
DATA CANARY Traffic Netflix Netflix Data Netflix Netflix Data Services Tester Netflix Service Services Netflix Canary Services Netflix Services Service Service Source System Video Metadata Service Source System Amazon S3
SEEING RETURNS Australia with AAT, CC BY-SA 2.0, Ssolbergj 2010, Wikimedia
Verify consistency prior to applying state changes. …one tool is a data canary.
CHAPTER TWO THE VANISHING OF CRITICAL SERVICES
# A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable. --Leslie Lamport
LOG DATA Devices Proxy/Routing Traffic Traffic Edge Edge Edge Log Data Service Service Service Service Playback Service Netflix Netflix Playback Playback Playback Service Service Service Cassandra
Proxy Devices Proxy/Routing Traffic Traffic Edge Edge Edge Log Data Service Service Service Service Playback Service Netflix Netflix Playback Playback Playback Service Service Service Cassandra
CASCADING FAILURE Devices Proxy/Routing Traffic Traffic Edge Edge Edge Log Data Service Service Service Service Playback Service Netflix Netflix Playback Playback Playback Service Service Service Cassandra
Log Data Service { throw new OutOfMemoryError(); Playback Service } Cassandra
PREVENTION MANAGING RESOURCE CONSTRAINTS Whatever you ask, CC BY-SA 2.0, Kreg Steppe 2008, Flikr
REDUCE SURFACE AREA Astronomical Clock, CC BY 2.0, Andrew Fleming 2011, Flikr
1 SO MANY JARS!! Keep Only Dependencies which are Necessary
2 LIMIT “MAGIC” Magic, CC BY-ND 2.0, Daniel Lee 2013, Flikr
3 ADD KILL SWITCHES Medusa Kill Switch, CC BY-NC-ND 2.0, Scott Hart 2013, Flikr
4 FAVOR IMMUTABILITY Playback Service Playback Service Playback Service DEV TEST PROD
Proxy/Routing Traffic try { remoteService.call(); Log Data Service } catch( Throwable t ){ //Oops! System.exit(1); Playback } Service Cassandra
MITIGATION CIRCUIT BREAKERS It's Electric, CC BY ND 2.0, Alan Hochberg 2008, Flikr
FAILURE TESTING Wrecking Ball in Building, CC BY 2.0, Jason Eppink 2008, Flikr
FAILURE TESTING Devices Proxy/Routing Traffic Applying Failure Testing Research Log Data Service @Netflix by Kolton Andrus and Peter Alvaro Automating Chaos Playback Service Experiments in Production by Ali Basiri Cassandra
Manage resource constraints by reducing surface area. Leverage circuit breakers and rigorously test failures.
CHAPTER THREE THE THROTTLE
PLAYBACK ARCHITECTURE Devices Proxy/Routing Traffic Traffic Edge Service Edge Service Edge Service Playback Service URL Service
NETFLIX CLIENT JARS Playback Service URL Client URL Service Discovery RPC Service Metrics Retries and Timeouts Circuit-breakers and Fallbacks
THROTTLING Concurrent Traffic Requests Throttled Requests Playback (HTTP 503) Service
Proxy/Routing Traffic Edge Service } System.gc(); } Playback Service URL Service
NETFLIX CLIENT JARS Playback Service URL Client Service Discovery RPC URL Service Retries and Heavy Metrics Timeouts Fallback Circuit-breakers
FALLBACK TESTING 15 RPS No fallback, CPU held at 90% 58 RPS With 100% Fallback, CPU held at 90% Siege: https://github.com/JoeDog/siege
SELECTING FALLBACKS FALLBACK STATIC CACHE SERVICE
Proxy/Routing Traffic } Edge Service return Response .status(503) .build(); } Playback Service URL Service
REQUEST BUCKETING CRITICAL NON-CRITICAL Customer Experience or Streaming Performance Impact Impact Fire Buckets at Oakworth Statione, CC BY 2.0, Tim Greene 2015, Flikr
APPLICATION SHARDING Devices Proxy/Routing Traffic Traffic Edge Service Edge Service Edge Service Critical Playback Non-Critical Service Playback Service Non-Critical URL URL Service Service
CRITICAL Country Road at Sunrisee, CC BY-SA 2.0, Susanne Nilssone 2014, Flikr
NON-CRITICAL Traffice, CC BY-NC 2.0, jonbgeme 2008, Flikr
APPLICATION SHARDING Devices Proxy/Routing Traffic Traffic Edge Service Edge Service Edge Service Critical Playback Non-Critical Service Playback Service Non-Critical URL URL Service Service
No heavy fallbacks!! Fallbacks should be light and fast. Shard your application based on operational characteristics.
EPILOGUE KEY TAKEAWAYS
KEY TAKEAWAYS CHAPTER 1: THE WEIRD DATA IN THE CATALOG • Verify consistency prior to applying state changes. • One tool is a data canary. CHAPTER 2: THE VANISHING OF CRITICAL SERVICES • Manage resource constraints by reducing surface area. • Leverage circuit breakers and rigorously test failures. CHAPTER 3: THE THROTTLE • No heavy fallbacks!! Fallbacks should be light and fast. • Shard your application based on operational characteristics.
Plan to fail. The unexpected will happen.
PARTING THOUGHT DISTRIBUTED SYSTEMS SOCIAL
Questions? Haley Tucker
Recommend
More recommend