the forces that disrupt netflix
play

The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016 our - PowerPoint PPT Presentation

The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016 our world ACROBAT FLEA parallel world # A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable.


  1. The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016

  2. our world ACROBAT FLEA parallel world

  3. # A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable. --Leslie Lamport

  4. our world ENGINEER ACROBAT FLEA computing parallel world

  5. PROLOGUE DISTRIBUTED SYSTEMS

  6. DECOMPOSING THE MONOLITH Devices Proxy/Routing Traffic Traffic Edge Edge Edge Service Service Service Netflix Netflix Edge Service Service Service Netflix Netflix Playback Playback Playback Service Service Service

  7. # Distributed systems are different because they fail often. --Jeff Hodges Notes on Distributed Systems for Young Bloods

  8. TABLE OF CONTENTS FORCES AT WORK CHAPTER 1: THE WEIRD DATA IN THE CATALOG • Metadata impacts on availability CHAPTER 2: THE VANISHING OF CRITICAL SERVICES • Crashing services and cascading failures CHAPTER 3: THE THROTTLE • Latency spikes and the impact of fallbacks

  9. Whoops, something went wrong … Netflix Streaming Error We’re having trouble playing this title right now. Please try again later or select a different title.

  10. CHAPTER ONE THE WEIRD DATA IN THE CATALOG

  11. 45 MINUTES!! Clock, by heyyobecky4lyfe, Tumblr

  12. VIDEO METADATA ARCHITECTURE Traffic Netflix Netflix Source Services Netflix Services Netflix System Services Netflix Services Video Service Metadata Service Source System Amazon S3

  13. Netflix Playback Service { String msg = “This should never happen! ” ; throw new IllegalStateException(msg); } Amazon S3

  14. MITIGATION BLAST RADIUS Explosion, CC BY 2.0, Andrew Kuznetsov 2008, Flikr

  15. Amazon WS Global Infrastructure

  16. Amazon WS Global Infrastructure STAGGERED ROLLOUT

  17. Diagnosis? Pager

  18. PREVENTION CANARIES Canary, CC BY 2.0, Steve P2008 2014, Flikr

  19. TRADITIONAL CANARY Traffic Traffic Traffic Netflix Netflix Source Services Netflix Services Netflix System Services Netflix Services Video Service Metadata Service Baseline Canary Source (Old Code) (New Code) System Amazon S3

  20. CONSISTENCY VALID STATE TRANSITIONS

  21. DATA CANARY Traffic Netflix Netflix Data Netflix Netflix Data Services Tester Netflix Service Services Netflix Canary Services Netflix Services Service Service Source System Video Metadata Service Source System Amazon S3

  22. SEEING RETURNS Australia with AAT, CC BY-SA 2.0, Ssolbergj 2010, Wikimedia

  23. Verify consistency prior to applying state changes. …one tool is a data canary.

  24. CHAPTER TWO THE VANISHING OF CRITICAL SERVICES

  25. # A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable. --Leslie Lamport

  26. LOG DATA Devices Proxy/Routing Traffic Traffic Edge Edge Edge Log Data Service Service Service Service Playback Service Netflix Netflix Playback Playback Playback Service Service Service Cassandra

  27. Proxy Devices Proxy/Routing Traffic Traffic Edge Edge Edge Log Data Service Service Service Service Playback Service Netflix Netflix Playback Playback Playback Service Service Service Cassandra

  28. CASCADING FAILURE Devices Proxy/Routing Traffic Traffic Edge Edge Edge Log Data Service Service Service Service Playback Service Netflix Netflix Playback Playback Playback Service Service Service Cassandra

  29. Log Data Service { throw new OutOfMemoryError(); Playback Service } Cassandra

  30. PREVENTION MANAGING RESOURCE CONSTRAINTS Whatever you ask, CC BY-SA 2.0, Kreg Steppe 2008, Flikr

  31. REDUCE SURFACE AREA Astronomical Clock, CC BY 2.0, Andrew Fleming 2011, Flikr

  32. 1 SO MANY JARS!! Keep Only Dependencies which are Necessary

  33. 2 LIMIT “MAGIC” Magic, CC BY-ND 2.0, Daniel Lee 2013, Flikr

  34. 3 ADD KILL SWITCHES Medusa Kill Switch, CC BY-NC-ND 2.0, Scott Hart 2013, Flikr

  35. 4 FAVOR IMMUTABILITY Playback Service Playback Service Playback Service DEV TEST PROD

  36. Proxy/Routing Traffic try { remoteService.call(); Log Data Service } catch( Throwable t ){ //Oops! System.exit(1); Playback } Service Cassandra

  37. MITIGATION CIRCUIT BREAKERS It's Electric, CC BY ND 2.0, Alan Hochberg 2008, Flikr

  38. FAILURE TESTING Wrecking Ball in Building, CC BY 2.0, Jason Eppink 2008, Flikr

  39. FAILURE TESTING Devices Proxy/Routing Traffic Applying Failure Testing Research Log Data Service @Netflix by Kolton Andrus and Peter Alvaro Automating Chaos Playback Service Experiments in Production by Ali Basiri Cassandra

  40. Manage resource constraints by reducing surface area. Leverage circuit breakers and rigorously test failures.

  41. CHAPTER THREE THE THROTTLE

  42. PLAYBACK ARCHITECTURE Devices Proxy/Routing Traffic Traffic Edge Service Edge Service Edge Service Playback Service URL Service

  43. NETFLIX CLIENT JARS Playback Service URL Client URL Service Discovery RPC Service Metrics Retries and Timeouts Circuit-breakers and Fallbacks

  44. THROTTLING Concurrent Traffic Requests Throttled Requests Playback (HTTP 503) Service

  45. Proxy/Routing Traffic Edge Service } System.gc(); } Playback Service URL Service

  46. NETFLIX CLIENT JARS Playback Service URL Client Service Discovery RPC URL Service Retries and Heavy Metrics Timeouts Fallback Circuit-breakers

  47. FALLBACK TESTING 15 RPS No fallback, CPU held at 90% 58 RPS With 100% Fallback, CPU held at 90% Siege: https://github.com/JoeDog/siege

  48. SELECTING FALLBACKS FALLBACK STATIC CACHE SERVICE

  49. Proxy/Routing Traffic } Edge Service return Response .status(503) .build(); } Playback Service URL Service

  50. REQUEST BUCKETING CRITICAL NON-CRITICAL Customer Experience or Streaming Performance Impact Impact Fire Buckets at Oakworth Statione, CC BY 2.0, Tim Greene 2015, Flikr

  51. APPLICATION SHARDING Devices Proxy/Routing Traffic Traffic Edge Service Edge Service Edge Service Critical Playback Non-Critical Service Playback Service Non-Critical URL URL Service Service

  52. CRITICAL Country Road at Sunrisee, CC BY-SA 2.0, Susanne Nilssone 2014, Flikr

  53. NON-CRITICAL Traffice, CC BY-NC 2.0, jonbgeme 2008, Flikr

  54. APPLICATION SHARDING Devices Proxy/Routing Traffic Traffic Edge Service Edge Service Edge Service Critical Playback Non-Critical Service Playback Service Non-Critical URL URL Service Service

  55. No heavy fallbacks!! Fallbacks should be light and fast. Shard your application based on operational characteristics.

  56. EPILOGUE KEY TAKEAWAYS

  57. KEY TAKEAWAYS CHAPTER 1: THE WEIRD DATA IN THE CATALOG • Verify consistency prior to applying state changes. • One tool is a data canary. CHAPTER 2: THE VANISHING OF CRITICAL SERVICES • Manage resource constraints by reducing surface area. • Leverage circuit breakers and rigorously test failures. CHAPTER 3: THE THROTTLE • No heavy fallbacks!! Fallbacks should be light and fast. • Shard your application based on operational characteristics.

  58. Plan to fail. The unexpected will happen.

  59. PARTING THOUGHT DISTRIBUTED SYSTEMS SOCIAL

  60. Questions? Haley Tucker

Recommend


More recommend