netflixoss a cloud native
play

NetflixOSS A Cloud Native Architecture LASER Sessions 2&3 - PowerPoint PPT Presentation

NetflixOSS A Cloud Native Architecture LASER Sessions 2&3 Overview September 2013 Adrian Cockcroft @adrianco @NetflixOSS http://www.linkedin.com/in/adriancockcroft Presentation vs. Tutorial Presentation Short duration,


  1. NetflixOSS – A Cloud Native Architecture LASER Sessions 2&3 – Overview September 2013 Adrian Cockcroft @adrianco @NetflixOSS http://www.linkedin.com/in/adriancockcroft

  2. Presentation vs. Tutorial • Presentation – Short duration, focused subject – One presenter to many anonymous audience – A few questions at the end • Tutorial – Time to explore in and around the subject – Tutor gets to know the audience – Discussion, rat- holes, “bring out your dead”

  3. Attendee Introductions • Who are you, where do you work • Why are you here today, what do you need • “Bring out your dead” – Do you have a specific problem or question? – One sentence elevator pitch • What instrument do you play?

  4. Content Why Public Cloud? Migration Path Service and API Architectures Storage Architecture Operations and Tools Example Applications

  5. Cloud Native A new engineering challenge Construct a highly agile and highly available service from ephemeral and assumed broken components

  6. How to get to Cloud Native Freedom and Responsibility for Developers Decentralize and Automate Ops Activities Integrate DevOps into the Business Organization

  7. Four Transitions • Management: Integrated Roles in a Single Organization – Business, Development, Operations -> BusDevOps • Developers: Denormalized Data – NoSQL – Decentralized, scalable, available, polyglot • Responsibility from Ops to Dev: Continuous Delivery – Decentralized small daily production updates • Responsibility from Ops to Dev: Agile Infrastructure - Cloud – Hardware in minutes, provisioned directly by developers

  8. Netflix BusDevOps Organization Chief Product Officer VP Product VP UI VP Discovery VP Platform Management Engineering Engineering Directors Directors Directors Directors Product Development Development Platform Code, independently updated Developers + Developers + Developers + DevOps DevOps DevOps continuous delivery Denormalized, independently UI Data Discovery Platform Sources Data Sources Data Sources updated and scaled data Cloud, self service updated & AWS AWS AWS scaled infrastructure

  9. Decentralized Deployment

  10. Asgard Developer Portal http://techblog.netflix.com/2012/06/asgard-web-based-cloud-management-and.html

  11. Ephemeral Instances • Largest services are autoscaled • Average lifetime of an instance is 36 hours P u s h Autoscale Up Autoscale Down

  12. Netflix Member Web Site Home Page Personalization Driven – How Does It Work?

  13. How Netflix Used to Work Consumer Oracle Electronics Monolithic Web AWS Cloud App Services MySQL CDN Edge Locations Oracle Datacenter Customer Device Monolithic (PC, PS3, TV…) Streaming App MySQL Content Management Limelight/Level 3 Akamai CDNs Content Encoding

  14. How Netflix Streaming Works Today Consumer User Data Electronics Web Site or AWS Cloud Discovery API Services Personalization CDN Edge Locations DRM Datacenter Customer Device Streaming API (PC, PS3, TV…) QoS Logging CDN Management and Steering OpenConnect CDN Boxes Content Encoding

  15. The AWS Question Why does Netflix use AWS when Amazon Prime is a competitor?

  16. Netflix vs. Amazon Prime • Do retailers competing with Amazon use AWS? – Yes, lots of them, Netflix is no different • Does Prime have a platform advantage? – No, because Netflix also gets to run on AWS • Does Netflix take Amazon Prime seriously? – Yes, but so far Prime isn’t impacting our growth

  17. Nov 2012 Streaming Bandwidth March 2013 Mean Bandwidth +39% 6mo

  18. The Google Cloud Question Why doesn’t Netflix use Google Cloud as well as AWS?

  19. Google Cloud – Wait and See Pro’s Con’s • Cloud Native • In beta until recently • Huge scale for internal apps • Few big customers yet • Exposing internal services • Missing many key features • Nice clean API model • Different arch model • Starting a price war • Missing billing options • Fast for what it does • No SSD or huge instances • Rapid start & minute billing • Zone maintenance windows But: Anyone interested is welcome to port NetflixOSS components to Google Cloud

  20. Cloud Wars: Price and Performance AWS vs. Private What Changed: No Change: GCS War Cloud $$ Everyone using Locked in for AWS or GCS gets three years. the price cuts and performance improvements, as they happen. No need to switch vendor.

  21. The DIY Question Why doesn’t Netflix build and run its own cloud?

  22. Fitting Into Public Scale 1,000 Instances 100,000 Instances Grey Public Private Area Netflix Startups Facebook

  23. How big is Public? AWS Maximum Possible Instance Count 4.2 Million – May 2013 Growth >10x in Three Years, >2x Per Annum - http://bit.ly/awsiprange AWS upper bound estimate based on the number of public IP Addresses Every provisioned instance gets a public IP by default (some VPC don’t)

  24. The Alternative Supplier Question What if there is no clear leader for a feature, or AWS doesn’t have what we need?

  25. Things We D on’t Use AWS For SaaS Applications – Pagerduty, Appdynamics Content Delivery Service DNS Service

  26. CDN Scale Gigabits Terabits Akamai Netflix Openconnect AWS CloudFront Limelight YouTube Level 3 Netflix Startups Facebook

  27. Content Delivery Service Open Source Hardware Design + FreeBSD, bird, nginx see openconnect.netflix.com

  28. DNS Service AWS Route53 is missing too many features (for now) Multiple vendor strategy Dyn, Ultra, Route53 Abstracted (broken) DNS APIs with Denominator

  29. Cost Process reduction reduction Lower Slow down Higher Speed up margins developers margins developers Less More More Less revenue competitive revenue competitive What Changed? Get out of the way of innovation Best of breed, by the hour Choices based on scale

  30. Availability Questions Is it running yet? How many places is it running in? How far apart are those places?

  31. Netflix Outages • Running very fast with scissors – Mostly self inflicted – bugs, mistakes from pace of change – Some caused by AWS bugs and mistakes • Incident Life-cycle Management by Platform Team – No runbooks, no operational changes by the SREs – Tools to identify what broke and call the right developer • Next step is multi-region active/active – Investigating and building in stages during 2013 – Could have prevented some of our 2012 outages

  32. Real Web Server Dependencies Flow (Netflix Home page business transaction as seen by AppDynamics) Each icon is three to a few hundred instances Cassandra across three AWS zones memcached Web service Start Here S3 bucket Personalization movie group choosers (for US, Canada and Latam)

  33. Three Balanced Availability Zones Test with Chaos Gorilla Load Balancers Zone A Zone B Zone C Cassandra and Evcache Cassandra and Evcache Cassandra and Evcache Replicas Replicas Replicas

  34. Isolated Regions US-East Load Balancers EU-West Load Balancers Zone A Zone B Zone C Zone A Zone B Zone C Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas

  35. Highly Available NoSQL Storage A highly scalable, available and durable deployment pattern based on Apache Cassandra

  36. Single Function Micro-Service Pattern One keyspace, replaces a single table or materialized view Single function Cassandra Many Different Single-Function REST Clients Cluster Managed by Priam Between 6 and 144 nodes Stateless Data Access REST Service Astyanax Cassandra Client Over 50 Cassandra clusters Over 1000 nodes Over 30TB backup Over 1M writes/s/cluster Optional Each icon represents a horizontally scaled service of three to Datacenter hundreds of instances deployed over three availability zones Update Flow Appdynamics Service Flow Visualization

  37. Stateless Micro-Service Architecture Linux Base AMI (CentOS or Ubuntu) Java (JDK 6 or 7) Optional Apache frontend, memcached, non-java apps AppDynamics Tomcat appagent monitoring Monitoring Log rotation Application war file, base Healthcheck, status to S3 servlet, platform, client servlets, JMX interface, GC and thread AppDynamics interface jars, Astyanax Servo autoscale dump logging machineagent Epic/Atlas

  38. Cassandra Instance Architecture Linux Base AMI (CentOS or Ubuntu) Java (JDK 7) Tomcat and Priam on JDK Healthcheck, Status AppDynamics Cassandra Server appagent monitoring Monitoring Local Ephemeral Disk Space – 2TB of SSD or 1.6TB disk AppDynamics GC and thread holding Commit log and SSTables machineagent dump logging Epic/Atlas

  39. Cassandra at Scale Benchmarking to Retire Risk

  40. Scalability from 48 to 288 nodes on AWS http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html Client Writes/s by node count – Replication Factor = 3 1200000 1099837 1000000 800000 Used 288 of m1.xlarge 4 CPU, 15 GB RAM, 8 ECU 600000 Cassandra 0.86 537172 Benchmark config only 400000 366828 existed for about 1hr 200000 174373 0 0 50 100 150 200 250 300 350

  41. Cassandra Disk vs. SSD Benchmark Same Throughput, Lower Latency, Half Cost http://techblog.netflix.com/2012/07/benchmarking-high-performance-io-with.html

Recommend


More recommend