NetflixOSS A Cloud Native Architecture LASER Sessions 2&3 - PowerPoint PPT Presentation

NetflixOSS – A Cloud Native Architecture LASER Sessions 2&3 – Overview September 2013 Adrian Cockcroft @adrianco @NetflixOSS http://www.linkedin.com/in/adriancockcroft

Presentation vs. Tutorial • Presentation – Short duration, focused subject – One presenter to many anonymous audience – A few questions at the end • Tutorial – Time to explore in and around the subject – Tutor gets to know the audience – Discussion, rat- holes, “bring out your dead”

Attendee Introductions • Who are you, where do you work • Why are you here today, what do you need • “Bring out your dead” – Do you have a specific problem or question? – One sentence elevator pitch • What instrument do you play?

Content Why Public Cloud? Migration Path Service and API Architectures Storage Architecture Operations and Tools Example Applications

Cloud Native A new engineering challenge Construct a highly agile and highly available service from ephemeral and assumed broken components

How to get to Cloud Native Freedom and Responsibility for Developers Decentralize and Automate Ops Activities Integrate DevOps into the Business Organization

Four Transitions • Management: Integrated Roles in a Single Organization – Business, Development, Operations -> BusDevOps • Developers: Denormalized Data – NoSQL – Decentralized, scalable, available, polyglot • Responsibility from Ops to Dev: Continuous Delivery – Decentralized small daily production updates • Responsibility from Ops to Dev: Agile Infrastructure - Cloud – Hardware in minutes, provisioned directly by developers

Netflix BusDevOps Organization Chief Product Officer VP Product VP UI VP Discovery VP Platform Management Engineering Engineering Directors Directors Directors Directors Product Development Development Platform Code, independently updated Developers + Developers + Developers + DevOps DevOps DevOps continuous delivery Denormalized, independently UI Data Discovery Platform Sources Data Sources Data Sources updated and scaled data Cloud, self service updated & AWS AWS AWS scaled infrastructure

Decentralized Deployment

Asgard Developer Portal http://techblog.netflix.com/2012/06/asgard-web-based-cloud-management-and.html

Ephemeral Instances • Largest services are autoscaled • Average lifetime of an instance is 36 hours P u s h Autoscale Up Autoscale Down

Netflix Member Web Site Home Page Personalization Driven – How Does It Work?

How Netflix Used to Work Consumer Oracle Electronics Monolithic Web AWS Cloud App Services MySQL CDN Edge Locations Oracle Datacenter Customer Device Monolithic (PC, PS3, TV…) Streaming App MySQL Content Management Limelight/Level 3 Akamai CDNs Content Encoding

How Netflix Streaming Works Today Consumer User Data Electronics Web Site or AWS Cloud Discovery API Services Personalization CDN Edge Locations DRM Datacenter Customer Device Streaming API (PC, PS3, TV…) QoS Logging CDN Management and Steering OpenConnect CDN Boxes Content Encoding

The AWS Question Why does Netflix use AWS when Amazon Prime is a competitor?

Netflix vs. Amazon Prime • Do retailers competing with Amazon use AWS? – Yes, lots of them, Netflix is no different • Does Prime have a platform advantage? – No, because Netflix also gets to run on AWS • Does Netflix take Amazon Prime seriously? – Yes, but so far Prime isn’t impacting our growth

Nov 2012 Streaming Bandwidth March 2013 Mean Bandwidth +39% 6mo

The Google Cloud Question Why doesn’t Netflix use Google Cloud as well as AWS?

Google Cloud – Wait and See Pro’s Con’s • Cloud Native • In beta until recently • Huge scale for internal apps • Few big customers yet • Exposing internal services • Missing many key features • Nice clean API model • Different arch model • Starting a price war • Missing billing options • Fast for what it does • No SSD or huge instances • Rapid start & minute billing • Zone maintenance windows But: Anyone interested is welcome to port NetflixOSS components to Google Cloud

Cloud Wars: Price and Performance AWS vs. Private What Changed: No Change: GCS War Cloud $$ Everyone using Locked in for AWS or GCS gets three years. the price cuts and performance improvements, as they happen. No need to switch vendor.

The DIY Question Why doesn’t Netflix build and run its own cloud?

Fitting Into Public Scale 1,000 Instances 100,000 Instances Grey Public Private Area Netflix Startups Facebook

How big is Public? AWS Maximum Possible Instance Count 4.2 Million – May 2013 Growth >10x in Three Years, >2x Per Annum - http://bit.ly/awsiprange AWS upper bound estimate based on the number of public IP Addresses Every provisioned instance gets a public IP by default (some VPC don’t)

The Alternative Supplier Question What if there is no clear leader for a feature, or AWS doesn’t have what we need?

Things We D on’t Use AWS For SaaS Applications – Pagerduty, Appdynamics Content Delivery Service DNS Service

CDN Scale Gigabits Terabits Akamai Netflix Openconnect AWS CloudFront Limelight YouTube Level 3 Netflix Startups Facebook

Content Delivery Service Open Source Hardware Design + FreeBSD, bird, nginx see openconnect.netflix.com

DNS Service AWS Route53 is missing too many features (for now) Multiple vendor strategy Dyn, Ultra, Route53 Abstracted (broken) DNS APIs with Denominator

Cost Process reduction reduction Lower Slow down Higher Speed up margins developers margins developers Less More More Less revenue competitive revenue competitive What Changed? Get out of the way of innovation Best of breed, by the hour Choices based on scale

Availability Questions Is it running yet? How many places is it running in? How far apart are those places?

Netflix Outages • Running very fast with scissors – Mostly self inflicted – bugs, mistakes from pace of change – Some caused by AWS bugs and mistakes • Incident Life-cycle Management by Platform Team – No runbooks, no operational changes by the SREs – Tools to identify what broke and call the right developer • Next step is multi-region active/active – Investigating and building in stages during 2013 – Could have prevented some of our 2012 outages

Real Web Server Dependencies Flow (Netflix Home page business transaction as seen by AppDynamics) Each icon is three to a few hundred instances Cassandra across three AWS zones memcached Web service Start Here S3 bucket Personalization movie group choosers (for US, Canada and Latam)

Three Balanced Availability Zones Test with Chaos Gorilla Load Balancers Zone A Zone B Zone C Cassandra and Evcache Cassandra and Evcache Cassandra and Evcache Replicas Replicas Replicas

Isolated Regions US-East Load Balancers EU-West Load Balancers Zone A Zone B Zone C Zone A Zone B Zone C Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas

Highly Available NoSQL Storage A highly scalable, available and durable deployment pattern based on Apache Cassandra

Single Function Micro-Service Pattern One keyspace, replaces a single table or materialized view Single function Cassandra Many Different Single-Function REST Clients Cluster Managed by Priam Between 6 and 144 nodes Stateless Data Access REST Service Astyanax Cassandra Client Over 50 Cassandra clusters Over 1000 nodes Over 30TB backup Over 1M writes/s/cluster Optional Each icon represents a horizontally scaled service of three to Datacenter hundreds of instances deployed over three availability zones Update Flow Appdynamics Service Flow Visualization

Stateless Micro-Service Architecture Linux Base AMI (CentOS or Ubuntu) Java (JDK 6 or 7) Optional Apache frontend, memcached, non-java apps AppDynamics Tomcat appagent monitoring Monitoring Log rotation Application war file, base Healthcheck, status to S3 servlet, platform, client servlets, JMX interface, GC and thread AppDynamics interface jars, Astyanax Servo autoscale dump logging machineagent Epic/Atlas

Cassandra Instance Architecture Linux Base AMI (CentOS or Ubuntu) Java (JDK 7) Tomcat and Priam on JDK Healthcheck, Status AppDynamics Cassandra Server appagent monitoring Monitoring Local Ephemeral Disk Space – 2TB of SSD or 1.6TB disk AppDynamics GC and thread holding Commit log and SSTables machineagent dump logging Epic/Atlas

Cassandra at Scale Benchmarking to Retire Risk

Scalability from 48 to 288 nodes on AWS http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html Client Writes/s by node count – Replication Factor = 3 1200000 1099837 1000000 800000 Used 288 of m1.xlarge 4 CPU, 15 GB RAM, 8 ECU 600000 Cassandra 0.86 537172 Benchmark config only 400000 366828 existed for about 1hr 200000 174373 0 0 50 100 150 200 250 300 350

Cassandra Disk vs. SSD Benchmark Same Throughput, Lower Latency, Half Cost http://techblog.netflix.com/2012/07/benchmarking-high-performance-io-with.html

NetflixOSS A Cloud Native Architecture LASER Sessions 2&3 - PowerPoint PPT Presentation

NetflixOSS A Cloud Native Architecture LASER Sessions 2&3 Overview September 2013 Adrian Cockcroft @adrianco @NetflixOSS http://www.linkedin.com/in/adriancockcroft Presentation vs. Tutorial Presentation Short duration,

Cloud-Native and Scalable Kafka Allen Wang @allenxwang About Me Real Time Data

Are We Really Cloud-Native? Bert Ertman Cloud-Native Computing What is Cloud-Native? answer:

Cloud Native Go Building Scalable, Resilient Microservices for the Cloud in Go 1 / 29

Cloud Native Visibility and Security Chris Kranz Sysdig Secure DevOps for Cloud Native Open by

Going Cloud Native with Cloud Foundry @chipchilders Chip Childers, VP Technology Cloud Foundry

Cloud Native Data Pipelines with Apache Kafka Gwen Shapira, Software Engineer @gwenshap 2

PATH TO CLOUD-NATIVE APP DEV 8 steps to cloud-native app dev Thomas Qvarnstrom Cesar Saavedra

The Cloud Native Elephant in the Room The Cloud Native Elephant in the Room Bob Quillin, VP

What is Cloud Native? WW Developer Advocacy Contents App Modernization Docker

Lessons Learnt from Running a Container Native Cloud Xu Wang (@gnawux) CTO & Cofounder,

PolarDB Cloud Native DB @ Alibaba Lixun Peng Inaam Rana Alibaba Cloud Team Agenda

5G Cloud Native from RAN to Core Christian Maciocco, Intel Shilpa Talwar, Intel Saikrishna

Using Cloud Native Technologies to Solve Complex Application Security Challenges in Kubernetes

Honey, I shrunk the database! Resilience and recoverability in Cloud Native services JEFFREY

fundamental technologies to work on for cloud-native networking Magnus Karlsson, Intel

Teaching an old DAG new tricks Migrating a decade old pipeline to Airflow Outline Cloud native

Cloud Native and Container Technology Landscape Chris Aniszczyk (@cra) Rise of Containers and

Up Cloud Native Networking with eBPF Next Technical Track Presentation Raymond Maika

Cloud Gaming Architecture based on StarlingX and Akraino Integrated Cloud Native Edge Stack (ICN

A Multi-Tenancy Cloud-Native Digital Library Platform Yinlin Chen, Jim Tuttle, William A. Ingram

Managing Openstack in a cloud-native way Marcel Haerry Alberto Garca Leading the

Enabling Cloud-Native Applications with Application Credentials in Keystone Colleen Murphy Cloud

Basil Policy-as-code Platform Ron Herardian (ISC) East Bay Chapter Fall Conference, November

X-Containers: Breaking Down Barriers to Improve Performance and Isolation of Cloud-Native

NetflixOSS A Cloud Native Architecture LASER Sessions 2&3 - PowerPoint PPT Presentation

NetflixOSS A Cloud Native Architecture LASER Sessions 2&3 Overview September 2013 Adrian Cockcroft @adrianco @NetflixOSS http://www.linkedin.com/in/adriancockcroft Presentation vs. Tutorial Presentation Short duration,

Cloud-Native and Scalable Kafka Allen Wang @allenxwang About Me Real Time Data

Are We Really Cloud-Native? Bert Ertman Cloud-Native Computing What is Cloud-Native? answer:

Cloud Native Go Building Scalable, Resilient Microservices for the Cloud in Go 1 / 29

Cloud Native Visibility and Security Chris Kranz Sysdig Secure DevOps for Cloud Native Open by

Going Cloud Native with Cloud Foundry @chipchilders Chip Childers, VP Technology Cloud Foundry

Cloud Native Data Pipelines with Apache Kafka Gwen Shapira, Software Engineer @gwenshap 2

PATH TO CLOUD-NATIVE APP DEV 8 steps to cloud-native app dev Thomas Qvarnstrom Cesar Saavedra

The Cloud Native Elephant in the Room The Cloud Native Elephant in the Room Bob Quillin, VP

What is Cloud Native? WW Developer Advocacy Contents App Modernization Docker

Lessons Learnt from Running a Container Native Cloud Xu Wang (@gnawux) CTO &amp; Cofounder,

PolarDB Cloud Native DB @ Alibaba Lixun Peng Inaam Rana Alibaba Cloud Team Agenda

5G Cloud Native from RAN to Core Christian Maciocco, Intel Shilpa Talwar, Intel Saikrishna

Using Cloud Native Technologies to Solve Complex Application Security Challenges in Kubernetes

Honey, I shrunk the database! Resilience and recoverability in Cloud Native services JEFFREY

fundamental technologies to work on for cloud-native networking Magnus Karlsson, Intel

Teaching an old DAG new tricks Migrating a decade old pipeline to Airflow Outline Cloud native

Cloud Native and Container Technology Landscape Chris Aniszczyk (@cra) Rise of Containers and

Up Cloud Native Networking with eBPF Next Technical Track Presentation Raymond Maika

Cloud Gaming Architecture based on StarlingX and Akraino Integrated Cloud Native Edge Stack (ICN

A Multi-Tenancy Cloud-Native Digital Library Platform Yinlin Chen, Jim Tuttle, William A. Ingram

Managing Openstack in a cloud-native way Marcel Haerry Alberto Garca Leading the

Enabling Cloud-Native Applications with Application Credentials in Keystone Colleen Murphy Cloud

Basil Policy-as-code Platform Ron Herardian (ISC) East Bay Chapter Fall Conference, November

X-Containers: Breaking Down Barriers to Improve Performance and Isolation of Cloud-Native

Lessons Learnt from Running a Container Native Cloud Xu Wang (@gnawux) CTO & Cofounder,