freeing the whale
play

Freeing the Whale How to Fail at Scale oliver gould cto, b uoyant - PowerPoint PPT Presentation

from QConSF, November 9, 2016 Freeing the Whale How to Fail at Scale oliver gould cto, b uoyant 2010 A FAILWHALE ODYSSEY Twitter, 2010 10 7 users 10 7 tweets/day 10 2 engineers 10 1 ops eng 10 1 services 10 1 deploys/week 10 2 hosts 0


  1. from QConSF, November 9, 2016 Freeing the Whale How to Fail at Scale oliver gould 
 cto, b uoyant

  2. 2010 A FAILWHALE ODYSSEY

  3. Twitter, 2010 10 7 users 10 7 tweets/day 10 2 engineers 10 1 ops eng 10 1 services 10 1 deploys/week 10 2 hosts 0 datacenters https://blog.twitter.com/2010/measuring-tweets 10 1 user-facing outages/week

  4. objective reliability flexibility

  5. objective solution reliability platform flexibility SOA + devops 
 i.e. “microservices”

  6. Resilience is an imperative: our software runs on the truly dismal computers we call datacenters . Besides being heinously 
 complex… they are unreliable and prone to 
 operator error. Marius Eriksen @marius 
 RPC Redux

  7. software you didn’t write hardware you can’t touch network you can’t trace break in new and surprising ways and your customers shouldn’t notice

  8. freeing the whale photo: Johanan Ottensooser

  9. mesos.apache.org UC Berkeley, 2010 Twitter, 2011 Apache, 2012 Abstracts compute resources Promise: don’t worry about the hosts

  10. aurora.apache.org Twitter, 2011 Apache, 2013 Schedules processes on Mesos Promise: no more puppet, monit, etc

  11. timelines users notifications x800 x300 x1000 Aurora (or Marathon, or …) Mesos host host host host host host

  12. timelines users notifications x800 x300 x1000 Aurora (or Marathon, or …) Mesos 🔦 host host host host host

  13. service discovery timelines users create ephemeral /svc/users/node_012345 
 {“host”: “host-abc”,“port”: 4321} zookeeper

  14. service discovery timelines users watch /svc/users/* zookeeper

  15. service discovery GetUser(olix0r) timelines users zookeeper

  16. service discovery GetUser(olix0r) timelines users uh oh. zookeeper

  17. service discovery GetUser(olix0r) timelines users client caches results zookeeper

  18. service discovery GetUser(olix0r) timelines users zookeeper serves empty results?! zookeeper

  19. service discovery GetUser(olix0r) timelines users service discovery is advisory zookeeper

  20. github.com/twitter/finagle RPC library (JVM) asynchronous built on Netty scala functional strongly typed first commit: Oct 2010

  21. 
 languages, libraries business [7] application json, protobuf, thrift, … [6] presentation rpc http/2, mux, … [5] session [4] transport kubernetes, mesos, swarm, … [3] network datacenter canal, weave, … [2] link aws, azure, digitalocean, gce, … [1] physical

  22. “ It’s slow ” 
 is the hardest problem you’ll ever debug. Je ff Hodges @jmhodges 
 Notes on Distributed Systems for Young Bloods

  23. observability counters (e.g. client/users/failures ) histograms (e.g. client/users/latency/p99 ) tracing

  24. tracing

  25. timeouts & retries timeout=400ms web web retries=3 timeout=400ms timelines timelines retries=2 timeout=200ms users users retries=3 db db

  26. timeouts & retries timeout= 400ms web web retries=3 timeout=400ms timelines timelines retries=2 800ms! timeout=200ms users users retries=3 600ms! db db

  27. deadlines web timeout= 400ms timelines 77ms elapsed deadline=323ms users 113ms elapsed deadline=210ms db

  28. retries typical: retries=3

  29. retries typical: worst-case: 300% more load!!! retries=3

  30. budgets typical: worst-case: 300% more load!!! retries=3 better: 
 worst-case: 20% more load retryBudget=20%

  31. load shedding via cancellation web web timelines timelines timeout! users users db db

  32. load shedding via cancellation web web timelines timelines timeout! users users db db

  33. backpressure web web 1000 requests timelines timelines 1000 requests users users 100 requests db db

  34. backpressure web web 1000 failed timelines timelines 1000 failed 💁 users users db db

  35. backpressure web 100 ok + 900 failed/redirected/etc timelines 100 ok users 100 ok db

  36. request-level load balancing lb algorithms: • round-robin • fewest connections • queue depth • exponentially-weighted moving average (ewma) • aperture

  37. So just rewrite everything in Finagle!?

  38. linkerd

  39. github.com/buoyantio/linkerd service mesh proxy built on finagle & netty suuuuper pluggable http, thrift, … etcd, consul, kubernetes, marathon, zookeeper, … …

  40. Linkers and Loaders, John R. Levine, Academic Press

  41. linker for the datacenter

  42. logical naming applications refer to /s/users logical names 
 requests are bound to /#/io.l5d.zk/prod/users concrete names 
 /#/io.l5d.zk/staging/users delegations express /s => /#/io.l5d.zk/prod routing

  43. per- request routing: staging GET / HTTP/1.1 
 Host: mysite.com 
 l5d-dtab: /s/B => /s/B2

  44. per- request routing: debug proxy GET / HTTP/1.1 
 Host: mysite.com 
 l5d-dtab: /s/E => /s/P/s/E

  45. linkerd service mesh transport security service discovery circuit breaking backpressure Service A Service B Service C deadlines instance instance instance retries tracing metrics linkerd linkerd linkerd keep-alive multiplexing load balancing per-request routing service-level objectives

  46. demo: gob’s microservice

  47. web l5d gen word l5d l5d

  48. web l5d gen word l5d l5d gen-v2 l5d

  49. web namerd l5d gen word l5d l5d gen-v2 l5d

  50. github.com/buoyantio/linkerd-examples

  51. linkerd roadmap • Battle test HTTP/2 • TLS client certs • Deadlines • Dark Tra ffi c • All configurable everything

  52. thanks! more at linkerd.io slack: slack.linkerd.io email: ver@buoyant.io twitter: • @olix0r • @linkerd

Recommend


More recommend