the distributed pit of success
play

THE DISTRIBUTED PIT OF SUCCESS Greg Beech ABOUT ME Lead Engineer - PowerPoint PPT Presentation

THE DISTRIBUTED PIT OF SUCCESS Greg Beech ABOUT ME Lead Engineer @ Deliveroo joined March 2015 Tech lead for international expansion Set up Deliveroo for Business Currently rebuilding our Live Operations tooling PAST


  1. THE DISTRIBUTED PIT OF SUCCESS Greg Beech

  2. ABOUT ME Lead Engineer @ Deliveroo — joined March 2015 • Tech lead for international expansion • Set up Deliveroo for Business • Currently rebuilding our Live Operations tooling PAST • Head of Platform Development @ blinkbox Books • Principal Engineer @ blinkbox Movies • Test Engineer @ Microsoft

  3. ABOUT DELIVEROO

  4. FOUNDED 2013 Body Body RESTAURANT-QUALITY FOOD TO YOUR HOME THEN NOW

  5. $475M FUNDING RAISED

  6. DAILY ORDERS 2013 2015 2017

  7. 12 COUNTRIES 150 CITIES

  8. ENGINEERS 600,000 SLOC 38,000 COMMITS 6,900 PULL REQUESTS 3,200 DEPLOYS 2013 2014 2015 2016 2017

  9. TECHNOLOGY CHALLENGES

  10. ARCHITECTURE

  11. APP SERVERS BIGGEST. HEROKU. APP. EVER. HEROKU APP LIMIT LITERALLY ONE SERVER 2015 2016 2017

  12. DEGRADING PERFORMANCE • General purpose models sub-optimal in most cases • Caching difficult due to geo, availability, timings, etc. • Long dyno boot time makes auto-scaling slow • Constrained to Ruby on Rails

  13. DEGRADING PERFORMANCE Q4 '15 Q1 '16 Q2 '16 Q3 '16 Q4 '16 Q1 '17 Restaurant List TTFB

  14. BUILD TIMES 2h15 https://xkcd.com/303/ 25 min 7 min 4 min 2 min 2013 2014 2015 2016 2017

  15. DEVELOPMENT PROCESS master ticket staging QA

  16. REDUCING DEVELOPMENT VELOCITY • CI becomes part of dev workflow • 70+ developers causes merge conflicts • “God objects” are hard to understand • “House of Cards” development in certain areas

  17. REDUCING DEVELOPMENT VELOCITY Q4 '15 Q1 '16 Q2 '16 Q3 '16 Q4 '16 Q1 '17 Pull Requests per Engineer

  18. DECREASING RELIABILITY • Single problem can bring everything down • Placing orders • Customer service • Rider dispatch • Rollbacks increasingly difficult with commit frequency • PG replication, analyse and vacuum settings critical

  19. DECREASING RELIABILITY Uptime Q4 '15 Q1 '16 Q2 '16 Q3 '16 Q4 '16 Q1 '17 # Outages (Unscheduled)

  20. IT’S NOT GOING TO GET EASIER 2015 2017 2019

  21. HOW DO WE FIX THIS?

  22. LARGE SCALE ARCHITECTURE CLIENT APPS CLIENT APPS MONITORING MONITORING EDGE SERVICES EVENT BUS DOMAIN SERVICES

  23. DOMAIN SERVICES • Owns part of the domain • Granular, purely RESTful APIs • Send & receive from the bus • Use other domain service APIs CLIENT APPS MONITORING EDGE SERVICES EVENT BUS DOMAIN SERVICES

  24. EDGE SERVICES • Does not own any of the domain • Presents more aggregated API, implement search, etc. • Receive-only from the bus • Use domain or edge service APIs CLIENT APPS MONITORING EDGE SERVICES EVENT BUS DOMAIN SERVICES

  25. 1-4 SERVICES/APPS PER TEAM

  26. 12 FACTOR APPS • One codebase, many deploys • Scale out as stateless processes • Dev/prod parity (time, personnel, tools) • Find the rest at https://12factor.net/

  27. DATA SHARING RULES • No shared data stores — no exceptions • All internal data exposed as REST APIs — no RPC • Publish events when data changes — no payloads

  28. ACTUAL REST WITH HYPERMEDIA { "_links": { "self": { "href": "https://api.deliveroo.com/orders/2457" }, "restaurant": { LINKS!! "href": "https://api.deliveroo.com/restaurants/203" }, "user": { "href": "https://api.deliveroo.com/users/814" } }, "id": 2457, "status": "placed", "deliver_from": "2016-09-21T09:25:00Z", "deliver_by": "2016-09-21T09:35:00Z", "scheduled": false }

  29. THE N+1 SELECTS PROBLEM GET /restaurants?geohash=gcpvhepze4b8 GET /restaurants/203 GET /restaurants/812 GET /restaurants/1074 GET /restaurants/1309 EDGE DOMAIN GET /restaurants/1873 GET /restaurants/2132 GET /restaurants/2493 GET /restaurants/2873

  30. THE N+1 SELECTS SOLUTION? GET /restaurants?geohash=gcpvhepze4b8 GET /restaurants/203 GET /restaurants/812 GET /restaurants/1074 GET /restaurants/1309 EDGE DOMAIN LOCAL GET /restaurants/1873 CACHE GET /restaurants/2132 GET /restaurants/2493 GET /restaurants/2873

  31. CACHE CORRECTNESS ETag : High consistency but high latency GET /restaurants/203 GET /restaurants/203 LOCAL EDGE DOMAIN CACHE If-None-Match: "xxx" Cache-Control : Low latency but low consistency GET /restaurants/203 GET /restaurants/203 LOCAL EDGE DOMAIN CACHE (only if expired)

  32. REPRESENTATIONAL STATE NOTIFICATION (RESN) • Send CREATE/UPDATE/DELETE events for entities { "topic": "restaurants", "type": "update", "href": "https://api.deliveroo.com/restaurants/203" } • Low latency and high consistency GET /restaurants/203 UPDATE /restaurants/203 LOCAL EDGE DOMAIN CACHE GET /restaurants/203 GET /restaurants/203

  33. WHY NO PAYLOADS? • Transfer of authority from service to bus • Encourages incomplete domain modelling • Bus becomes a critical source for data loss • Need to handle multiple representations in consumers

  34. WHAT ABOUT STREAMS? • Tens of millions of location/availability pings per day • Nonsensical to model as entities with identity • Non-critical immutable value objects may be sent in payloads { "_links": { "rider": { "href": "https://api.deliveroo.com/riders/872" } }, "created_at": "2016-09-21T09:25:00Z", "latitude": 51.52168804, "longditude": -0.14303600, }

  35. LANGUAGES AND TOOLS

  36. WHICH LANGUAGE SHOULD WE USE?

  37. WE LOVE RUBY Easier migration path from existing codebase Well known within the company No need to argue over approaches & standards Quick to write, test and iterate Performant enough for most applications

  38. ROO ON RAILS $ rails new my_service --database=postgresql $ cd my_service $ echo "gem 'roo_on_rails'" >> Gemfile $ bundle && bundle exec roo_on_rails

  39. CASE STUDY: LIVE OPERATIONS

  40. WHAT IS LIVE OPS? • Manual intervention to get orders delivered • Finding and resolving issues with orders, riders, etc. • Like software, easier to fix things the earlier you find them

  41. LIVE OPERATIONS HISTORICALLY order id time restaurant address zone status REPEATED SCANNING

  42. WHAT ARE OUR GOALS? • Reduce investigations per order by 93% • Reduce or hold “unacceptable” orders • Give visibility into live issues

  43. LIVE OPERATIONS NOW: BACKEND orders riders EVENT RULES 
 NOTIFY 
 live issues HANDLERS ENGINE HANDLERS pickups deliveries

  44. LIVE OPERATIONS NOW: DASHBOARD

  45. ARCHITECTURE: NOW DASHBOARD EVENT orders, riders, etc. LIVE OPS API BUS

  46. ARCHITECTURE: NEXT DASHBOARD live issues, etc. LIVE OPS API EVENT https BUS live issues LIVE ISSUES API orders, riders, etc.

  47. ARCHITECTURE: FINAL? DASHBOARD https web sockets live issues, etc. LIVE OPS API EVENT https BUS live issues LIVE ISSUES API orders, riders, etc.

  48. ENABLE TEAMS TO WORK LIKE STARTUPS • Identify problems in their area • Set goals and define metrics • Fast develop/test/deploy cycles • Evolve easily, but only when necessary • Succeed even with limited distributed experience

  49. gsb@deliveroo.co.uk @gregbeech

Recommend


More recommend