THE DISTRIBUTED PIT OF SUCCESS Greg Beech
ABOUT ME Lead Engineer @ Deliveroo — joined March 2015 • Tech lead for international expansion • Set up Deliveroo for Business • Currently rebuilding our Live Operations tooling PAST • Head of Platform Development @ blinkbox Books • Principal Engineer @ blinkbox Movies • Test Engineer @ Microsoft
ABOUT DELIVEROO
FOUNDED 2013 Body Body RESTAURANT-QUALITY FOOD TO YOUR HOME THEN NOW
$475M FUNDING RAISED
DAILY ORDERS 2013 2015 2017
12 COUNTRIES 150 CITIES
ENGINEERS 600,000 SLOC 38,000 COMMITS 6,900 PULL REQUESTS 3,200 DEPLOYS 2013 2014 2015 2016 2017
TECHNOLOGY CHALLENGES
ARCHITECTURE
APP SERVERS BIGGEST. HEROKU. APP. EVER. HEROKU APP LIMIT LITERALLY ONE SERVER 2015 2016 2017
DEGRADING PERFORMANCE • General purpose models sub-optimal in most cases • Caching difficult due to geo, availability, timings, etc. • Long dyno boot time makes auto-scaling slow • Constrained to Ruby on Rails
DEGRADING PERFORMANCE Q4 '15 Q1 '16 Q2 '16 Q3 '16 Q4 '16 Q1 '17 Restaurant List TTFB
BUILD TIMES 2h15 https://xkcd.com/303/ 25 min 7 min 4 min 2 min 2013 2014 2015 2016 2017
DEVELOPMENT PROCESS master ticket staging QA
REDUCING DEVELOPMENT VELOCITY • CI becomes part of dev workflow • 70+ developers causes merge conflicts • “God objects” are hard to understand • “House of Cards” development in certain areas
REDUCING DEVELOPMENT VELOCITY Q4 '15 Q1 '16 Q2 '16 Q3 '16 Q4 '16 Q1 '17 Pull Requests per Engineer
DECREASING RELIABILITY • Single problem can bring everything down • Placing orders • Customer service • Rider dispatch • Rollbacks increasingly difficult with commit frequency • PG replication, analyse and vacuum settings critical
DECREASING RELIABILITY Uptime Q4 '15 Q1 '16 Q2 '16 Q3 '16 Q4 '16 Q1 '17 # Outages (Unscheduled)
IT’S NOT GOING TO GET EASIER 2015 2017 2019
HOW DO WE FIX THIS?
LARGE SCALE ARCHITECTURE CLIENT APPS CLIENT APPS MONITORING MONITORING EDGE SERVICES EVENT BUS DOMAIN SERVICES
DOMAIN SERVICES • Owns part of the domain • Granular, purely RESTful APIs • Send & receive from the bus • Use other domain service APIs CLIENT APPS MONITORING EDGE SERVICES EVENT BUS DOMAIN SERVICES
EDGE SERVICES • Does not own any of the domain • Presents more aggregated API, implement search, etc. • Receive-only from the bus • Use domain or edge service APIs CLIENT APPS MONITORING EDGE SERVICES EVENT BUS DOMAIN SERVICES
1-4 SERVICES/APPS PER TEAM
12 FACTOR APPS • One codebase, many deploys • Scale out as stateless processes • Dev/prod parity (time, personnel, tools) • Find the rest at https://12factor.net/
DATA SHARING RULES • No shared data stores — no exceptions • All internal data exposed as REST APIs — no RPC • Publish events when data changes — no payloads
ACTUAL REST WITH HYPERMEDIA { "_links": { "self": { "href": "https://api.deliveroo.com/orders/2457" }, "restaurant": { LINKS!! "href": "https://api.deliveroo.com/restaurants/203" }, "user": { "href": "https://api.deliveroo.com/users/814" } }, "id": 2457, "status": "placed", "deliver_from": "2016-09-21T09:25:00Z", "deliver_by": "2016-09-21T09:35:00Z", "scheduled": false }
THE N+1 SELECTS PROBLEM GET /restaurants?geohash=gcpvhepze4b8 GET /restaurants/203 GET /restaurants/812 GET /restaurants/1074 GET /restaurants/1309 EDGE DOMAIN GET /restaurants/1873 GET /restaurants/2132 GET /restaurants/2493 GET /restaurants/2873
THE N+1 SELECTS SOLUTION? GET /restaurants?geohash=gcpvhepze4b8 GET /restaurants/203 GET /restaurants/812 GET /restaurants/1074 GET /restaurants/1309 EDGE DOMAIN LOCAL GET /restaurants/1873 CACHE GET /restaurants/2132 GET /restaurants/2493 GET /restaurants/2873
CACHE CORRECTNESS ETag : High consistency but high latency GET /restaurants/203 GET /restaurants/203 LOCAL EDGE DOMAIN CACHE If-None-Match: "xxx" Cache-Control : Low latency but low consistency GET /restaurants/203 GET /restaurants/203 LOCAL EDGE DOMAIN CACHE (only if expired)
REPRESENTATIONAL STATE NOTIFICATION (RESN) • Send CREATE/UPDATE/DELETE events for entities { "topic": "restaurants", "type": "update", "href": "https://api.deliveroo.com/restaurants/203" } • Low latency and high consistency GET /restaurants/203 UPDATE /restaurants/203 LOCAL EDGE DOMAIN CACHE GET /restaurants/203 GET /restaurants/203
WHY NO PAYLOADS? • Transfer of authority from service to bus • Encourages incomplete domain modelling • Bus becomes a critical source for data loss • Need to handle multiple representations in consumers
WHAT ABOUT STREAMS? • Tens of millions of location/availability pings per day • Nonsensical to model as entities with identity • Non-critical immutable value objects may be sent in payloads { "_links": { "rider": { "href": "https://api.deliveroo.com/riders/872" } }, "created_at": "2016-09-21T09:25:00Z", "latitude": 51.52168804, "longditude": -0.14303600, }
LANGUAGES AND TOOLS
WHICH LANGUAGE SHOULD WE USE?
WE LOVE RUBY Easier migration path from existing codebase Well known within the company No need to argue over approaches & standards Quick to write, test and iterate Performant enough for most applications
ROO ON RAILS $ rails new my_service --database=postgresql $ cd my_service $ echo "gem 'roo_on_rails'" >> Gemfile $ bundle && bundle exec roo_on_rails
CASE STUDY: LIVE OPERATIONS
WHAT IS LIVE OPS? • Manual intervention to get orders delivered • Finding and resolving issues with orders, riders, etc. • Like software, easier to fix things the earlier you find them
LIVE OPERATIONS HISTORICALLY order id time restaurant address zone status REPEATED SCANNING
WHAT ARE OUR GOALS? • Reduce investigations per order by 93% • Reduce or hold “unacceptable” orders • Give visibility into live issues
LIVE OPERATIONS NOW: BACKEND orders riders EVENT RULES NOTIFY live issues HANDLERS ENGINE HANDLERS pickups deliveries
LIVE OPERATIONS NOW: DASHBOARD
ARCHITECTURE: NOW DASHBOARD EVENT orders, riders, etc. LIVE OPS API BUS
ARCHITECTURE: NEXT DASHBOARD live issues, etc. LIVE OPS API EVENT https BUS live issues LIVE ISSUES API orders, riders, etc.
ARCHITECTURE: FINAL? DASHBOARD https web sockets live issues, etc. LIVE OPS API EVENT https BUS live issues LIVE ISSUES API orders, riders, etc.
ENABLE TEAMS TO WORK LIKE STARTUPS • Identify problems in their area • Set goals and define metrics • Fast develop/test/deploy cycles • Evolve easily, but only when necessary • Succeed even with limited distributed experience
gsb@deliveroo.co.uk @gregbeech
Recommend
More recommend