Scaling Uber with Node.js Amos Barreto @amos_barreto
Uber is everyone’s Private driver. REQUEST � RIDE � RATE � Tap to select location Sit back and relax, tell your Help us maintain a quality service � driver your destination by rating your experience
YOUR DRIVERS 4
Your Drivers UBER QUALIFIED RIDER RATED LICENSED & INSURED Uber only partners with drivers Tell us what you think. Your From insurance to background who have a keen eye for feedback helps us work with checks, every driver meets or customer service and a drivers to constantly improve beats local regulations. passion for the trade. the Uber experience. 19
LOGISTICS 4
#OMGUBERICECREAM 22
UberChopper #OMGUBERCHOPPER 22
#UBERVALENTINES 22
#ICANHASUBERKITTENS 22
Trip State Machine (Simplified) Request Dispatch Accept Arrive End Begin 6
Trip State Machine (Extended) Expire / Request Dispatch (1) Reject Dispatch (2) Accept Arrive End Begin 6
OUR STORY 4
Version 1 • PHP dispatch • Outsourced to remote PHP contractors in Midwest • Half the code in spanish Cron • Flat file � • Lifetime: 6-9 months 6
33
“I read an article on HackerNews about a new framework called Node.js” � � � � Jason Roberts �
Tradeoffs • Learning curve • Database drivers � � • Scalability • Documentation � � � • Performance • Monitoring � � � • Library ecosystem • Production operations �
Version 2 • Lifetime: 9 months � • Developed in house Node.js � • Node.js application • Prototyped on 0.2 • Launched in production with 0.4 � • MongoDB datastore
“I really don’t see dispatch changing much in the next three years” 33
Expect the unexpected 15
Version 3 • Mongo did not scale with volume of GPS logs (global CN CN CN write lock) CN � • Swapped mongo for redis and flat files SF NYC SEA CHI
Decoupling storage of different types of data
Version 3 (continued) • Node.js mongo client failed to recognize replica set CN CN CN topology changes CN SF NYC SEA CHI
Be wary of immature client libraries
Commits to client modules over time
Version 3 (continued) SF NYC SEA CHI BOS PAR
Focus on driving business value 15
15
15
Capacity planning, forecasting, and load testing are your friends 15
Measure everything 15
Version 4 • Nickname: The Grid � CN CN • Multi-process dispatch CN � • Peer assignment � • Redis is now considered the SF NYC SEA CHI SF NYC CHI source of truth SF CHI � • Use lua interpreter for atomic operations � • Fan out to all city peers to find nearby cars
clientStatus = redis.call('hget', clientHash, ‘status’) driverStatus = redis.call('hget', driverHash, ‘status’) if clientStatus == 'WaitingForPickup' and driverStatus == 'Open' then local clientPeerId = redis.call('hget', clientAssignmentHash, clientToken) redis.call('hset', driverHash, 'status', 'DispatchPending') redis.call('hset', clientAssignmentHash, clientToken, driverPeerId) if clientPeerId then redis.call('zincrby', countKey, -1, clientPeerId) end redis.call('zincrby', countKey, 1, driverPeerId) return redis.status_reply('SUCCESS') else return redis.error_reply('ERROR - clientStatus: '..tostring(clientStatus)..', driverStatus: '..tostring(driverStatus)) end 15
clientStatus = redis.call('hget', clientHash, ‘status’) driverStatus = redis.call('hget', driverHash, ‘status’) if clientStatus == 'WaitingForPickup' and driverStatus == 'Open' then local clientPeerId = redis.call('hget', clientAssignmentHash, clientToken) redis.call('hset', driverHash, 'status', 'DispatchPending') redis.call('hset', clientAssignmentHash, clientToken, driverPeerId) if clientPeerId then redis.call('zincrby', countKey, -1, clientPeerId) end redis.call('zincrby', countKey, 1, driverPeerId) return redis.status_reply('SUCCESS') else return redis.error_reply('ERROR - clientStatus: '..tostring(clientStatus)..', driverStatus: '..tostring(driverStatus)) end 15
clientStatus = redis.call('hget', clientHash, ‘status’) driverStatus = redis.call('hget', driverHash, ‘status’) if clientStatus == 'WaitingForPickup' and driverStatus == 'Open' then local clientPeerId = redis.call('hget', clientAssignmentHash, clientToken) redis.call('hset', driverHash, 'status', 'DispatchPending') redis.call('hset', clientAssignmentHash, clientToken, driverPeerId) if clientPeerId then redis.call('zincrby', countKey, -1, clientPeerId) end redis.call('zincrby', countKey, 1, driverPeerId) return redis.status_reply('SUCCESS') else return redis.error_reply('ERROR - clientStatus: '..tostring(clientStatus)..', driverStatus: '..tostring(driverStatus)) end 15
clientStatus = redis.call('hget', clientHash, ‘status’) driverStatus = redis.call('hget', driverHash, ‘status’) if clientStatus == 'WaitingForPickup' and driverStatus == 'Open' then local clientPeerId = redis.call('hget', clientAssignmentHash, clientToken) redis.call('hset', driverHash, 'status', 'DispatchPending') redis.call('hset', clientAssignmentHash, clientToken, driverPeerId) if clientPeerId then redis.call('zincrby', countKey, -1, clientPeerId) end redis.call('zincrby', countKey, 1, driverPeerId) return redis.status_reply('SUCCESS') else return redis.error_reply('ERROR - clientStatus: '..tostring(clientStatus)..', driverStatus: '..tostring(driverStatus)) end 15
Version 4 (continued) SF1 SF2 NY1 NY2 SEA1 SEA2 CHI1 PAR1 BOS1 BOS2 SF3 NY3 NY4 SEA3 SEA4 CHI2 CHI3
Version 5 SF SF SF SF SF SF SF SF SF
Version 5 max # of loc queries # of nodes
Version 5 CN CN CN ncar SF NYC SEA CHI ncar SF NYC CHI ncar SF CHI ncar
Break out services as needed 15
Understand v8 to optimize Node.js applications 15
SF1 SF2 NY1 NY2 SEA1 SEA2 CHI1 PAR1 BOS1 BOS2 SF3 NY3 NY4 SEA3 SEA4 CHI2 CHI3
Don’t take vacation ;)
Don’t live in Chicago! 15
Stateless applications… No single points of failure… Replicated data stores… Dynamic application topology… 15
Version 6 SF1 NY2 NY3 NY1 SEA3 PAR1 CHI1 BOS1 BOS2 SF3 SF2 NY4 SEA1 SEA4 CHI2 CHI3 SEA2 Grid Grid Grid Manager Manager Manager
Version 7 haproxy
Do the obvious 15
Pros • every application is horizontally scalable • flexible, partially dynamic topology • failure recovery manual in the worst case � • supports primary business case very well • conservative estimates 1-2 years of runway
Never be satisfied
Cons • what happens when a city out scales the capacity of a single redis instance? � • who wants to wake up in the middle of the night for servers crashes? � • what about future business use cases?
#WORLDCLASS 4
World Class • city agnostic dispatch application � • “stateless” applications � • scale to 100x current load � • flexible data model
Every now and then it’s okay to bend the rules 15
Realtime Analytics
Realtime Analytics
So why did we stick with Node.js? • JavaScript is easy to learn � Simple interface with thorough documentation • � • Lends itself to fast prototyping � • Asynchronous, nimble � Avoid concurrency challenges • � • Increasingly mature module ecosystem
How to win with Node.js? • measure everything - particularly response times and event loop lag • learn to take heap dumps to debug memory issues • strace, perf, flame graphs are necessary tools for improving performance � � • small, reusable components to reduce duplication
The Human Factor 34
Thank you. Questions?
Recommend
More recommend