Hi, I'm Andrew Godwin Django core developer Senior Software Engineer at Used to complain about migrations a lot
Distributed Systems
c = 299,792,458 m/s
Early CPUs c = 60m propagation distance 5 MHz ~2cm Clock
Modern CPUs c = 10cm propagation distance 3 GHz
Distributed systems are made of independent components
They are slower and harder to write than synchronous systems
But they can be scaled up much, much further
Trade-offs
There is never a perfect solution.
Fast Cheap Good
Load Balancer WSGI WSGI WSGI Worker Worker Worker
Load Balancer WSGI WSGI WSGI Worker Worker Worker Cache
Load Balancer WSGI WSGI WSGI Worker Worker Worker Cache Cache Cache
Load Balancer WSGI WSGI WSGI Worker Worker Worker Database
CAP Theorem
Partition Tolerant Available Consistent
PostgreSQL: CP Consistent everywhere Handles network latency/drops Can't write if main server is down
Cassandra: AP Can read/write to any node Handles network latency/drops Data can be inconsistent
It's hard to design a product that might be inconsistent
But if you take the tradeoff, scaling is easy
Otherwise, you must fi nd other solutions
Read Replicas (often called master/slave) Load Balancer WSGI WSGI WSGI Worker Worker Worker Replica Replica Main
Replicas scale reads forever... But writes must go to one place
If a request writes to a table it must be pinned there, so later reads do not get old data
When your write load is too high, you must then shard
Vertical Sharding Users Tickets Events Payments
Horizontal Sharding Users Users Users Users 0 - 2 3 - 5 6 - 8 9 - A
Both Users Users Users Users 0 - 2 3 - 5 6 - 8 9 - A Events Events Events Events 0 - 2 3 - 5 6 - 8 9 - A Tickets Tickets Tickets Tickets 0 - 2 3 - 5 6 - 8 9 - A
Both plus caching Users Users Users Users User 0 - 2 3 - 5 6 - 8 9 - A Cache Events Events Events Events Event 0 - 2 3 - 5 6 - 8 9 - A Cache Tickets Tickets Tickets Tickets Ticket 0 - 2 3 - 5 6 - 8 9 - A Cache
Teams have to scale too; nobody should have to understand eveything in a big system.
Services allow complexity to be reduced - for a tradeoff of speed
User Service Users Users Users Users User 0 - 2 3 - 5 6 - 8 9 - A Cache Event Service Events Events Events Events Event 0 - 2 3 - 5 6 - 8 9 - A Cache Ticket Service Tickets Tickets Tickets Tickets Ticket 0 - 2 3 - 5 6 - 8 9 - A Cache
User Service WSGI Server Event Service Ticket Service
Each service is its own, smaller project, managed and scaled separately.
But how do you communicate between them?
Direct Communication Service 1 Service 3 Service 2
Service 5 Service 1 Service 4 Service 3 Service 2
Service 7 Service 8 Service 6 Service 1 Service 5 Service 2 Service 4 Service 3
Message Bus Service 1 Service 2 Service 3 Service 1 Service 2 Service 3
A single point of failure is not always bad - if the alternative is multiple, fragile ones
Channels and ASGI provide a standard message bus built with certain tradeoffs
Django Channels Library Django Channels Project ASGI (Channel Layer) Backing Store e.g. Redis, RabbitMQ
Pure Python ASGI (Channel Layer) Backing Store e.g. Redis, RabbitMQ
Failure Mode At most once Messages either do not arrive, or arrive once. At least once Messages arrive once, or arrive multiple times
Guarantees vs. Latency Low latency Messages arrive very quickly but go missing more Low loss rate Messages are almost never lost but arrive slower
Queuing Type First In First Out Consistent performance for all users First In Last Out Hides backlogs but makes them worse
Queue Sizing Finite Queues Sending can fail In fi nite queues Makes problems even worse
You must understand what you are making (This is surprisingly uncommon)
Design as much as possible around shared-nothing
Per-machine caches On-demand thumbnailing Signed cookie sessions
Has to be shared? Try to split it
Has to be shared? Try sharding it.
Django's job is to be slowly replaced by your code
Just make sure you match the API contract of what you're replacing!
Don't try to scale too early; you'll pick the wrong tradeoffs.
Thanks. Andrew Godwin @andrewgodwin channels.readthedocs.io
Recommend
More recommend