November 6, 2018 Scaling Slack The Good, The Unexpected, and The Road Ahead Michael Demmer mdemmer@slack-corp.com | @mjdemmer
Me
(Not) This Talk 1. 2016: Monolith 2. 2016-2018: Microservices 3. 2016-2018: Best Practices 4. 2018: Lessons Learned
This Talk 1. 2016: How Slack Worked 2. 2016-2018: Things Got More Interesting 3. 2016-2018: What We Did About It 4. 2018+: Themes and Road Ahead
Slack in 2016
Slack
Workspaces, Channels, Users, and more A workspace logically contains all channels and messages , as well as users , emoji , bots , and more. All interactions occur within the workspace boundary. Acme Corp Oceanic Airlines #brainstorming @alice #proj-roadrunner @bob #marketing @carol Delos … ... Duff Beer us_east_1
Slack Facts (2016) User Base Largest Organizations 4M Daily Active Users >10,000 Active Users Connectivity Engineering Style 2.5M peak simultaneous connected Conservative, Pragmatic, Minimal Avg 10 hrs/day Most systems > 10 year old technology
How Slack Works (2016) RTM RTM Message Service Service Proxy RTM Service us_west_1 Message Server RTM Service (Java) Websocket Client HTTP API Calls Webapp Webapp Webapp MySQL (PHP) MySQL Job Queue us_east_1
Client / Server Flow Initial login: Message Download full workspace model 3: websocket connect ● Proxy with all channels, users, emoji, etc. Establish real time websocket ● 1: rtm.start Webapp 2: prefs: {...}, (PHP) users: {...}, channels: {...}, emoji: {...}, ms: “ms1.slack-msgs.com”
Client / Server Flow Initial login: Message Download full workspace model ● {message: ...} Proxy with all channels, users, emoji, etc. Establish real time websocket ● While connected: reactions.add Push updates via websocket ● Webapp (PHP) API calls for channel history, ● message edits, create channels, etc.
Sharding And Routing Workspace Sharding RTM RTM Message Service Service Assign a workspace to a DB ● Servers and MS shard at creation s m a e t m o r f Metadata table lookup for Mains ● * 4 3 t 2 c 1 e = l d e i s e each API request to route r e h w db_shard:35, ms_shard:11, ...} {id:1234, domain:demmer, Webapp (PHP) MySQL Shards
Sharding And Routing Workspace Sharding RTM RTM Message Service Service Assign a workspace to a DB ● Servers and MS shard at creation Metadata table lookup for Mains ● each API request to route “Herd of Pets” Webapp (PHP) DBs run in active/active pairs ● with application failover Service hosts are addressed in ● config and manually replaced MySQL Shards
Why This Worked Client Experience Server Experience Data model lends itself to a seamless, rich Implementation model is straightforward , real-time client experience . easy to reason about and debug . Full data model available in memory All operations are workspace scoped ● ● Updates appear instantly Horizontally scale by adding servers ● ● Everything feels real time Few components or dependencies ● ●
Things Get More Interesting...
Things Get More Interesting Size and Scale Product Model
Slack Growth
Slack Facts (2018) User Base Largest Organizations >8M Daily Active Users >125,000 Active Users Connectivity Engineering Style >7M peak simultaneous connected Still pragmatic, but embrace complexity Avg 10 hrs/day where needed to solve hardest problems
Slack Facts (2018) 10x ! 2x User Base Largest Organizations >8M Daily Active Users >125,000 Active Users 3x Connectivity Engineering Style >7M peak simultaneous connected Still pragmatic, but embrace complexity Avg 10 hrs/day where needed to solve hardest problems
Change the Model A workspace logically contains all channels and messages , as well as users , emoji , bots , and more. All interactions occur within the workspace boundary. Acme Corp Oceanic Airlines #brainstorming @alice #proj-roadrunner @bob #marketing @carol Delos … ... Duff Beer us_east_1
Change the Model Workspaces Enterprise Acme Wayne Corp Enterprises Duff Oceanic Beer Airlines Wayne Wayne Finance Shipping Wayne Delos Security
Change the Model Workspaces Enterprise Shared Channels Acme Wayne Corp Agents of Enterprises SHIELD Duff Oceanic Beer Airlines Wayne Wayne Finance Shipping Stark Industries Wayne Delos Security
Challenges Recurring Issues Large organizations : Boot metadata download is slow and expensive ● Thundering Herd : Load to connect >> Load in steady state ● Hot spots : Overwhelm database hosts (mains and shards) and other systems ● Herd of Pets : Manual operation to replace specific servers ● Cross Workspace Channels: Need to change assumptions about partitioning ●
So What Did We Do?
What Did We Do Thin Message Client Vitess Services Model Fine-Grained Service Flannel Cache DB Sharding Decomposition
What Did We Do Thin Client Model Flannel Cache
Challenge: Boot Model Explosion boot_payload_size ~= (num_users * user_profile_bytes) + (num_channels * (channel_info_size + (num_users_in_channel * user_id bytes))) Users Profiles Channels Total 12 6 KB 1 KB 7 KB 530 140 KB 28 KB 168 KB 4,008 5 MB 2 MB 7 MB
Challenge: Boot Model Explosion boot_payload_size ~= (num_users * user_profile_bytes) + (num_channels * (channel_info_size + (num_users_in_channel * user_id bytes))) Users Profiles Channels Total 12 6 KB 1 KB 7 KB 530 140 KB 28 KB 168 KB 4,008 5 MB 2 MB 7 MB 44,030 36 MB 25 MB 59 MB 148,170 78 MB 40 MB 118 MB
Thin Client Model RTM RTM Message Service Service Proxy RTM Service us_west_1 RTM Service Message Server Websocket Client HTTP API Calls Webapp Webapp Webapp MySQL MySQL Job Queue us_east_1
Thin Client Model RTM RTM RTM RTM Flannel Message Service Service Service Service Cache Proxy RTM Service us_west_1 us_west_1 RTM Service Message Server Websocket Client HTTP API Calls Webapp Webapp Webapp MySQL MySQL Job Consul Queue us_east_1
Thin Client Model Flannel Service Minimize Workspace Model Globally distributed edge cache Much smaller boot payload Routing Query API Workspace affinity for cache locality Fetch unknown objects from cache Cache Updates Proxy subscription messages to clients RTM RTM Service Flannel Service Websocket
Thin Client Model Unblock Large Organizations Adapting clients to a lazy load model was a critical change to enable Slack for large organizations. Huge reduction in payload times on initial connect ● Flannel efficiently responds to > 1+ million queries per second ● Adds challenges of cache coherency and reconciling business logic ●
What Did We Do Vitess Fine-Grained DB Sharding
Challenge: Hot Spots & Manual Repair
Vitess RTM RTM RTM RTM Flannel Message Service Service Service Service Cache Proxy RTM Service us_west_1 us_west_1 RTM Service Message Server Websocket Client HTTP API Calls Webapp Webapp Webapp MySQL MySQL Job Consul Queue us_east_1
Vitess RTM RTM RTM RTM Flannel Message Service Service Service Service Cache Proxy RTM Service us_west_1 us_west_1 RTM Service Message Server Websocket VtGate VtGate VtGate Client HTTP API Calls Webapp Webapp VtTablet Webapp MySQL MySQL MySQL Job Consul Queue us_east_1
Vitess Flexible Sharding Topology Management Vitess manages per-table sharding policy Database servers self-register Single Master Failover Using GTID and semi-sync replication Orchestrator promotes a replica on failover VtGate VtGate Resharding Workflows VtGate Automatically expand the cluster Webapp VtTablet Webapp Webapp MySQL
Vitess Fine-Grained Sharding Migrating to a channel-sharded / user-sharded data model helps mitigate hot spots for large teams and thundering herds. Retains MySQL at the core for developer / operations continuity ● More mature topology management and cluster expansion systems ● Data migrations that change the sharding model take a long time ●
What Did We Do Message Services Service Decomposition
Challenge: Shared Channels? Agents of Message Server SHIELD Stark Message Server Industries
Challenge: Shared Channels? Agents of Message Server SHIELD Stark Message Server Industries
Message Server to Services RTM RTM RTM RTM Flannel Message Service Service Service Service Cache Proxy RTM Service us_west_1 us_west_1 RTM Service Message Server Websocket VtGate VtGate VtGate Client HTTP API Calls Webapp Webapp VtTablet Webapp MySQL MySQL MySQL Job Consul Queue us_east_1
Message Server to Services RTM RTM RTM RTM RTM RTM Flannel Gateway Channel Service Service Service Service Service Service Cache Server Server RTM RTM Presence Service Service Server us_west_1 RTM RTM VtGate Admin Message Service VtGate Websocket Service Server Server VtGate VtGate VtGate Client HTTP API Calls Webapp Webapp VtTablet Webapp MySQL MySQL MySQL Job Consul Queue us_east_1
Recommend
More recommend