scaling slack
play

Scaling Slack The Good, The Unexpected, and The Road Ahead Michael - PowerPoint PPT Presentation

November 6, 2018 Scaling Slack The Good, The Unexpected, and The Road Ahead Michael Demmer mdemmer@slack-corp.com | @mjdemmer Me (Not) This Talk 1. 2016: Monolith 2. 2016-2018: Microservices 3. 2016-2018: Best Practices 4. 2018: Lessons


  1. November 6, 2018 Scaling Slack The Good, The Unexpected, and The Road Ahead Michael Demmer mdemmer@slack-corp.com | @mjdemmer

  2. Me

  3. (Not) This Talk 1. 2016: Monolith 2. 2016-2018: Microservices 3. 2016-2018: Best Practices 4. 2018: Lessons Learned

  4. This Talk 1. 2016: How Slack Worked 2. 2016-2018: Things Got More Interesting 3. 2016-2018: What We Did About It 4. 2018+: Themes and Road Ahead

  5. Slack in 2016

  6. Slack

  7. Workspaces, Channels, Users, and more A workspace logically contains all channels and messages , as well as users , emoji , bots , and more. All interactions occur within the workspace boundary. Acme Corp Oceanic Airlines #brainstorming @alice #proj-roadrunner @bob #marketing @carol Delos … ... Duff Beer us_east_1

  8. Slack Facts (2016) User Base Largest Organizations 4M Daily Active Users >10,000 Active Users Connectivity Engineering Style 2.5M peak simultaneous connected Conservative, Pragmatic, Minimal Avg 10 hrs/day Most systems > 10 year old technology

  9. How Slack Works (2016) RTM RTM Message Service Service Proxy RTM Service us_west_1 Message Server RTM Service (Java) Websocket Client HTTP API Calls Webapp Webapp Webapp MySQL (PHP) MySQL Job Queue us_east_1

  10. Client / Server Flow Initial login: Message Download full workspace model 3: websocket connect ● Proxy with all channels, users, emoji, etc. Establish real time websocket ● 1: rtm.start Webapp 2: prefs: {...}, (PHP) users: {...}, channels: {...}, emoji: {...}, ms: “ms1.slack-msgs.com”

  11. Client / Server Flow Initial login: Message Download full workspace model ● {message: ...} Proxy with all channels, users, emoji, etc. Establish real time websocket ● While connected: reactions.add Push updates via websocket ● Webapp (PHP) API calls for channel history, ● message edits, create channels, etc.

  12. Sharding And Routing Workspace Sharding RTM RTM Message Service Service Assign a workspace to a DB ● Servers and MS shard at creation s m a e t m o r f Metadata table lookup for Mains ● * 4 3 t 2 c 1 e = l d e i s e each API request to route r e h w db_shard:35, ms_shard:11, ...} {id:1234, domain:demmer, Webapp (PHP) MySQL Shards

  13. Sharding And Routing Workspace Sharding RTM RTM Message Service Service Assign a workspace to a DB ● Servers and MS shard at creation Metadata table lookup for Mains ● each API request to route “Herd of Pets” Webapp (PHP) DBs run in active/active pairs ● with application failover Service hosts are addressed in ● config and manually replaced MySQL Shards

  14. Why This Worked Client Experience Server Experience Data model lends itself to a seamless, rich Implementation model is straightforward , real-time client experience . easy to reason about and debug . Full data model available in memory All operations are workspace scoped ● ● Updates appear instantly Horizontally scale by adding servers ● ● Everything feels real time Few components or dependencies ● ●

  15. Things Get More Interesting...

  16. Things Get More Interesting Size and Scale Product Model

  17. Slack Growth

  18. Slack Facts (2018) User Base Largest Organizations >8M Daily Active Users >125,000 Active Users Connectivity Engineering Style >7M peak simultaneous connected Still pragmatic, but embrace complexity Avg 10 hrs/day where needed to solve hardest problems

  19. Slack Facts (2018) 10x ! 2x User Base Largest Organizations >8M Daily Active Users >125,000 Active Users 3x Connectivity Engineering Style >7M peak simultaneous connected Still pragmatic, but embrace complexity Avg 10 hrs/day where needed to solve hardest problems

  20. Change the Model A workspace logically contains all channels and messages , as well as users , emoji , bots , and more. All interactions occur within the workspace boundary. Acme Corp Oceanic Airlines #brainstorming @alice #proj-roadrunner @bob #marketing @carol Delos … ... Duff Beer us_east_1

  21. Change the Model Workspaces Enterprise Acme Wayne Corp Enterprises Duff Oceanic Beer Airlines Wayne Wayne Finance Shipping Wayne Delos Security

  22. Change the Model Workspaces Enterprise Shared Channels Acme Wayne Corp Agents of Enterprises SHIELD Duff Oceanic Beer Airlines Wayne Wayne Finance Shipping Stark Industries Wayne Delos Security

  23. Challenges Recurring Issues Large organizations : Boot metadata download is slow and expensive ● Thundering Herd : Load to connect >> Load in steady state ● Hot spots : Overwhelm database hosts (mains and shards) and other systems ● Herd of Pets : Manual operation to replace specific servers ● Cross Workspace Channels: Need to change assumptions about partitioning ●

  24. So What Did We Do?

  25. What Did We Do Thin Message Client Vitess Services Model Fine-Grained Service Flannel Cache DB Sharding Decomposition

  26. What Did We Do Thin Client Model Flannel Cache

  27. Challenge: Boot Model Explosion boot_payload_size ~= (num_users * user_profile_bytes) + (num_channels * (channel_info_size + (num_users_in_channel * user_id bytes))) Users Profiles Channels Total 12 6 KB 1 KB 7 KB 530 140 KB 28 KB 168 KB 4,008 5 MB 2 MB 7 MB

  28. Challenge: Boot Model Explosion boot_payload_size ~= (num_users * user_profile_bytes) + (num_channels * (channel_info_size + (num_users_in_channel * user_id bytes))) Users Profiles Channels Total 12 6 KB 1 KB 7 KB 530 140 KB 28 KB 168 KB 4,008 5 MB 2 MB 7 MB 44,030 36 MB 25 MB 59 MB 148,170 78 MB 40 MB 118 MB

  29. Thin Client Model RTM RTM Message Service Service Proxy RTM Service us_west_1 RTM Service Message Server Websocket Client HTTP API Calls Webapp Webapp Webapp MySQL MySQL Job Queue us_east_1

  30. Thin Client Model RTM RTM RTM RTM Flannel Message Service Service Service Service Cache Proxy RTM Service us_west_1 us_west_1 RTM Service Message Server Websocket Client HTTP API Calls Webapp Webapp Webapp MySQL MySQL Job Consul Queue us_east_1

  31. Thin Client Model Flannel Service Minimize Workspace Model Globally distributed edge cache Much smaller boot payload Routing Query API Workspace affinity for cache locality Fetch unknown objects from cache Cache Updates Proxy subscription messages to clients RTM RTM Service Flannel Service Websocket

  32. Thin Client Model Unblock Large Organizations Adapting clients to a lazy load model was a critical change to enable Slack for large organizations. Huge reduction in payload times on initial connect ● Flannel efficiently responds to > 1+ million queries per second ● Adds challenges of cache coherency and reconciling business logic ●

  33. What Did We Do Vitess Fine-Grained DB Sharding

  34. Challenge: Hot Spots & Manual Repair

  35. Vitess RTM RTM RTM RTM Flannel Message Service Service Service Service Cache Proxy RTM Service us_west_1 us_west_1 RTM Service Message Server Websocket Client HTTP API Calls Webapp Webapp Webapp MySQL MySQL Job Consul Queue us_east_1

  36. Vitess RTM RTM RTM RTM Flannel Message Service Service Service Service Cache Proxy RTM Service us_west_1 us_west_1 RTM Service Message Server Websocket VtGate VtGate VtGate Client HTTP API Calls Webapp Webapp VtTablet Webapp MySQL MySQL MySQL Job Consul Queue us_east_1

  37. Vitess Flexible Sharding Topology Management Vitess manages per-table sharding policy Database servers self-register Single Master Failover Using GTID and semi-sync replication Orchestrator promotes a replica on failover VtGate VtGate Resharding Workflows VtGate Automatically expand the cluster Webapp VtTablet Webapp Webapp MySQL

  38. Vitess Fine-Grained Sharding Migrating to a channel-sharded / user-sharded data model helps mitigate hot spots for large teams and thundering herds. Retains MySQL at the core for developer / operations continuity ● More mature topology management and cluster expansion systems ● Data migrations that change the sharding model take a long time ●

  39. What Did We Do Message Services Service Decomposition

  40. Challenge: Shared Channels? Agents of Message Server SHIELD Stark Message Server Industries

  41. Challenge: Shared Channels? Agents of Message Server SHIELD Stark Message Server Industries

  42. Message Server to Services RTM RTM RTM RTM Flannel Message Service Service Service Service Cache Proxy RTM Service us_west_1 us_west_1 RTM Service Message Server Websocket VtGate VtGate VtGate Client HTTP API Calls Webapp Webapp VtTablet Webapp MySQL MySQL MySQL Job Consul Queue us_east_1

  43. Message Server to Services RTM RTM RTM RTM RTM RTM Flannel Gateway Channel Service Service Service Service Service Service Cache Server Server RTM RTM Presence Service Service Server us_west_1 RTM RTM VtGate Admin Message Service VtGate Websocket Service Server Server VtGate VtGate VtGate Client HTTP API Calls Webapp Webapp VtTablet Webapp MySQL MySQL MySQL Job Consul Queue us_east_1

Recommend


More recommend