scaling slack
play

Scaling Slack Bing Wei Infrastructure@Slack 2 3 Our Mission: To - PowerPoint PPT Presentation

Scaling Slack Bing Wei Infrastructure@Slack 2 3 Our Mission: To make peoples working lives simpler, more pleasant, and more productive. 4 From supporting small teams To serving gigantic organizations of hundreds of thousands of users 5


  1. Scaling Slack Bing Wei Infrastructure@Slack

  2. 2

  3. 3

  4. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4

  5. From supporting small teams To serving gigantic organizations of hundreds of thousands of users 5

  6. Slack Scale ◈ 6M+ DAU, 9M+ WAU 5M+ peak simultaneously connected ◈ Avg 10+ hrs/weekday connected Avg 2+ hrs/weekday in active use ◈ 55% of DAU outside of US 6

  7. Cartoon Architecture Job Queue Redis/Kafka HTTP WebApp PHP/Hack Sharded WebSocket MySql Messaging Server Java 7

  8. Outline ◈ The slowness problem ◈ Incremental Improvements ◈ Architecture changes ○ Flannel ○ Client Pub/Sub ○ Messaging Server 8

  9. Challenge: Slowness Connecting to Slack 9

  10. Login Flow in 2015 1. HTTP POST User with user’s token WebApp MySql 2. HTTP Response: a snapshot of the team & websocket URL 10

  11. Some examples number of response size users/channels 30 / 10 200K 500 / 200 2.5M 3K / 7K 20M 30K / 1K 60M 11

  12. Login flow in 2015 1. HTTP POST User with user’s token WebApp MySql 2. HTTP Response: a snapshot of the team & websocket url 3. Websocket: Messaging Server real-time events 12

  13. Real-time Events on WebSocket WebSocket: 100+ types of events User Messaging Server e.g. chat messages, typing indicator, files uploads, files comments, threads replies, user presence changes, user profile changes, reactions, pins, stars, channel creations, app installations, etc. 13

  14. Login Flow in 2015 ◈ Clients Architecture ○ Download a snapshot of entire team ○ Updates trickle in through the WebSocket ○ Eventually consistent snapshot of whole team 14

  15. Problems Initial team snapshot takes time 15

  16. Problems Initial team snapshot takes time Large client memory footprint 16

  17. Problems Initial team snapshot takes time Large client memory footprint Expensive reconnections 17

  18. Problems Initial team snapshot takes time Large client memory footprint Expensive reconnections Reconnect storm 18

  19. Outline ◈ The slowness problem ◈ Incremental Improvements ◈ Architecture changes ○ Flannel ○ Client Pub/Sub ○ Messaging Server 19

  20. Improvements ◈ Smaller team snapshot ○ Client local storage + delta ○ Remove objects out + in parallel loading ○ Simplified objects: e.g. canonical Avatars 20

  21. Improvements ◈ Incremental boot ○ Load one channel first 21

  22. Improvements ◈ Rate Limit ◈ POPs ◈ Load Testing Framework 22

  23. Support New Product Features Product Launch 23

  24. Cope with New Product Features Product Launch 24

  25. Still... Limitations ◈ What if team sizes keep growing ◈ Outages when clients dump their local storage 25

  26. Outline ◈ The slowness problem ◈ Incremental Improvements ◈ Architecture changes ○ Flannel ○ Client Pub/Sub ○ Messaging Server 26

  27. Client Lazy Loading Download less data upfront ◈ Fetch more on demand 27

  28. Flannel: Edge Cache Service A query engine backed by cache on edge locations 28

  29. What are in Flannel’s cache ◈ Support big objects first ○ Users ○ Channels Membership ○ Channels 29

  30. Login and Message Flow with Flannel 2. HTTP Post: download a snapshot 1. WebSocket of the team connection Flannel WebApp MySQL 3. WebSocket: Stream Json events Messaging Server User 30

  31. A Man in the Middle User WebSocket WebSocket Flannel Messaging Server Use real-time events to update its cache E.g. user creation, user profile change, channel creation, user joins a channel, channel convert to private 31

  32. Edge Locations main region us-east-1 Mix of AWS & Google Cloud 32

  33. Examples Powered by Flannel Quick Switcher 33

  34. Examples Powered by Flannel Mention Suggestion 34

  35. Examples Powered by Flannel Channel Header 35

  36. Examples Powered by Flannel Channel Sidebar 36

  37. Examples Powered by Flannel Team Directory 37

  38. Flannel Results ◈ Launched Jan 2017 ○ Load 200K user team ◈ 5M+ simultaneous connections at peak ◈ 1M+ clients queries/sec 38

  39. Flannel Results 39

  40. This is not the end of the story 40

  41. Evolution of Flannel 41

  42. Web Client Iterations Flannel Just-In-Time Annotation Right before Web clients are about to access an object, Flannel pushes that object to clients. 42

  43. A Closer Look Why does Flannel sit on WebSocket? 43

  44. Old Way of Cache Updates Users LOTS of duplicated Json events Messaging Server Flannel 44

  45. Publish/Subscribe (Pub/Sub) to Update Cache Users Pub/Sub Thrift events Messaging Server Flannel 45

  46. Pub/Sub Benefits Less Flannel CPU Simpler Flannel code Schema data Flexibility for cache management 46

  47. Flexibility for Cache Management Previously ◈ Load when the first user connects ◈ Unload when the last user disconnects 47

  48. Flexibility for Cache Management With Pub/Sub ◈ Isolate received events from user connections 48

  49. Another Closer Look ◈ With Pub/Sub, does Flannel need to be on WebSocket path? 49

  50. Next Step Move Flannel out of WebSocket path 50

  51. Next Step Move Flannel out of WebSocket path Why? Separation & Flexibility 51

  52. Evolution with Product Requirements Grid for Big Enterprise 52

  53. Before Grid Team Affinity for Cache Efficiency 53

  54. Now Team Affinity Grid Aware 54

  55. Grid Awareness Improvements Flannel Memory Saves 22G of per host, 1.1TB total 55

  56. Grid Awareness Improvements For our biggest customer P99 User Connect Latency 40s -> 4s DB Shard CPU Idle 25% -> 90% 56

  57. Future Team Affinity Grid Aware Scatter & Gather 57

  58. Outline ◈ The slowness problem ◈ Incremental Improvements ◈ Architecture changes ○ Flannel ○ Client Pub/Sub ○ Messaging Server 58

  59. Expand Pub/Sub to Client Side Client Side Pub/Sub reduces events Clients have to handle 59

  60. Presence Events 60% of all events O(N 2 ) 1000 user team ⇒ 1000 * 1000 = 1M events 60

  61. Presence Pub/Sub Clients ◈ Track who are in the current view ◈ Subscribe/Unsubscribe to Messaging server when view changes 61

  62. Outline ◈ The slowness problem ◈ Incremental Improvements ◈ Architecture changes ○ Flannel ○ Client Pub/Sub ○ Messaging Server 62

  63. What is Messaging Server Messaging Server 63

  64. A Message Router Messaging Server 64

  65. Events Routing and Fanout 2.Events Fanout 1.Events happen on team WebApp/ DB Messaging Server 65

  66. Limitations ◈ Sharded by Team Single point of failure 66

  67. Limitations ◈ Sharded by Team Single point of failure ◈ Shared Channels Shared states among teams 67

  68. 68

  69. Topic Sharding Everything is a Topic public/private channel, DM, group DM, user, team, grid 69

  70. Topic Sharding Natural fit for shared channels 70

  71. Topic Sharding Natural fit for shared channels Reduce user perceived failures 71

  72. Other Improvements Auto failure recovery 72

  73. Other Improvements Auto failure recovery Publish/Subscribe 73

  74. Other Improvements Auto failure recovery Publish/Subscribe Fanout at the edge 74

  75. Our Journey Ongoing Incremental Architectural Problem Evolution Change Change 75

  76. Journey Ahead More To Come Get in touch: https://slack.com/jobs 76

  77. Thanks! Any questions? @bingw11 77

Recommend


More recommend