lessons from an internet scale notification system
play

Lessons from an Internet-Scale Notification System Atul Adya - PowerPoint PPT Presentation

Lessons from an Internet-Scale Notification System Atul Adya History End-client notification system Thialfi Presented at SOSP 2011 Since then: Scaled by several orders of magnitude Used by many more products and in


  1. Lessons from an Internet-Scale Notification System Atul Adya

  2. History ● End-client notification system Thialfi ○ Presented at SOSP 2011 ● Since then: ○ Scaled by several orders of magnitude ○ Used by many more products and in different ways ● Several unexpected “lessons”

  3. Case for Notifications Ensuring cached data is fresh across users and devices "Colin is online" Bob's browser Alice’s Notebook Phil's phones

  4. Common Pattern #1: Polling Did it change yet? No! Did it change yet? No! Did it change yet? No! Did it change yet? No! Did it change yet? No! Did it change yet? Yes! Did it change yet? No! .... Cost and speed issues at scale: 100M clients polling at 10 minute intervals => 166K QPS

  5. Common Pattern #2: App pushes updates over point-to-point channels Complicated for every app to build Bookkeeping "Colin is online" object ids endpoints Plumbing registrations Fan out to endpoints cursors Manage channels ACLs Ensure reliable delivery XMPP Pending HTTP GCM

  6. Our Solution: Thialfi ● Scalable: handles hundreds of millions of clients and objects ● Fast: notifies clients in less than a second ● Reliable: even when entire data centers fail ● Easy to use and deploy: Chrome Sync (Desktop/Android), Google Plus, Contacts, Music, GDrive

  7. Thialfi Programming Overview Register X Notify X Client C1 Client C2 Client Thialfi client library library Client Data center Register X Update X Register X Notify X Thialfi Service Application backend X: C1, C2 Update X

  8. Thialfi Architecture Client library Registrations, notifications, acknowledgments Client HTTP/XMPP/GCM Data center Client Registrar Bigtable Notifications Translation Object Application Matcher Bridge Bigtable Backend ● Matcher: Object→ registered clients, version ● Registrar: Client ID → registered object, unacked messages

  9. Thialfi Abstraction ● Objects have unique IDs and version numbers, monotonically increasing on every update ● Delivery guarantee ● Registered clients learn latest version number ● Reliable signal only: cached object ID X at version Y (Think “Cache Invalidation”)

  10. Thialfi Characteristics ● Built around soft-state ● Recover registration state from clients ● Lost notification signal: InvalidateUnknownVersion ● Registration-Sync: ● Exchange hash of registrations between client & server ● Helps in edge cases, async storage, cluster switch ● Multi-Platform: ● Libraries: C++, Java, JavaScript, Objective-C ● OS: Windows/Mac/Linux, browsers, Android, iOS ● Channels: HTTP, XMPP, GCM, Internal-RPC

  11. Some Lesions Ouch! I mean, Lessons

  12. Lesson 1: Is this thing on? ● Launch your system and no one is using it ○ How do I know it is working? ● People start using it ○ Is it working now? ● Magically know works for 99.999% of the time ○ Which 99.999%? ● How to distinguish among ephemeral, disconnected, and buggy clients You can never know

  13. Lesson 1: Is this thing on? What’s the best you can do? ● Continuous testing in production ○ But may not be able to get client monitoring ● Look at server graphs End-to-end, e.g., latency More detailed, e.g., reg-sync per client type

  14. Lesson 1: Is this thing on? ● But graphs are not sufficient ○ Even when it looks right, averages can be deceptive ○ How know if “missing” some traffic ● Have other ways of getting more reports: customer monitoring, real customers, Twitter, ...

  15. Lesson 2: And you thought you could debug? ● Monitoring indicates that there is a problem ○ Server text logs: but hard to correlate ○ Structured logging: may have to log selectively ■ E.g., cannot log incoming stream multiple times ○ Client logs: typically not available ○ Monitoring graphs: but can be too many signals ● Specific user has problem (needle-in-a-haystack) ○ Structured logging - if available ○ Custom production code!

  16. War Story: VIP Customer ● Customer unable to receive notifications ● Whole team spent hours looking ● Early on - debugging support was poor ○ Text logs - had rolled over ○ Structured logs - not there yet ○ Persistent state - had no history ● Eventually got “lucky” ○ Version numbers were timestamps ○ Saw last notification “version” was very old ○ Deflected the bug

  17. Opportunity: Monitoring & Debugging Tools ● Automated tools to detect anomalies ○ Machine-learning based? ● Tools for root-cause analysis ○ Which signals to examine when problem occurs ● Finding needles in a haystack ○ Dynamically switch on debugging for a “needle” ■ E.g., trace a client’s registration and notifications

  18. Lesson 3: Clients considered harmful ● Started out: “Offloading work to clients is good” ● But, client code is painful: ○ Maintenance burden of multiple platforms ○ Upgrades: days, weeks, months, years … never ○ Hurts evolution and agility

  19. War Story: Worldwide crash of Chrome on Android (alpha) ● Switched a flag to change message delivery via a different client code path ● Tested this path extensively with tests ● Unfortunately, our Android code did network access from the main thread on this path ● Newer versions of the OS than in our tests crashed the application when this happened

  20. War Story: Strange Reg-Sync Loops ● Discovered unnecessary registrations for a (small) customer ● “Some JavaScript clients in Reg-Sync loop” ● Theories: Races, Bug - app, library, Closure, ... ● Theory: HTTP clients switching too much ○ Nope!

  21. War Story: Buggy Platform ● Logged platform of every Reg-sync looping client ● Found “6.0” and that meant Safari ● Wrote test but failed to find bug ● Engineer searched for “safari javascript runtime bug" ● Ran test in a loop ○ SHA-1 hash not the same in all runs of loop! ○ Safari JavaScript mis-JIT i++ to ++i sometimes

  22. Future direction: “Thin” client ● Move complexity to where it can be maintained ● Removing most code from client ○ Trying to make library be a thin wrapper around API ● Planning to use Spanner (synchronous store) ● But still keeping soft-state aspects of Thialfi

  23. Lesson 4: Getting your foot (code) in the door ● Developers will use a system iff it obviously makes things better than doing it on their own ● Clean semantics and reliability not the selling point you think they are ○ Clients care about features not properties

  24. Lesson 4: Getting your foot (code) in the door ● May need “unclean” features to get customers ○ Best-effort data along with versions ○ Support special object ids for users ○ Added new server (Bridge) for translating messages ● Customers may not be able to meet your strong requirements ○ Version numbers not feasible for many systems ○ Allow time instead of version numbers

  25. Lesson 4: Getting your foot (code) in the door ● Understand their architecture and review their code for integrating with your system ○ “Error” path broken: invalidateUnknownVersion ○ Naming matters: Changing to mustResync ● Know where your customer’s code is - so that you can migrate them to newer infrastructure ● Debugging tools also needed for “bug deflection”

  26. War Story: “Thialfi is unreliable” ● A team used Thialfi for reliable “backup” path to augment their unreliable “fast” path ● Experienced an outage when their fast path became really unreliable ● Informed us Thialfi was dropping notifications! ● Investigation revealed ○ Under stress, backend dropped messages on their path and gave up publishing into Thialfi after few retries

  27. Lesson 5: You are building your castle on sand ● You will do a reasonable job thinking through your own design, protocols, failures, etc ● Your outage is likely to come from a violation of one of your assumptions or another system several levels of dependencies away

  28. War Story: Delayed replication in Chrome Sync ● Chrome backend dependency stopped sending notifications to Thialfi ● When it unwedged, traffic went up by more than 3X. We only had capacity for 2X Incoming feed QPS

  29. War Story: Delayed replication in Chrome Sync ● Good news: Internal latency remained low and system did not fall over ● Bad news: End-to-end latency spiked to minutes for all customers ● Isolation not strong enough - not only Chrome Sync but all customers saw elevated latency

  30. Opportunity: Resource Isolation ● Need the ability to isolate various customers from each other ● General problem for shared infrastructure services

  31. War Story: Load balancer config change ● Thialfi needs clients to be stable w.r.t clusters ○ Not globally reshuffle during a single-cluster outage ● Change to inter-cluster load balancer config to remove ad hoc cluster stickiness ○ Previously discussed with owning team ● Config change caused large-scale loss of cluster stickiness for clients

  32. War Story: Load balancer config change No. of active clients ● Client flapping between clusters caused an explosion in the number of active clients ○ Same client was using resources many times over

  33. Fix: Consistent hash routing ● Reverted load balancer config change ● Use consistent hashing for cluster selection ○ Routed client based on client id ○ Not geographically optimal

Recommend


More recommend