Lessons from an Internet-Scale Notification System Atul Adya
History ● End-client notification system Thialfi ○ Presented at SOSP 2011 ● Since then: ○ Scaled by several orders of magnitude ○ Used by many more products and in different ways ● Several unexpected “lessons”
Case for Notifications Ensuring cached data is fresh across users and devices "Colin is online" Bob's browser Alice’s Notebook Phil's phones
Common Pattern #1: Polling Did it change yet? No! Did it change yet? No! Did it change yet? No! Did it change yet? No! Did it change yet? No! Did it change yet? Yes! Did it change yet? No! .... Cost and speed issues at scale: 100M clients polling at 10 minute intervals => 166K QPS
Common Pattern #2: App pushes updates over point-to-point channels Complicated for every app to build Bookkeeping "Colin is online" object ids endpoints Plumbing registrations Fan out to endpoints cursors Manage channels ACLs Ensure reliable delivery XMPP Pending HTTP GCM
Our Solution: Thialfi ● Scalable: handles hundreds of millions of clients and objects ● Fast: notifies clients in less than a second ● Reliable: even when entire data centers fail ● Easy to use and deploy: Chrome Sync (Desktop/Android), Google Plus, Contacts, Music, GDrive
Thialfi Programming Overview Register X Notify X Client C1 Client C2 Client Thialfi client library library Client Data center Register X Update X Register X Notify X Thialfi Service Application backend X: C1, C2 Update X
Thialfi Architecture Client library Registrations, notifications, acknowledgments Client HTTP/XMPP/GCM Data center Client Registrar Bigtable Notifications Translation Object Application Matcher Bridge Bigtable Backend ● Matcher: Object→ registered clients, version ● Registrar: Client ID → registered object, unacked messages
Thialfi Abstraction ● Objects have unique IDs and version numbers, monotonically increasing on every update ● Delivery guarantee ● Registered clients learn latest version number ● Reliable signal only: cached object ID X at version Y (Think “Cache Invalidation”)
Thialfi Characteristics ● Built around soft-state ● Recover registration state from clients ● Lost notification signal: InvalidateUnknownVersion ● Registration-Sync: ● Exchange hash of registrations between client & server ● Helps in edge cases, async storage, cluster switch ● Multi-Platform: ● Libraries: C++, Java, JavaScript, Objective-C ● OS: Windows/Mac/Linux, browsers, Android, iOS ● Channels: HTTP, XMPP, GCM, Internal-RPC
Some Lesions Ouch! I mean, Lessons
Lesson 1: Is this thing on? ● Launch your system and no one is using it ○ How do I know it is working? ● People start using it ○ Is it working now? ● Magically know works for 99.999% of the time ○ Which 99.999%? ● How to distinguish among ephemeral, disconnected, and buggy clients You can never know
Lesson 1: Is this thing on? What’s the best you can do? ● Continuous testing in production ○ But may not be able to get client monitoring ● Look at server graphs End-to-end, e.g., latency More detailed, e.g., reg-sync per client type
Lesson 1: Is this thing on? ● But graphs are not sufficient ○ Even when it looks right, averages can be deceptive ○ How know if “missing” some traffic ● Have other ways of getting more reports: customer monitoring, real customers, Twitter, ...
Lesson 2: And you thought you could debug? ● Monitoring indicates that there is a problem ○ Server text logs: but hard to correlate ○ Structured logging: may have to log selectively ■ E.g., cannot log incoming stream multiple times ○ Client logs: typically not available ○ Monitoring graphs: but can be too many signals ● Specific user has problem (needle-in-a-haystack) ○ Structured logging - if available ○ Custom production code!
War Story: VIP Customer ● Customer unable to receive notifications ● Whole team spent hours looking ● Early on - debugging support was poor ○ Text logs - had rolled over ○ Structured logs - not there yet ○ Persistent state - had no history ● Eventually got “lucky” ○ Version numbers were timestamps ○ Saw last notification “version” was very old ○ Deflected the bug
Opportunity: Monitoring & Debugging Tools ● Automated tools to detect anomalies ○ Machine-learning based? ● Tools for root-cause analysis ○ Which signals to examine when problem occurs ● Finding needles in a haystack ○ Dynamically switch on debugging for a “needle” ■ E.g., trace a client’s registration and notifications
Lesson 3: Clients considered harmful ● Started out: “Offloading work to clients is good” ● But, client code is painful: ○ Maintenance burden of multiple platforms ○ Upgrades: days, weeks, months, years … never ○ Hurts evolution and agility
War Story: Worldwide crash of Chrome on Android (alpha) ● Switched a flag to change message delivery via a different client code path ● Tested this path extensively with tests ● Unfortunately, our Android code did network access from the main thread on this path ● Newer versions of the OS than in our tests crashed the application when this happened
War Story: Strange Reg-Sync Loops ● Discovered unnecessary registrations for a (small) customer ● “Some JavaScript clients in Reg-Sync loop” ● Theories: Races, Bug - app, library, Closure, ... ● Theory: HTTP clients switching too much ○ Nope!
War Story: Buggy Platform ● Logged platform of every Reg-sync looping client ● Found “6.0” and that meant Safari ● Wrote test but failed to find bug ● Engineer searched for “safari javascript runtime bug" ● Ran test in a loop ○ SHA-1 hash not the same in all runs of loop! ○ Safari JavaScript mis-JIT i++ to ++i sometimes
Future direction: “Thin” client ● Move complexity to where it can be maintained ● Removing most code from client ○ Trying to make library be a thin wrapper around API ● Planning to use Spanner (synchronous store) ● But still keeping soft-state aspects of Thialfi
Lesson 4: Getting your foot (code) in the door ● Developers will use a system iff it obviously makes things better than doing it on their own ● Clean semantics and reliability not the selling point you think they are ○ Clients care about features not properties
Lesson 4: Getting your foot (code) in the door ● May need “unclean” features to get customers ○ Best-effort data along with versions ○ Support special object ids for users ○ Added new server (Bridge) for translating messages ● Customers may not be able to meet your strong requirements ○ Version numbers not feasible for many systems ○ Allow time instead of version numbers
Lesson 4: Getting your foot (code) in the door ● Understand their architecture and review their code for integrating with your system ○ “Error” path broken: invalidateUnknownVersion ○ Naming matters: Changing to mustResync ● Know where your customer’s code is - so that you can migrate them to newer infrastructure ● Debugging tools also needed for “bug deflection”
War Story: “Thialfi is unreliable” ● A team used Thialfi for reliable “backup” path to augment their unreliable “fast” path ● Experienced an outage when their fast path became really unreliable ● Informed us Thialfi was dropping notifications! ● Investigation revealed ○ Under stress, backend dropped messages on their path and gave up publishing into Thialfi after few retries
Lesson 5: You are building your castle on sand ● You will do a reasonable job thinking through your own design, protocols, failures, etc ● Your outage is likely to come from a violation of one of your assumptions or another system several levels of dependencies away
War Story: Delayed replication in Chrome Sync ● Chrome backend dependency stopped sending notifications to Thialfi ● When it unwedged, traffic went up by more than 3X. We only had capacity for 2X Incoming feed QPS
War Story: Delayed replication in Chrome Sync ● Good news: Internal latency remained low and system did not fall over ● Bad news: End-to-end latency spiked to minutes for all customers ● Isolation not strong enough - not only Chrome Sync but all customers saw elevated latency
Opportunity: Resource Isolation ● Need the ability to isolate various customers from each other ● General problem for shared infrastructure services
War Story: Load balancer config change ● Thialfi needs clients to be stable w.r.t clusters ○ Not globally reshuffle during a single-cluster outage ● Change to inter-cluster load balancer config to remove ad hoc cluster stickiness ○ Previously discussed with owning team ● Config change caused large-scale loss of cluster stickiness for clients
War Story: Load balancer config change No. of active clients ● Client flapping between clusters caused an explosion in the number of active clients ○ Same client was using resources many times over
Fix: Consistent hash routing ● Reverted load balancer config change ● Use consistent hashing for cluster selection ○ Routed client based on client id ○ Not geographically optimal
Recommend
More recommend