Lessons from an Internet-Scale Notification System Atul Adya

History ● End-client notification system Thialfi ○ Presented at SOSP 2011 ● Since then: ○ Scaled by several orders of magnitude ○ Used by many more products and in different ways ● Several unexpected “lessons”

Case for Notifications Ensuring cached data is fresh across users and devices "Colin is online" Bob's browser Alice’s Notebook Phil's phones

Common Pattern #1: Polling Did it change yet? No! Did it change yet? No! Did it change yet? No! Did it change yet? No! Did it change yet? No! Did it change yet? Yes! Did it change yet? No! .... Cost and speed issues at scale: 100M clients polling at 10 minute intervals => 166K QPS

Common Pattern #2: App pushes updates over point-to-point channels Complicated for every app to build Bookkeeping "Colin is online" object ids endpoints Plumbing registrations Fan out to endpoints cursors Manage channels ACLs Ensure reliable delivery XMPP Pending HTTP GCM

Our Solution: Thialfi ● Scalable: handles hundreds of millions of clients and objects ● Fast: notifies clients in less than a second ● Reliable: even when entire data centers fail ● Easy to use and deploy: Chrome Sync (Desktop/Android), Google Plus, Contacts, Music, GDrive

Thialfi Programming Overview Register X Notify X Client C1 Client C2 Client Thialfi client library library Client Data center Register X Update X Register X Notify X Thialfi Service Application backend X: C1, C2 Update X

Thialfi Architecture Client library Registrations, notifications, acknowledgments Client HTTP/XMPP/GCM Data center Client Registrar Bigtable Notifications Translation Object Application Matcher Bridge Bigtable Backend ● Matcher: Object→ registered clients, version ● Registrar: Client ID → registered object, unacked messages

Thialfi Abstraction ● Objects have unique IDs and version numbers, monotonically increasing on every update ● Delivery guarantee ● Registered clients learn latest version number ● Reliable signal only: cached object ID X at version Y (Think “Cache Invalidation”)

Thialfi Characteristics ● Built around soft-state ● Recover registration state from clients ● Lost notification signal: InvalidateUnknownVersion ● Registration-Sync: ● Exchange hash of registrations between client & server ● Helps in edge cases, async storage, cluster switch ● Multi-Platform: ● Libraries: C++, Java, JavaScript, Objective-C ● OS: Windows/Mac/Linux, browsers, Android, iOS ● Channels: HTTP, XMPP, GCM, Internal-RPC

Some Lesions Ouch! I mean, Lessons

Lesson 1: Is this thing on? ● Launch your system and no one is using it ○ How do I know it is working? ● People start using it ○ Is it working now? ● Magically know works for 99.999% of the time ○ Which 99.999%? ● How to distinguish among ephemeral, disconnected, and buggy clients You can never know

Lesson 1: Is this thing on? What’s the best you can do? ● Continuous testing in production ○ But may not be able to get client monitoring ● Look at server graphs End-to-end, e.g., latency More detailed, e.g., reg-sync per client type

Lesson 1: Is this thing on? ● But graphs are not sufficient ○ Even when it looks right, averages can be deceptive ○ How know if “missing” some traffic ● Have other ways of getting more reports: customer monitoring, real customers, Twitter, ...

Lesson 2: And you thought you could debug? ● Monitoring indicates that there is a problem ○ Server text logs: but hard to correlate ○ Structured logging: may have to log selectively ■ E.g., cannot log incoming stream multiple times ○ Client logs: typically not available ○ Monitoring graphs: but can be too many signals ● Specific user has problem (needle-in-a-haystack) ○ Structured logging - if available ○ Custom production code!

War Story: VIP Customer ● Customer unable to receive notifications ● Whole team spent hours looking ● Early on - debugging support was poor ○ Text logs - had rolled over ○ Structured logs - not there yet ○ Persistent state - had no history ● Eventually got “lucky” ○ Version numbers were timestamps ○ Saw last notification “version” was very old ○ Deflected the bug

Opportunity: Monitoring & Debugging Tools ● Automated tools to detect anomalies ○ Machine-learning based? ● Tools for root-cause analysis ○ Which signals to examine when problem occurs ● Finding needles in a haystack ○ Dynamically switch on debugging for a “needle” ■ E.g., trace a client’s registration and notifications

Lesson 3: Clients considered harmful ● Started out: “Offloading work to clients is good” ● But, client code is painful: ○ Maintenance burden of multiple platforms ○ Upgrades: days, weeks, months, years … never ○ Hurts evolution and agility

War Story: Worldwide crash of Chrome on Android (alpha) ● Switched a flag to change message delivery via a different client code path ● Tested this path extensively with tests ● Unfortunately, our Android code did network access from the main thread on this path ● Newer versions of the OS than in our tests crashed the application when this happened

War Story: Strange Reg-Sync Loops ● Discovered unnecessary registrations for a (small) customer ● “Some JavaScript clients in Reg-Sync loop” ● Theories: Races, Bug - app, library, Closure, ... ● Theory: HTTP clients switching too much ○ Nope!

War Story: Buggy Platform ● Logged platform of every Reg-sync looping client ● Found “6.0” and that meant Safari ● Wrote test but failed to find bug ● Engineer searched for “safari javascript runtime bug" ● Ran test in a loop ○ SHA-1 hash not the same in all runs of loop! ○ Safari JavaScript mis-JIT i++ to ++i sometimes

Future direction: “Thin” client ● Move complexity to where it can be maintained ● Removing most code from client ○ Trying to make library be a thin wrapper around API ● Planning to use Spanner (synchronous store) ● But still keeping soft-state aspects of Thialfi

Lesson 4: Getting your foot (code) in the door ● Developers will use a system iff it obviously makes things better than doing it on their own ● Clean semantics and reliability not the selling point you think they are ○ Clients care about features not properties

Lesson 4: Getting your foot (code) in the door ● May need “unclean” features to get customers ○ Best-effort data along with versions ○ Support special object ids for users ○ Added new server (Bridge) for translating messages ● Customers may not be able to meet your strong requirements ○ Version numbers not feasible for many systems ○ Allow time instead of version numbers

Lesson 4: Getting your foot (code) in the door ● Understand their architecture and review their code for integrating with your system ○ “Error” path broken: invalidateUnknownVersion ○ Naming matters: Changing to mustResync ● Know where your customer’s code is - so that you can migrate them to newer infrastructure ● Debugging tools also needed for “bug deflection”

War Story: “Thialfi is unreliable” ● A team used Thialfi for reliable “backup” path to augment their unreliable “fast” path ● Experienced an outage when their fast path became really unreliable ● Informed us Thialfi was dropping notifications! ● Investigation revealed ○ Under stress, backend dropped messages on their path and gave up publishing into Thialfi after few retries

Lesson 5: You are building your castle on sand ● You will do a reasonable job thinking through your own design, protocols, failures, etc ● Your outage is likely to come from a violation of one of your assumptions or another system several levels of dependencies away

War Story: Delayed replication in Chrome Sync ● Chrome backend dependency stopped sending notifications to Thialfi ● When it unwedged, traffic went up by more than 3X. We only had capacity for 2X Incoming feed QPS

War Story: Delayed replication in Chrome Sync ● Good news: Internal latency remained low and system did not fall over ● Bad news: End-to-end latency spiked to minutes for all customers ● Isolation not strong enough - not only Chrome Sync but all customers saw elevated latency

Opportunity: Resource Isolation ● Need the ability to isolate various customers from each other ● General problem for shared infrastructure services

War Story: Load balancer config change ● Thialfi needs clients to be stable w.r.t clusters ○ Not globally reshuffle during a single-cluster outage ● Change to inter-cluster load balancer config to remove ad hoc cluster stickiness ○ Previously discussed with owning team ● Config change caused large-scale loss of cluster stickiness for clients

War Story: Load balancer config change No. of active clients ● Client flapping between clusters caused an explosion in the number of active clients ○ Same client was using resources many times over

Fix: Consistent hash routing ● Reverted load balancer config change ● Use consistent hashing for cluster selection ○ Routed client based on client id ○ Not geographically optimal

Lessons from an Internet-Scale Notification System Atul Adya - PowerPoint PPT Presentation

Lessons from an Internet-Scale Notification System Atul Adya History End-client notification system Thialfi Presented at SOSP 2011 Since then: Scaled by several orders of magnitude Used by many more products and in

(HBSS) Paperwork 2 types of notification FTE notification PEX notification Some

I ts Your Duty! Victim Notification Victim Notification a life saving measure a life

Code Maroon Emergency Notification System August 2013 Page 1 What is the Code Maroon Emergency

Functional components Notification component Application received Refuse ? Notification

Water Quality Updates Public Notification of PFOS/PFOA Notification Level Exceedances Division

F24 F24 Notification and Crisis Notification and Crisis Management Management January 9, 2013

Breach Notification Response Information Security and Privacy Office January 2012 Agenda

IEEE 802.1Qau Congestion IEEE 802.1Qau Congestion Notification Notification Pat Thaler IEEE

LEGISLATIVE DRAFTING WORKSHOP Norhasnani Tamin www.agc.gov.bn Pegawai Undang-Undang TABLE OF

May 2018 ALL THINGS ADAPTED LESSONS What are adapted lessons? therapeutic music lessons

Lessons Learned Lessons Learned From From Lessons Learned Lessons Learned From From

Internet-Scale Event Notification: Architecture Alternatives Jie Ren Information and Computer

THIALFI A Client Notification Service for Internet-Scale Applications Authors: Atul Adya, Gregory

INTERNET FOR A MOBILE INTERNET FOR A MOBILE GENERATION GENERATION www.itu.int/mobileinternet

History of the Internet Pat Morin COMP 2405 Outline Origins of the Internet Internet

IOC: Internet of Composites IOC: Internet of Composites IOC: Internet of Composites IOC: Internet

Photon-photon collisions at the LHC Lucian Harland-Lang, University College London IPPP seminar,

Fitting the Fermi-LAT GeV excess: on the importance of the propagation of electrons from dark

Dark Matter Indirect Detection amid hints & constraints Marco Cirelli (CNRS IPhT Saclay)

Gamma Rays from Star Forming Galaxies and Dark Matter Non-Thermal plasma: T ion electron f( v

The current evaluation of |V | and the top-row test of CKM matrix unitarity ud CURRENT STATUS

Charmonium ( cc ) mass in hadron-nucleus reactions, how the in-medium gluon condensate can be

Tendenze del contesto Tendenze del contesto Forte concentrazione nel settore del trasporto

Sunday, December 16, 12 Sunday, December 16, 12 Sunday, December 16, 12 CLUB BAR TAVERN