Honey, I shrunk the database! Resilience and recoverability in Cloud Native services JEFFREY FARBER SIDNEY SHEK
Cloud infrastructure = reliable services right?
SUPER RESILIENT CLOUD-BASED ARCHITECTURE Canary Progressive Rollouts Distributed Blue-Green Deployments Multi-Region Cassandra Database 5 minute backups
WITH LOTS OF DEPENDENCIES Reliability Promise = 99.999% Recovery Time Objective = 1 hour Recovery Point Objective = 5 minutes
Until…
BET YOU DIDN’T SEE THIS COMING (THIS IS A TRUE STORY)
WE RESTORE… 2 hour old snapshot
WE RESTORE… … BUT WHAT ABOUT OUR DEPENDENCIES? M AY HAVE CORRECT DATA , BUT HOW L OST DATA W RONG (?) DATA TO SYNC ? 10 X NORMAL LOAD ?
Big 💪 will happen. Accept and plan for it. statement with support Emergent behaviour Our systems are complex and unpredictable. Broad spectrum solutions Incorporate general recovery methods to handle the unexpected.
Big 💪 will happen. Accept and plan for it. statement with support Emergent behaviour Our systems are complex and unpredictable. Broad spectrum solutions Incorporate general recovery methods to handle the unexpected.
Big 💪 will happen. Accept and plan for it. statement with support Emergent behaviour Our systems are complex and unpredictable. Broad spectrum solutions Incorporate general recovery methods to handle the unexpected.
1. Event sourcing Minimize data loss after a restore 2. Easily recoverable replication Get downstream systems back in sync 3. Local and distributed redundancy PATTERNS FOR GUARDING Having fallbacks for fallbacks for fallbacks… AGAINST THE IMPOSSIBLE
1. Event sourcing Minimize data loss after a restore 2. Easily recoverable replication Get downstream systems back in sync 3. Local and distributed redundancy PATTERNS FOR GUARDING Having fallbacks for fallbacks for fallbacks… AGAINST THE IMPOSSIBLE
1. Event sourcing Minimize data loss after a restore 2. Easily recoverable replication Get downstream systems back in sync 3. Local and distributed redundancy PATTERNS FOR GUARDING Having fallbacks for fallbacks for fallbacks… AGAINST THE IMPOSSIBLE
Event sourcing Minimize data loss after a restore
Let’s add recovery here
Event Sourcing Minimize data loss of DB restore (RPO) Critical applications can’t afford hours of data loss. Goals Recover from bugs ruining data Databases can replicate your data across regions. It also replicates your bugs. Write Events Accuracy & Time (RTO) Generating We need confidence in our restored data, and we need it quickly! Recovery
Event Sourcing Minimize data loss of DB restore (RPO) Critical applications can’t afford hours of data loss. Goals Recover from bugs ruining data Databases can replicate your data across regions. It also replicates your bugs. Write Events Accuracy & Time (RTO) Generating We need confidence in our restored data, and we need it quickly! Recovery
Event Sourcing Minimize data loss of DB restore (RPO) Critical applications can’t afford hours of data loss. Goals Recover from bugs ruining data Databases can replicate your data across regions. It also replicates your bugs. Write Events Accuracy & Time (RTO) Generating We need confidence in our restored data, and we need it quickly! Recovery
Event Sourcing Service Goals INSERT INTO table SELECT FROM table DELETE FROM table Write Events Generating Recovery
Event Sourcing Writes Reads Goals INSERT INTO table SELECT FROM table DELETE FROM table Write Events Generating Recovery
Event Sourcing Writes Goals SELECT FROM table Write Events INSERT INTO table DELETE FROM table Generating INSERT/APPEND historical command store Recovery
Event Sourcing Writes Goals SELECT FROM table Write Events INSERT INTO table DELETE FROM table Generating INSERT/APPEND historical Replay events command to main store Recovery database
Event Sourcing Writes Goals SELECT FROM table Write Events INSERT INTO table DELETE FROM table Generating INSERT/APPEND M AKE SURE THIS IS AN historical I NDEPENDENT STORE ! command store Recovery
[ { Event Sourcing "sequence": 20, "stream": "user123", "commandType": "grant_permission", "params": { Goals "user": "user123", "resource": "issueABC", Write Events "permission": "view" }, "timestamp": "2019-07-24 3:30 PM”, Generating "actor": "jira_share_service" }, Recovery ]
[ { Event Sourcing "sequence": 20, "stream": "user123", Order all writes, so we can replay in same order 1 "commandType": "grant_permission", "params": { Goals "user": "user123", "resource": "issueABC", Write Events "permission": "view" }, "timestamp": "2019-07-24 3:30 PM”, Generating "actor": "jira_share_service" }, Recovery ]
"sequence": 20, Order all writes, so we can replay in same order 1 Event Sourcing S TRICTLY ORDERED Goals Write Events Stream Sequence user123 19 Generating SET sequence = sequence + 1 WHERE sequence = 19 Recovery sequences -> 19, 20, 21...
"sequence": 20, Order all writes, so we can replay in same order 1 Event Sourcing M OSTLY ORDERED Goals sequence = {timestamp}{unique_node_id} Write Events sequences -> 1565312340, 1565323450, ... Generating Recovery
"sequence": 20, Order all writes, so we can replay in same order 1 Event Sourcing M OSTLY ORDERED Goals sequence = {timestamp}{unique_node_id} Write Events sequences -> 1565312340, 1565323450, ... Generating No SPOF (database) Only certain writes need strict ordering Clock skew window is small (< 1 sec) Recovery Don’t know previous sequence
"sequence": 20, Order all writes, so we can replay in same order 1 Event Sourcing M OSTLY ORDERED + CUSTOMER - DICTATED STRICT ORDERING Goals write2 write1 ?after={timestamp1} Write Events /create /delete Generating {sequence/timestamp1} {sequence/timestamp2} Recovery timestamp2 > timestamp1
[ { Event Sourcing "sequence": 20, "stream": "user123", "commandType": "grant_permission", Streams guarantee order 2 "params": { Parallelize across streams Goals "user": "user123", "resource": "issueABC", Write Events "permission": "view" }, "timestamp": "2019-07-24 3:30 PM”, Generating "actor": "jira_share_service" }, Recovery ]
[ { Event Sourcing "sequence": 20, "stream": "user123", "commandType": "grant_permission", Streams guarantee order 2 "params": { Parallelize across streams Goals "user": "user123", user123 {sequence: 21, {sequence: 20, "resource": "issueABC", commandType: ”revoke”, commandType: ”grant", Write Events "permission": "view" permission: "view"} permission: "view"} }, "timestamp": "2019-07-24 3:30 PM”, Generating "actor": "jira_share_service" }, Recovery ]
[ { Event Sourcing "sequence": 20, "stream": "user123", "commandType": "grant_permission", Streams guarantee order 2 "params": { Parallelize across streams Goals "user": "user123", user123 {sequence: 21, {sequence: 20, "resource": "issueABC", commandType: ”revoke”, commandType: ”grant", Write Events "permission": "view" permission: "view"} permission: "view"} }, user456 {sequence: 75, {sequence: 74, "timestamp": "2019-07-24 3:30 PM”, Generating commandType: ”revoke”, commandType: ”grant", "actor": "jira_share_service" permission: “edit"} permission: “edit"} }, Recovery ]
Event Sourcing Restore snapshot 1 Goals Write Events Generating Recovery
Event Sourcing Restore snapshot 1 user123 Goals 21, 20 19 Recover streams in parallel 2 user456 Write Events ..., 76, 75, 74 73 Generating Recovery
Event Sourcing Restore snapshot 1 user123 Goals 21, 20 19 Recover streams in parallel 2 user456 Write Events ..., 76, 75, 74 73 Generating Bonus: process all stream events in-memory Recovery
Events for stream “user123” Main Datastore [ Event Sourcing { "stream": "user123", "sequence": 20, "commandType": "grant_permission", "params": { Stream Sequence Goals "user": "user123", user123 19 "resource": "issueABC", "permission": “view" User Resource Permissions Write Events }, user123 issueABC [] ... }, Generating { "stream": "user123", "sequence": 21, "commandType": "grant_permission", Recovery "params": { "user": "user123",
Recommend
More recommend