Autopsy of an automation disaster Simon J Mudd (Senior Database - PowerPoint PPT Presentation

Autopsy of an automation disaster Simon J Mudd (Senior Database Engineer) Percona Live, 25 th April 2017

To err is human To really foul things up requires a computer [1] (or a script) [1]: http://quoteinvestigator.com/2010/12/07/foul-computer/ 2

Booking.com ● Based in Amsterdam since 1996 ● Online Hotel/Accommodation/Travel Agent (OTA): ● +1,200,000 properties in 227 countries ● +1.200.000 room nights reserved daily ● +40 languages (website and customer service) ● +15.700 people working in 187 offices worldwide ● Part of the Priceline Group, PCLN on Nasdaq ● And we use MySQL: ● Thousands (1000s) of servers, ~90% replicating ● >150 masters: ~30 >50 slaves & ~10 >100 slaves 3

Session Summary 1. MySQL replication at Booking.com 2. Automation disaster: external eye 3. Chain of events: analysis 4. Learning / takeaway 4

MySQL replication at Booking.com ● Typical MySQL replication deployment at Booking.com: +---+ | M | +---+ | +------+-- ... --+---------------+-------- ... | | | | +---+ +---+ +---+ +---+ | S1| | S2| | Sn| | M1| +---+ +---+ +---+ +---+ | +-- ... --+ | | +---+ +---+ | T1| | Tm| +---+ +---+ 5

MySQL replication at Booking.com’ ● We use and contribute to Orchestrator: 6

MySQL replication at Booking.com’’ ● Orchestrator allows us to: ● Visualize our replication deployments ● Move slaves for planned maintenance of an intermediate master ● Automatically replace an intermediate master in case of its unexpected failure (thanks to pseudo-GTIDs when we have not deployed GTIDs) 7

MySQL replication at Booking.com’’ 8

MySQL replication at Booking.com’’ ● Orchestrator allows us to: ● Visualize our replication deployments ● Move slaves for planned maintenance of an intermediate master ● Automatically replace an intermediate master in case of its unexpected failure (thanks to pseudo-GTIDs when we have not deployed GTIDs widely though orchestrator handles both types) ● Automatically replace a master in case of a failure (failing over to a slave) 9

MySQL replication at Booking.com’’ 10

MySQL replication at Booking.com’’ ● Orchestrator allows us to: ● Visualize our replication deployments ● Move slaves for planned maintenance of an intermediate master ● Automatically replace an intermediate master in case of its unexpected failure (thanks to pseudo-GTIDs when we have not deployed GTIDs) ● Automatically replace a master in case of a failure (failing over to a slave) ● But Orchestrator cannot replace a master alone: ● Booking.com uses DNS for master discovery ● So Orchestrator calls custom hooks (a script) to repoint DNS (and to do other magic) 11

MySQL replication at Booking.com’’ ● We also contribute to Orchestrator ● To allow for better scaling ● To improve integration with our own tooling ● To ensure that it can work on all our systems (MySQL, MariaDB) with GTID or pseudo-GTID ● To ensure it can provide us an HA service ● Shlomi has not stopped improving it and others contribute too 12

MySQL replication at Booking.com’’ ● So it can handle this: 13

Our subject database ● Simple replication deployment (in two data centers): DNS (master) +---+ points here --> | A | +---+ | +------------------------+ | | Reads +---+ +---+ happen here --> | B | | X | +---+ +---+ | +---+ And reads | Y | <-- happen here +---+ 14

Split brain: 1 st event ● A and B (two servers in same data center) fail at the same time: DNS (master) +\-/+ points here --> | A | but accesses +/-\+ are now failing Reads +\-/+ +---+ happen here --> | B | | X | but accesses +/-\+ +---+ are now failing | +---+ And reads | Y | <-- happen here +---+ (I will cover how/why this happened later.) 15

Split brain: 1 st event’ ● Orchestrator fixes things: +\-/+ | A | +/-\+ Reads +\-/+ +---+ Now, DNS (master) happen here --> | B | | X | <-- points here but accesses +/-\+ +---+ are now failing | +---+ Reads | Y | <-- happen here +---+ 16

Split brain: disaster ● A few things happen overnight and we wake-up to this: +\-/+ | A | +/-\+ DNS +---+ +---+ points here --> | B | | X | +---+ +---+ | +---+ | Y | +---+ 17

Split brain: disaster’ ● And to make things worse, reads are still happening on Y: +\-/+ | A | +/-\+ DNS (master) +---+ +---+ points here --> | B | | X | +---+ +---+ | +---+ Reads | Y | <-- happen here +---+ 18

Split brain: disaster’’ ● This is not good: ● When A and B failed, X was promoted as the new master ● Something made DNS point to B (we will see what later) à writes are now happening on B ● But B is outdated: all writes to X (after the failure of A) did not reach B +\-/+ ● So we have data on X that cannot be read on B | A | +/-\+ ● And we have new data on B that is not read on Y DNS (master) +---+ +---+ points here --> | B | | X | +---+ +---+ | +---+ Reads | Y | <-- happen here 19 +---+

Split-brain: analysis ● Digging more in the chain of events, we find that: After the 1 st failure of A, a 2 nd failure was detected and Orchestrator promoted B ● ● So after their failures, A and B came back and formed an isolated replication chain +\-/+ | A | +/-\+ +\-/+ +---+ DNS (master) | B | | X | <-- points here +/-\+ +---+ | +---+ Reads | Y | <-- happen here 20 +---+

Split-brain: analysis ● Digging more in the chain of events, we find that: After the 1 st failure of A, a 2 nd failure was detected and Orchestrator promoted B ● ● So after their failures, A and B came back and formed an isolated replication chain ● And something caused a failure of A +---+ | A | +---+ | +---+ +---+ DNS (master) | B | | X | <-- points here +---+ +---+ | +---+ Reads | Y | <-- happen here 21 +---+

Split-brain: analysis ● Digging more in the chain of events, we find that: After the 1 st failure of A, a 2 nd failure was detected and Orchestrator promoted B ● ● So after their failures, A and B came back and formed an isolated replication chain ● And something caused a failure of A ● But how did DNS end-up pointing to B ? +\-/+ | A | ● The failover to B called the DNS repointing script +/-\+ ● The script stole the DNS entry from X +---+ +---+ DNS (master) and pointed it to B | B | | X | <-- points here +---+ +---+ | +---+ Reads | Y | <-- happen here 22 +---+

Split-brain: analysis ● Digging more in the chain of events, we find that: After the 1 st failure of A, a 2 nd failure was detected and Orchestrator promoted B ● ● So after their failures, A and B came back and formed an isolated replication chain ● And something caused a failure of A ● But how did DNS end-up pointing to B ? +\-/+ | A | ● The failover to B called the DNS repointing script +/-\+ ● The script stole the DNS entry from X DNS (master) +---+ +---+ and pointed it to B points here --> | B | | X | +---+ +---+ | ● But is that all: what made A fail ? +---+ Reads | Y | <-- happen here 23 +---+

Split-brain: analysis’ ● What made A fail ? ● Once A and B came back up as a new replication chain, they had outdated data ● If B would have come back before A, it could have been re-cloned under X +---+ | A | +---+ | +---+ +---+ DNS (master) | B | | X | <-- points here +---+ +---+ | +---+ Reads | Y | <-- happen here 24 +---+

Split-brain: analysis’ ● What made A fail ? ● Once A and B came back up as a new replication chain, they had outdated data ● If B would have come back before A, it could have been re-cloned under X ● But as A came back before re-cloning, it injected heartbeat and p-GTID into B +\-/+ | A | +/-\+ +---+ +---+ DNS (master) | B | | X | <-- points here +---+ +---+ | +---+ Reads | Y | <-- happen here 25 +---+

Split-brain: analysis’ ● What made A fail ? ● Once A and B came back up as a new replication chain, they had outdated data ● If B would have come back before A, it could have been re-cloned under X ● But as A came back before re-cloning, it injected heartbeat and p-GTID into B ● Then B could have been re-cloned without problems +---+ | A | +---+ | +---+ +---+ DNS (master) | B | | X | <-- points here +---+ +---+ | +---+ Reads | Y | <-- happen here 26 +---+

Split-brain: analysis’ ● What made A fail ? ● Once A and B came back up as a new replication chain, they had outdated data ● If B would have come back before A, it could have been re-cloned under X ● But as A came back before re-cloning, it injected heartbeat and p-GTID into B ● Then B could have been re-cloned without problems +---+ ● | A | But A was re-cloned instead ( human error #1 ) +---+ +\-/+ +---+ DNS (master) | B | | X | <-- points here +/-\+ +---+ | +---+ Reads | Y | <-- happen here 27 +---+

Autopsy of an automation disaster Simon J Mudd (Senior Database - PowerPoint PPT Presentation

Autopsy of an automation disaster Simon J Mudd (Senior Database Engineer) Percona Live, 25 th April 2017 To err is human To really foul things up requires a computer [1] (or a script) [1]: http://quoteinvestigator.com/2010/12/07/foul-computer/

Autopsy of a Small UST Site in Bedrock: Autopsy of a Small UST Site in Bedrock: Implications for

1 Automation Overview Definition Automation (automation, Automation ) : 1) set of all measures

Autopsy of an automation disaster Jean-Franois Gagn - Saturday, February 4, 2017 FOSDEM MySQL

Disaster Risk Reduction and Disaster Disaster Risk Reduction and Disaster Management Management

The S cholarly Article Autopsy Information S ources from the Inside Out Krista Bowers S harpe

CSN08101 Digital Forensics Lecture 5: Data management and Autopsy Lecture 5: Data management and

Test automation Building automatically repeatable test suites Test automation n Test automation

HEALTH IT IN DISASTER RECOVERY Presenter: Alaina Lamphear HIT IN DISASTER RECOVERY HEALTH IT IN

Automation is in the Eye of the Automation is in the Eye of the Automation is in the Eye of the

Readiness for Response NORTH-EAST MONSOON 2019 Disaster Management Centre National Disaster

DISASTER PREPAREDNESS Andrew Lindquist Vice President Labor4orce Disaster Services

Disaster Mitigation Plan Update The Pre Disaster Mitigation Program The Pre-Disaster

THE THE BE BE NE NE FITS THAT SE FITS THAT SE NSORS CAN BRING TO DISASTE NSORS CAN BRING

National Disaster Risk National Disaster Risk Management Framework Management Framework

Water Management and Disaster Risk Reduction (DRR) Omdiyar Fund Overview Disaster

American Red Cross Disaster Services Technology Disaster Services Technology Summary

Introduction to Artificial Intelligence CS540-1 Yingyu Liang slide 1 Logistics Course

Algorithms for NLP CS 11711, Spring 2020 Lecture 1: Introduction Yulia Tsvetkov 1 Welcome!

Algorithms for NLP Lecture 1: Introduction Yulia Tsvetkov CMU Slides: Nathan Schneider

RFIDIOts!!! Hacking RFID Without A Soldering Iron (or a Patent Attorney) Adam Laurie

6hp http://www.ida.liu.se/~patla00/courses/BDA Teachers Lectures: Patrick Lambrix,

CS885 Reinforcement Learning Module 3: July 5, 2020 Imitation Learning Torabi, F., Warnell, G.,

Understanding Git Nelson Elhage Anders Kaseorg Student Information Processing Board September

Model-Agnostic Meta-Learning Universality, Inductive Bias, and Weak Supervision Chelsea Finn