d istributed s ystems comp9243 lecture 8 fault tolerance
play

D ISTRIBUTED S YSTEMS [COMP9243] Lecture 8: Fault Tolerance C ASE S - PowerPoint PPT Presentation

D ISTRIBUTED S YSTEMS [COMP9243] Lecture 8: Fault Tolerance C ASE S TUDY : AWS FAILURE 2011 April 21, 2011 EBS (Elastic Block Store) in US East region unavailable for about 2 days 13% of volumes in one availability zone got stuck


  1. D ISTRIBUTED S YSTEMS [COMP9243] Lecture 8: Fault Tolerance C ASE S TUDY : AWS FAILURE 2011 ➜ April 21, 2011 ➜ EBS (Elastic Block Store) in US East region unavailable for about 2 days ➜ 13% of volumes in one availability zone got stuck Slide 1 Slide 3 ➜ led to control API errors and outage in whole region ➜ led to problems with EC2 instances and RDS in most popular region ➜ due to reconfig error and re-mirroring storm . ➀ Failure ➜ http://aws.amazon.com/message/65648/ ➁ Reliable Communication ➂ Process Resilience ➃ Recovery AWS EBS Overview: ➜ Region → Availability Zones D EPENDABILITY Region ➜ Clusters → Nodes → Volumes ➜ Volume: replicated in cluster Availability Zone Availability: system is ready to be used immediately ➜ Control Plane Services: API for EBS Cluster Reliability: system can run continuously without failure volumes for whole region Node Node ➜ Networks: primary, secondary Safety: when a system (temporarily) fails to operate correctly, Slide 2 Slide 4 Node What happened?: nothing catastrophic happens ➜ US east AZ Control Plane AZ Maintainability: how easily a failed system can be repaired server ➜ network config problem ➜ re-mirroring storm Building a dependable system comes down to ➜ CP API thread starvation controlling failure and faults. ➜ node race condition ➜ CP election overload C ASE S TUDY : AWS FAILURE 2011 1 C ASE S TUDY : AWS FAILURE 2011 2

  2. Solution: T OTAL VS P ARTIAL F AILURE ➜ Disconnect bad cluster ➜ Throttle re-mirroring Total Failure: ➜ Add more disk space ➜ Slowly un-throttle re-mirroring All components in a system fail ➜ Volumes unstuck → reconnect cluster ➜ Typical in nondistributed system ➜ 0.07% data lost Partial Failure: Slide 5 Slide 7 Lessons learned: One or more (but not all) components in a distributed ➜ Back off system fail ➜ Re-establish connectivity to previous replicas ➜ Some components affected ➜ Shorter timeouts ➜ Other components completely unaffected ➜ Snapshot stuck volumes ➜ Considered as fault for the whole system ➜ CP: one AZ shouldn’t crash another AZ ➜ Make it easier to use multiple AZs F AILURE C ATEGORISING F AULTS AND F AILURES Terminology: Failure: a system fails when it does not meet its promises or Types of Faults: cannot provide its services in the specified manner Transient Fault: occurs once then disappear Error: part of the system state that leads to failure (i.e., it Intermittent Fault: occurs, vanishes, reoccurs, vanishes, etc. differs from its intended value) Permanent Fault: persists until faulty component is replaced Slide 6 Slide 8 Fault: the cause of an error (results from design errors, manufacturing faults, deterioration, or external Types of Failures: disturbance) Process Failure: process proceeds incorrectly or not at all Recursive: Storage Failure: secondary storage is inaccessible ➜ Failure can be a fault ➜ Manufacturing fault leads to disk failure Communication Failure: communication link or node failure ➜ Disk failure is a fault that leads to database failure ➜ Database failure is a fault that leads to email service failure T OTAL VS P ARTIAL F AILURE 3 F AILURE M ODELS 4

  3. F AILURE M ODELS D ETECTING F AILURE Crash Failure: a server halts, but works correctly until it halts Failure Detector: Fail-Stop: server will stop in a way that clients can tell that it has ➜ Service that detects process failures halted. ➜ Answers queries about status of a process Fail-Resume: server will stop, then resume execution at a later Reliable: time. Slide 9 Slide 11 ➜ Failed – crashed Fail-Silent: clients do not know server has halted ➜ Unsuspected – hint Omission Failure: a server fails to respond to incoming Unreliable: requests ➜ Suspected – may still be alive Receive Omission: fails to receive incoming messages ➜ Unsuspected – hint Send Omission: fails to send messages Synchronous systems: Response Failure: a server’s response is incorrect ➜ Timeout Value Failure: the value of the response is wrong ➜ Failure detector sends probes to detect crash failures State Transition Failure: the server deviates from the correct flow of Asynchronous systems: control � Timeout gives no guarantees Slide 10 Slide 12 Timing Failure: a server’s response lies outside the specified ➜ Failure detector can track suspected failures time interval ➜ Combine results from multiple detectors � How to distinguish communication failure from process failure? Arbitrary Failure: a server may produce arbitrary response at ➜ Ignore messages from suspected processes arbitrary times (aka Byzantine failure ) � Turn an asynchronous system into a synchronous one D ETECTING F AILURE 5 F AULT T OLERANCE 6

  4. F AULT T OLERANCE Fault Tolerance: F AILURE P REDICTION ➜ System can provide its services even in the presence of faults Deal with expected faults: Goal: ➜ Test for error conditions ➜ Automatically recover from partial failure ➜ Error handling code Slide 13 Slide 15 ➜ Without seriously affecting overall performance ➜ Error correcting codes • checksums Techniques: • erasure codes ➜ Prevention : prevent or reduce occurrence of faults ➜ Masking : hide the occurrence of the fault ➜ Prediction : predict the faults that can occur and deal with them ➜ Recovery : restore an erroneous state to an error-free state F AILURE M ASKING F AILURE P REVENTION Try to hide occurrence of failures from other processes Make sure faults don’t happen: Mask: ➜ Quality hardware ➀ Communication Failure → Slide 14 ➜ Hardened hardware Slide 16 Reliable Communication ➜ Quality software ➁ Process Failure → Process Resilience F AILURE P REDICTION 7 F AILURE M ASKING 8

  5. Two Army Problem: Redundancy: Non-faulty processes but lossy communication. ➜ Information redundancy ➜ Time redundancy ➜ Physical redundancy 3000 3000 1 2 A B C (a) Slide 17 Slide 19 Voter 5000 A1 V1 B1 V4 C1 V7 A2 V2 B2 V5 C2 V8 ➜ 1 → 2 attack! Consensus with lossy com- ➜ 2 → 1 ack A3 V3 B3 V6 C3 V9 munication is impossible. ➜ 2: did 1 get my ack? ➜ 1 → 2 ack ack Why does TCP work? (b) ➜ 1: did 2 get my ack ack? ➜ etc. R ELIABLE P OINT - TO -P OINT C OMMUNICATION R ELIABLE C OMMUNICATION ➜ Reliable transport protocol (e.g., TCP) ➜ Communication channel experiences failure Slide 18 Slide 20 ➜ Focus on masking crash (lost/broken connections) and omission � Masks omission failure (lost messages) failures � Not crash failure R ELIABLE C OMMUNICATION 9 R ELIABLE P OINT - TO -P OINT C OMMUNICATION 10

  6. S CALABILITY OF R ELIABLE M ULTICAST Feedback Implosion: sender is swamped with feedback Example: Failure and RPC: messages Possible failures: Nonhierarchical Multicast: ➜ Client cannot locate server ➜ Use NACK s ➜ Request message to server is lost ➜ Feedback suppression: NACKs multicast to everyone Slide 21 Slide 23 ➜ Server crashes after receiving a request ➜ Prevents other receivers from sending NACKs if they’ve already ➜ Reply message from server is lost seen one. ➜ Client crashes after sending a request � Reduces (N)ACK load on server � Receivers have to be coordinated so they don’t all multicast How to deal with the various kinds of failure? NACKs at same time � Multicasting feedback also interrupts processes that successfully received message R ELIABLE G ROUP C OMMUNICATION Hierarchical Multicast: Receiver missed message #24 Sender Sender Receiver Receiver Receiver Receiver (Long-haul) connection Local-area network M25 S History Coordinator buffer Last = 24 Last = 24 Last = 23 Last = 24 M25 M25 M25 M25 C C Slide 22 Slide 24 Network R (a) Root Receiver Sender Receiver Receiver Receiver Receiver Last = 25 Last = 24 Last = 23 Last = 24 M25 M25 M25 M25 ACK 25 ACK 25 Missed 24 ACK 25 Network (b) S CALABILITY OF R ELIABLE M ULTICAST 11 P ROCESS R ESILIENCE 12

  7. R EPLICATION Create groups using replication Primary-Based: ➜ Primary-backup ➜ Hierarchical group ➜ If primary crashes others elect a new primary P ROCESS R ESILIENCE Replicated-Write: Slide 25 Slide 27 ➜ Active replication or Quorum Protection against process failures ➜ Flat group ➜ Ordering of requests (atomic multicast problem) k Fault Tolerance: ➜ can survive faults in k components and still meet its specifications ➜ k + 1 replicas enough if fail-silent (or fail-stop) ➜ 2 k + 1 required if if byzantine Groups: ➜ Organise identical processes into groups • Process groups are dynamic • Processes can be members of multiple groups • Mechanisms for managing groups and group membership S TATE M ACHINE R EPLICATION Slide 26 Slide 28 ➜ Deal with all processes in a group as a single abstraction Flat vs Hierarchical Groups: ➜ Flat group: all decisions made collectively ➜ Hierarchical group: coordinator makes decisions R EPLICATION 13 S TATE M ACHINE R EPLICATION 14

Recommend


More recommend