d etecting f ailures in d istributed s ystems with the
play

D ETECTING F AILURES IN D ISTRIBUTED S YSTEMS WITH THE FALCON S PY N - PowerPoint PPT Presentation

D ETECTING F AILURES IN D ISTRIBUTED S YSTEMS WITH THE FALCON S PY N ETWORK Joshua B. Leners, Hao Wu, Wei-Lun Hung, Marcos K.Aguilera, Michael Walfish SOSP11 Presented by Khiem Ngo PROBLEM Reliable distributed systems must handle crash


  1. D ETECTING F AILURES IN D ISTRIBUTED S YSTEMS WITH THE FALCON S PY N ETWORK Joshua B. Leners, Hao Wu, Wei-Lun Hung, Marcos K.Aguilera, Michael Walfish SOSP’11 Presented by Khiem Ngo

  2. PROBLEM • Reliable distributed systems must handle crash failures • Application crashes, hardware failures, etc. • Detecting failures can take longer than recovery • Building a fast, reliable and unobtrusive failure detector is challenging • Distributed systems are built upon asynchronous communication environment • Existing failure detection techniques (e.g., end-to-end timeout) are unreliable or disruptive

  3. PROBLEM PRIMARY VS. SUPPLEMENTAL [P RIMARY ] • Formal theories and definitions of several classes of failure detectors • How consensus and atomic broadcast are made possible in asynchronous network with failure detectors [S UPPLEMENTAL ] • How to build a failure detector that is fast, reliable, little disruptive

  4. KEY TECHNIQUES FALCON • FALCON: a network of spies chained together to monitor different layers of the system • FALCON: F ast A nd L ethal C omponent O bservation N etwork • Monitored layers: Application, Operating System, Virtual Machine Monitor, and Network • Each spy uses inside information (e.g., process table, internal timeouts, etc.) à fast • Lower-level spies monitor higher-level ones • Kill the layer when in doubt to achieve reliability • Try to kill smallest possible component à low disruption • Use end-to-end timeout as the last resort

  5. FALCON architecture Spy architecture

  6. KEY TECHNIQUES PRIMARY VS. SUPPLEMENTAL [Primary] • Formal theories and definitions of several classes of failure detectors • (Theoretically) show that simpler solutions for consensus and atomic broadcast are possible with reliable failure detectors (RFD) [Supplemental] • Build a failure detector that is fast, reliable, little disruptive • (Experimentally) shows that some distributed system tasks can be made simpler with RFD

  7. KEY FINDINGS • FALCON is fast and achieves sub-second detection • Its detection time is an order of magnitude faster than baseline FDs • FALCON’s CPU overhead is mall (< 1% per component) • FALCON has little disruption in spite of surgical kill • FALCON reduces unavailability period after crashes (6x) • FALCON helps simplify distributed system programming Replication Lines of code # replicas/ approach witnesses Paxos 1759 3 Primary-back 1388 2

  8. Detection time of FALCON and baseline failure detector under various failures

  9. KEY TAKEAWAYS • FALCON: a chained network of spies monitoring different layers • FALCON: uses inside information and local timeouts for fast detection, surgical killing for accuracy • FALCON: has little disruption, help simplify distributed system programming • FALCON does not contradict the FLP impossibility result • FALCON cannot handle Byzantine faults • FALCON: cannot differentiate between a slow network and a failed network

Recommend


More recommend