D ETECTING F AILURES IN D ISTRIBUTED S YSTEMS WITH THE FALCON S PY N - PowerPoint PPT Presentation

May 29, 2023 •314 likes •419 views

D ETECTING F AILURES IN D ISTRIBUTED S YSTEMS WITH THE FALCON S PY N ETWORK Joshua B. Leners, Hao Wu, Wei-Lun Hung, Marcos K.Aguilera, Michael Walfish SOSP11 Presented by Khiem Ngo PROBLEM Reliable distributed systems must handle crash

D ETECTING F AILURES IN D ISTRIBUTED S YSTEMS WITH THE FALCON S PY N ETWORK Joshua B. Leners, Hao Wu, Wei-Lun Hung, Marcos K.Aguilera, Michael Walfish SOSP’11 Presented by Khiem Ngo
PROBLEM • Reliable distributed systems must handle crash failures • Application crashes, hardware failures, etc. • Detecting failures can take longer than recovery • Building a fast, reliable and unobtrusive failure detector is challenging • Distributed systems are built upon asynchronous communication environment • Existing failure detection techniques (e.g., end-to-end timeout) are unreliable or disruptive
PROBLEM PRIMARY VS. SUPPLEMENTAL [P RIMARY ] • Formal theories and definitions of several classes of failure detectors • How consensus and atomic broadcast are made possible in asynchronous network with failure detectors [S UPPLEMENTAL ] • How to build a failure detector that is fast, reliable, little disruptive
KEY TECHNIQUES FALCON • FALCON: a network of spies chained together to monitor different layers of the system • FALCON: F ast A nd L ethal C omponent O bservation N etwork • Monitored layers: Application, Operating System, Virtual Machine Monitor, and Network • Each spy uses inside information (e.g., process table, internal timeouts, etc.) à fast • Lower-level spies monitor higher-level ones • Kill the layer when in doubt to achieve reliability • Try to kill smallest possible component à low disruption • Use end-to-end timeout as the last resort
FALCON architecture Spy architecture
KEY TECHNIQUES PRIMARY VS. SUPPLEMENTAL [Primary] • Formal theories and definitions of several classes of failure detectors • (Theoretically) show that simpler solutions for consensus and atomic broadcast are possible with reliable failure detectors (RFD) [Supplemental] • Build a failure detector that is fast, reliable, little disruptive • (Experimentally) shows that some distributed system tasks can be made simpler with RFD
KEY FINDINGS • FALCON is fast and achieves sub-second detection • Its detection time is an order of magnitude faster than baseline FDs • FALCON’s CPU overhead is mall (< 1% per component) • FALCON has little disruption in spite of surgical kill • FALCON reduces unavailability period after crashes (6x) • FALCON helps simplify distributed system programming Replication Lines of code # replicas/ approach witnesses Paxos 1759 3 Primary-back 1388 2
Detection time of FALCON and baseline failure detector under various failures
KEY TAKEAWAYS • FALCON: a chained network of spies monitoring different layers • FALCON: uses inside information and local timeouts for fast detection, surgical killing for accuracy • FALCON: has little disruption, help simplify distributed system programming • FALCON does not contradict the FLP impossibility result • FALCON cannot handle Byzantine faults • FALCON: cannot differentiate between a slow network and a failed network

Recommend

D ISTRIBUTED S YSTEMS [COMP9243] S YNCHRONOUS VS A SYNCHRONOUS D ISTRIBUTED S YSTEMS Lecture 7

D ISTRIBUTED S YSTEMS [COMP9243] S YNCHRONOUS VS A SYNCHRONOUS D ISTRIBUTED S YSTEMS Lecture 7 (A): Synchronisation and Coordination Timing model of a distributed system Part 1 Slide 1 Slide 3 Affected by: Execution speed/time of processes

446 views • 16 slides

D ISTRIBUTED S YSTEMS [COMP9243] Lecture 9: Security T HE C AST Slide 1 Slide 3 Introduction

D ISTRIBUTED S YSTEMS [COMP9243] Lecture 9: Security T HE C AST Slide 1 Slide 3 Introduction Cryptography Secure protocols and communication Authentication Authorisation S ECURITY IN D ISTRIBUTED S YSTEMS Confidentiality:

564 views • 22 slides

D ISTRIBUTED S YSTEMS [COMP9243] Lecture 8a: Naming Basic Concepts Naming Services

D ISTRIBUTED S YSTEMS [COMP9243] Lecture 8a: Naming Basic Concepts Naming Services Attribute-based Naming (aka Directory Services) Distributed hash tables D ISTRIBUTED S YSTEMS [COMP9243] 1 W HAT IS N AMING ? Systems manage a wide

591 views • 40 slides

D ISTRIBUTED S YSTEMS [COMP9243] B UILDING A D ISTRIBUTED S YSTEM Lecture 3: System Architecture

D ISTRIBUTED S YSTEMS [COMP9243] B UILDING A D ISTRIBUTED S YSTEM Lecture 3: System Architecture Two questions: Slide 1 Slide 3 System Architectures Where to place the hardware? Client-server (and multi-tier) Where to place the

745 views • 11 slides

D ISTRIBUTED S YSTEMS [COMP9243] Lecture 8b: Distributed File Systems Introduction NFS

D ISTRIBUTED S YSTEMS [COMP9243] Lecture 8b: Distributed File Systems Introduction NFS (Network File System) AFS (Andrew File System) & Coda GFS (Google File System) D ISTRIBUTED S YSTEMS [COMP9243] 1 I NTRODUCTION Distributed

918 views • 34 slides

D ISTRIBUTED S YSTEMS [COMP9243] Distributed Object based: Objects invoke each others

K INDS OF M IDDLEWARE D ISTRIBUTED S YSTEMS [COMP9243] Distributed Object based: Objects invoke each others methods Lecture 9: Middleware Bank AccountDB newAccount() lookup() Customer closeAccount() add() Introduction Slide 1

483 views • 14 slides

D ISTRIBUTED S YSTEMS [COMP9243] Defines a sequence of operations Atomic in presence of

T RANSACTIONS Transaction: Comes from database world D ISTRIBUTED S YSTEMS [COMP9243] Defines a sequence of operations Atomic in presence of multiple clients and failures Lecture 5: Synchronisation and Coordination Mutual Exclusion

347 views • 17 slides

D ISTRIBUTED S YSTEMS [COMP9243] Lecture 2: System Architecture & Communication B UILDING A D

D ISTRIBUTED S YSTEMS [COMP9243] Lecture 2: System Architecture & Communication B UILDING A D ISTRIBUTED S YSTEM Two questions: Slide 1 Slide 3 Where to place the hardware? Where to place the software? System Architectures

220 views • 20 slides

D ISTRIBUTED S YSTEMS [COMP9243] Migration: a file can transparently move to another server

C HALLENGES Transparency: Location: a client cannot tell where a file is located D ISTRIBUTED S YSTEMS [COMP9243] Migration: a file can transparently move to another server Replication: multiple copies of a file may exist

223 views • 10 slides

D ISTRIBUTED S YSTEMS [COMP9243] Lecture 8: Fault Tolerance C ASE S TUDY : AWS FAILURE 2011

D ISTRIBUTED S YSTEMS [COMP9243] Lecture 8: Fault Tolerance C ASE S TUDY : AWS FAILURE 2011 April 21, 2011 EBS (Elastic Block Store) in US East region unavailable for about 2 days 13% of volumes in one availability zone got stuck

766 views • 22 slides

EECS 591 D ISTRIBUTED S YSTEMS Manos Kapritsos Fall 2020 V ECTOR CLOCKS ? [0,0,0] [0,0,0]

EECS 591 D ISTRIBUTED S YSTEMS Manos Kapritsos Fall 2020 V ECTOR CLOCKS ? [0,0,0] [0,0,0] [0,0,0] Network Time Protocol The oldest distributed protocol still running on the Internet Hierarchical architecture Latency-tolerant,

420 views • 27 slides

EECS 591 D ISTRIBUTED S YSTEMS Manos Kapritsos Fall 2020 What type of properties are the

EECS 591 D ISTRIBUTED S YSTEMS Manos Kapritsos Fall 2020 What type of properties are the following: Once you have sent a request to the server, you will receive a response within 10 seconds No client request is ever left unanswered A TOMIC

538 views • 22 slides

D ISTRIBUTED S YSTEMS - The Next Grand Challenge in Embedded System Design Jan M. Rabaey Donald

D ISTRIBUTED S YSTEMS - The Next Grand Challenge in Embedded System Design Jan M. Rabaey Donald O. Pederson Distinguished Prof. Director FCRP MultiScale Systems Center (MuSyC) Scientific Co-Director Berkeley Wireless Research Center University

647 views • 30 slides

EECS 591 D ISTRIBUTED S YSTEMS Manos Kapritsos Fall 2020 D EALING WITH MULTIPLE PROPOSERS I

EECS 591 D ISTRIBUTED S YSTEMS Manos Kapritsos Fall 2020 D EALING WITH MULTIPLE PROPOSERS I swear I wont follow an earlier leader! And, btw, here is my current accepted value (if any) by leader x. Proposer IAmLeader #1 YouAreLeader Decree

521 views • 17 slides

D ISTRIBUTED S YSTEMS [COMP9243] Lecture 10a: Cloud Computing Slide 1 Slide 3 Why is it called

D ISTRIBUTED S YSTEMS [COMP9243] Lecture 10a: Cloud Computing Slide 1 Slide 3 Why is it called Cloud ? services provided on virtualised resources What is Cloud Computing? virtual machines spawned on demand X as a Service

690 views • 12 slides

EECS 591 D ISTRIBUTED S YSTEMS Manos Kapritsos Fall 2020 Slides by: Lorenzo Alvisi PBFT: A B

EECS 591 D ISTRIBUTED S YSTEMS Manos Kapritsos Fall 2020 Slides by: Lorenzo Alvisi PBFT: A B YZANTINE R ENAISSANCE Practical Byzantine Fault Tolerance (Castro, Liskov 1999-2000) First practical protocol for asynchronous BFT replication Like

591 views • 14 slides

EECS 591 D ISTRIBUTED S YSTEMS Manos Kapritsos Fall 2020 H ANDLING QUERIES query Primary The

EECS 591 D ISTRIBUTED S YSTEMS Manos Kapritsos Fall 2020 H ANDLING QUERIES query Primary The primary cannot respond until it has received all acks for prior updates ack ack ack ack ack backups C HAIN REPLICATION Primary replicas Head

588 views • 36 slides

EECS 591 D ISTRIBUTED S YSTEMS Manos Kapritsos Fall 2020 S TATE M ACHINE R EPLICATION M ODELING

EECS 591 D ISTRIBUTED S YSTEMS Manos Kapritsos Fall 2020 S TATE M ACHINE R EPLICATION M ODELING FAULTS Mean Time To Failure/Mean Time To Recover used mostly for disks of questionable value in expressing reliability Threshold: out of makes

622 views • 35 slides

EECS 591 D ISTRIBUTED S YSTEMS Manos Kapritsos Fall 2020 Slides by: Lorenzo Alvisi C ONSENSUS

EECS 591 D ISTRIBUTED S YSTEMS Manos Kapritsos Fall 2020 Slides by: Lorenzo Alvisi C ONSENSUS Every process has a value to propose. After running a consensus algorithm, all processes should deliver the same value. C ONSENSUS Validity If

390 views • 22 slides

EECS 591 D ISTRIBUTED S YSTEMS Manos Kapritsos Fall 2020 Slides by: Lorenzo Alvisi B YZANTINE F

EECS 591 D ISTRIBUTED S YSTEMS Manos Kapritsos Fall 2020 Slides by: Lorenzo Alvisi B YZANTINE F AULT T OLERANCE A HIERARCHY OF FAILURE MODELS Fail-stop Crash Send omission Receive omission = benign failures General omission Arbitrary

654 views • 37 slides

D ISTRIBUTED S YSTEMS [COMP9243] Replication and consistency of memory objects Shared address

S HARED A DDRESS S PACE DSM consists of two components: Shared address space D ISTRIBUTED S YSTEMS [COMP9243] Replication and consistency of memory objects Shared address space: Lecture 3b: Distributed Shared Memory Node 1 Node 2 Slide

798 views • 8 slides

I NVARIANT S AFETY FOR D ISTRIBUTED A PPLICATIONS Sreeja Nair Gustavo Petri Marc Shapiro S

I NVARIANT S AFETY FOR D ISTRIBUTED A PPLICATIONS Sreeja Nair Gustavo Petri Marc Shapiro S TATEFUL D ISTRIBUTED S YSTEMS W E WANT : } Scalability Replicated State Availability Programmability Strong Consistency S

1.2k views • 98 slides

EECS 591 D ISTRIBUTED S YSTEMS Manos Kapritsos Fall 2020 Slides by: Lorenzo Alvisi 3-P HASE C

EECS 591 D ISTRIBUTED S YSTEMS Manos Kapritsos Fall 2020 Slides by: Lorenzo Alvisi 3-P HASE C OMMIT Coordinator Participant I. sends VOTE-REQ to all participants 2. sends to Coordinator if = No then := Abort 3. if (all

662 views • 20 slides

D ISTRIBUTED S YSTEMS [COMP9243] T HE E RLANG E NVIRONMENT unix% erl Lecture 1.5: Erlang 1> 1

D ISTRIBUTED S YSTEMS [COMP9243] T HE E RLANG E NVIRONMENT unix% erl Lecture 1.5: Erlang 1> 1 + 2. 3 2> c(demo). {ok,demo} 3> demo:double(25). 50 Slide 1 Slide 3 4> date(). {2004,2,24} 5> halt(). unix% cat demo.erl

208 views • 8 slides