Model Checking of Fault-Tolerant Distributed Algorithms Igor Konnov - PowerPoint PPT Presentation

Model Checking of Fault-Tolerant Distributed Algorithms Igor Konnov joint work with Annu Gmeiner Ulrich Schmid Helmut Veith Josef Widder

Igor Konnov Distributed Systems Are they always working? 2/63

No. . . some failing systems Therac-25 (1985) radiation therapy machine gave massive overdoses, e.g., due to race conditions of concurrent tasks Quantas Airbus in-flight Learmonth upset (2008) 1 out of 3 replicated components failed computer initiated dangerous altitude drop Ariane 501 maiden flight (1996) primary/backup, i.e., 2 replicated computers both run into the same integer overflow Netflix outages due to Amazon’s cloud (ongoing) one is not sure what is going on there hundreds of computers involved Igor Konnov 3/63

Why do they fail? Igor Konnov 4/63

Why do they fail? faults at design/implementation phase faults at runtime outside of control of designer/developer e.g., to the right: crack in a diode in the data link interface of the Space Shuttle ⇒ led to erroneous messages being sent Driscoll (Honeywell) Igor Konnov 5/63

Why do they fail? faults at design/implementation phase approach: find and fix faults before operation ⇒ model checking faults at runtime outside of control of designer/developer e.g., to the right: crack in a diode in the data link interface of the Space Shuttle ⇒ led to erroneous messages being sent Driscoll (Honeywell) Igor Konnov 6/63

Why do they fail? faults at design/implementation phase approach: find and fix faults before operation ⇒ model checking faults at runtime outside of control of designer/developer e.g., to the right: crack in a diode in the data link interface of the Space Shuttle ⇒ led to erroneous messages being sent approach: keep system operational despite faults Driscoll (Honeywell) ⇒ fault-tolerant distributed algorithms Igor Konnov 7/63

Bringing both together Goal: automatically verified fault-tolerant distributed algorithms e.g., Paxos, Fast Byzantine Consensus, etc. Igor Konnov 8/63

Bringing both together Goal: automatically verified fault-tolerant distributed algorithms e.g., Paxos, Fast Byzantine Consensus, etc. model checking FTDAs is a research challenge: computers run independently at different speeds exchange messages with uncertain delays faults parameterization . . . fault-tolerance makes model checking harder Igor Konnov 9/63

Why Model Checking? Transition system: s 0 : { r } an alternative proof approach s 3 : { r , y , g } useful counter-examples s 1 : { y } s 2 : { y } ability to define and vary assumptions about the system s 4 : { g } and see why it breaks Linear Temporal Logic: closer to code level F ( ) s 0 s 1 s 2 s 3 s 4 good degree of automation G ( ) s ′ s ′ s ′ s ′ s ′ 0 1 2 3 4 Igor Konnov 10/63

Distributed Algorithms: Model Checking Challenges unbounded data types unbounded number of rounds (round numbers part of messages) parameterization in multiple parameters among n processes f ≤ t are faulty with n > 3 t contrast to concurrent programs diverse fault models (adverse environments) continuous time fault-tolerant clock synchronization degrees of concurrency: synchronous, asynchronous partially synchronous a process makes at most 5 steps between 2 steps of any other process Igor Konnov 11/63

Challenge #1: fault models clean crashes: least severe faulty processes prematurely halt after/before “send to all” crash faults: faulty processes prematurely halt (also) in the middle of “send to all” omission faults: faulty processes follow the algorithm, but some messages sent by them might be lost symmetric faults: faulty processes send arbitrarily to all or nobody Byzantine faults: most severe faulty processes can do anything encompass all behaviors of above models Igor Konnov 12/63

Challenges #2 & #3: Pseudo-code and Communication Translate pseudo-code to a formal description that allows us to verify the algorithm and does not oversimplify the original algorithm. Assumptions about the communication medium are usually written in plain English, spread across research papers, constitute folklore knowledge. Igor Konnov 13/63

Typical Structure of a Computation Step receive messages compute using atomic messages and local variables (description in English with basic control flow if-then-else) send messages Igor Konnov 14/63

Typical Structure of a Computation Step t i c i l p receive messages m i e compute using d atomic messages and local variables o c (description in English - o d with basic control flow u if-then-else) e s p send messages Igor Konnov 15/63

Challenge #4: Parameterized Model Checking Parameterized model checking problem: given a process template P ( n , t , f ) , resilience condition RC : n > 3 t ∧ t ≥ f ≥ 0, fairness constraints Φ , e.g., “all messages will be delivered” and an LTL-X formula ϕ show for all n , t , and f satisfying RC ( P ( n , t , f )) n − f + f faults | = (Φ → ϕ ) n n ? ? ? ? ? ? t t f Igor Konnov 16/63

Challenge #5: Liveness in Distributed Algorithms Interplay of safety and liveness is a central challenge in DAs achieving safety and liveness is non-trivial asynchrony and faults lead to impossibility results [Fischer, Lynch, Paterson’85] Igor Konnov 17/63

Challenge #5: Liveness in Distributed Algorithms Interplay of safety and liveness is a central challenge in DAs achieving safety and liveness is non-trivial asynchrony and faults lead to impossibility results [Fischer, Lynch, Paterson’85] Rich literature to verify safety (e.g. in concurrent systems) Distributed algorithms perspective: “doing nothing is always safe” “tools verify algorithms that actually might do nothing” Verification efforts often have to simplify assumptions Igor Konnov 18/63

Summary We have to model: faults, communication medium captured in English, algorithms written in pseudo-code. and check: safety and liveness of parameterized systems with unbounded integers, non-standard fairness constraints, Igor Konnov 19/63

Model Checking for Small System Sizes Igor Konnov 20/63

Fault-tolerant distributed algorithms n n processes communicate by messages all processes know that at most t of them might be faulty f are actually faulty Igor Konnov 21/63

Fault-tolerant distributed algorithms n ? ? ? t n processes communicate by messages all processes know that at most t of them might be faulty f are actually faulty Igor Konnov 22/63

Fault-tolerant distributed algorithms n ? ? ? t f n processes communicate by messages all processes know that at most t of them might be faulty f are actually faulty Igor Konnov 23/63

Asynchronous Reliable Broadcast [Srikanth & Toueg’87] The core of the classic broadcast algorithm from the DA literature. It solves an agreement problem depending on the inputs v i . Variables of process i v i : { 0 , 1 } i n i t i a l l y 0 or 1 accept i : { 0 , 1 } i n i t i a l l y 0 An atomic step: i f v i = 1 then send ( echo ) to all ; i f received (echo) from at l e a s t t + 1 distinct processes and not sent ( echo ) before then send ( echo ) to all ; i f received ( echo ) from at l e a s t n - t distinct processes then accept i := 1 ; Igor Konnov 24/63

Asynchronous Reliable Broadcast [Srikanth & Toueg’87] The core of the classic broadcast algorithm from the DA literature. It solves an agreement problem depending on the inputs v i . Variables of process i v i : { 0 , 1 } i n i t i a l l y 0 or 1 accept i : { 0 , 1 } i n i t i a l l y 0 asynchronous t Byzantine faults An atomic step: i f v i = 1 correct if n > 3 t then send ( echo ) to all ; the code is i f received (echo) from parameterized in n at l e a s t t + 1 distinct processes and t and not sent ( echo ) before then send ( echo ) to all ; ⇒ process template i f received ( echo ) from at l e a s t P ( n , t , f ) n - t distinct processes then accept i := 1 ; Igor Konnov 25/63

Threshold-Guarded Distributed Algorithms Standard construct: quantified guards (t=f=0) Existential Guard if received m from some process then ... Universal Guard if received m from all processes then ... Igor Konnov 26/63

Threshold-Guarded Distributed Algorithms Standard construct: quantified guards (t=f=0) Existential Guard if received m from some process then ... Universal Guard if received m from all processes then ... what if faults might occur? Igor Konnov 27/63

Threshold-Guarded Distributed Algorithms Standard construct: quantified guards (t=f=0) Existential Guard if received m from some process then ... Universal Guard if received m from all processes then ... what if faults might occur? Fault-Tolerant Algorithms: n processes, at most t are Byzantine Threshold Guard if received m from n − t processes then ... (the processes cannot refer to f !) Igor Konnov 28/63

Counting Argument in Threshold-Guarded Algorithms t + 1 n t f Correct processes count incoming messages from distinct processes Igor Konnov 29/63

Counting Argument in Threshold-Guarded Algorithms t + 1 n t f Correct processes count incoming messages from distinct processes Igor Konnov 30/63

Counting Argument in Threshold-Guarded Algorithms t + 1 n at least one non-faulty sent the message t f Correct processes count incoming messages from distinct processes Igor Konnov 31/63

Model Checking of Fault-Tolerant Distributed Algorithms Igor Konnov - PowerPoint PPT Presentation

Model Checking of Fault-Tolerant Distributed Algorithms Igor Konnov joint work with Annu Gmeiner Ulrich Schmid Helmut Veith Josef Widder Igor Konnov Distributed Systems Are they always working? 2/63 No. . . some failing systems

Lecture 10: Fault Tolerance Fault Tolerant Concurrent Computing The main principles of fault

Building a Fault- Building a Fault- Tolerant Distributed Tolerant Distributed System with

Distributed Systems 5. Fault Tolerant Systems Fault-Tolerance - 1 Lszl Bszrmnyi

Adaptive Fault Tolerant Systems: Adaptive Fault Tolerant Systems: Reflective Design and

Idealised Fault Tolerant Idealised Fault Tolerant Architectural Element Architectural Element

Fault-tolerant techniques Fault-tolerant techniques What causes component faults? What are the

FAULT-TOLERANT CONTROL Is it possible? JAN MACIEJOWSKI Fault- tolerant control. DPS09,

Fault-Tolerant Data Collection in Fault-Tolerant Data Collection in Heterogeneous Intelligent

Parameterized Model Checking of Fault-tolerant Distributed Algorithms by Abstraction Annu John

Overview Introduction and basic concept ECE 753: FAULT-TOLERANT Fault model and fault

Fault-Tolerant Distributed Optimization Lili Su, Arun Padakandla, Qiong Hu, Seyyed A. Fatemi,

Computability Abstractions for Fault-tolerant Asynchronous Distributed Computing Julien Stainer

Non-Cryptographic Fault-Tolerant Distributed Computation Marek Hamerlik December 6, 2007 Marek

Real Real Real Time Real-Time Time Time Model Checking Model Model Checking Model

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

CS6100: Topics in Design and Analysis of Algorithms Fault Tolerant Consensus CS6100 (Even 2012):

Distributed Real-Time Fault Tolerance on a Virtualized Multi-Core System Eric Missimer*, Richard

Dependability Engineering of Complex Computing Systems M. Kaniche J.-C. Laprie

Fully Fault Tolerant Real Time Data Pipeline with Docker and Mesos Rahul Kumar Technical Lead

Fault Tolerant Computing Coping with errors Steven Janke February 2013 Steven Janke (Seminar)

Hybrid Fault-Tolerant Consensus in Asynchronous and Wireless Embedded Systems 22nd International

COMP9313: Big Data Management Introduction to MapReduce and Spark Motivation of MapReduce

Policy-Driven Fault Management for NFV Eco System Akhil Jain (NEC) akhil.jain@india.nec.com

Overview Motivation ECE 753: FAULT-TOLERANT About the Course and the Instructor