Towards Automatically Checking Thousands of Failures with Micro-Specifications Haryadi S. Gunawi, Thanh Do † , Pallavi Joshi, Joseph M. Hellerstein, Andrea C. Arpaci-Dusseau † , Remzi H. Arpaci-Dusseau † , Koushik Sen University of California, Berkeley † University of Wisconsin, Madison 1
Cloud Era Solve bigger human problems Use cluster of thousands of machines 2 2
Failures in The Cloud 3 3
Failures in The Cloud “The future is a world of failures everywhere ” - Garth Gibson 3 3
Failures in The Cloud “The future is a world of failures everywhere ” - Garth Gibson “Recovery must be a first-class operation” - Raghu Ramakrishnan 3 3
Failures in The Cloud “The future is a world of failures everywhere ” - Garth Gibson “Recovery must be a first-class operation” - Raghu Ramakrishnan “Reliability has to come from the software ” - Je fg rey Dean 3 3
4 4
5 5
Why Failure Recovery Hard? • Testing is not advanced enough against complex failures – Diverse, frequent, and multiple failures – FaceBook photo loss • Recovery is under specified – Need to specify failure recovery behaviors – Customized well-grounded protocols • Example: Paxos made live – An engineering perspective [PODC’ 07] 6 6
Our Solutions 7 7
Our Solutions • FTS (“FATE”) – Failure Testing Service – New abstraction for failure exploration – Systematically exercise 40,000 unique combinations of failures 7 7
Our Solutions • FTS (“FATE”) – Failure Testing Service – New abstraction for failure exploration – Systematically exercise 40,000 unique combinations of failures • DTS (“DESTINI”) – Declarative Testing Specification – Enable concise recovery specifications – We have written 74 checks (3 lines / check) 7 7
Our Solutions • FTS (“FATE”) – Failure Testing Service – New abstraction for failure exploration – Systematically exercise 40,000 unique combinations of failures • DTS (“DESTINI”) – Declarative Testing Specification – Enable concise recovery specifications – We have written 74 checks (3 lines / check) • Note: Names have changed since the paper 7 7
Summary of Findings • Applied FATE and DESTINI to three cloud systems: HDFS, ZooKeeper, Cassandra • Found 16 new bugs • Reproduced 74 bugs • Problems found – Inconsistency – Data loss – Rack awareness broken – Unavailability 8 8
Outline Introduction • FATE • DESTINI • Evaluation • Summary 9 9
10 10
M C 1 2 3 No failures 10 10
Alloc. Req. M C 1 2 3 No failures 10 10
Setup Alloc. Stage Req. M C 1 2 3 Data Transfer Stage No failures 10 10
M C 1 2 3 No failures 10 10
M C 1 2 3 M C 1 2 3 4 X 1 Setup Stage Recovery: No failures Recreate fresh pipeline 10 10
M C 1 2 3 M C 1 2 3 4 X 1 Setup Stage Recovery: No failures Recreate fresh pipeline M C 1 2 3 X 2 Data transfer Stage Recovery: Continue on surviving nodes 10 10
M C 1 2 3 M C 1 2 3 4 X 1 Setup Stage Recovery: No failures Recreate fresh pipeline M C 1 2 3 M C 1 2 3 X 2 X 3 Data transfer Stage Recovery: Bug in Data Transfer Stage Recovery Continue on surviving nodes 10 10
M C 1 2 3 M C 1 2 3 4 X 1 Failures at Setup Stage Recovery: No failures DIFFERENT STAGES Recreate fresh pipeline lead to M C 1 2 3 M C 1 2 3 DIFFERENT FAILURE BEHAVIORS Goal: Exercise di fg erent failure recovery path X 2 X 3 Data transfer Stage Recovery: Bug in Data Transfer Stage Recovery Continue on surviving nodes 10 10
FATE • A failure injection framework – target IO points – Systematically exploring failure – Multiple failures • New abstraction of failure scenario – Remember injected failures – Increase failure coverage 11 11
FATE M C 1 2 3 • A failure injection framework – target IO points – Systematically exploring failure – Multiple failures • New abstraction of failure scenario – Remember injected failures – Increase failure coverage 11 11
FATE M C 1 2 3 • A failure injection framework X X – target IO points – Systematically exploring failure – Multiple failures • New abstraction of failure scenario – Remember injected failures – Increase failure coverage 11 11
FATE M C 1 2 3 • A failure injection framework X X X – target IO points X – Systematically exploring failure – Multiple failures • New abstraction of failure scenario – Remember injected failures – Increase failure coverage 11 11
FATE M C 1 2 3 • A failure injection framework X X X – target IO points X – Systematically exploring failure – Multiple failures • New abstraction of failure scenario – Remember injected failures – Increase failure coverage 11 11
FATE M C 1 2 3 • A failure injection framework X X X – target IO points X – Systematically exploring failure X X – Multiple failures • New abstraction of failure scenario – Remember injected failures – Increase failure coverage 11 11
Failure ID 2 3 12 12
Failure ID 2 3 Field Fields Values Static Static Func. Call OutputStream.read() Source File BlockReceiver.java Dynamic Stack Track … Domain Domain Source Node 2 specific specific Destination Node 3 Net. Message Data Packet Failure Type Crash After Hash 12348729 12348729 12 12
How Developers Build Failure ID? • FATE intercepts all I/Os • Use aspectJ to collect information at every I/O point – I/O bu fg ers (e.g file bu fg er, network bu fg er) – Target I/O (e.g. file name, IP address) • Reverse engineer for domain specific information 13 13
Failure ID 2 3 12 14
Failure ID 2 3 Field Fields Values Static Static Func. Call OutputStream.read() Source File BlockReceiver.java Dynamic Stack Track … Domain Domain Source Node 2 specific specific Destination Node 3 Net. Message Data Packet Failure Type Crash After Hash 12348729 12348729 12 14
Failure ID 2 3 Field Fields Values Static Static Func. Call OutputStream.read() Source File BlockReceiver.java Dynamic Stack Track … Domain Domain Source Node 2 specific specific Destination Node 3 Net. Message Data Packet Failure Type Crash After Hash 12348729 12348729 12 14
Exploring Failure Space 14 15
Exploring Failure Space M C 1 2 3 A Exp #1: A 14 15
Exploring Failure Space M C 1 2 3 A Exp #1: A A Exp #2: B B 14 15
Exploring Failure Space M C 1 2 3 A Exp #1: A A Exp #2: B B Exp #3: C A B C 14 15
Exploring Failure Space M C 1 2 3 M C 1 2 3 A A Exp #1: A AB B A Exp #2: B B Exp #3: C A B C 14 15
Exploring Failure Space M C 1 2 3 M C 1 2 3 A A Exp #1: A AB B A A Exp #2: B AC B C B Exp #3: C A B C 14 15
Exploring Failure Space M C 1 2 3 M C 1 2 3 A A Exp #1: A AB B A A Exp #2: B AC B C B A Exp #3: C A BC C B B C 14 15
Outline Introduction FATE • DESTINI • Evaluation • Summary 15 16
DESTINI • Enable concise recovery specifications • Check if expected behaviors match with actual behaviors • Important elements: – Expectations – Facts – Failure Events – Check Timing • Interpose network and disk protocols 16 17
Writing specifications 17 18
Writing specifications “Violation if expectation is di fg erent from actual facts” 17 18
Writing specifications “Violation if expectation is di fg erent from actual facts” 17 18
Writing specifications “Violation if expectation is di fg erent from actual facts” violationTable():- expectationTable(), NOT-IN actualTable() 17 18
Writing specifications “Violation if expectation is di fg erent from actual facts” violationTable():- expectationTable(), NOT-IN actualTable() 17 18
Writing specifications “Violation if expectation is di fg erent from actual facts” violationTable():- expectationTable(), NOT-IN actualTable() DataLog syntax: 17 18
Writing specifications “Violation if expectation is di fg erent from actual facts” violationTable():- expectationTable(), NOT-IN actualTable() DataLog syntax: :- derivation 17 18
Writing specifications “Violation if expectation is di fg erent from actual facts” violationTable():- expectationTable(), NOT-IN actualTable() DataLog syntax: :- derivation , AND 17 18
Correct recovery Incorrect Recovery M C 1 2 3 M C 1 2 3 X X 18 19
Correct recovery Incorrect Recovery M C 1 2 3 M C 1 2 3 X X incorrectNodes (B, N) :- expectedNodes (B, N), NOT-IN actualNodes (B, N); 18 19
Correct recovery Incorrect Recovery M C 1 2 3 M C 1 2 3 X X Expected Nodes Expected Nodes (Block, Node) ock, Node) B Node 1 B Node 2 incorrectNodes (B, N) :- expectedNodes (B, N), NOT-IN actualNodes (B, N); 18 19
Recommend
More recommend