1 DCatch: Automatically Detecting Distributed Concurrency Bugs in Cloud Systems Haopeng Liu , Guangpu Li, Jeffrey Lukman, Jiaxin Li, Shan Lu, Haryadi Gunawi, and Chen Tian* *
2 Cloud systems
3 Cloud systems
4 Distributed concurrency bugs (DCbugs)
5 Distributed concurrency bugs (DCbugs) • Unexpected timing among distributed operations
6 Distributed concurrency bugs (DCbugs) • Unexpected timing among distributed operations • Example A B C MapReduce-3274
7 Distributed concurrency bugs (DCbugs) • Unexpected timing among distributed operations • Example A B C A B C hang MapReduce-3274
8 DCbugs need to be tackled • Common in distributed systems [1, 2, 3] – 26% failures caused by non-deterministic [1] – 6% software bugs in clouds system [2] [1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14 [2] Gunawi . What Bugs Live in the Cloud?. In SoCC’14 [3] Leesatapornwongsa. TaxDC . In ASPLOS’16
9 DCbugs need to be tackled • Common in distributed systems [1, 2, 3] • Difficult to avoid, expose and diagnose [1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14 [2] Gunawi . What Bugs Live in the Cloud?. In SoCC’14 [3] Leesatapornwongsa. TaxDC . In ASPLOS’16
10 DCbugs need to be tackled • Common in distributed systems [1, 2, 3] • Difficult to avoid, expose and diagnose Hadoop Map/Reduce / MAPREDUCE-3274 “That is one monster of a race!” [1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14 [2] Gunawi . What Bugs Live in the Cloud?. In SoCC’14 [3] Leesatapornwongsa. TaxDC . In ASPLOS’16
11 DCbugs need to be tackled • Common in distributed systems [1, 2, 3] • Difficult to avoid, expose and diagnose Hadoop Map/Reduce / MAPREDUCE-3274 “That is one monster of a race!” HBase / HBASE-4397 “There isn’t a week going by without new bugs about races.” [1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14 [2] Gunawi . What Bugs Live in the Cloud?. In SoCC’14 [3] Leesatapornwongsa. TaxDC . In ASPLOS’16
12 DCbugs need to be tackled • Common in distributed systems [1, 2, 3] • Difficult to avoid, expose and diagnose Hadoop Map/Reduce / MAPREDUCE-3274 “That is one monster of a race!” HBase / HBASE-4397 “There isn’t a week going by without new bugs about races.” HBase / HBASE-6147 “We have already fix many cases, however it seems exist many other [racing] cases.” [1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14 [2] Gunawi . What Bugs Live in the Cloud?. In SoCC’14 [3] Leesatapornwongsa. TaxDC . In ASPLOS’16
13 DCbugs need to be tackled • Common in distributed systems [1, 2, 3] • Difficult to avoid, expose and diagnose Hadoop Map/Reduce / MAPREDUCE-3274 “That is one monster of a race!” HBase / HBASE-4397 “There isn’t a week going by without new bugs about races.” HBase / HBASE-6147 “We have already found and fix many cases, Hadoop Map/Reduce / MAPREDUCE-4819 however it seems exist many other cases.” “This has become quite messy, sigh. ” [1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14 [2] Gunawi . What Bugs Live in the Cloud?. In SoCC’14 [3] Leesatapornwongsa. TaxDC . In ASPLOS’16
14 DCbugs need to be tackled • Common in distributed systems [1, 2, 3] • Difficult to avoid, expose and diagnose Hadoop Map/Reduce / MAPREDUCE-3274 “That is one monster of a race!” HBase / HBASE-4397 “There isn’t a week going by without new bugs about races.” HBase / HBASE-6147 “We have already found and fix many cases, Hadoop Map/Reduce / MAPREDUCE-4819 however it seems exist many other cases.” “This has become quite messy, sigh. ” Hadoop Map/Reduce / MAPREDUCE-4099 “Great catch, Sid! Apologies for missing the race condition.” [1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14 [2] Gunawi . What Bugs Live in the Cloud?. In SoCC’14 [3] Leesatapornwongsa. TaxDC . In ASPLOS’16
15 DCbugs need to be tackled • Common in distributed systems [1, 2, 3] • Difficult to avoid, expose and diagnose Hadoop Map/Reduce / MAPREDUCE-3274 “That is one monster of a race!” HBase / HBASE-4397 “There isn’t a week going by without new bugs about races.” HBase / HBASE-6147 “We have already found and fix many cases, Hadoop Map/Reduce / MAPREDUCE-4819 however it seems exist many other cases.” “This has become quite messy, sigh. ” Hadoop Map/Reduce / MAPREDUCE-4099 “Great catch, Sid! Apologies for missing the race condition.” Hadoop Map/Reduce / MAPREDUCE-3634 “We [prefer] debug crashes instead of hanging jobs.” [1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14 [2] Gunawi . What Bugs Live in the Cloud?. In SoCC’14 [3] Leesatapornwongsa. TaxDC . In ASPLOS’16
16 DCbugs need to be tackled • Common in distributed systems [1, 2, 3] • Difficult to avoid, expose and diagnose Hadoop Map/Reduce / MAPREDUCE-3274 “That is one monster of a race!” HBase / HBASE-4397 “There isn’t a week going by without new bugs about races.” HBase / HBASE-6147 “We have already found and fix many cases, Can we detect DCbugs before they manifest? Hadoop Map/Reduce / MAPREDUCE-4819 however it seems exist many other cases.” “This has become quite messy, sigh. ” Hadoop Map/Reduce / MAPREDUCE-4099 “Great catch, Sid! Apologies for missing the race condition.” Hadoop Map/Reduce / MAPREDUCE-3634 “We [prefer] debug crashes instead of hanging jobs.” [1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14 [2] Gunawi . What Bugs Live in the Cloud?. In SoCC’14 [3] Leesatapornwongsa. TaxDC . In ASPLOS’16
17 Previous work • Model checking – Work on abstracted models – Face state-space explosion issue
18 Our idea • Follow the philosophy of traditional concurrency bug detection
19 Our idea • Follow the philosophy of traditional concurrency bug detection Machine 1 Machine 2 Machine 3 Machine 4
20 Our idea • Follow the philosophy of traditional concurrency bug detection Machine 1 Machine 2 Machine 3 Machine 4
21 Our idea • Follow the philosophy of traditional concurrency bug detection Machine 1 Machine 2 Machine 3 Machine 4
22 Example A B C
23 Example A B C B //UnReg thread //RPC thread void unReg(jID){ Task getTask(jID){ ... jMap.remove(jID) ; return jMap.get(jID) ; .... } }
24 Local concurrency bug detection
25 Local concurrency bug detection
26 Local concurrency bug detection Is the problem solved?
27 Local concurrency bug detection T2 T1 T3 Trace
28 Local concurrency bug detection T2 T1 T3 C Trace 1 C1: How to handle the huge amount of mem accesses? Challenges
29 Local concurrency bug detection T2 T1 T3 . . . C Trace HB 1 C1: How to handle the huge amount of mem accesses? Challenges
30 Local concurrency bug detection T2 T1 T3 . . r . w C Trace HB 1 C1: How to handle the huge amount of mem accesses? Challenges
31 Local concurrency bug detection T2 T1 T3 . . r . w C C Trace HB 1 2 C1: How to handle the huge C2: What’s the happens -before amount of mem accesses? model? Challenges
32 Local concurrency bug detection T2 T1 T3 . . . . r r . . w w assert( r ) C C Trace HB Triage 1 2 C1: How to handle the huge C2: What’s the happens -before amount of mem accesses? model? Challenges
33 Local concurrency bug detection T2 T1 T3 . . . . r r . . w w assert( r ) C C C Trace HB Triage 1 2 3 C1: How to handle the huge C2: What’s the happens -before amount of mem accesses? model? Challenges C3: How to estimate the distributed impact of a race?
34 Local concurrency bug detection T2 T1 T3 . . . . . . r r +sleep . . . w w assert( r ) C C C Trace HB Triage Trigger 1 2 3 C1: How to handle the huge C2: What’s the happens -before amount of mem accesses? model? Challenges C3: How to estimate the distributed impact of a race?
35 Local concurrency bug detection T2 T1 T3 . . . . . . r r +sleep . . . w w assert( r ) C C C C Trace HB Triage Trigger 1 2 3 4 C1: How to handle the huge C2: What’s the happens -before amount of mem accesses? model? Challenges C3: How to estimate the C4: How to trigger with distributed impact of a race? distributed time manipulation?
36 Contribution • A comprehensive HB Model for distributed systems
37 Contribution • A comprehensive HB Model for distributed systems • DCatch tool detects DCbugs from correct runs C C C C Trace HB Triage Trigger 1 2 3 4 C1: How to handle the huge C2: What’s the amount of mem accesses? happens-before model? Challenges Solved by DCatch C3: How to estimate the C4: How to trigger with distributed impact of a race? distributed time manipulation?
38 Contribution • A comprehensive HB Model for distributed systems • DCatch tool detects DCbugs from correct runs • Evaluate on 4 systems • Report 32 DCbugs, with 20 of them being truly harmful
39 Outline • Motivation • DCatch Happens-before Model • DCatch tool • Evaluation • Conclusion
Recommend
More recommend