dcatch automatically detecting
play

DCatch: Automatically Detecting Distributed Concurrency Bugs in - PowerPoint PPT Presentation

1 DCatch: Automatically Detecting Distributed Concurrency Bugs in Cloud Systems Haopeng Liu , Guangpu Li, Jeffrey Lukman, Jiaxin Li, Shan Lu, Haryadi Gunawi, and Chen Tian* * 2 Cloud systems 3 Cloud systems 4 Distributed concurrency bugs


  1. 1 DCatch: Automatically Detecting Distributed Concurrency Bugs in Cloud Systems Haopeng Liu , Guangpu Li, Jeffrey Lukman, Jiaxin Li, Shan Lu, Haryadi Gunawi, and Chen Tian* *

  2. 2 Cloud systems

  3. 3 Cloud systems

  4. 4 Distributed concurrency bugs (DCbugs)

  5. 5 Distributed concurrency bugs (DCbugs) • Unexpected timing among distributed operations

  6. 6 Distributed concurrency bugs (DCbugs) • Unexpected timing among distributed operations • Example A B C MapReduce-3274

  7. 7 Distributed concurrency bugs (DCbugs) • Unexpected timing among distributed operations • Example A B C A B C hang MapReduce-3274

  8. 8 DCbugs need to be tackled • Common in distributed systems [1, 2, 3] – 26% failures caused by non-deterministic [1] – 6% software bugs in clouds system [2] [1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14 [2] Gunawi . What Bugs Live in the Cloud?. In SoCC’14 [3] Leesatapornwongsa. TaxDC . In ASPLOS’16

  9. 9 DCbugs need to be tackled • Common in distributed systems [1, 2, 3] • Difficult to avoid, expose and diagnose [1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14 [2] Gunawi . What Bugs Live in the Cloud?. In SoCC’14 [3] Leesatapornwongsa. TaxDC . In ASPLOS’16

  10. 10 DCbugs need to be tackled • Common in distributed systems [1, 2, 3] • Difficult to avoid, expose and diagnose Hadoop Map/Reduce / MAPREDUCE-3274 “That is one monster of a race!” [1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14 [2] Gunawi . What Bugs Live in the Cloud?. In SoCC’14 [3] Leesatapornwongsa. TaxDC . In ASPLOS’16

  11. 11 DCbugs need to be tackled • Common in distributed systems [1, 2, 3] • Difficult to avoid, expose and diagnose Hadoop Map/Reduce / MAPREDUCE-3274 “That is one monster of a race!” HBase / HBASE-4397 “There isn’t a week going by without new bugs about races.” [1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14 [2] Gunawi . What Bugs Live in the Cloud?. In SoCC’14 [3] Leesatapornwongsa. TaxDC . In ASPLOS’16

  12. 12 DCbugs need to be tackled • Common in distributed systems [1, 2, 3] • Difficult to avoid, expose and diagnose Hadoop Map/Reduce / MAPREDUCE-3274 “That is one monster of a race!” HBase / HBASE-4397 “There isn’t a week going by without new bugs about races.” HBase / HBASE-6147 “We have already fix many cases, however it seems exist many other [racing] cases.” [1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14 [2] Gunawi . What Bugs Live in the Cloud?. In SoCC’14 [3] Leesatapornwongsa. TaxDC . In ASPLOS’16

  13. 13 DCbugs need to be tackled • Common in distributed systems [1, 2, 3] • Difficult to avoid, expose and diagnose Hadoop Map/Reduce / MAPREDUCE-3274 “That is one monster of a race!” HBase / HBASE-4397 “There isn’t a week going by without new bugs about races.” HBase / HBASE-6147 “We have already found and fix many cases, Hadoop Map/Reduce / MAPREDUCE-4819 however it seems exist many other cases.” “This has become quite messy, sigh. ” [1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14 [2] Gunawi . What Bugs Live in the Cloud?. In SoCC’14 [3] Leesatapornwongsa. TaxDC . In ASPLOS’16

  14. 14 DCbugs need to be tackled • Common in distributed systems [1, 2, 3] • Difficult to avoid, expose and diagnose Hadoop Map/Reduce / MAPREDUCE-3274 “That is one monster of a race!” HBase / HBASE-4397 “There isn’t a week going by without new bugs about races.” HBase / HBASE-6147 “We have already found and fix many cases, Hadoop Map/Reduce / MAPREDUCE-4819 however it seems exist many other cases.” “This has become quite messy, sigh. ” Hadoop Map/Reduce / MAPREDUCE-4099 “Great catch, Sid! Apologies for missing the race condition.” [1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14 [2] Gunawi . What Bugs Live in the Cloud?. In SoCC’14 [3] Leesatapornwongsa. TaxDC . In ASPLOS’16

  15. 15 DCbugs need to be tackled • Common in distributed systems [1, 2, 3] • Difficult to avoid, expose and diagnose Hadoop Map/Reduce / MAPREDUCE-3274 “That is one monster of a race!” HBase / HBASE-4397 “There isn’t a week going by without new bugs about races.” HBase / HBASE-6147 “We have already found and fix many cases, Hadoop Map/Reduce / MAPREDUCE-4819 however it seems exist many other cases.” “This has become quite messy, sigh. ” Hadoop Map/Reduce / MAPREDUCE-4099 “Great catch, Sid! Apologies for missing the race condition.” Hadoop Map/Reduce / MAPREDUCE-3634 “We [prefer] debug crashes instead of hanging jobs.” [1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14 [2] Gunawi . What Bugs Live in the Cloud?. In SoCC’14 [3] Leesatapornwongsa. TaxDC . In ASPLOS’16

  16. 16 DCbugs need to be tackled • Common in distributed systems [1, 2, 3] • Difficult to avoid, expose and diagnose Hadoop Map/Reduce / MAPREDUCE-3274 “That is one monster of a race!” HBase / HBASE-4397 “There isn’t a week going by without new bugs about races.” HBase / HBASE-6147 “We have already found and fix many cases, Can we detect DCbugs before they manifest? Hadoop Map/Reduce / MAPREDUCE-4819 however it seems exist many other cases.” “This has become quite messy, sigh. ” Hadoop Map/Reduce / MAPREDUCE-4099 “Great catch, Sid! Apologies for missing the race condition.” Hadoop Map/Reduce / MAPREDUCE-3634 “We [prefer] debug crashes instead of hanging jobs.” [1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14 [2] Gunawi . What Bugs Live in the Cloud?. In SoCC’14 [3] Leesatapornwongsa. TaxDC . In ASPLOS’16

  17. 17 Previous work • Model checking – Work on abstracted models – Face state-space explosion issue

  18. 18 Our idea • Follow the philosophy of traditional concurrency bug detection

  19. 19 Our idea • Follow the philosophy of traditional concurrency bug detection Machine 1 Machine 2 Machine 3 Machine 4

  20. 20 Our idea • Follow the philosophy of traditional concurrency bug detection Machine 1 Machine 2 Machine 3 Machine 4

  21. 21 Our idea • Follow the philosophy of traditional concurrency bug detection Machine 1 Machine 2 Machine 3 Machine 4

  22. 22 Example A B C

  23. 23 Example A B C B //UnReg thread //RPC thread void unReg(jID){ Task getTask(jID){ ... jMap.remove(jID) ; return jMap.get(jID) ; .... } }

  24. 24 Local concurrency bug detection

  25. 25 Local concurrency bug detection

  26. 26 Local concurrency bug detection Is the problem solved?

  27. 27 Local concurrency bug detection T2 T1 T3  Trace 

  28. 28 Local concurrency bug detection T2 T1 T3  C Trace 1  C1: How to handle the huge amount of mem accesses? Challenges

  29. 29 Local concurrency bug detection T2 T1 T3 . . .  C Trace HB 1  C1: How to handle the huge amount of mem accesses? Challenges

  30. 30 Local concurrency bug detection T2 T1 T3 . . r . w  C Trace HB 1  C1: How to handle the huge amount of mem accesses? Challenges

  31. 31 Local concurrency bug detection T2 T1 T3 . . r . w  C C Trace HB 1 2  C1: How to handle the huge C2: What’s the happens -before amount of mem accesses? model? Challenges

  32. 32 Local concurrency bug detection T2 T1 T3 . . . . r r . . w w assert( r )  C C Trace HB Triage 1 2  C1: How to handle the huge C2: What’s the happens -before amount of mem accesses? model? Challenges

  33. 33 Local concurrency bug detection T2 T1 T3 . . . . r r . . w w assert( r )  C C C Trace HB Triage 1 2 3  C1: How to handle the huge C2: What’s the happens -before amount of mem accesses? model? Challenges C3: How to estimate the distributed impact of a race?

  34. 34 Local concurrency bug detection T2 T1 T3 . . . . . . r r +sleep . . . w w assert( r )  C C C Trace HB Triage Trigger 1 2 3  C1: How to handle the huge C2: What’s the happens -before amount of mem accesses? model? Challenges C3: How to estimate the distributed impact of a race?

  35. 35 Local concurrency bug detection T2 T1 T3 . . . . . . r r +sleep . . . w w assert( r )  C C C C Trace HB Triage Trigger 1 2 3 4  C1: How to handle the huge C2: What’s the happens -before amount of mem accesses? model? Challenges C3: How to estimate the C4: How to trigger with distributed impact of a race? distributed time manipulation?

  36. 36 Contribution • A comprehensive HB Model for distributed systems

  37. 37 Contribution • A comprehensive HB Model for distributed systems • DCatch tool detects DCbugs from correct runs  C C C C Trace HB Triage Trigger 1 2 3 4  C1: How to handle the huge C2: What’s the amount of mem accesses? happens-before model? Challenges Solved by DCatch C3: How to estimate the C4: How to trigger with distributed impact of a race? distributed time manipulation?

  38. 38 Contribution • A comprehensive HB Model for distributed systems • DCatch tool detects DCbugs from correct runs • Evaluate on 4 systems • Report 32 DCbugs, with 20 of them being truly harmful

  39. 39 Outline • Motivation • DCatch Happens-before Model • DCatch tool • Evaluation • Conclusion

Recommend


More recommend