understanding issue correlations a case study of the
play

Understanding Issue Correlations: A Case Study of the Hadoop System - PowerPoint PPT Presentation

Understanding Issue Correlations: A Case Study of the Hadoop System Jian Huang Xuechen Zhang Karsten Schwan Why Issue Study Matters? Scalable distributed systems are complex [Yuan et al., OSDI14] Complicated System 2 Why


  1. Understanding Issue Correlations: A Case Study of the Hadoop System Jian Huang Xuechen Zhang† Karsten Schwan †

  2. Why Issue Study Matters? Scalable distributed systems are complex [Yuan et al., OSDI’14] Complicated System 2

  3. Why Issue Study Matters? Scalable distributed systems are complex [Yuan et al., OSDI’14] + Complicated System Error-prone 2

  4. Why Issue Study Matters? Scalable distributed systems are complex [Yuan et al., OSDI’14] + + Complicated System Error-prone Hard to Debug 2

  5. Why Issue Study Matters? Scalable distributed systems are complex [Yuan et al., OSDI’14] + + Complicated System Error-prone Hard to Debug Issue Study Issue Pattern 2

  6. Why Issue Study Matters? Scalable distributed systems are complex [Yuan et al., OSDI’14] + + Complicated System Error-prone Hard to Debug Issue Study + Issue Pattern Better Software & Debugging Tools 2

  7. Hadoop: A Representative Distributed System 3

  8. Hadoop: A Representative Distributed System HDFS (Storage) MapRedue (Computation) 10 Number of Reported Issues 8 6 (x1000) 4 2 0 2008 2009 2010 2011 2012 2013 2014 2015 The Evolution of Apache Hadoop 3

  9. Hadoop: A Representative Distributed System HDFS (Storage) MapRedue (Computation) 10 Number of Reported Issues 8 …… 6 (x1000) 4 2 0 2008 2009 2010 2011 2012 2013 2014 2015 The Evolution of Apache Hadoop 3

  10. Hadoop: A Representative Distributed System HDFS (Storage) MapRedue (Computation) 10 Number of Reported Issues 8 …… 6 (x1000) 4 2 0 2008 2009 2010 2011 2012 2013 2014 2015 The Evolution of Apache Hadoop Learn from issues – more than 6 years of experience. 3

  11. What Can We Learn From Issues? Related Work [Gunawi et al., SoCC’14] What Bugs Live in the Cloud? [Lu et al., FAST’13] A Study of Linux File System Evolution …… 4

  12. What Can We Learn From Issues? Related Work [Gunawi et al., SoCC’14] What Bugs Live in the Cloud? [Lu et al., FAST’13] A Study of Linux File System Evolution …… Our Focus: Issue Correlations Programming Systems Tools 4

  13. Our Findings Half of the issues are independent • MapReduce issues tend to relate to YARN • One third of the issues have similar causes • ...... • 5

  14. Our Findings Half of the issues are independent • MapReduce issues tend to relate to YARN • One third of the issues have similar causes • ...... • Memory: GC is still the No. 1 concern • Programming Storage: “99.99% of data reliability” is challenged • Programming: one third of them relate to interfaces • Tools: the logging in Hadoop is error-prone • Systems Tools ...... • 5

  15. Methodology Used in Our Study … Hive Pig Flume … HCatalog Mahout Cascading MapReduce HBase HDFS Hadoop Ecosystem 6

  16. Methodology Used in Our Study … Hive Pig Flume … HCatalog Mahout Cascading Computation MapReduce HBase HDFS Storage Hadoop Ecosystem 6

  17. Methodology Used in Our Study … Hive Pig Flume … HCatalog Mahout Cascading Computation MapReduce HBase HDFS Storage Hadoop Ecosystem Closed Issues 2340 2359 Examined Issues 2038 2180 6

  18. Methodology Used in Our Study … Hive Pig Flume … HCatalog Mahout Cascading Computation MapReduce HBase HDFS Storage Hadoop Ecosystem Closed Issues 2340 2359 Sampling Rate 89.8% Examined Issues 2038 2180 6

  19. Methodology Used in Our Study … Hive Pig Flume … HCatalog Mahout Cascading Computation MapReduce HBase HDFS Storage Hadoop Ecosystem Closed Issues 2340 2359 Sampling Rate 89.8% Examined Issues 2038 2180 Sampling Period ~6 years 5 years 6

  20. Methodology Used in Our Study Issues Description Patches Follow-up Source Code Discussions Analysis 6

  21. Methodology Used in Our Study Issues Description Patches Follow-up Source Code Discussions Analysis Labeling IssueID Subcomponent Type Causes …… Create/Commit Time CorrelatedIssueID HPatchDB 6

  22. Where Are the Correlated Issues From? Do you know where I’m from? #Correlated Issues 0 1 2 3 >=4 HDFS 94.7% 4.8% 0.5% - - External MapReduce 79.3% 17.1% 2.8% 0.5% 0.3% HDFS 52.7% 32.8% 9.1% 3.1% 2.3% Internal MapReduce 59.3% 32.7% 5.6% 1.3% 1.0% 7

  23. Where Are the Correlated Issues From? External Correlation correlated issues appear in other systems A Internal Correlation correlated issues appear in the same system C B #Correlated Issues 0 1 2 3 >=4 HDFS 94.7% 4.8% 0.5% - - External MapReduce 79.3% 17.1% 2.8% 0.5% 0.3% HDFS 52.7% 32.8% 9.1% 3.1% 2.3% Internal MapReduce 59.3% 32.7% 5.6% 1.3% 1.0% 7

  24. Where Are the Correlated Issues From? A C B #Correlated Issues 0 1 2 3 >=4 HDFS 94.7% 4.8% 0.5% - - External MapReduce 79.3% 17.1% 2.8% 0.5% 0.3% HDFS 52.7% 32.8% 9.1% 3.1% 2.3% Internal MapReduce 59.3% 32.7% 5.6% 1.3% 1.0% 7

  25. Where Are the Correlated Issues From? A A significant number of issues are independent. C B #Correlated Issues 0 1 2 3 >=4 HDFS 94.7% 4.8% 0.5% - - External MapReduce 79.3% 17.1% 2.8% 0.5% 0.3% HDFS 52.7% 32.8% 9.1% 3.1% 2.3% Internal MapReduce 59.3% 32.7% 5.6% 1.3% 1.0% 7

  26. Where Are the Correlated Issues From? A Half of them are from YARN. C B #Correlated Issues 0 1 2 3 >=4 HDFS 94.7% 4.8% 0.5% - - External MapReduce 79.3% 17.1% 2.8% 0.5% 0.3% HDFS 52.7% 32.8% 9.1% 3.1% 2.3% Internal MapReduce 59.3% 32.7% 5.6% 1.3% 1.0% 7

  27. Where Are the Correlated Issues From? A C B #Correlated Issues 0 1 2 3 >=4 HDFS 94.7% 4.8% 0.5% - - External MapReduce 79.3% 17.1% 2.8% 0.5% 0.3% HDFS 52.7% 32.8% 9.1% 3.1% 2.3% Internal MapReduce 59.3% 32.7% 5.6% 1.3% 1.0% 7

  28. Where Are the Correlated Issues From? A Half of them are independent. C B #Correlated Issues 0 1 2 3 >=4 HDFS 94.7% 4.8% 0.5% - - External MapReduce 79.3% 17.1% 2.8% 0.5% 0.3% HDFS 52.7% 32.8% 9.1% 3.1% 2.3% Internal MapReduce 59.3% 32.7% 5.6% 1.3% 1.0% 7

  29. Where Are the Correlated Issues From? A C B #Correlated Issues 0 1 2 3 >=4 HDFS 94.7% 4.8% 0.5% - - External MapReduce 79.3% 17.1% 2.8% 0.5% 0.3% HDFS 52.7% 32.8% 9.1% 3.1% 2.3% Internal MapReduce 59.3% 32.7% 5.6% 1.3% 1.0% 7

  30. How the Issues Are Correlated? Do you know our relationship? 8

  31. How the Issues Are Correlated? Similar Causes Issues have similar causes Blocking Other Issues Issues need to be fixed before fixing other issues Fix on Fix Issues are caused by fixing other issues 8

  32. How the Issues Are Correlated? 26-33% of the issues have similar causes. Similar Causes Blocking Other Issues Fix on Fix 40 Percentage (%) 30 20 10 0 HDFS MapReduce 8

  33. How the Issues Are Correlated? These issues that block others appear more frequently in HDFS. Similar Causes Blocking Other Issues Fix on Fix 40 Percentage (%) 30 20 10 0 HDFS MapReduce 8

  34. How the Issues Are Correlated? Mostly due to functional dependency. Similar Causes Blocking Other Issues Fix on Fix 40 Percentage (%) 30 20 10 0 HDFS MapReduce 8

  35. On the Issue Correlations with System Characteristics Programming Systems Tools 9

  36. On the Issue Correlations with System Characteristics 27% 26% 47% Programming Systems Tools 9

  37. How Issues Relate to Systems? 10

  38. How Issues Relate to Systems? 100 90 80 70 Percentage (%) 60 50 40 30 20 10 0 HDFS MapReduce security networking storage file system memory cache 10

  39. How Issues Relate to Systems? 100 90 80 GC is still the No.1 concern, 70 Percentage (%) memory-friendly objects are preferred. 60 50 40 • LightWeightGSet Vs. java.util structure 30 • Object cache for long lived object: 20 ReplicasMap, ReplicasInfo 10 0 HDFS MapReduce security networking storage file system memory cache 10

  40. How Issues Relate to Systems? 100 Many issues happened in file system like EXT4 appear in Hadoop. 90 80 70 Percentage (%) 60 File system semantic: 50 40 namespace management, file permission, 30 consistency (e.g., fsck), etc. 20 10 0 HDFS MapReduce security networking storage file system memory cache 10

  41. How Issues Relate to Systems? 100 The statement of the 99.99% of data 90 reliability in cloud storage is challenged. 80 70 Percentage (%) 60 Issues in rack placement policy: 50 0.16% of blocks and their replicas are in 40 the same rack upon system upgrade. 30 20 10 0 HDFS MapReduce security networking storage file system memory cache 10

  42. How Issues Relate to Systems? 100 One quarter of networking issues cause 90 resource wastage. 80 Socket 70 Percentage (%) Read a block: leak ! 60 Peer peer = newTcpPeer(dnAddr); - return newBlockReader(…) 50 + try{ 40 + reader = newBlockReader(…) 30 + return reader + } catch (IOException ex) { 20 + throw ex; 10 + } finally { 0 + if(reader == null) closeQuietly(peer); HDFS MapReduce + } security networking storage file system memory cache 10

Recommend


More recommend