re engineering software engineering in a data centric
play

Re-Engineering Software Engineering in a Data-Centric World - PowerPoint PPT Presentation

Re-Engineering Software Engineering in a Data-Centric World Miryung Kim University of California, Los Angeles 1 Confluence Interdisciplinary thinking via confluences, George Varghese @ SIGCOMM 2014 Keynote 2 Confluence: Interdisciplinary


  1. Re-Engineering Software Engineering in a Data-Centric World Miryung Kim University of California, Los Angeles 1

  2. Confluence Interdisciplinary thinking via confluences, George Varghese @ SIGCOMM 2014 Keynote 2

  3. Confluence: Interdisciplinary Thinking Inflection Point Interdisciplinary thinking via confluences, George Varghese @ SIGCOMM 2014 Keynote

  4. Confluence: Impressionism Inflection Point Interdisciplinary thinking via confluences, George Varghese @ SIGCOMM 2014 Keynote

  5. Confluence: Data Analytics and SE Inflection Point ML Big Data AI Interdisciplinary thinking via confluences, George Varghese @ SIGCOMM 2014 Keynote

  6. Takeaway Message: A Case for Software Engineering for Data Analytics (SE4DA) Bug finding is a huge problem in data analytics. SE4DA is underserved ; somehow people have gravitated to applying data analytics to SE. SE4DA requires re-thinking software engineering techniques. 6

  7. There is a huge opportunity for data analytics. 7

  8. Data analytics are in high demand, yet … 8

  9. Bugs are huge problems in data analytics. Data analytics used by The widespread harm thousands of scientists includes from a wrong produce misleading or medical diagnosis to wrong results incorrect interpretation [BBC News] of stock history [Dataversity] Predictably inaccurate : The prevalence and perils of bad big data. [Deloitte] 9

  10. Growth of Data Analytics Papers in SE Data Analytics (AI, Big Data, ML) Growth in ASE Papers 100 39 40 50 50 38 47 28 21 22 0 2016 2017 2018 2019 Data Analytics Rest 10

  11. SE4DA is under-investigated. (SE4DA: 13, DA4SE: 105) SE4DA (4%): SE4DA Improving SE for 4% data analytics DA4SE (37%): DA4SE Applying data 37% Rest analytics to SE 59% 11

  12. Outline: Making a Case for Software Engineering for Data Analytics (SE4DA) Shift to data-centric SW ① Studies: development Data Differences between traditional SW Scientists ② vs. data-centric SW dev process Debugging & testing for big data ③ Tools analytics ④ Open problems in SE4DA

  13. Part 1. Data Scientists in Software Teams: State of the Art and Challenges Miryung Kim, Thomas Zimmermann, Rob DeLine, Andrew Begel

  14. ① Data ② ③ ④ Scientists Challenges Difference Tools The Emerging Roles of Data Scientists on Software Teams We are at a tipping point where there are large scale telemetry, machine, quality, and user data. Data scientists are emerging roles in SW teams. To understand working styles and challenges, we conducted the first in-depth interview study and the largest scale survey of professional data scientists. 14

  15. ① Data ② ③ ④ Scientists Challenges Difference Tools Methodology for Studying “Data Scientists” Survey [TSE 2018] In-Depth Interviews [ICSE’16]: 793 responses • 5 women and 11 men from • demographics/self- eight different Microsoft perception organizations • skills and tool usage • working styles Bio Finance Physics • time spent Informatics • challenges and best Economics Math practices Computer Cog Statistics Sci Science ML 15

  16. ① Data ② ③ ④ Scientists Challenges Difference Tools Time Spent on Activities Hours spent on certain activities (self reported, survey, N=532) 16

  17. ① Data ② ③ ④ Scientists Challenges Difference Tools What is a “Data Scientist”? # $ ! "# " ! # " # $ ! " # ! Clustering $ " $ $ $ ! " # ! " $ based on # $ 532 data scientists relative time spent at Microsoft in activities !! # … " 9 Distinct Categories 17

  18. ① Data ② ③ ④ Scientists Challenges Difference Tools Category 1: Data Shaper Analyzing and preparing data Post-graduate degrees Algorithms, machine learning, and optimizations Less familiar with front-end programming 18

  19. ① Data ② ③ ④ Scientists Challenges Difference Tools Category 2: Platform Builder Instrument code to collect data Big data and distributed systems Back-end and front-end programming SQL, C, C++ and C# 19

  20. ① Data ② ③ ④ Scientists Challenges Difference Tools Category 3: Data Analyzer Familiar with statistics Not familiar with front-end programming Difficulty with data transformation R Studio or statistical analysis 20

  21. ① Data ② ③ ④ Scientists Challenges Difference Tools Common challenges: Data scientists find it difficult to ensure “correctness” Validation is a major challenge. “Honestly, we don’t have a good method for this.” “Just because the math is right, doesn’t mean that the answer is right.” Explainability is important— “to gain insights, you must go one level deeper.” 21

  22. ① Data ② ③ ④ Scientists Challenges Difference Tools Outline: Making a Case for Software Engineering for Data Analytics (SE4DA) Shift to data-centric SW ① Studies: development Data Differences between traditional SW Scientists ② vs. data-centric SW dev process Debugging & testing for big data ③ Tools analytics ④ Open problems in SE4DA 22

  23. ① Data ② ③ ④ Scientists Challenges Difference Tools Part 2. How is Traditional Development Different from Big Data Analytics Development? [Interactions’12] [ICSE-SEIP’19] [NIPS’15] [TSE’19] [ICSE’16] [TSE’18]

  24. ① Data ② ③ ④ Scientists Challenges Difference Tools Traditional vs. Big Data Analytics Development 1 Develop 1 Develop locally 2 Run 2 Test locally with Sample Data 3 Execute the job on the cloud 3 Test hoping that it would work 4 Several hours later, the job crashes 4 Debug or produces wrong output 5 Repeat 5 Repeat 24

  25. ① Data ② ③ ④ Scientists Challenges Difference Tools Traditional vs. Big Data Analytics Development 1. Data is huge , remote , 1 Develop locally and distributed . 2 Test with Sample 25

  26. ① Data ② ③ ④ Scientists Challenges Difference Tools Traditional vs. Big Data Analytics Development 2. Writing test is hard . Don’t even know the full input and don’t know the expected output. 3. Failures are hard to 2 Test with Sample define. 4 The job crashes or produces wrong output 26

  27. ① Data ② ③ ④ Scientists Challenges Difference Tools Traditional vs. Big Data Analytics Development 4. System stack is complex with little visibility. Filter Map Reduce 3 Execute the job on the cloud 27

  28. ① Data ② ③ ④ Scientists Challenges Difference Tools Traditional vs. Big Data Analytics Development Zipcode Trips Map Map Filter Join: ⨝ Map ReduceByKey 5. Gap between logical 3 Execute the job on the vs. physical execution cloud 28

  29. ① Data ② ③ ④ Scientists Challenges Difference Tools Traditional vs. Big Data Analytics Development Task 31 failed 3 times; aborting job ERROR Executor: Exception in task 31 in stage 0 (TID 31) java.lang.NumberFormatException 6. Data tracing is hard. 3 Execute the job on the cloud 4 The job crashes or produces wrong output � 5 Repeat 29

  30. ① Data ② ③ ④ Scientists Challenges Difference Tools Outline: Making a Case for Software Engineering for Data Analytics (SE4DA) Shift to data-centric SW ① Studies: development Data Differences between traditional SW Scientists ② vs. data-centric SW dev process Debugging & testing for big data ③ Tools analytics ④ Open problems in SE4DA 30

  31. Part 3. Debugging and Testing for Big Data Analytics Tyson Condie, Ari Ekmekji, Muhammad Ali Gulzar, Miryung Kim, Matteo Interlandi, Shaghayegh Mardani, Todd Millstein, Madanlal Musuvathi, Kshitij Shah, Sai Deep Tetali, Seunghyun Yoo

  32. ① Data ② ③ ④ Scientists Challenges Difference Tools Insights from Debugging and Testing for Apache Spark • Designing interactive debug primitives requires deep understanding of internal execution model, job scheduling, and materialization . • Providing traceability requires modifying a runtime . • Abstraction is a powerful force in simplifying program paths. 32

  33. ① Data ② ③ ④ Scientists Challenges Difference Tools Enabling interactive debugging requires us to re-think a traditional debugger • Pausing the entire computation on the cluster could reduce throughput • It is clearly infeasible for a user to inspect billion of records through a regular watchpoint 33

  34. ① Data ② ③ ④ Scientists Challenges Difference Tools BigDebug: Interactive Debug Primitives for Big Data Analytics [ICSE 2016] Program Stage 2 Stage 1 (DAG) Map Map Filter Map Map Reduce Map � Stored ④Backward ①Simulated Data Tracing Breakpoint Records Reduce age < 0 ②On Demand ③ Realtime Watchpoint Repair 34

  35. ① Data ② ③ ④ Scientists Challenges Difference Tools Titian: Data Provenance for Apache Spark [VLDB 2016] Program Stage 1 Stage 2 (DAG) Map Filter Map Map Reduce Map Lineage Table Worker 1 Worker 1 � � ⨝ ⨝ Worker 2 Worker 2 Worker 3 Worker 3 35

Recommend


More recommend