Ahmed E Hassan Ahmed E. Hassan • NSERC/RIM Software Engineering Research Chair Queen’s University, Canada y, Mining Software Engineering Data g g g • Leads the SAIL research group at Queen’s • Co-chair for Workshop on Mining Software C h i f W k h Mi i S ft Ahmed E. Hassan Tao Xie Repositories (MSR) from 2004-2006 Queen’s University Q y North Carolina State University y www.cs.queensu.ca/~ahmed www.csc.ncsu.edu/faculty/xie • Chair of the steering committee for MSR ahmed@cs.queensu.ca xie@csc.ncsu.edu Some slides are adapted from tutorial slides co-prepared by Jian Pei from Simon Fraser University, Canada y An up-to-date version of this tutorial is available at http://ase.csc.ncsu.edu/dmse/ A. E. Hassan and T. Xie: Mining Software Engineering Data 2 Tao Xie Tao Xie Acknowledgments Acknowledgments • Assistant Professor at North Carolina State A i t t P f t N th C li St t • Jian Pei, SFU University, USA • Thomas Zimmermann Microsoft Research Thomas Zimmermann, Microsoft Research • Leads the ASE research group at NCSU • Peter Rigby, U. of Victoria • PC Co-Chair of ICSM 2009 MSR 2011 PC Co Chair of ICSM 2009 MSR 2011 • Sunghun Kim, HKUST • Co-organizer of 2007 Dagstuhl Seminar on • John Anvik, U. of Victoria • John Anvik U of Victoria Mining Programs and Processes Mining Programs and Processes A. E. Hassan and T. Xie: Mining Software Engineering Data 3 A. E. Hassan and T. Xie: Mining Software Engineering Data 4 Tutorial Goals Tutorial Goals Mining SE Data Mining SE Data • Learn about: • MAIN GOAL – Recent and notable research and researchers in mining – Transform static record- SE data keeping SE data to active – Data mining and data processing techniques and how to data apply them to SE data l th t SE d t – Make SE data actionable – Risks in using SE data due to e.g., noise, project culture by uncovering hidden by uncovering hidden • By end of tutorial, you should be able: patterns and trends – Retrieve SE data Bugzilla Mailings Mailings Bugzilla – Prepare SE data for mining – Mine interesting information from SE data Code Execution CVS CVS repository traces A. E. Hassan and T. Xie: Mining Software Engineering Data 5 A. E. Hassan and T. Xie: Mining Software Engineering Data 6
Mining SE Data Mining SE Data Overview of Mining SE Data Overview of Mining SE Data • SE data can be used to: programming defect detection testing debugging maintenance … – Gain empirically-based understanding of p y g software engineering tasks helped by data mining ft i i t k h l d b d t i i software development – Predict plan and understand various aspects Predict, plan, and understand various aspects of a project association/ classification clustering … patterns – Support future development and project Support future development and project data mining techniques management activities code change program structural bug … bases bases history history states states entities entities reports reports software engineering data A. E. Hassan and T. Xie: Mining Software Engineering Data 7 A. E. Hassan and T. Xie: Mining Software Engineering Data 8 Overview of Mining SE Data g Overview of Mining SE Data Overview of Mining SE Data 99 ASE 00 ICSE 05 FSE*2 99 FSE 99 FSE ASE 01 ICSE programming defect detection testing debugging maintenance … PLDI FSE POPL 02 ISSTA OSDI OSDI software engineering tasks helped by data mining ft i i t k h l d b d t i i POPL 06 PLDI KDD OOPSLA 03 PLDI KDD 99 ICSE 99 ICSE 04 ASE 04 ASE 07 ICSE*3 02 ICSE ISSTA association/ FSE*3 03 PLDI 05 ICSE classification clustering 03 ICSE … patterns ASE 05 FSE ASE 06 ICSE PLDI*2 PLDI*2 04 ICSE 04 ICSE PLDI PLDI 06 ICSE 06 ICSE 06 ASE data mining techniques ISSTA*2 05 FSE*2 06 ISSTA FSE*2 07 ICSE KDD 06 ASE 07 ISSTA 07 PLDI SOSP 07 ICSE*2 08 ICSE 08 ICSE*3 08 ICSE 3 08 ICSE 08 ICSE 08 ICSE 08 ICSE code change program structural bug code change program structural bug … … bases bases history history states states entities entities reports/nl reports/nl bases bases history history states states entities entities reports reports software engineering data software engineering data A. E. Hassan and T. Xie: Mining Software Engineering Data 9 A. E. Hassan and T. Xie: Mining Software Engineering Data 10 Overview of Mining SE Data Overview of Mining SE Data Tutorial Outline Tutorial Outline • Part I: What can you learn from SE data? programming defect detection testing debugging maintenance … – A sample of notable recent findings for different p g software engineering tasks helped by data mining ft i i t k h l d b d t i i SE data types 02 KDD 99 ASE 01 SOSP 99 ICSE 03 ICSE 04 ICSE 00 ICSE 04 OSDI PLDI*2 0 01 ICSE*2 CS ASE 05 FSE 05 FSE*2 05 ICSE FSE 05 FSE PLDI 06 ICSE*2 FSE 02 ICSE • Part II: How can you mine SE data? ASE*2 POPL 07 ICSE*2 ISSTA ASE 06 KDD 06 KDD 06 FSE 06 FSE FSE*2 FSE*2 POPL POPL PLDI PLDI – Overview of data mining techniques 07 ICSE*3 OOPSLA ISSTA 04 ISSTA 06 ICSE 08 ICSE*2 PLDI PLDI*2 06 ISSTA FSE – Overview of SE data processing tools and Overview of SE data processing tools and 07 FSE SOSP 07 ICSE ASE ISSTA 08 ICSE*3 techniques ISSTA PLDI KDD 08 ICSE A. E. Hassan and T. Xie: Mining Software Engineering Data 11 A. E. Hassan and T. Xie: Mining Software Engineering Data 12
Types of SE Data Types of SE Data Historical Data Historical Data • Historical data – Version or source control: cvs, subversion, perforce “History is a guide to navigation in History is a guide to navigation in – Bug systems: bugzilla, GNATS, JIRA perilous times. History is who we are – Mailing lists: mbox • Multi-run and multi-site data and why we are the way we are.” – Execution traces - David C. McCullough - David C McCullough – Deployment logs • Source code data Source code data – Source code repositories: sourceforge.net, google code A. E. Hassan and T. Xie: Mining Software Engineering Data 13 A. E. Hassan and T. Xie: Mining Software Engineering Data 14 Percentage of Project Costs Historical Data Historical Data Devoted to Maintenance • Track the evolution of a software project: 100 – source control systems store changes to the code 95 95 – defect tracking systems follow the resolution of defects Moad 90 Erlikh 00 90 – archived project communications record rationale for 85 decisions throughout the life of a project 80 Lientz & Swanson 81 • Used primarily for record-keeping activities: 75 Eastwood 93 Eastwood 93 70 – checking the status of a bug McKee 1984 65 – retrieving old code Zelkowitz 79 Huff 90 Port 98 60 1975 1980 1985 1990 1995 2000 2005 A. E. Hassan and T. Xie: Mining Software Engineering Data 15 A. E. Hassan and T. Xie: Mining Software Engineering Data 16 Survey of Software Maintenance Activities • Perfective: add new functionality dd f ti lit P f ti Source Control Repositories p • Corrective: fix faults Corrective: fix faults • Adaptive: new file formats, refactoring 2.2 2 2 18.2 39 0 39.0 17 4 17.4 56.7 60.3 Lientz, Swanson, Tomhkins [1978] Schach, Jin, Yu, Heller, Offutt [2003] Nosek, Palvia [1990] Mining ChangeLogs MIS Survey (Linux, GCC, RTP) A. E. Hassan and T. Xie: Mining Software Engineering Data 17
Recommend
More recommend