The Many Faces of Software Analytics David Lo School of Information Systems Singapore Management University davidlo@smu.edu.sg Talk at the University of Luxembourg, Dec 2014
A Brief Self-Introduction X X 6,496 miles or 10,454 km 2
A Brief Self-Introduction From Wikipedia 3
A Brief Self-Introduction 4
Singapore Management University Third university in Singapore Number of students: 7000+ (UG) 1000+ (PG) Schools: Information Systems Economics Law Business Accountancy Social Science 5
School of Information Systems Undergraduates: 1000+ Master students: 100+ Doctoral students: 50+ 6
Our Research Group @ SMU 7
Our Research Group @ SMU 9 PhD Students 1 Visiting Professor 1 Research Engineer (Jan 2015) 8
Software Analytics ”Data exploration and analysis in order to obtain insightful and actionable information for data- driven tasks around software and services” (Zhang and Xie, 2012) 9
Software Analytics: Definition Analysis of a large amount of software data stored in various repositories in order to: Understand software development process Help improve software maintenance Help improve software reliability And more 10
Software Analytics Mailings Bugzilla Code Dev. Execution SVN Network traces 11
Research Directions: Software Analytics Analytics for Coding & Collaboration Analytics for Testing & Debugging Analytics for Requirement & Design Validation 12
Our Past and Current Work Analytics for Coding & Collaboration 13
Intelligent Multi Modal Code Search 14
Intelligent Multi Modal Code Search e.g., structured query, free User text, code example… Query Code Version control Search system, Code base Engine collaboration sites…… e.g., code fragment, Relevant method, class, projects, … Code 15
Intelligent Multimodal Code Search Nodes: func A, func B, var C, var D; How do I load properties Relations: C dataDepends A, D from an XML file? dataDepends B, D isFieldOf C; Targets: D Free Text Dependence Query Language Code Search Engine Code Examples 16
Structured Code Search (ASE10) A developer can define a query about the dependence relationship in a bug pattern or a need-to-refactor code pattern. Using our search engine, he/she can find x1, x2, and x3 which are instances of the code pattern X1 Codes Bug Report X3 X2 Query Dependence Based Code Search Engine 17
Workflow of Our Approach Query Graph Query Query Construction Graphs and Splitting Query Post-Filtering Graph Query Results and Merging Processing Code I ndexed Graph SDG SDG Indexing 18
Dependence Query Language (DQL) Allows developers to describe a target Involving several code elements Including the dependencies between the elements Composed of 4 parts Query identifier declarations [D] Code element (node) constraints [N] Relation constraints [R] Desired target identifiers [T] 19
Dependence Query Language (DQL) Node Description [N] : Code element constraints contains < Text> , inFile < FileName> , inFunction < FnName> , controlType < for/while/switch/if> , etc. Relation Description [R] : Relationship constraints A (transitively) controls B, A calls B, A is data dependent on B A is one step ( directly ) < depend-operation> on B A textual contains B, etc. 20
Query Splitting Split a query with disjunctions of conditions Result: Multiple queries with only conjunctions function A, variable B; A contains "abc"; A dataDepends B; want A control-point A, variable B; A contains "abc"; A dataDepends B; want A function/control-point A, variable B; A contains "abc" or contains "de"; function A, variable B; A contains A dataDepends B; wantA "de"; A dataDepends B; want A control-point A, variable B; A contains "de"; A dataDepends B; want A 21
Query Graph Construction Query Declarations Each identifier becomes a node in the query graph Relation Descriptions Each dependence relation becomes an edge in the query graph A:declaration B: actual-out C: expression 22
Query Graph Splitting Divide the query graph to two sub-graphs Each only capture control OR data dependences A:declaration D: Control B: actual-out point A:declaration C: expression B: actual-out D: Control C: expression point B: actual-out C: expression 23
Graph Indexing and Query Purpose: Locate all instances of a given graph pattern in a large graph (Cheng et al., ICDE08) Graph A1 A2 A3 Query Three results found: B2 - triangle A B1 C1 - square C3 C2 - star B D1 E1 F E2 F2 E3 F1 (b) (a) 24
Result Filtering & Merging Result Filtering Textual conditions (e.g., textual contains) Other relation descriptions Result Merging Split 1: Disjunctions Split 2: Data vs. Control Dependences Need to union the sub-results 25
Evaluation Two open source projects expat, gpsbabel Project name Description Version Size (LOC) expat 2002-05-17 13 XML handling library 2002-05-22 13 gpsbabel GPS toolkit 2004-10-27 50 2005-03-21 54 Four software maintenance tasks From pairs of snapshots from version histories Developer change = Gold standard 26
Overall Results: Accuracy Task # Targets Text Search Code Clone Our approach Detection FP FN FP FN FP FN 1 2 526 0 0 2 36 0 2 8(186) 829(651) 0 0 8 200(22) 0 3 37 297 0 23 3 25 2 4 19 86 0 9 2 3 0 For task 2, the number in the bracket: Adjusted numbers after considering correct locations that are not modified yet by developers 27
Free Text Code Search (FSE12) Find optimum connected graph that meets user needs Greedy subgraph search algorithm with shortest path indexing 28
Example Based Code Search (ASEJ15) Example 2: Example 1: if(b> 1){ if(c> 3){ b= ext()+ foo(); c= getStr(); } c= ext(); } Lightweight type Extend to compilable Generate PDGs inference, codes Closed subgraph mining PDGs Generation Engine Our Manual Generate Recover Mine Prec. 0.684 0.584 dependency textual common query information subgraphs Recall 0.721 0.767 F1 0.702 0.664 Query Generation Engine 29
Coding & Collaboration Structured Example Free Text Active Code Search Based Code Search Code Search (ASE10) Code Search (FSE12) (ASE14) (ASEJ15) Multi-Criteria Project Search Structured (ICECCS13) + Topic Model (WCRE10) Similar Project Search (ICSM12) 30
Coding & Collaboration Recommending Recommending Recommending Answer Posts Related Libraries API Methods Given (ASE11) (WCRE13) Feature Requests (ASE13) 31
Coding & Collaboration Observatory of Automated Content Recommending Developer Tweets and Trends Categorization Recommendation Tags to Contents (ASE11) (ICPC14) (MSR13, ICSME14) (WCRE11) Project Recommending Identification of Best Answerers Success Relevant Microblogs (QMC13) Estimation (ICSM12) (CSMR13) 32
Coding & Collaboration Software Diffusion Collaboration Coding Practice New Media Patterns Usages APSEC12 PLOS13 WCRE10 MUD14 COMPSAC13 CSMR13 CSMR13 SAC13 MSR12 33
Our Past and Current Work Analytics for Testing & Debugging 34
Bug Finding and Fixing are Hard ! Software bugs cost the US Economy 59.5 billion dollars annually Stated by the US National Institute of Standards and Technology in 2002 (Tassey, 2002) Software debugging is an expensive and time consuming task in software projects Testing and debugging activities account 30-90% of the labor expended on a project (Beizer, 1990) 35
Bug Finding Techniques A buggy program Analyze program List of possible buggy program elements 36
Bug Finding Techniques Bug Report Failure Bug Finder Anomaly 37
Spectrum-Based Fault Localization Block Program Element T1 T2 T3, T4, … I D 1 double a, x; double ap, del, sum; int n; double temp; if ( x < = 0.0 ) 2 { return 0.0;} 3 del = sum = 1.0 / (ap = a); for ( n = 1; n < = ITMAX; + + n){ 4 sum + = del * = x / + + ap; if ( Abs( del ) < Abs( sum ) * EPS){ 5 /* BUGS: supposed to be:* / /* temp = sum * exp(-x + a* log(x)-Lgamma(a))* / temp = sum * exp( x + a* log( x )-Lgamma(a)); return temp;} } F P Status of Test Case Execution Program spectra 38
Measuring suspiciousness Suspiciousness Scores Program Elements vb vb e.g., spectrum-based fault localization (Abreu et.al, TAICPART-MUTATION’07, Lucia et al., ICSM’10 ) 39
Motivation There is no single fault localization techniques that is the best in all cases. (Lucia et al., JSEP, 2014) Combine different techniques? 40
Fusion Localizer (ASE14) 41
Step 2. Techniques selection A set of fault localization techniques Choosing the techniques to be fused (A) Overlap-based (B) Bias-based selection selection Selected fault localization techniques 42
Recommend
More recommend