Automatic Identification of Bug-fix Commits: The Case of GitHub Projects Yujuan Jiang, Rodrigo Morales, Bram Adams, Foutse Khom 1
• Case study projects • Approach • Research questions • Result (so far) 2
Case Study Projects key words: GitHub, C language 3
Approach • Data Collection • Feature Extraction (Text & Source code) • Model Training • Evaluation 4
Approach: Data collection 5
Approach: Feature Extraction Textual Analysis: keywords Code Analysis 6
Approach: Feature Extraction 1) Textual Analysis: 7
Approach: Feature Extraction 1) Textual Analysis: keywords 7
Approach: Feature Extraction 1) Textual Analysis: keywords + feature words 7
Approach: Feature Extraction 1) Textual Analysis: keywords + feature words All words 7
Approach: Feature Extraction 1) Textual Analysis: keywords + feature words Stem + All words remove stop words 7
Approach: Feature Extraction 1) Textual Analysis: keywords + feature words Stem + All words Filter remove stop words 7
Approach: Feature Extraction 1) Textual Analysis: keywords + feature words Stem + All words Filter remove stop words 7
Approach: Feature Extraction 2) Source Code Analysis: 8
Approach: Feature Extraction 2) Source Code Analysis: Patch Parser 8
Approach: Feature Extraction 2) Source Code Analysis: Patch Parser + re Script 8
Approach: Feature Extraction 2) Source Code Analysis: Patch Parser + re Script Commits 8
Approach: Feature Extraction 2) Source Code Analysis: Patch Parser + re Script Commits Parser 8
Approach: Feature Extraction 2) Source Code Analysis: Patch Parser + re Script Commits Parser Commit Profile 8
Approach: Feature Extraction 2) Source Code Analysis: Patch Parser + re Script # of while loops # of ifs # of boolean ...... Commits Parser Commit Profile Features 8
Approach: Feature Extraction 9
Approach: Model Training Black data (Manually label 300 bug fixing commits for each project) Grey data (Unlabelled) 10
Approach: Model Training Black data (Manually label 300 bug fixing commits for each project) Grey data LPU (Unlabelled) 10
Approach: Model Training Black data (Manually label 300 bug fixing commits for each project) White data (Bottom k) Grey data LPU (Unlabelled) Black data 10
Approach: Model Training Black data (Manually label 300 bug fixing commits for each project) White data (Bottom k) Grey data + LPU (Unlabelled) Black data SVM Random Forest 10
Approach: Evaluation 11
Research Questions • Does our classifier work better than the baseline: keyword-based approach? • How does the parameter k impact the classifier? • What kind of metrics play more important roles in identifying bug-fixing commits? • Is the hybrid approach (namely the combination of the LPU and SVM) more effective than a single classifier approach? • Which combination of the options of the tool LPU makes the classifier work best? 12
Result (so far): recall • Libgit2: 76.95% • openFrameworks: 96.67% 13
Result (so far): key features X5 ● X6 ● X7 ● X22 ● X20 ● X21 ● X23 ● X31 ● X12 ● X50 ● X27 ● X16 ● X10 ● X16676 ● X51 ● X49 ● X48 ● X47 ● X46 ● X45 ● X44 ● X43 ● X42 ● X40 ● X39 ● X36 ● X35 ● X34 ● X32 ● X30 ● X29 ● X28 ● X25 ● X24 ● X19 ● X18 ● X17 ● X15 ● X14 ● X13 ● X11 ● X9 ● X4 ● X3 ● X2 ● X26 ● X37 ● X33 ● X41 ● X38 ● 0.000 0.005 0.010 0.015 0.020 0.025 0.030 Libgit2 14
15
15
15
LPU SVM 15
X5 ● X6 ● X7 ● X22 ● X20 ● X21 ● X23 ● X31 ● X12 ● X50 ● X27 ● X16 ● X10 ● X16676 ● X51 ● X49 ● X48 ● X47 ● X46 ● X45 ● X44 ● X43 ● X42 ● X40 ● X39 ● X36 ● X35 ● X34 ● X32 ● X30 ● X29 ● X28 ● X25 ● X24 ● X19 ● X18 ● X17 ● X15 ● X14 ● X13 ● X11 ● X9 ● X4 ● X3 ● X2 ● X26 ● X37 ● X33 ● X41 ● X38 ● LPU SVM 0.000 0.005 0.010 0.015 0.020 0.025 0.030 15
Recommend
More recommend