Smelling Source Code Using Deep Learning Tushar Sharma http://www.tusharma.in
What is a smell? …certain structures in the code that suggest (sometimes they scream for) the possibility of refactoring . - Kent Beck 20 Definitions of smells: http://www.tusharma.in/smells/smellDefs.html Smells’ catalog: http://www.tusharma.in/smells/
Implementation smells
Design Smells
Architecture Smells
How smells get detected?
Metrics-based smell detection Source model < > Metrics Code (or source artifact) < ! > Smells
Machine learning-based smell detection < > Existing < > examples < ! > Machine learning algorithm < > Code Source model (or source artifact) f(x) < ! > f(x) f(x) f(x) Smells Trained model
Machine learning-based smell detection Existing academic work: Take metrics as the - Support vector machines features/input - Bayesian belief network - Logistic regression - CNN m f(m) Validation on balanced samples
Research questions RQ1: Would it be possible to use deep learning methods to detect code smells? RQ2: Is transfer-learning feasible in the context of detecting smells? Transfer-learning refers to the technique where a learning algorithm exploits the commonalities between different learning tasks to enable knowledge transfer across the tasks
Overview Detected smells C# C# </> C# CodeSplit Code Java fragments Java Java -- -- ---- Learning data ---- -- -- generator Preprocess 23 51 23 51 -- -- 32 200 Tokenized 32 200 -- -- ---- 11 45 ---- 11 45 ---- ---- -- -- -- -- samples Research questions -- -- -- -- ---- ---- -- -- ---- -- ---- Deep learning -- ---- ---- -- -- Tokenizer -- -- models Positive and negative samples
Data Curation
Repositories download 2,528 and C# Java 1,072 selected 100 repositories C# Java repositories C# Java Architecture Community CI Documentation History License Issues Unit test Stars
Splitting code fragments C# </> -- ---- -- C# -- ---- -- ---- -- -- C# Code fragments CodeSplit (methods or classes) Java </> -- ---- -- -- ---- -- ---- -- Java -- Java https://github.com/tushartushar/CodeSplitJava
Smell detection C# C# C# Detected code smells Java Java Java Java https://github.com/tushartushar/DesigniteJava http://www.designite-tools.com/
Generating training and evaluation samples Code smells -- Code -- -- ---- -- ---- -- -- ---- ---- -- -- -- ---- ---- -- -- -- ---- -- fragments -- -- ---- -- ---- -- -- Sample generator Positive and negative samples
Tokenizing learning samples -- -- 23 51 23 51 ---- -- 32 200 -- ---- -- -- 32 200 -- 11 45 -- -- ---- -- ---- 11 45 -- ---- -- ---- -- ---- -- ---- -- -- ---- ---- -- -- -- -- Code fragments Tokenizer Tokenized samples https://github.com/dspinellis/tokenizer
Tokenizing learning samples public void InternalCallback(object state) 1-D { 123 2002 40 2003 41 59 474 123 2004 46 2005 Callback(State); try { 2-D timer.Change(Period, TimeSpan.Zero); 123 2002 40 2003 41 59 } 474 123 2004 46 2005 40 2006 44 2007 46 2008 catch (ObjectDisposedException) 125 329 40 2009 41 123 125 125 { } } -- -- 23 51 23 51 ---- -- 32 200 -- ---- -- -- 32 200 -- 11 45 -- -- ---- -- ---- 11 45 -- ---- -- ---- -- ---- -- ---- -- -- ---- ---- -- -- -- --
Data preparation 5,146 311,533 70-30 3,602 218,073 93,460 1,544 split 3,602 3,602 93,460 1,544 Training samples Evaluation samples
Selection of smells an unexplained the method has numeric literal is high cyclomatic used in an complexity • Complex method expression • Magic number • Empty catch block • Multifaceted abstraction a class has more a catch block of than one an exception is responsibility empty assigned to it
Architecture - CNN Inputs • Filters = {8, 16, 32, 64} Convolution layer • Kernel size = {5, 7, 11} Repeat this set of • Pooling window = {2, 3, 4, 5} Batch normalization layer hidden units Max pooling layer • Dynamic Batch size = {32, 64, 128, Dropout layer 0.1 256} Flatten layer 32, relu • Callbacks Dense layer 1 • Early stopping (patience = 5) 1, sigmoid Dense layer 2 • Model check point Output
Architecture - RNN • Dimensionality of embedding layer = {16, 32} Inputs • LSTM units = {32, 64, 128} Embedding layer Repeat this set of LSTM layer hidden units • Dynamic Batch size = {32, 64, 128, 0.2 Dropout layer 256} 1, sigmoid Dense layer • Callbacks Output • Early stopping (patience = 2) • Model check point
Running experiments • Phase 1 – Grid search for optimal hyper- parameters • Validation set – 20% • Number of configurations GRNET Super • CNN = 144 computing facility • RNN = 18 Each experiment using 1 GPU with 64 GB • Phase 2 – experiment with the optimal memory hyper-parameters
Results
RQ1. Would it be possible to use deep learning methods to detect code smells? F1 0.7 AUC-ROC 0.57 0.6 0.90 0.5 0.41 0.80 0.38 0.4 0.35 0.31 0.29 0.70 0.3 0.22 0.60 0.2 0.09 0.50 0.06 0.1 0.04 0.02 0.02 0.40 0 CM ECB MN MA CM ECB MN MA CM ECB MN MA CM ECB MN MA CM ECB MN MA CM ECB MN MA CNN-1D CNN-2D RNN CNN-1D CNN-2D RNN
CNN-1D vs CNN-2D CNN-1D (max) - 0.05 CNN-1D (max) - 0.40 CNN-2D (max) - 0.04 CNN-2D (max) - 0.39 CNN-1D (max) - 0.36 CNN-1D (max) - 0.18 CNN-2D (max) - 0.35 CNN-2D (max) - 0.16
CNN vs RNN RNN and RNN and CNN-1D CNN-2D CM -22.94 -33.81 ECB 80.23 91.94 MN 48.96 38.58 MA -349.12 -205.26 Difference in percentage; comparing max F1
Are more deep layers always good? Layers CM ECB MN MA 1 0.36 0.05 0.36 0.08 CNN- 2 0.40 0.05 0.36 0.18 1D 3 0.40 0.05 0.36 0.19 1 0.39 0.04 0.35 0.07 CNN- 2 0.39 0.04 0.34 0.16 2D 3 0.39 0.05 0.34 0.10 1 0.34 0.21 0.48 0.28 RNN 2 0.36 0.24 0.48 0.22 3 0.37 0.23 0.48 0.20
RQ2: Is transfer-learning feasible in the context of detecting smells? F1 F1 0.57 0.57 0.6 0.54 0.54 0.60 0.49 0.49 0.49 0.49 0.5 0.50 0.41 0.38 0.35 0.4 0.40 0.29 0.3 0.30 0.2 0.14 0.14 0.20 0.09 0.07 0.07 0.06 0.06 0.04 0.1 0.03 0.03 0.02 0.10 0.01 0 0.00 CM ECB MN MA CM ECB MN MA CM ECB MN MA CM ECB MN MA CNN-1D CNN-2D CNN-1D CNN-2D Transfer-learning Direct-learning
It is feasible to make the deep learning model Transfer-learning is learn to detect smells feasible. Conclusions Improvements – many possibilities - Performance - Add more smells – different kinds
Relevant links Source code and data https://github.com/tushartushar/DeepLearningSmells Smell detection tool Java - https://github.com/tushartushar/DesigniteJava C# - http://www.designite-tools.com CodeSplit </> Java - https://github.com/tushartushar/CodeSplitJava C# - https://github.com/tushartushar/DeepLearningSmells/tree/master/CodeSplit Tokenizer https://github.com/dspinellis/tokenizer
Thank you!! Courtesy: spikedmath.com
Recommend
More recommend