smelling source code using deep learning
play

Smelling Source Code Using Deep Learning Tushar Sharma - PowerPoint PPT Presentation

Smelling Source Code Using Deep Learning Tushar Sharma http://www.tusharma.in What is a smell? certain structures in the code that suggest (sometimes they scream for) the possibility of refactoring . - Kent Beck 20 Definitions of smells:


  1. Smelling Source Code Using Deep Learning Tushar Sharma http://www.tusharma.in

  2. What is a smell? …certain structures in the code that suggest (sometimes they scream for) the possibility of refactoring . - Kent Beck 20 Definitions of smells: http://www.tusharma.in/smells/smellDefs.html Smells’ catalog: http://www.tusharma.in/smells/

  3. Implementation smells

  4. Design Smells

  5. Architecture Smells

  6. How smells get detected?

  7. Metrics-based smell detection Source model < > Metrics Code (or source artifact) < ! > Smells

  8. Machine learning-based smell detection < > Existing < > examples < ! > Machine learning algorithm < > Code Source model (or source artifact) f(x) < ! > f(x) f(x) f(x) Smells Trained model

  9. Machine learning-based smell detection Existing academic work: Take metrics as the - Support vector machines features/input - Bayesian belief network - Logistic regression - CNN m f(m) Validation on balanced samples

  10. Research questions RQ1: Would it be possible to use deep learning methods to detect code smells? RQ2: Is transfer-learning feasible in the context of detecting smells? Transfer-learning refers to the technique where a learning algorithm exploits the commonalities between different learning tasks to enable knowledge transfer across the tasks

  11. Overview Detected smells C# C# </> C# CodeSplit Code Java fragments Java Java -- -- ---- Learning data ---- -- -- generator Preprocess 23 51 23 51 -- -- 32 200 Tokenized 32 200 -- -- ---- 11 45 ---- 11 45 ---- ---- -- -- -- -- samples Research questions -- -- -- -- ---- ---- -- -- ---- -- ---- Deep learning -- ---- ---- -- -- Tokenizer -- -- models Positive and negative samples

  12. Data Curation

  13. Repositories download 2,528 and C# Java 1,072 selected 100 repositories C# Java repositories C# Java Architecture Community CI Documentation History License Issues Unit test Stars

  14. Splitting code fragments C# </> -- ---- -- C# -- ---- -- ---- -- -- C# Code fragments CodeSplit (methods or classes) Java </> -- ---- -- -- ---- -- ---- -- Java -- Java https://github.com/tushartushar/CodeSplitJava

  15. Smell detection C# C# C# Detected code smells Java Java Java Java https://github.com/tushartushar/DesigniteJava http://www.designite-tools.com/

  16. Generating training and evaluation samples Code smells -- Code -- -- ---- -- ---- -- -- ---- ---- -- -- -- ---- ---- -- -- -- ---- -- fragments -- -- ---- -- ---- -- -- Sample generator Positive and negative samples

  17. Tokenizing learning samples -- -- 23 51 23 51 ---- -- 32 200 -- ---- -- -- 32 200 -- 11 45 -- -- ---- -- ---- 11 45 -- ---- -- ---- -- ---- -- ---- -- -- ---- ---- -- -- -- -- Code fragments Tokenizer Tokenized samples https://github.com/dspinellis/tokenizer

  18. Tokenizing learning samples public void InternalCallback(object state) 1-D { 123 2002 40 2003 41 59 474 123 2004 46 2005 Callback(State); try { 2-D timer.Change(Period, TimeSpan.Zero); 123 2002 40 2003 41 59 } 474 123 2004 46 2005 40 2006 44 2007 46 2008 catch (ObjectDisposedException) 125 329 40 2009 41 123 125 125 { } } -- -- 23 51 23 51 ---- -- 32 200 -- ---- -- -- 32 200 -- 11 45 -- -- ---- -- ---- 11 45 -- ---- -- ---- -- ---- -- ---- -- -- ---- ---- -- -- -- --

  19. Data preparation 5,146 311,533 70-30 3,602 218,073 93,460 1,544 split 3,602 3,602 93,460 1,544 Training samples Evaluation samples

  20. Selection of smells an unexplained the method has numeric literal is high cyclomatic used in an complexity • Complex method expression • Magic number • Empty catch block • Multifaceted abstraction a class has more a catch block of than one an exception is responsibility empty assigned to it

  21. Architecture - CNN Inputs • Filters = {8, 16, 32, 64} Convolution layer • Kernel size = {5, 7, 11} Repeat this set of • Pooling window = {2, 3, 4, 5} Batch normalization layer hidden units Max pooling layer • Dynamic Batch size = {32, 64, 128, Dropout layer 0.1 256} Flatten layer 32, relu • Callbacks Dense layer 1 • Early stopping (patience = 5) 1, sigmoid Dense layer 2 • Model check point Output

  22. Architecture - RNN • Dimensionality of embedding layer = {16, 32} Inputs • LSTM units = {32, 64, 128} Embedding layer Repeat this set of LSTM layer hidden units • Dynamic Batch size = {32, 64, 128, 0.2 Dropout layer 256} 1, sigmoid Dense layer • Callbacks Output • Early stopping (patience = 2) • Model check point

  23. Running experiments • Phase 1 – Grid search for optimal hyper- parameters • Validation set – 20% • Number of configurations GRNET Super • CNN = 144 computing facility • RNN = 18 Each experiment using 1 GPU with 64 GB • Phase 2 – experiment with the optimal memory hyper-parameters

  24. Results

  25. RQ1. Would it be possible to use deep learning methods to detect code smells? F1 0.7 AUC-ROC 0.57 0.6 0.90 0.5 0.41 0.80 0.38 0.4 0.35 0.31 0.29 0.70 0.3 0.22 0.60 0.2 0.09 0.50 0.06 0.1 0.04 0.02 0.02 0.40 0 CM ECB MN MA CM ECB MN MA CM ECB MN MA CM ECB MN MA CM ECB MN MA CM ECB MN MA CNN-1D CNN-2D RNN CNN-1D CNN-2D RNN

  26. CNN-1D vs CNN-2D CNN-1D (max) - 0.05 CNN-1D (max) - 0.40 CNN-2D (max) - 0.04 CNN-2D (max) - 0.39 CNN-1D (max) - 0.36 CNN-1D (max) - 0.18 CNN-2D (max) - 0.35 CNN-2D (max) - 0.16

  27. CNN vs RNN RNN and RNN and CNN-1D CNN-2D CM -22.94 -33.81 ECB 80.23 91.94 MN 48.96 38.58 MA -349.12 -205.26 Difference in percentage; comparing max F1

  28. Are more deep layers always good? Layers CM ECB MN MA 1 0.36 0.05 0.36 0.08 CNN- 2 0.40 0.05 0.36 0.18 1D 3 0.40 0.05 0.36 0.19 1 0.39 0.04 0.35 0.07 CNN- 2 0.39 0.04 0.34 0.16 2D 3 0.39 0.05 0.34 0.10 1 0.34 0.21 0.48 0.28 RNN 2 0.36 0.24 0.48 0.22 3 0.37 0.23 0.48 0.20

  29. RQ2: Is transfer-learning feasible in the context of detecting smells? F1 F1 0.57 0.57 0.6 0.54 0.54 0.60 0.49 0.49 0.49 0.49 0.5 0.50 0.41 0.38 0.35 0.4 0.40 0.29 0.3 0.30 0.2 0.14 0.14 0.20 0.09 0.07 0.07 0.06 0.06 0.04 0.1 0.03 0.03 0.02 0.10 0.01 0 0.00 CM ECB MN MA CM ECB MN MA CM ECB MN MA CM ECB MN MA CNN-1D CNN-2D CNN-1D CNN-2D Transfer-learning Direct-learning

  30. It is feasible to make the deep learning model Transfer-learning is learn to detect smells feasible. Conclusions Improvements – many possibilities - Performance - Add more smells – different kinds

  31. Relevant links Source code and data https://github.com/tushartushar/DeepLearningSmells Smell detection tool Java - https://github.com/tushartushar/DesigniteJava C# - http://www.designite-tools.com CodeSplit </> Java - https://github.com/tushartushar/CodeSplitJava C# - https://github.com/tushartushar/DeepLearningSmells/tree/master/CodeSplit Tokenizer https://github.com/dspinellis/tokenizer

  32. Thank you!! Courtesy: spikedmath.com

Recommend


More recommend