Empirical analysis of the relationship between CC and SLOC in a large corpus of Java methods Davy Landman Alexander Serebrenik Jurgen Vinju
Metrics • Lines of Code (SLOC) • Cyclomatic Complexity (CC) • Popular in practice and research
Metrics • Lines of Code (SLOC) = 7 • Cyclomatic Complexity (CC) = 2 public ¡ double ¡sqrt( int ¡n){ ¡ 1 ¡ ¡ ¡ ¡ ¡// ¡Newton-‑Raphson ¡method ¡ ¡ ¡ ¡ ¡ ¡ double ¡r ¡= ¡n ¡/ ¡2.0; ¡ 2 ¡ ¡ ¡ ¡ ¡ while ¡(abs(r ¡– ¡(n ¡/ ¡r)) ¡> ¡0.00001) ¡{ ¡ 3 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡r ¡= ¡0.5 ¡* ¡(r ¡+ ¡(n ¡/ ¡r)); ¡ 4 ¡ ¡ ¡ ¡ ¡} ¡ 5 ¡ ¡ ¡ ¡ ¡ return ¡r; ¡ 6 ¡ } ¡ 7 ¡
M. Shepperd. "A critique of cyclomatic complexity as a software metric." Software Engineering Journal 3.2 (1988)
Citations Total 218 Last 5 years 90
CC redundant? • Shepperd’s was based on 8 papers (1979-1987) • 7 papers followed (1991-2013) • Fortran, PL/1, Pascal, COBOL, C, C++, and Java • SLOC & CC correlate linearly R 2 = 0.65 - 0.95
Our research • Identify di ff erences in 15 papers • Get data • Reproduce!
we do not conclude that CC is redundant with SLOC • Our result: R 2 = 0.43 • Di ff erence related work: • Aggregation • Power transform • Larger methods correlate even less • Di ff ering variance
Corpus • 13K Open Source Java Projects (14GB of Java) • 17M methods in 362M SLOC 1e+07 1e+07 1e+05 1e+05 Frequency Frequency 1e+03 1e+03 1e+01 1e+01 1 10 100 100 1000 10000 1 10 100 100 1000 SLOC of a Method CC of a Method E. Linstead, S. K. Bajracharya, T. C. Ngo, P. Rigor, C. V. Lopes, and P. Baldi, “Sourcerer: mining and searching internet-scale software repositories,” Data Mining and Knowledge Discovery, 18.2 (2009).
First result • Correlation ( R 2 ) : 0.43 • Lower than other papers: 0.65 - 0.95 • Why?
Other explanations • Correlation ( R 2 ) : 0.43 • Lower than other papers: 0.65 - 0.95 Yes No Power transform 4 12 File level (sum) 9 6
Power transform 8e+06 1e+07 6e+06 1e+05 Frequency Frequency 4e+06 1e+03 2e+06 1e+01 0e+00 0 1 50 10 100 100 100 150 1000 200 10000 250 SLOC of a Method SLOC of a Method
Method level R 2 = 0.43 R 2 = 0.70
File level • Example: 1 fj le, 30 “small” methods. • File SLOC = 30 * avg(SLOC m ) = 30 * 2.5 • File CC = 30 * avg(CC m ) = 30 * 2 • Volume factor causes high correlation [1] [1] K. El Emam, S. Benlarbi, N. Goel, S.N. Rai. "The confounding e ff ect of class size on the validity of object-oriented metrics." IEEE Transactions on Software Engineering 27.7 (2001)
File level R 2 = 0.87 R 2 = 0.65 Aggrega&on ¡causing ¡it? ¡
we do not conclude that CC is redundant with SLOC • Our result: R 2 = 0.43 • Di ff erence related work: • Aggregation • Power transform • Larger methods correlate even less • Di ff ering variance
1e+07 50% 25% 10% 1% 0.1% 1e+05 Frequency 1e+03 1e+01 1 10 100 100 1000 10000 SLOC of a Method Israel Herraiz and Ahmed E. Hassan, “Beyond lines of code: Do we need more complexity metrics?” Making Software What Really Works, and Why We Believe It. (2010)
Statistics R 2 “power” R 2 Tail min. SLOC # Methods 100% 1 17.8M 0.43 0.70 50% 3 8.9M 0.45 0.62 25% 9 4.5M 0.42 0.44 10% 20 1.8M 0.38 0.27 1% 77 179K 0.29 0.05 0.1% 230 18K 0.21 0.00
Large Methods
we do not conclude that CC is redundant with SLOC • Our result: R 2 = 0.43 • Di ff erence related work: • Aggregation • Power transform • Larger methods correlate even less • Di ff ering variance
Variance • R 2 = 0.43 means 57% variance not explained • Variance = actual CC – predicted CC
Method level
log 10 (Method level) Method level
log 10 (Method level) Method level File level
log 10 (Method level) log 10 (File level) Method level File level
Di ff ering variance complicate interpretation of linear models
we do not conclude that CC is redundant with SLOC • Our result: R 2 = 0.43 • Di ff erence related work: • Aggregation • Power transform • Larger methods correlate even less • Di ff ering variance
Method Level File Level Summary Large Methods Di ff ering variance ( data, scripts & preprint: http://is.gd/icsme_cc )
Recommend
More recommend