Pushing the Boundaries in Regression Testing Shin Yoo & Mark Harman / King ʼ s College London Shmuel Ur / IBM Haifa Paolo Tonella & Angelo Susi / FBK
Limited Resource Large Test Suites Regression Testing
Selection Limited Resource Large Test Suites Uses impact analysis to precisely identify the changed parts of the program and only test those parts Often not the answer: - requires static analysis - not enough reduction
Minimisation Large Test Suites Seeks to reduce the size of test suites while satisfying test adequacy goals ✓ ✓ ✓ ✓ R1 R2 R3 R4 T1 T2 T3
Prioritisation Limited Resource Seeks to achieve test adequacy as much and as early as possible 100.00 75.00 50.00 25.00 0 A-B-C-D-E C-E-B-A-D
Publication Trend Relative Interest in Different Subjects 30 Leung et al. 24 on cost model Rothermel et al. on prioritisation 18 Harrold et al. on minimisation Fischer et al. 1977 on selection 12 6 0 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 Minimisation Selection Prioritisation Empirical/Comparative
Potential Impact 8% 20% 80% 92% Purely Academic Toy Programs Author from Industry Industrial Scale Subjects Summary of 157 papers on regression testing techniques from a recent survey
Research Output • Pareto Efficient Multi-Objective Test Case Selection: S. Yoo & M. Harman, ISSTA 2007 • Mesuring and Improving Latency to Avoid Test Suite Wear-Out: S. Yoo, M. Harman & S. Ur, SBST 2009 • Clustering Test Cases to Achieve Effective and Scalable Prioritisation Incorporating Expert Knowledge: S. Yoo, M. Harman, P. Tonella & A. Susi, ISSTA 2009
Multi-Objectiveness : Problems • “After performing minimisation, the test suite is still too big. What can I actually do in the next 6 hours?” • “I care not just code coverage, but something else too. Can I also achieve X with the minimised suite?”
Program Bl Prog ram Blocks Blocks Test Case Test Case Time Time 1 2 3 4 5 6 7 8 9 10 T1 x x x x x x x x 4 T2 x x x x x x x x x 5 T3 x x x x 3 T4 x x x x x 3 Single Objective Multi Objective Choose test case with highest 100 block per time ratio as the next one Additional Greedy 80 Pareto Frontier 1) T1 (ratio = 2.0) Coverage(%) 60 2) T2 (ratio = 2 / 5 = 0.4) 40 ∴ {T1, T2} (takes 9 hours) 20 0 “But what if you have only 7 0 2 4 6 8 10 Execution Time hours...?”
Latency : Problems • “My regression testing seems to depend on the same test cases all the time; is this okay?” • “Code coverage is necessary but not sufficient test adequacy; how do we make it safer?”
The Coverage Trap 100% 100% 80% T5 may detect an unknown fault. T1 T9 If the minimisation technique T3 never picks T5, fault detection capability is compromised. T2 We should consider multiple subsets. If necessary, we should T4 T5 T7 improve the remaining part of the test suite. T8 T6 Minimisation Minimisation
Designing Test Suites 100 Average maximum statement coverage(%) flex ● ● ● grep ● sed 80 space ● ● gzip ● printtokens ● ● ● 60 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 40 ● ● ● ● ● ● ● ● ● ● ● ● ● 20 0 0 20 40 60 80 Times that reduction technique has been applied
ComplexBranch Average maximum branch coverage(%) 100 ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● Hill Climbing ● EDA Random 80 Original ● ● ● ● ● 60 ●●●●●● ●●●●●●● 40 20 0 0 10 20 30 40 50 60 Times that reduction technique has been applied
Expertise : Problems • “I have seen the results of automated test case prioritisation and I don ʼ t agree. This is the way it should be!” • “We need to prioritise a specific set of tests due to business priority.”
Interleaved Clusters Prioritisation Cluster T1 T2 Intra-cluster Prioritisation T6 T3 Inter-cluster Prioritisation T4 Interleaving Clusters T5
Experimental Setup Prioritised Test Suite Simple Agglomerative Hierarchical Clustering (k=14) Hamming distance between Human Structural stmt. coverage as dissimilarity Pairwise Coverage metric Comparison A human user model with controlled error rate ... C1 C2 C13 C14
Tolerance gzip 1.00 0.99 ● ● ● ● This is what we initially ● ● ● 0.98 expected to see. ● APFD 0.97 ● But... 0.96 ● 0.95 0.05 0.2 0.35 0.5 0.65 0.8 0.95
Tolerance space suite2 0.94 ● ● 0.92 ● ● ● ● ● ● ● APFD 0.90 0.88 schedule suite2 0.86 space suite1 sed 1.00 0.05 0.2 0.35 0.5 0.65 0.8 0.95 0.94 0.92 0.95 ● ● ● ● ● ● ● ● ● APFD ● APFD ● ● ● ● 0.90 ● ● ● ● ● ● ● ● ● ● ● ● 0.90 0.88 0.94 0.86 0.85 0.05 0.2 0.35 0.5 0.65 0.8 0.95 0.05 0.2 0.35 0.5 0.65 0.8 0.95 0.92 ● ● ● ● ● ● ● APFD ● 0.90 ● ● ● 0.88 0.86 0.05 0.2 0.35 0.5 0.65 0.8 0.95 Some test suites are very resilient to errors
Boundaries & Open Questions • Scalability: not only quantitative but also qualitative scalability • Complexity: set-up cost, oracle cost, dependency • Effectiveness: are the traditional metrics good enough?
Summaries • Regression testing is hard - not a single solution • Multi-objective paradigm allows us to formulate a complex problem • Code coverage needs to be re-thought • Humans are a vast pool of knowledge
Recommend
More recommend