unit tes ng tool compe on round four
play

Unit Tes)ng Tool Compe))on Round Four Urko Rueda, Ren Just, Juan - PowerPoint PPT Presentation

Unit Tes)ng Tool Compe))on Round Four Urko Rueda, Ren Just, Juan P. Galeo5, Tanja E. J. Vos The 9th Interna=onal Workshop on Search-Based SoDware Tes=ng Contents 1. About the Tool compe==on 2. The Tools 3. The Methodology 4. The Results


  1. Unit Tes)ng Tool Compe))on Round Four Urko Rueda, René Just, Juan P. Galeo5, Tanja E. J. Vos The 9th Interna=onal Workshop on Search-Based SoDware Tes=ng

  2. Contents 1. About the Tool compe==on 2. The Tools 3. The Methodology 4. The Results 5. Lessons learned 4th Java unit tes=ng 1 compe==on

  3. About the Tool compe))on Benchmarked Java unit tes=ng at the class level FITTEST Unit Testing Coverage Mutation CUTs / Tools crest.cs.ucl.ac. Tool metrics metrics Projects SBST & nonSBST uk/fi;est Competition / Tools 2012 Cobertura Javalanche 77 / 5 / 2 Manual & Randoop ✓ - baselines ICST’13 JaCoCo PITest 63 / 9 / 4 1 st + T3 & Evosuite 2013 Round Two FITTEST’13 ✓ 63 / 9 / 8 2014 ✗ 2 nd + Commercial & GRT & jTexPert & Round Three Mosa(Evosuite) SBST’15 Defects4J: github.com/rjust/ 2015 ✗ 68 / 5 / 4 Randoop - baseline defects4j & T3 & Evosuite & Round Four + Real fault finding metric jTexPert SBST’16 4th Java unit tes=ng 2 compe==on

  4. About the Tool compe))on § Why? § Towards tes=ng field maturity – this is just Java … § Tools improvements, future developments insight § What is new in the 4 th edi=on? § Benchmark infrastructure – split into § Test genera=on § Test execu=on & Test assessment (Defects4J) § Benchmark subjects (from Defects4J dataset) § Time budgets (1, 2, 4 & 8 minutes) § Flaky tests (non compliable, non reliable pass) 4th Java unit tes=ng 3 compe==on

  5. The Tools § SBST and non-SBST tools § Command line tools § Fully automated – no human interven=on Tool Technique Static Edition analysis 2012 2013 2014 2015 Randoop Random ✗ ✓ ✓ ✓ ✓ (baseline) T3 ✗ ✗ ✓ ✓ ✓ jTexPert Random (guided) ✓ ✗ ✗ ✓ ✓ Evosuite Evolutionary ✓ ✗ ✓ ✓ ✓ algorithm 4th Java unit tes=ng 4 compe==on

  6. The Methodology § Tool deployment § Installa=on – Linux environment § Wrapper implementa=on – runtool script § Std. IN/OUT communica=on protocol § 4 th edi=on has a =me budget § Tune-up cycle – setup, run, resolve issues § Benchmark infrastructure § Defects4J integra=on § Decoupling test genera=on from test execu=on/assessment § Tool – run over non contest benchmark samples 4th Java unit tes=ng 5 compe==on

  7. The Methodology benchmark run tool framework for Tool T "BENCHMARK" Src Path / Bin Path / ClassPath ClassPath for JUnit Compilation . . preparation . "READY" time-budget name of CUT . . generate file in loop ./temp/testcases . "READY" compile + execute + measure test case 4th Java unit tes=ng 6 compe==on

  8. The Methodology § Benchmark infrastructure § Two HP Z820 worksta=ons – each: § 2 CPU sockets for a total of 20 cores § 256Gb RAM § 32 virtual machines (16 per worksta=on) § Test genera=on § 1 core – control tool mul=-threading capability § 8GB RAM § Test execu=on/assessment (tool independent) § 2 cores § 16Gb RAM – resolves out of memory issues 4th Java unit tes=ng 7 compe==on

  9. The Methodology HP Z820 16 VMs HP Z820 16 VMs 80 20core CPU 20core CPU CUTs 256Gb RAM 256Gb RAM 1core CPU 2core CPU 1core CPU 2core CPU 8Gb RAM 16Gb RAM 8Gb RAM 16Gb RAM time budgets time budgets runtool T3 1 2 4 8m 1 2 4 8m 1 2 4 8m 1 2 4 8m 1 2 4 8m 1 2 4 8m 1 2 4 8m 1 2 4 8m replicated x32 VMs 1 2 4 8m 1 2 4 8m 1 2 4 8m 1 2 4 8m benchmark tool runtool jTexpert 1 2 4 8m 1 2 4 8m 1 2 4 8m 1 2 4 8m RUNs 1, 2, 3 RUNs 4, 5, 6 runtool EvoSuite generate collect generate collect test cases metrics test cases metrics aggregator Randoop runtool Calculate Score 4th Java unit tes=ng 8 compe==on

  10. The Methodology EvoSuite Randoop T3 jTexpert runtool runtool runtool runtool benchmark tool Time- CUT generate budget ( fixed ) (1, 2 , 4, 8min) Test classes run to detect @Test Y and remove compilable @Test flaky tests N @Test Test classes @Test @Test No flaky tests CUT ( 1 real fault ) run to CUT collect ( fixed ) metrics CUT ( mutated ) calculate score 4th Java unit tes=ng 9 compe==on

  11. The Methodology § Flaky tests § Passes during genera=on § But, might Fail during execu=on/assessment § False-posi=ve warnings § Non reliable fault-detec=on § Non reliable muta=on analysis § Defects4J flaky tests sanity § Non compiling test classes § Failing tests over 5 execu=ons (fixed CUT versions) 4th Java unit tes=ng 10 compe==on

  12. The Methodology § The Metrics – Test effec=veness § Code coverage (fixed benchmark versions) § Defects4J <- Cobertura § Statement coverage § Condi=on coverage § Muta=on score § Defects4J <- Major framework (all muta=on operators) § Real fault detec=on (buggy benchmark versions) § 1 real fault per benchmark § 0 or 1 score, independent of how many tests reveal it 4th Java unit tes=ng 11 compe==on

  13. The Methodology § The Scoring formula covScore (T,L,C,r) := w i · cov i + w b · cov b + w m · cov m + (real fault found ? w f : 0) T = Tool; L = Time budget; C = CUT; r = RUN (1..6) Coverages: cov i = statement; cov b = condi=on cov m = mutants kill ra=o Weights: w i = 1; w b = 2; w m = 4; w f = 4 4th Java unit tes=ng 12 compe==on

  14. The Methodology § The Scoring formula – =me penalty § Test genera=on slot: L .. 2 · L § No penalty if genTime <= L § Penalty for Extra =me taken (genTime – L) § Half covScore if the Tool must be killed (> 2 · L) 4th Java unit tes=ng 13 compe==on

  15. The Methodology § The Scoring formula – tests penalty #Classes = generated test classes; #uClasses = uncompilable #Tests = test cases; #fTests = flaky 4th Java unit tes=ng 14 compe==on

  16. The Methodology § The Scoring formula – Tool score Score(T,L,C,r) := tScore(T,L,C,r) – penalty(T,L,C,r) Score(T,L,C) := avg(Score(T,L,C,r) for all r execu=ons 4th Java unit tes=ng 15 compe==on

  17. The Methodology § Conclusion validity § Reliability of treatment implementa=on § Tool deployment instruc=ons EQUAL for all par=cipants § Reliability of measures § Efficiency: wall clock =me by Java System.currentTimeMillis() § Effec=veness: Defects4J § Tools non-determinis=c nature: 6 runs (HW Capacity) 4th Java unit tes=ng 16 compe==on

  18. The Methodology § Internal validity § CUTs from Defects4J (uniform and arbitrary selec=on from 5 open source projects) § Tools and benchmark infrastructure Tune-up samples § Contest benchmarks § Wrappers runtool : implemented by Tools side § Construct validity § Scoring formula weights – quality indicators value § Empirical studies – correla=on of proxy metrics for: Test effec=veness and Fault finding capability 4th Java unit tes=ng 17 compe==on

  19. The Results Contest run for ~ 1week Test genera=on, A single virtual execu=on and machine would use assessment 8 CPU months ! x32 VMs 4th Java unit tes=ng 18 compe==on

  20. Lessons learned § Tes=ng Tools improvements § Automa=on, Test effec=veness, Comparability § Benchmarking infrastructure improvements § Decoupling Test gen. from execu=on/assessment § Flaky tests iden=fica=on and sanity § Fault finding capability measurement § Test effec=veness due to Test genera=on =me § What next? § Automated paralleliza=on of the benchmark contest § More Tools, new languages? (i.e. C#?) 4th Java unit tes=ng 19 compe==on

  21. Contact us Universidad Politécnica de Valencia, ES urueda@pros.upv.es, tvos@dsic.upv.es Open Universiteit Heerlen, NL tanja.vos@ou.nl University of Massachuseys Amherst, MA, USA rjust@cs.umass.edu University of Buenos Aires, Argen=na jgaleo5@dc.uba.ar web: hyp://sbstcontest.dsic.upv.es/ 4th Java unit tes=ng 20 compe==on

Recommend


More recommend