Duet Benchmarking Improving Measurement Accuracy in the Cloud Lubomír Bulej François Farquet Vojtěch Horký Aleksandar Prokopec Petr Tůma
Software Regression Testing … … of Performance
Common Testing Pipeline Write Check Write Check commit Code Out Code Out hook commit Write Write Code Code Code Build More Build More Repository Repository Repository Code Code commit Write Run Write Run Even Tests Even Tests More More Code Pass or fail Code Pass or fail verdict verdict commit
Project Context Graal Java JIT+AOT Compiler • Currently ~5 merge commits per day • Bare minimum testing JDK 8 + JDK 11 • Running ~60 standard benchmarks • Minimum warm up time 5 minutes Use Use • Minimum 10 executions more more Skip Skip machines machines some some 5 x 2 x 60 x 5 x 10 commits 5 x 2 x 60 x 5 x 10 commits = 30000 minutes = 30000 minutes Skip Skip = ~21 days = ~21 days some some benchmarks benchmarks
Where to Go for More Machines ? … to the Cloud !
Cloud Resource Sharing Amazon Elastic Cloud Instance Types • t3.nano 2 vCPU @ 5% power, 512MB RAM • t3.medium 2 vCPU @ 20% power, 4GB RAM • m5.large 2 vCPU 8GB RAM • ... This might perhaps This might perhaps Or you can forgo the virtualization somewhat disrupt somewhat disrupt • m5.metal 96 threads 48 cores 384GB RAM measurements measurements • Likely the same Intel Xeon Platinum 8175M Envelope estimate • CPU 48 cores / 5% = 960 instances • RAM 384 GB / 512 MB = 768 instances
… Effect of Resource Sharing 99% CI for the mean 99% CI for the mean is ~61% bigger is ~61% bigger
… Effect of Resource Sharing 99% CI for the mean 99% CI for the mean is ~1800% bigger is ~1800% bigger
Resource Management ... … Should Be Fair !
Is Resource Management Fair ? Hyperthreading • Intel says it “maximizes use of execution units” Bursty processor scheduling • Amazon says “one CPU credit is equal to 100% utilization for one minute” (in any combination) and “credits are accrued and spent at millisecond resolution” Memory caches ? Would it be fine Would it be fine Memory bandwidth ? if some instances if some instances Thermal budget ? were systematically were systematically disadvantaged ? disadvantaged ?
Two Measurements In Parallel Both workloads Both workloads fluctuate together fluctuate together Measured on Measured on GitLab CI GitLab CI
How To Use This ? Look at ratios instead of absolute values • Assumes effects are multiplicative • Ratios are what people want to know “We want to reliably detect 5% slowdowns ...” Confidence intervals using bootstrap Compare with sequential measurements • Confidence interval width relative to mean • Not quite apples-to-apples but gives some intuition
How Much More Accurate ? ScalaBench ~2.3x ScalaBench ~2.3x SPEC CPU ~27x SPEC CPU ~27x ScalaBench ~9.1x ScalaBench ~9.1x ScalaBench ~12x ScalaBench ~12x SPEC CPU ~24x SPEC CPU ~24x
… More Done Does duet benchmarking work because of synchronized interference ? Does duet benchmarking address interference due to resource sharing ? Does duet benchmarking measure performance differences accurately ? …
Thank You ! Complete paper at https://arxiv.org/abs/2001.05811 For more information visit http://d3s.mff.cuni.cz
Recommend
More recommend