COMPILER-ASSISTED TEST ACCELERATION ON GPUS FOR EMBEDDED SOFTWARE VANYA YANEVA Ajitha Rajan, Christophe Dubach ISSTA 2017 10 July 2017 Santa Barbara, CA
EMBEDDED SOFTWARE IS EVERYWHERE ITS SAFETY AND CORRECTNESS ARE CRUCIAL FUNCTIONAL TESTING IS CRITICAL
FUNCTIONAL TESTING CAN BE EXTREMELY TIME CONSUMING
FUNCTIONAL TESTING CAN BE EXTREMELY TIME CONSUMING Test suite Test case 1 Expected result 1 Test case 2 Expected result 2 Test case 3 Expected result 3 Application Test case n Expected result n
FUNCTIONAL TESTING CAN BE EXTREMELY TIME CONSUMING Test suite Test case 1 Expected result 1 Test case 2 Expected result 2 Test case 3 Expected result 3 Application Test case n Expected result n TESTING IS AN IDEAL CANDIDATE FOR PARALLELISATION
CPU SERVERS Expensive Do not scale easily as test suites grow Can be extremely underutilised
CPU SERVERS GPUS Expensive Cheap and widely available Do not scale easily as test suites grow Large-scale parallelism, thousands of threads Can be extremely underutilised SIMD architecture suited to functional testing
EXECUTE TESTS IN PARALLEL ON THE GPU THREADS Test suite Read test cases: INPUT[] = {test case 1 … test case n} Test case 1 Expected result 1 Transfer INPUT[] to GPU memory Test case 2 Expected result 2 Build and launch tested program Test case 3 on the GPU threads Expected result 3 th_id 0 1 2 3 n-1 OUTPUT[th_id] = program ( INPUT[th_id] ) Test case n Expected result n Transfer OUTPUT[] to CPU memory A. Rajan, S. Sharma, P. Schrammel, D. Kroening. Accelerated test execution using GPUs. In proceedings of ASE 2014, pages 97- 102, Sweden, Nov 2014.
EXECUTE TESTS IN PARALLEL ON THE GPU THREADS Test suite Read test cases: INPUT[] = {test case 1 … test case n} Test case 1 Expected result 1 Transfer INPUT[] to GPU memory CHALLENGES Test case 2 Expected result 2 Build and launch tested program Test case 3 on the GPU threads Expected result 3 Usability ✘ th_id 0 1 2 3 n-1 Scope ✘ OUTPUT[th_id] = program ( INPUT[th_id] ) Test case n Expected result n Performance ? Transfer OUTPUT[] to CPU memory A. Rajan, S. Sharma, P. Schrammel, D. Kroening. Accelerated test execution using GPUs. In proceedings of ASE 2014, pages 97- 102, Sweden, Nov 2014.
INTRODUCING PARTECL Test cases (CSV format) Unmodified ParTeCL ParTeCL source files Execution OpenCL CodeGen Runtime on the GPU Config file
INPUTS Example: Configuration: #include <stdio.h> input: int a 1 #include <stdlib.h> input: int b 2 result: int sum variable: sum int c; int addc(int a, int b){ Test cases: return a + b + c; } int main(int argc, char* argv[]){ int a = atoi(argv[1]); 1 13 7 int b = atoi(argv[2]); 2 50 22 c = 3; 3 1000 0 4 0 1000 int sum = addc(a, b); 5 0 0 printf("%d + %d + %c = %d\n", a, b, c, sum); }
PARTECL CODEGEN Example: OpenCL: #include <stdio.h> #include "structs.h" #include <stdlib.h> //#include <stdio.h> //#include <stdlib.h> int c; /*int c;*/ int addc(int a, int b){ int addc(int a, int b, int *c){ return a + b + c; return a + b + (*c); } } int main(int argc, char* argv[]){ kernel void main_kernel( global struct test_input* inputs, int a = atoi(argv[1]); global struct test_result* results){ int b = atoi(argv[2]); c = 3; int idx = get_global_id(0); struct test_input input_gen = inputs[idx]; int sum = addc(a, b); global struct test_result *result_gen = &results[idx]; printf("%d + %d + %c = %d\n", a, b, c, sum); int argc = input_gen.argc; } result_gen->test_case_num = input_gen.test_case_num; int c; int a = input_gen.a; int b = input_gen.b; c = 3; int sum = addc(a, b, &c); /*printf("%d + %d + %c = %d\n", a, b, c, sum);*/ result_gen->sum = sum; }
CODE TRANSFORMATIONS global scope variables command line arguments standard in/out standard library (partial support): clClibc
PARTECL RUNTIME Read test cases: INPUT[] = {test case 1 … test case n} Transfer INPUT[] to GPU memory Automatically generated Build and launch tested program OpenCL on the GPU threads th_id 0 1 2 3 n-1 OUTPUT[th_id] = program ( INPUT[th_id] ) Transfer OUTPUT[] to CPU memory
CHALLENGES Usability ✔ Scope ✔ Performance ? Test cases (CSV format) Unmodified ParTeCL ParTeCL source files Execution OpenCL CodeGen Runtime on the GPU Config file
EVALUATION 1. Speedup against CPU 2. Data transfer overhead 3. Comparison to a multi-core CPU 4. Correctness
EXPERIMENT Subjects: EEMBC - Industry-standard benchmark suite for embedded software Hardware: GPU - NVidia Tesla K40m; CPU - Intel Xeon, 8 cores Test suite size: 130K
SPEEDUP AGAINST CPU
DATA TRANSFER OVERHEAD viterb00 fbital00 a2time01 autcor00 120 140 Input transfer Input transfer Input transfer Input transfer Output transfer Output transfer Output transfer Output transfer 40 80 120 100 Kernelexecution Kernelexecution Kernelexecution Kernelexecution Execution time [ms] Execution time [ms] Execution time [ms] Execution time [ms] 100 80 30 60 80 60 20 40 60 40 40 10 20 20 20 0 0 0 0 8 2 9 2 10 2 11 2 12 2 13 2 14 2 15 2 16 2 8 2 9 2 10 2 11 2 12 2 13 2 14 2 15 2 16 2 8 2 9 2 10 2 11 2 12 2 13 2 14 2 15 2 16 2 17 17 8 2 9 2 10 2 11 2 12 2 13 2 14 2 15 2 16 2 17 17 2 2 2 2 Number of tests (log base 2 scale) Number of tests (log base 2 scale) Number of tests (log base 2 scale) Number of tests (log base 2 scale) tblook01 fft00 conven00 puwmod01 rspeed01 140 Input transfer Input transfer Input transfer 140 Input transfer Input transfer 60 Output transfer Output transfer Output transfer Output transfer 60 Output transfer 250 120 120 Kernelexecution Kernelexecution Kernelexecution Kernelexecution Kernelexecution Execution time [ms] Execution time [ms] Execution time [ms] 50 Execution time [ms] Execution time [ms] 50 100 200 100 40 40 80 80 150 30 30 60 60 100 20 20 40 40 50 10 20 10 20 0 0 0 0 0 8 2 9 2 10 2 11 2 12 2 13 2 14 2 15 2 16 2 8 2 9 2 10 2 11 2 12 2 13 2 14 2 15 2 16 2 8 2 9 2 10 2 11 2 12 2 13 2 14 2 15 2 16 2 8 2 9 2 10 2 11 2 12 2 13 2 14 2 15 2 16 2 8 2 9 2 10 2 11 2 12 2 13 2 14 2 15 2 16 2 17 17 17 17 17 2 2 2 2 2 Number of tests (log base 2 scale) Number of tests (log base 2 scale) Number of tests (log base 2 scale) Number of tests (log base 2 scale) Number of tests (log base 2 scale)
DATA TRANSFER OVERHEAD
COMPARISON TO A MULTI-CORE CPU
CHALLENGES Usability ✔ Scope ✔ Performance ✔
CORRECTNESS For all 9 benchmarks, testing results from the GPU are an exact match to the testing results from the CPU.
SUMMARY Automatic GPU code generation Automatic test execution on the GPU threads Speedup of up to 53x (avg 16x) on EEMBC benchmarks Correct testing results
SUMMARY Automatic GPU code generation Automatic test execution on the GPU threads Speedup of up to 53x (avg 16x) on EEMBC benchmarks Correct testing results FUTURE WORK Extend evaluation & scope Analyse & improve performance
THANKS github.com/wyaneva/partecl-codegen ParTeCL CodeGen ParTeCL Runtime github.com/wyaneva/partecl-runtime github.com/wyaneva/clClibc clClibc
C FEATURES Out of the box: pure functions, function calls, double precision (for OpenCL 1.2) With transformations: standard in/out global scope variables standard library calls (partial support) Unsupported (yet): dynamic memory allocation file I/O recursion
Recommend
More recommend