COMPILER-ASSISTED TEST ACCELERATION ON GPUS FOR EMBEDDED SOFTWARE - PowerPoint PPT Presentation

COMPILER-ASSISTED TEST ACCELERATION ON GPUS FOR EMBEDDED SOFTWARE VANYA YANEVA Ajitha Rajan, Christophe Dubach ISSTA 2017 10 July 2017 Santa Barbara, CA

EMBEDDED SOFTWARE IS EVERYWHERE ITS SAFETY AND CORRECTNESS ARE CRUCIAL FUNCTIONAL TESTING IS CRITICAL

FUNCTIONAL TESTING CAN BE EXTREMELY TIME CONSUMING

FUNCTIONAL TESTING CAN BE EXTREMELY TIME CONSUMING Test suite Test case 1 Expected result 1 Test case 2 Expected result 2 Test case 3 Expected result 3 Application Test case n Expected result n

FUNCTIONAL TESTING CAN BE EXTREMELY TIME CONSUMING Test suite Test case 1 Expected result 1 Test case 2 Expected result 2 Test case 3 Expected result 3 Application Test case n Expected result n TESTING IS AN IDEAL CANDIDATE FOR PARALLELISATION

CPU SERVERS Expensive Do not scale easily as test suites grow Can be extremely underutilised

CPU SERVERS GPUS Expensive Cheap and widely available Do not scale easily as test suites grow Large-scale parallelism, thousands of threads Can be extremely underutilised SIMD architecture suited to functional testing

EXECUTE TESTS IN PARALLEL ON THE GPU THREADS Test suite Read test cases: INPUT[] = {test case 1 … test case n} Test case 1 Expected result 1 Transfer INPUT[] to GPU memory Test case 2 Expected result 2 Build and launch tested program Test case 3 on the GPU threads Expected result 3 th_id 0 1 2 3 n-1 OUTPUT[th_id] = program ( INPUT[th_id] ) Test case n Expected result n Transfer OUTPUT[] to CPU memory A. Rajan, S. Sharma, P. Schrammel, D. Kroening. Accelerated test execution using GPUs. In proceedings of ASE 2014, pages 97- 102, Sweden, Nov 2014.

EXECUTE TESTS IN PARALLEL ON THE GPU THREADS Test suite Read test cases: INPUT[] = {test case 1 … test case n} Test case 1 Expected result 1 Transfer INPUT[] to GPU memory CHALLENGES Test case 2 Expected result 2 Build and launch tested program Test case 3 on the GPU threads Expected result 3 Usability ✘ th_id 0 1 2 3 n-1 Scope ✘ OUTPUT[th_id] = program ( INPUT[th_id] ) Test case n Expected result n Performance ? Transfer OUTPUT[] to CPU memory A. Rajan, S. Sharma, P. Schrammel, D. Kroening. Accelerated test execution using GPUs. In proceedings of ASE 2014, pages 97- 102, Sweden, Nov 2014.

INTRODUCING PARTECL Test cases (CSV format) Unmodified ParTeCL ParTeCL source files Execution OpenCL CodeGen Runtime on the GPU Config file

INPUTS Example: Configuration: #include <stdio.h> input: int a 1 #include <stdlib.h> input: int b 2 result: int sum variable: sum int c; int addc(int a, int b){ Test cases: return a + b + c; } int main(int argc, char* argv[]){ int a = atoi(argv[1]); 1 13 7 int b = atoi(argv[2]); 2 50 22 c = 3; 3 1000 0 4 0 1000 int sum = addc(a, b); 5 0 0 printf("%d + %d + %c = %d\n", a, b, c, sum); }

PARTECL CODEGEN Example: OpenCL: #include <stdio.h> #include "structs.h" #include <stdlib.h> //#include <stdio.h> //#include <stdlib.h> int c; /*int c;*/ int addc(int a, int b){ int addc(int a, int b, int *c){ return a + b + c; return a + b + (*c); } } int main(int argc, char* argv[]){ kernel void main_kernel( global struct test_input* inputs, int a = atoi(argv[1]); global struct test_result* results){ int b = atoi(argv[2]); c = 3; int idx = get_global_id(0); struct test_input input_gen = inputs[idx]; int sum = addc(a, b); global struct test_result *result_gen = &results[idx]; printf("%d + %d + %c = %d\n", a, b, c, sum); int argc = input_gen.argc; } result_gen->test_case_num = input_gen.test_case_num; int c; int a = input_gen.a; int b = input_gen.b; c = 3; int sum = addc(a, b, &c); /*printf("%d + %d + %c = %d\n", a, b, c, sum);*/ result_gen->sum = sum; }

CODE TRANSFORMATIONS global scope variables command line arguments standard in/out standard library (partial support): clClibc

PARTECL RUNTIME Read test cases: INPUT[] = {test case 1 … test case n} Transfer INPUT[] to GPU memory Automatically generated Build and launch tested program OpenCL on the GPU threads th_id 0 1 2 3 n-1 OUTPUT[th_id] = program ( INPUT[th_id] ) Transfer OUTPUT[] to CPU memory

CHALLENGES Usability ✔ Scope ✔ Performance ? Test cases (CSV format) Unmodified ParTeCL ParTeCL source files Execution OpenCL CodeGen Runtime on the GPU Config file

EVALUATION 1. Speedup against CPU 2. Data transfer overhead 3. Comparison to a multi-core CPU 4. Correctness

EXPERIMENT Subjects: EEMBC - Industry-standard benchmark suite for embedded software Hardware: GPU - NVidia Tesla K40m; CPU - Intel Xeon, 8 cores Test suite size: 130K

SPEEDUP AGAINST CPU

DATA TRANSFER OVERHEAD viterb00 fbital00 a2time01 autcor00 120 140 Input transfer Input transfer Input transfer Input transfer Output transfer Output transfer Output transfer Output transfer 40 80 120 100 Kernelexecution Kernelexecution Kernelexecution Kernelexecution Execution time [ms] Execution time [ms] Execution time [ms] Execution time [ms] 100 80 30 60 80 60 20 40 60 40 40 10 20 20 20 0 0 0 0 8 2 9 2 10 2 11 2 12 2 13 2 14 2 15 2 16 2 8 2 9 2 10 2 11 2 12 2 13 2 14 2 15 2 16 2 8 2 9 2 10 2 11 2 12 2 13 2 14 2 15 2 16 2 17 17 8 2 9 2 10 2 11 2 12 2 13 2 14 2 15 2 16 2 17 17 2 2 2 2 Number of tests (log base 2 scale) Number of tests (log base 2 scale) Number of tests (log base 2 scale) Number of tests (log base 2 scale) tblook01 fft00 conven00 puwmod01 rspeed01 140 Input transfer Input transfer Input transfer 140 Input transfer Input transfer 60 Output transfer Output transfer Output transfer Output transfer 60 Output transfer 250 120 120 Kernelexecution Kernelexecution Kernelexecution Kernelexecution Kernelexecution Execution time [ms] Execution time [ms] Execution time [ms] 50 Execution time [ms] Execution time [ms] 50 100 200 100 40 40 80 80 150 30 30 60 60 100 20 20 40 40 50 10 20 10 20 0 0 0 0 0 8 2 9 2 10 2 11 2 12 2 13 2 14 2 15 2 16 2 8 2 9 2 10 2 11 2 12 2 13 2 14 2 15 2 16 2 8 2 9 2 10 2 11 2 12 2 13 2 14 2 15 2 16 2 8 2 9 2 10 2 11 2 12 2 13 2 14 2 15 2 16 2 8 2 9 2 10 2 11 2 12 2 13 2 14 2 15 2 16 2 17 17 17 17 17 2 2 2 2 2 Number of tests (log base 2 scale) Number of tests (log base 2 scale) Number of tests (log base 2 scale) Number of tests (log base 2 scale) Number of tests (log base 2 scale)

DATA TRANSFER OVERHEAD

COMPARISON TO A MULTI-CORE CPU

CHALLENGES Usability ✔ Scope ✔ Performance ✔

CORRECTNESS For all 9 benchmarks, testing results from the GPU are an exact match to the testing results from the CPU.

SUMMARY Automatic GPU code generation Automatic test execution on the GPU threads Speedup of up to 53x (avg 16x) on EEMBC benchmarks Correct testing results

SUMMARY Automatic GPU code generation Automatic test execution on the GPU threads Speedup of up to 53x (avg 16x) on EEMBC benchmarks Correct testing results FUTURE WORK Extend evaluation & scope Analyse & improve performance

THANKS github.com/wyaneva/partecl-codegen ParTeCL CodeGen ParTeCL Runtime github.com/wyaneva/partecl-runtime github.com/wyaneva/clClibc clClibc

C FEATURES Out of the box: pure functions, function calls, double precision (for OpenCL 1.2) With transformations: standard in/out global scope variables standard library calls (partial support) Unsupported (yet): dynamic memory allocation file I/O recursion

COMPILER-ASSISTED TEST ACCELERATION ON GPUS FOR EMBEDDED SOFTWARE - PowerPoint PPT Presentation

COMPILER-ASSISTED TEST ACCELERATION ON GPUS FOR EMBEDDED SOFTWARE VANYA YANEVA Ajitha Rajan, Christophe Dubach ISSTA 2017 10 July 2017 Santa Barbara, CA EMBEDDED SOFTWARE IS EVERYWHERE ITS SAFETY AND CORRECTNESS ARE CRUCIAL FUNCTIONAL

Compiler Construction Chapter 11 1 Compiler Construction Compiler Construction A New Compiler

Why use GPUs for graph processing? FOSDEM 2020 2 GPUs and Graphs Graphs GPUs Found

Compiler-assisted Performance Analysis Adam Nemet Apple anemet@apple.com Hotspot User

A GPU-Inspired Soft Processor for High- Throughput Acceleration Throughput Acceleration Jeffrey

Model-Based Testing (ISTQB Chapter 4) Arie van Deursen 1 4.1 ISTQB Test Design Test Scripts

Embedded PC The modular Industrial PC for mid-range control Embedded PC 1 Embedded OS

EMBEDDED EMBEDDED REAL TIME SYSTEMS REAL TIME SYSTEMS EMBEDDED EMBEDDED REAL TIME SYSTEMS

Platform Convergence Journey Windows Embedded Standard 7 Windows Embedded Standard 8 Converged

The Embedded Learning Library The Embedded Learning Library (ELL) Cross-compiler for AI

11/8/2012 The Structure of a Compiler (2) The Structure of a Compiler (1) Any compiler must

Compiler Development (CMPSC 401) Janyl Jumadinova January 17, 2018 Janyl Jumadinova Compiler

Principles of Compiler Design - The Brainf*ck Compiler - Clifford Wolf - www.clifford.at

Medication Assisted Treatment For Opioid Use Disorder Medication Assisted Treatment For Opioid

Influencing and voluntary assisted dying Slide Voluntary assisted dying, euthanasia, dying with

Acceleration at North Allegheny Mathematics Acceleration (Elementary) Students may qualify for

Particle Driven Acceleration Experiments Edda Gschwendtner CAS, Plasma Wake Acceleration 2014 2

Crossbow: Scaling Deep Learning on Multi-GPU Servers Peter Pietzuch with Alexandros Koliousis,

Volunteer Outreach Training Why Outreach Calls Matter Help our advisors become aware of

21 st Century Disaster Readiness Kevin Yeskey, M.D. March 15, 2018 21 st Century: An

Training for Transfer Learning objective Be able to develop a first version of a complete

GOAL To learn about the realities of transition to adult care and adulthood for a person with

Cluster-Jet Target Slow-Control An In-Kind Contract: FAIR GmbH-JU-NCBJ Legal aspects and

Trails of triples in Steiner triple systems Daniel Horsley (Monash University, Australia) Joint

Schedule, Structure, Budget, EVM, Safety, People, 2013 : a memorable year ! First beam

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us