QUASAR (GPU Programming Language) on GDaaS Accelerates Coding from Months to Days
OUTLINE CAUSE DEMO 1 4 THE 2 5 RESULTS OFFER HOW DOES 3 6 CONCLUSION IT WORK 2
GPUs are everywhere
… ALMOST EVERYWHERE ??? Low level Long coding experts development are required lead times Strong coupling Each between algorithm HW platform development and requires a new implementation implementation 4
OBSERVATION While breakthrough results are achieved, still limited usage in research — Scientific articles mentioning CUDA: 90K — Scientific articles mentioning a specific scripting language: 400K-1900K A continuous investment in easy access: — Optimized libraries: cuFFT,cuDNN, cuBLAS,… — Tools: Digits High level access increases potential market by a factor 7 Limited to specific applications 5
On GPU desktop as a service (GDAAS) HIGH LEVEL PROGRAMMING LANGUAGE IDE & RUNTIME OPTIMIZATION KNOWLEDGE BASE & LIBRARIES 6
’S VALUE PROPOSITION Lowering Efficient Future Better Faster barrier code proof algorithms development of entry Days iso of #lines of codes Distinctive months Future proof reduction by a tools for Development to get started * code can cycle reduction factor of 2 to 3 coding Algorithm also target by a factor of and design development other GPU 3 to 10 Same analysis and decoupled from models performance exploration implementation Larger Reducing Earlier Early access market and Better R&D and product to highest faster take maintenance products performance launch costs up 7
HOW DOES IT WORK? Y = sum (A + B .* C + D) Code analysis and target-dependent lowering 8
HOW DOES IT WORK? Y = sum (A + B .* C + D) Code analysis and target-dependent lowering function $out:scalar = __ kernel __ kernel$1(A:vec'col'unchecked,B:vec'unchecked,C:scalar, D:vec‘unchecked ,$datadims:int,blkpos:int,blkdim:int,blkidx:int) Automatic generation of a kernel $bins:vec'unchecked= shared (blkdim) $accum0=0. function for $m=(blkpos+(blkidx*blkdim))..(64*blkdim)..($datadims-1) pos=$m $accum0+=(A[pos]+(B[pos].*C)+D[pos]) Parallel reduction algorithm using end $bins[blkpos]=$accum0 shared memory syncthreads $bit=1 Compile-time handling of while ($bit<blkdim) if (mod(blkpos,(2*$bit))==0) boundary checks $bins[blkpos]=($bins[blkpos]+$bins[blkpos+$bit]) endif syncthreads $bit*=2 continue end if (blkpos==0) $out+=$bins[0] endif end $out= parallel_do ([($blksz.*[1,64,1]),$blksz],A,B,C,D,numel(A),kernel$1) 9
HOW DOES IT WORK? Y = sum (A + B .* C + D) Code analysis and target-dependent lowering __ global __ void kernel(scalar *ret, Vector _PA, Vector _PB, scalar _PC, Vector _PD, int _P_datadims) function $out:scalar = __ kernel __ kernel$1(A:vec'col'unchecked,B:vec'unchecked,C:scalar, { D:vec‘unchecked ,$datadims:int,blkpos:int,blkdim:int,blkidx:int) shmem shmem; shmem_init(&shmem); Automatic generation of a kernel $bins:vec'unchecked= shared (blkdim) int blkpos = threadIdx.x, blkdim = blockDim.x, blkidx = blockIdx.x; $accum0=0. function Matrix o35, bins; for $m=(blkpos+(blkidx*blkdim))..(64*blkdim)..($datadims-1) scalar accum0; pos=$m int m, _Lpos, bit; $accum0+=(A[pos]+(B[pos].*C)+D[pos]) Parallel reduction algorithm using bins = shmem_alloc<scalar>(&shmem,blkdim); end $bins[blkpos]=$accum0 accum0 = 0.0f; shared memory syncthreads for (m = (blkpos + (blkidx * blkdim)); m <= _P_datadims - 1; m += (64 * blkdim)) { $bit=1 accum0 += vector_get_at<scalar>(_PA, m) + vector_get_at_checked<scalar>(_PB, _Lpos) * _PC + vector_get_at_checked<scalar>(_PD, m); Compile-time handling of while ($bit<blkdim) } if (mod(blkpos,(2*$bit))==0) vector_set_at<scalar>(bins, blkpos, accum0); boundary checks $bins[blkpos]=($bins[blkpos]+$bins[blkpos+$bit]) __ syncthreads (); endif for (bit = 1; bit < blkdim; bit *= 2) { syncthreads if (mod(blkpos,(2 * bit)) == 0) { $bit*=2 Automatic generation of continue scalar t05 = vector_get_at_safe<scalar>(bins, blkpos + bit); end scalar t15 = vector_get_at<scalar>(bins, blkpos); CUDA/OpenCL/C++ code vector_set_at<scalar>(bins, blkpos, (t15 + t05)); if (blkpos==0) } $out+=$bins[0] } endif if (blkpos == 0) end atomicAdd(ret, vector_get_at<scalar>(bins, 0)); } $out= parallel_do ([($blksz.*[1,64,1]),$blksz],A,B,C,D,numel(A),kernel$1) 10
’S WORKFLOW DEVELOPMENT CODE OPTIMAL SCRIPTING ANALYSIS & RUNTIME COMPILATION LANGUAGE EXECUTION LOWERING High level scripting INPUT HARDWARE DATA Compact, readable code Runtime information Ideal for rapid prototyping 11
’S WORKFLOW DEVELOPMENT CODE OPTIMAL SCRIPTING ANALYSIS & RUNTIME COMPILATION LANGUAGE EXECUTION LOWERING Optimization hints INPUT HARDWARE DATA Automatic detection of parallelism Runtime information 12
’S WORKFLOW DEVELOPMENT CODE OPTIMAL SCRIPTING ANALYSIS & RUNTIME COMPILATION LANGUAGE EXECUTION LOWERING INPUT HARDWARE DATA Runtime information 13
’S WORKFLOW DEVELOPMENT CODE OPTIMAL SCRIPTING ANALYSIS & RUNTIME COMPILATION LANGUAGE EXECUTION LOWERING INPUT HARDWARE DATA Runtime information HW-setup, load, memory state, scheduling 14
QUASAR ON GDaaS BENEFITS 1 Anyness (screen, device, GPU power) 2 Hourly model (1-4 GPUs) Monthly Quasar licenses Instant app distribution 3 Today: M60 Coming: Multi-GPU
DEMO
RESULTS Lines of code Execution time (ms) 400 6000 350 5000 300 4000 250 200 3000 150 2000 100 50 1000 0 0 Filter 32 taps, with 2D spatial filter Wavelet filter, with Filter 32 taps, with 2D spatial filter 32x32 Wavelet filter, with global memory 32x32 separable, global memory global memory separable, with global global memory with global memory memory CUDA-LOC QUASAR-LOC CUDA-time (ms) QUASAR-time (ms) Development Time Implementation of MRI 4 reconstruction algorithm in <14 days 3.5 using QUASAR 3 2.5 versus 3 months using CUDA 2 1.5 1 0.5 0 Filter 32 taps, with 2D spatial filter 32x32 Wavelet filter, with More efficient code and shorter global memory separable, with global global memory memory development times while keeping CUDA-Dev Time QUASAR-Dev time same performance 17
QUASAR APPLICATIONS Quasar on GDaaS Accelerates Coding 18 from Months to Days
CONCLUSION High level scripting language Ideal for rapid prototyping Fast development Maintainable, compact code Optimal usage of heterogeneous hardware (multi-core, GPUs) Context aware execution Build once, execute on any system Different hardware, different optimization Future proof code Better, faster and smarter development thanks to GDaaS Quasar on GDaaS Accelerates Coding 19 from Months to Days
www.gdaas.com/quasar www.gepura.io Visit us on booth: 826 LEAVE YOUR BUSINESS CARD TO REQUEST YOUR FREE TRIAL OR GO TO www.gdaas.com
Recommend
More recommend