QUASAR (GPU Programming Language) on GDaaS Accelerates Coding from - PowerPoint PPT Presentation

QUASAR (GPU Programming Language) on GDaaS Accelerates Coding from Months to Days

OUTLINE CAUSE DEMO 1 4 THE 2 5 RESULTS OFFER HOW DOES 3 6 CONCLUSION IT WORK 2

GPUs are everywhere

… ALMOST EVERYWHERE ??? Low level Long coding experts development are required lead times Strong coupling Each between algorithm HW platform development and requires a new implementation implementation 4

OBSERVATION  While breakthrough results are achieved, still limited usage in research — Scientific articles mentioning CUDA: 90K — Scientific articles mentioning a specific scripting language: 400K-1900K  A continuous investment in easy access: — Optimized libraries: cuFFT,cuDNN, cuBLAS,… — Tools: Digits  High level access increases potential market by a factor 7 Limited to specific applications 5

On GPU desktop as a service (GDAAS) HIGH LEVEL PROGRAMMING LANGUAGE IDE & RUNTIME OPTIMIZATION KNOWLEDGE BASE & LIBRARIES 6

’S VALUE PROPOSITION Lowering Efficient Future Better Faster barrier code proof algorithms development of entry Days iso of #lines of codes Distinctive months Future proof reduction by a tools for Development to get started * code can cycle reduction factor of 2 to 3 coding Algorithm also target by a factor of and design development other GPU 3 to 10 Same analysis and decoupled from models performance exploration implementation Larger Reducing Earlier Early access market and Better R&D and product to highest faster take maintenance products performance launch costs up 7

HOW DOES IT WORK? Y = sum (A + B .* C + D) Code analysis and target-dependent lowering 8

HOW DOES IT WORK? Y = sum (A + B .* C + D) Code analysis and target-dependent lowering function $out:scalar = __ kernel __ kernel$1(A:vec'col'unchecked,B:vec'unchecked,C:scalar, D:vec‘unchecked ,$datadims:int,blkpos:int,blkdim:int,blkidx:int) Automatic generation of a kernel $bins:vec'unchecked= shared (blkdim) $accum0=0. function for $m=(blkpos+(blkidx*blkdim))..(64*blkdim)..($datadims-1) pos=$m $accum0+=(A[pos]+(B[pos].*C)+D[pos]) Parallel reduction algorithm using end $bins[blkpos]=$accum0 shared memory syncthreads $bit=1 Compile-time handling of while ($bit<blkdim) if (mod(blkpos,(2*$bit))==0) boundary checks $bins[blkpos]=($bins[blkpos]+$bins[blkpos+$bit]) endif syncthreads $bit*=2 continue end if (blkpos==0) $out+=$bins[0] endif end $out= parallel_do ([($blksz.*[1,64,1]),$blksz],A,B,C,D,numel(A),kernel$1) 9

HOW DOES IT WORK? Y = sum (A + B .* C + D) Code analysis and target-dependent lowering __ global __ void kernel(scalar *ret, Vector _PA, Vector _PB, scalar _PC, Vector _PD, int _P_datadims) function $out:scalar = __ kernel __ kernel$1(A:vec'col'unchecked,B:vec'unchecked,C:scalar, { D:vec‘unchecked ,$datadims:int,blkpos:int,blkdim:int,blkidx:int) shmem shmem; shmem_init(&shmem); Automatic generation of a kernel $bins:vec'unchecked= shared (blkdim) int blkpos = threadIdx.x, blkdim = blockDim.x, blkidx = blockIdx.x; $accum0=0. function Matrix o35, bins; for $m=(blkpos+(blkidx*blkdim))..(64*blkdim)..($datadims-1) scalar accum0; pos=$m int m, _Lpos, bit; $accum0+=(A[pos]+(B[pos].*C)+D[pos]) Parallel reduction algorithm using bins = shmem_alloc<scalar>(&shmem,blkdim); end $bins[blkpos]=$accum0 accum0 = 0.0f; shared memory syncthreads for (m = (blkpos + (blkidx * blkdim)); m <= _P_datadims - 1; m += (64 * blkdim)) { $bit=1 accum0 += vector_get_at<scalar>(_PA, m) + vector_get_at_checked<scalar>(_PB, _Lpos) * _PC + vector_get_at_checked<scalar>(_PD, m); Compile-time handling of while ($bit<blkdim) } if (mod(blkpos,(2*$bit))==0) vector_set_at<scalar>(bins, blkpos, accum0); boundary checks $bins[blkpos]=($bins[blkpos]+$bins[blkpos+$bit]) __ syncthreads (); endif for (bit = 1; bit < blkdim; bit *= 2) { syncthreads if (mod(blkpos,(2 * bit)) == 0) { $bit*=2 Automatic generation of continue scalar t05 = vector_get_at_safe<scalar>(bins, blkpos + bit); end scalar t15 = vector_get_at<scalar>(bins, blkpos); CUDA/OpenCL/C++ code vector_set_at<scalar>(bins, blkpos, (t15 + t05)); if (blkpos==0) } $out+=$bins[0] } endif if (blkpos == 0) end atomicAdd(ret, vector_get_at<scalar>(bins, 0)); } $out= parallel_do ([($blksz.*[1,64,1]),$blksz],A,B,C,D,numel(A),kernel$1) 10

’S WORKFLOW DEVELOPMENT CODE OPTIMAL SCRIPTING ANALYSIS & RUNTIME COMPILATION LANGUAGE EXECUTION LOWERING High level scripting INPUT HARDWARE DATA Compact, readable code Runtime information Ideal for rapid prototyping 11

’S WORKFLOW DEVELOPMENT CODE OPTIMAL SCRIPTING ANALYSIS & RUNTIME COMPILATION LANGUAGE EXECUTION LOWERING Optimization hints INPUT HARDWARE DATA Automatic detection of parallelism Runtime information 12

’S WORKFLOW DEVELOPMENT CODE OPTIMAL SCRIPTING ANALYSIS & RUNTIME COMPILATION LANGUAGE EXECUTION LOWERING INPUT HARDWARE DATA Runtime information 13

’S WORKFLOW DEVELOPMENT CODE OPTIMAL SCRIPTING ANALYSIS & RUNTIME COMPILATION LANGUAGE EXECUTION LOWERING INPUT HARDWARE DATA Runtime information HW-setup, load, memory state, scheduling 14

QUASAR ON GDaaS BENEFITS 1 Anyness (screen, device, GPU power) 2 Hourly model (1-4 GPUs) Monthly Quasar licenses Instant app distribution 3 Today: M60 Coming: Multi-GPU

RESULTS Lines of code Execution time (ms) 400 6000 350 5000 300 4000 250 200 3000 150 2000 100 50 1000 0 0 Filter 32 taps, with 2D spatial filter Wavelet filter, with Filter 32 taps, with 2D spatial filter 32x32 Wavelet filter, with global memory 32x32 separable, global memory global memory separable, with global global memory with global memory memory CUDA-LOC QUASAR-LOC CUDA-time (ms) QUASAR-time (ms) Development Time Implementation of MRI 4 reconstruction algorithm in <14 days 3.5 using QUASAR 3 2.5 versus 3 months using CUDA 2 1.5 1 0.5 0 Filter 32 taps, with 2D spatial filter 32x32 Wavelet filter, with More efficient code and shorter global memory separable, with global global memory memory development times while keeping CUDA-Dev Time QUASAR-Dev time same performance 17

QUASAR APPLICATIONS Quasar on GDaaS Accelerates Coding 18 from Months to Days

CONCLUSION High level scripting language Ideal for rapid prototyping Fast development Maintainable, compact code Optimal usage of heterogeneous hardware (multi-core, GPUs) Context aware execution Build once, execute on any system Different hardware, different optimization Future proof code Better, faster and smarter development thanks to GDaaS Quasar on GDaaS Accelerates Coding 19 from Months to Days

www.gdaas.com/quasar www.gepura.io Visit us on booth: 826 LEAVE YOUR BUSINESS CARD TO REQUEST YOUR FREE TRIAL OR GO TO www.gdaas.com

QUASAR (GPU Programming Language) on GDaaS Accelerates Coding from - PowerPoint PPT Presentation

QUASAR (GPU Programming Language) on GDaaS Accelerates Coding from Months to Days OUTLINE CAUSE DEMO 1 4 THE 2 5 RESULTS OFFER HOW DOES 3 6 CONCLUSION IT WORK 2 GPUs are everywhere ALMOST EVERYWHERE ??? Low level Long

Quasar evolution at high redshift Ian McGreer Steward Observatory a brief history of quasars a

The HST Quasar Absorption Line Key Project XI: Some Practical Suggestions for Flat Fielding Your

The Impact of Quasar-Mode Feedback on the Ecosystems of Galaxies Sylvain Veilleux (U. Maryland,

SafEye Quasar 900 series commissioning kit Instructions for use (IFU) booklet Commissioning Kit

Storyline In the year 2525 the prophecy of the Goths came true and the galaxy Quasar appeared

The ionising radiation of AGN: Ultraviolet quasar composite from WFC3 Elisabeta Lusso INAF -

Software Reviews Fernando Brito e Abreu (fba@di.fct.unl.pt) Universidade Nova de Lisboa

Quasar-Lyman Forest Cross- Correlation from BOSS survey: Baryon Acoustic Oscillations Autor:

Probing He II reionization with GALEX-selected quasar sightlines in the HST/COS era Gbor

Software Design Fernando Brito e Abreu (fba@di.fct.unl.pt) Universidade Nova de Lisboa

Redshift-space distortion analysis from the DR14 eBOSS quasar sample in Fourier space Hctor

Software Configuration Management Fernando Brito e Abreu (fba@di.fct.unl.pt) Universidade Nova

Model Galaxy Formation in Quasar Proximity Zones during Reionization Huanqing Chen UChicago

Software Testing Fernando Brito e Abreu (fba@di.fct.unl.pt) Universidade Nova de Lisboa

Software Construction Fernando Brito e Abreu (fba@di.fct.unl.pt) Universidade Nova de Lisboa

Linear Regression Fernando Brito e Abreu (fba@di.fct.unl.pt) Universidade Nova de Lisboa

Uniform General Algorithmic (UNIGA) Financial Trading Language Leon Wu, Jian Pan, Jiahua Ni, Yu

A scripting language to analyse semi-structured textual data Guillaume Baudart, Louis Mandel,

1 Features common to Unix shells command execution + built-in commands, e.g., cd

Content Manpower Professionals since 1975! 1. Group Profile 2. Our Group Companies CMS

EPROM EPROM Entrepreneurial Programming and Research on Mobiles Entrepreneurial Programming and

STAR-CCM+ in your Workflow Bill Jester, CD-adapco STAR-CCM+ in your workflow Contents

Arduino open-source hardware and integrated development environment 1 Goals of this

Meeting Goals International Hydrographic Organization S100WG4 Aalborg, Denmark, February