April 4-7, 2016 | Silicon Valley PARALLEL SILENCE CODING ALGORITHMS ON GPUS John Cheng and Nanxun Dai BGP International Inc, R&D Center April 5, 2016
1 Silence Encoding 2 Algorithm Conversion from Serial CONTENTS to Parallel Implementation on GPUS with CUB 3 2
SEISMIC DATA COMPRESSION ALGORITHM Wavelet Transformation Quantization Prefix Encoding Silence Encoding Huffman Encoding 3
A typical data in wave propagation 4
AN ILLUSTRATION OF SILENCE ENCODING a pair data: ( zero, its length ) 5
HOW TO MAKE IT RUN IN PARALLEL Which thread has a right to write Where should the thread write to What should the thread write 6
FOR A NON-ZERO-THREAD How may zero elements before it How may zero segments before it 7
FOR A ZERO-THREAD How may zero elements before it How may zero segments before it How may zero elements in its own zero segment 8
PREFIX SCAN Prefix Scan might be considered as a key primitive in parallel computation All the information we need can be calculated in parallel by using Prefix-Sum Therefore, we can convert the algorithm from serial to parallel 9
THE ILLUSTRATION OF PREFIX SCAN partial sum till index 5 10
CALCULATING PRECEDING ZERO ELEMENTS Auxiliary variable Inclusive Prefix-Sum 11
CALCULATING PRECEDING ZERO SEGMENTS Auxiliary variable Exclusive Prefix-Sum 12
PARALLEL SILENCE ENCODING ALGORITHM Step 1: Read global data to shared memory Step 2: Calculate preceding zero elements with inclusive prefix-sum Step 3: Calculate preceding zero segments with exclusive prefix-sum Step 4: Calculate write positions for each thread Step 5: Write the encoded string to shared memory Step 6: Write the encoded string to global memory 13
IMPLEMENTATION WITH CUB Warp-wide primitives Block-wide primitives Device-wide primitives More info: https://nvlabs.github.io/cub/ 14
HOW TO WRAP CUB PRIMITIVES __device__ __forceinline__ void cub_prefix_sum_exclusive (char in, char& out, char& aggregate) { typedef BlockScan <char, DIM, BLOCK_SCAN_RAKING> BlockScanT; typename BlockScanT::TempStorage __shared__ iscan; char data[1]; data[0] = in; __syncthreads(); BlockScanT(iscan).ExclusiveSum (data, data, aggregate); __syncthreads(); out = data[0]; } 15
PERFORMANCE OF DIFFERENT KERNELS Naive Kernel Parallel Kernel 2500 2000 Elapsed Time in ms 1500 1000 500 0 1024x1024x1 1024x1024x100 1024x1024x1000 Data Size 16
DIFFERENT CUB ALGORITHMS 1024x1024x1000 BLOCK_SCAN_WARP_SCANS 352.57 BLOCK_SCAN_RAKING_MEMOIZE 303.05 LOCK_SCAN_RAKING 323.81 270 280 290 300 310 320 330 340 350 360 17
CONCLUSION Prefix-sum is an efficient way to convert serial computations to parallel computations It is convenient to integrate CUB parallel primitives into your implementation 18
Each subject in the book is treated with a profile- driven approach 19
April 4-7, 2016 | Silicon Valley THANK YOU John Cheng and Nanxun Dai BGP International Inc, R&D Center 10630 Haddington Dr., Houston, Texas 77043 rwcheng@bgprdc.com
Recommend
More recommend