parallel silence coding
play

PARALLEL SILENCE CODING ALGORITHMS ON GPUS John Cheng and Nanxun - PowerPoint PPT Presentation

April 4-7, 2016 | Silicon Valley PARALLEL SILENCE CODING ALGORITHMS ON GPUS John Cheng and Nanxun Dai BGP International Inc, R&D Center April 5, 2016 1 Silence Encoding 2 Algorithm Conversion from Serial CONTENTS to Parallel


  1. April 4-7, 2016 | Silicon Valley PARALLEL SILENCE CODING ALGORITHMS ON GPUS John Cheng and Nanxun Dai BGP International Inc, R&D Center April 5, 2016

  2. 1 Silence Encoding 2 Algorithm Conversion from Serial CONTENTS to Parallel Implementation on GPUS with CUB 3 2

  3. SEISMIC DATA COMPRESSION ALGORITHM Wavelet Transformation  Quantization  Prefix Encoding  Silence Encoding  Huffman Encoding  3

  4. A typical data in wave propagation 4

  5. AN ILLUSTRATION OF SILENCE ENCODING a pair data: ( zero, its length ) 5

  6. HOW TO MAKE IT RUN IN PARALLEL  Which thread has a right to write  Where should the thread write to  What should the thread write 6

  7. FOR A NON-ZERO-THREAD  How may zero elements before it  How may zero segments before it 7

  8. FOR A ZERO-THREAD  How may zero elements before it  How may zero segments before it  How may zero elements in its own zero segment 8

  9. PREFIX SCAN Prefix Scan might be considered as a key  primitive in parallel computation All the information we need can be  calculated in parallel by using Prefix-Sum Therefore, we can convert the algorithm  from serial to parallel 9

  10. THE ILLUSTRATION OF PREFIX SCAN partial sum till index 5 10

  11. CALCULATING PRECEDING ZERO ELEMENTS Auxiliary variable Inclusive Prefix-Sum 11

  12. CALCULATING PRECEDING ZERO SEGMENTS Auxiliary variable Exclusive Prefix-Sum 12

  13. PARALLEL SILENCE ENCODING ALGORITHM Step 1: Read global data to shared memory Step 2: Calculate preceding zero elements with inclusive prefix-sum Step 3: Calculate preceding zero segments with exclusive prefix-sum Step 4: Calculate write positions for each thread Step 5: Write the encoded string to shared memory Step 6: Write the encoded string to global memory 13

  14. IMPLEMENTATION WITH CUB Warp-wide primitives  Block-wide primitives  Device-wide primitives  More info: https://nvlabs.github.io/cub/ 14

  15. HOW TO WRAP CUB PRIMITIVES __device__ __forceinline__ void cub_prefix_sum_exclusive (char in, char& out, char& aggregate) { typedef BlockScan <char, DIM, BLOCK_SCAN_RAKING> BlockScanT; typename BlockScanT::TempStorage __shared__ iscan; char data[1]; data[0] = in; __syncthreads(); BlockScanT(iscan).ExclusiveSum (data, data, aggregate); __syncthreads(); out = data[0]; } 15

  16. PERFORMANCE OF DIFFERENT KERNELS Naive Kernel Parallel Kernel 2500 2000 Elapsed Time in ms 1500 1000 500 0 1024x1024x1 1024x1024x100 1024x1024x1000 Data Size 16

  17. DIFFERENT CUB ALGORITHMS 1024x1024x1000 BLOCK_SCAN_WARP_SCANS 352.57 BLOCK_SCAN_RAKING_MEMOIZE 303.05 LOCK_SCAN_RAKING 323.81 270 280 290 300 310 320 330 340 350 360 17

  18. CONCLUSION Prefix-sum is an efficient way to convert  serial computations to parallel computations It is convenient to integrate CUB parallel  primitives into your implementation 18

  19. Each subject in the book is treated with a profile- driven approach 19

  20. April 4-7, 2016 | Silicon Valley THANK YOU John Cheng and Nanxun Dai BGP International Inc, R&D Center 10630 Haddington Dr., Houston, Texas 77043 rwcheng@bgprdc.com

Recommend


More recommend