high capability multidimensional data high capability
play

High Capability Multidimensional Data High Capability - PowerPoint PPT Presentation

S5455, GTC 2015 High Capability Multidimensional Data High Capability Multidimensional Data Compression on GPUs Sergio E. Zarantonello szarantonello@scu.edu Ed Karrels ed.karrels@gmail.com School of Engineering, Santa Clara University Part 1:


  1. S5455, GTC 2015 High Capability Multidimensional Data High Capability Multidimensional Data Compression on GPUs Sergio E. Zarantonello szarantonello@scu.edu Ed Karrels ed.karrels@gmail.com School of Engineering, Santa Clara University

  2. Part 1: Theory and Applications Challenge g • Massive amounts of multidimensional data being generated by scientific simulations, monitoring devices, and high-end imaging applications. li ti • Growing inability of current networks, conventional computer hardware and software, to transmit, store, and analyze this data. , , , y Solution • • Fast and effective lossy data compression Fast and effective lossy data compression. • Optimized compression ratios subject to a priori set error bounds, requiring several iterations of compress/decompress cycle. • GPUs to make the above feasible.

  3. Part 1: Theory and Applications Our goal g • A multidimensional wavelet-based CODEC for large data. • A discrete optimization procedure to provide best compression ratios p p p p subject to error tolerances and error metrics specified by user • A high performance CUDA implementation for large data, exploiting parallelism at various levels parallelism at various levels. • A flexible design allowing for future enhancements (redundant bases, adaptive dictionaries, compressive sensing, sparse representations, etc.) • An initial focus on Medical Computed Tomography, Seismic Imaging, and Non-Destructive Testing of Materials and Non-Destructive Testing of Materials.

  4. Part 1: Theory and Applications Why wavelets ? y • Wavelets are “short” waves “localized” in both, spatial and frequency domains. p q y • Can be used as basis functions for sparse representations of data. • Give compact representations of well-behaved data and point singularities. • Multidimensional wavelets take advantage of data correlation along Multidimensional wavelets take advantage of data correlation along all coordinate axes. • Wavelet encoding/decoding can be implemented with fast algorithms . algorithms

  5. Part 1: Theory and Applications Conventional 2d procedure p

  6. Part 1: Theory and Applications Design g • Data decomposed into overlapping cubelets. • Cubelets encoded via biorthogonal wavelet filters along each g g coordinate axis. • Wavelet coefficients are thresholded, then quantized. • Quantized cubelets are Huffman encoded. • Process is “reversed” to reconstruct the data. • “Hill Climbing” algorithm is implemented to deliver highest compression possible subject to error constraint(s).

  7. Part 1: Theory and Applications 2d (frame ‐ by ‐ frame) versus 3d procedure ( y ) p 6 6 Error X-Ray CT scan 5 10 steps of the cardiac cycle y 512 x 512 x 96 cube 4 http://www.osirix-viewer.com 3 2 1 2d procedure 3d procedure Same error rate: PSNR=46 0 Compression Ratio=6.6 Compression Ratio=10.2 Cutoff=88% Bins=1106 Cutoff=92% Bins=850 Max Error= 9.45 Max Error = 5.68

  8. Part 1: Theory and Applications Outline of 3d procedure p y y z z x x x y z

  9. Part 1: Theory and Applications Optimized compression for given error tolerance p p g 1. Calculate wavelet coefficients 1 C l l t l t ffi i t Error Check 2. Find starting compression parameters 3. Calculate reconstruction error 4. Calculate compression Ratio 4. Calculate compression Ratio 5. “Hill Climbing” iterations to find maximum compression ratio subject to error tolerance

  10. Part 1: Theory and Applications Optimized compression for given error tolerance p p g 2000 56 1800 14 54 1600 12 52 10 1400 50 8 1200 48 6 1000 46 46 4 4 800 76 78 80 82 84 86 88 90 92 2 44 95 2500 % Cutoff 2000 42 95 90 1500 85 80 1000 75 500 500 70 70 65 Begin End 0 60 Cutoff 76 % 92 % Bins 1850 1850 850 850 PSNR 52.4 46.2 Ratio 3.6 X 10.2 X

  11. Part 1: Theory and Applications Applications: Optical Coherence Tomography pp p g p y Objective: efficient transfer over the internet of Dataset courtesy of Quinglin Wang Carl Zeiss Meditec Inc . high-resolution 3d images of retina for diagnosis. 250 100 0 200 200 0 150 300 0 100 00 400 0 50 500 0 600 0 0 100 200 300 400 500 100 200 300 400 500

  12. Part 1: Theory and Applications Applications: Reverse Time Migration pp g -3 x 10 3 50 2 2 100 1 150 200 0 250 -1 300 0 0 -2 350 0 0 400 -3 0 100 200 300 400 500 600 700 800 0 0 0

  13. Part 2: Implementation p • Stages of compression 3D CT X ‐ Ray Inspection of a carburetor Dataset courtesy of of a carburetor. Dataset courtesy of – Wavelet transform North Star Imaging – Threshold – Threshold – Quantization – Huffman coding • Overall speedup p p

  14. Part 2: Implementation Wavelet Transform on GPU • Apply convolution • Each row is independent • Within each row, multiple read / write passes Before … After odds evens • 1 row == 1 thread block • Synchronize between read & write

  15. Part 2: Implementation 3d Wavelet Transform Thread block 0,2 , Thread block 0,1 Thread block 0,0 Thread block 1,0 Thread block 2,0 Thread block 3,0 Height × depth rows, each one is an independent thread block.

  16. Part 2: Implementation Transform Along Each Axis g X Transpose XYZ → YZX Y Transpose YZX → ZXY Z

  17. Part 2: Implementation GPU Transpose p • Access global memory in contiguous order write write read read Contiguous (thread indices) 0 1 2 3 0 1 2 3 write to global 4 5 6 7 4 5 6 7 memory memory 8 8 9 9 10 10 11 11 8 8 9 9 10 10 11 11 Global 12 13 14 15 12 13 14 15 memory write write Sh Shared d read Noncontiguous 0 1 2 3 0 4 8 12 memory read from shared 4 5 6 7 1 5 9 13 memory memory 8 8 9 9 10 10 11 11 2 2 6 6 10 10 14 14 12 13 14 15 3 7 11 15

  18. Part 2: Implementation Optimizations p Version 1: Global memory Version 2: Shared memory • 2.5× speedup p p Version 3: Constant factors double → float • 1 6× speedup 1.6× speedup #define FILTER_0 .852 _ #define FILTER_1 .377 #define FILTER_2 ‐ .110 Speedup over CPU version: 105x #define FILTER_0 .852f (860ms → 8.2ms for 256x256x256 cubelet, #define FILTER_1 .377f #define FILTER_2 ‐ .110f 8 levels of transforms along each axis)

  19. Part 2: Implementation Threshold • Trim smallest x% of values – round to 0 [ ‐ 5 .. +5] [ 5 .. 5] • Just sort absolute values using Thrust library Just sort absolute values using Thrust library • Speedup over CPU sort: 112× (7.0 toolkit is 35% faster than 6.5)

  20. Part 2: Implementation Quantization Q Map floating point values to small set of integers • Log : bin size near x proportional to x – Matches data distribution well – Simple function; fast • Lloyd's algorithm • Lloyd s algorithm – Given starting bins, fine ‐ tune to minimize overall error – Start with log quantization bins g q – Multiple passes over full data set, time ‐ consuming

  21. Part 2: Implementation Log / Lloyd Quantization g / y Q Log quantization, pSNR 38.269 GPU speedup: 97× (thrust::transform()) Lloyd quantization, pSNR 45.974 GPU speedup: create 13×, apply 48×

  22. Part 2: Implementation Huffman Encoding g • Optimal bit encoding based on value frequencies • Compute histogram on CPU Value Count Encoding 9 16609445 1 – Copy data GPU → CPU: 17ms 8 8 46198 46198 011 011 – Compute on CPU: 27ms 10 42896 001 • Compute histogram on GPU 11 32594 000 Compute histogram on GPU 7 30831 0101 – No copy needed 12 6942 01000 – Compute: .61ms 6 5388 010011 Compute: .61ms – Optimization: per ‐ thread counter for common value

  23. Part 2: Implementation Overall CPU → GPU speedup p p Wavelet transform CPU Sort Quantize GPU Histogram 0 500 1000 1500 2000 2500 Milliseconds GPU GPU: GTX 980 0 0 5 5 10 10 15 15 20 20 25 25 CPU: Intel Core i7 3.5GHz CPU I l C i7 3 5GH MATLAB CPU GPU Compress 43000 2300 21 Error control 39000 1400 18 time to process 256 3 cubelet, in milliseconds

  24. Part 2: Implementation Future Directions • Improve performance – Use subsample for training Lloyd's – Use Quickselect to find threshold value – Multiple GPUs • Improve accuracy p y – Weighted values in Lloyd's algorithm – Normalize values in each quadrant q

  25. Our Team Ed Karrels Drazen Fabris Sergio Zarantonello Santa Clara University, Santa Clara University Santa Clara University ed.karrels@gmail.com dfabris@scu.edu szarantonello@scu.edu szarantonello@scu edu David Concha Bonnie Smithson Anupam Goyal Universidad Rey Juan Carlos, Spain Santa Clara University Algorithmica LLC david.concha@urjc.es Bonnie@DejaThoris.com anupam@rithmica.com

Recommend


More recommend