optimization techniques for 3d fwt on systems
play

Optimization techniques for 3D-FWT on systems with manycore GPUs and - PowerPoint PPT Presentation

International Conference on Computational Science (ICCS 2013) Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs G. Bernab , J. Cuenca and D. Gimnez Computer Engineering Department, University


  1. International Conference on Computational Science (ICCS 2013) Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs G. Bernabé † , J. Cuenca † and D. Giménez ‡ † Computer Engineering Department, University of Murcia ‡ Computer Science and Systems Department, University of Murcia 5-7 June, 2013 Conference title 1

  2. Outline Introduction The Wavelet Transform Optimization techniques for 3D-FWT on a single GPU system Optimization techniques for 3D-FWT on hybrid systems Conclusions and Future work ICCS’13 – Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs 2

  3. Outline Introduction The Wavelet Transform Optimization techniques for 3D-FWT on a single GPU system Optimization techniques for 3D-FWT on hybrid systems Conclusions and Future work ICCS’13 – Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs 3

  4. Introduction • The application of Wavelet Transform – Important development: Mainly applied to image and video compression – Optimal tiled 2D and 3D FWT: Reduction of almost an order of magnitude in the overall execution time (with respect to a baseline version on a CPU) – CUDA and OpenCL provide mechanisms to optimize general-purpose applications on GPUs (GPGPUs) – Several implementations of the 3D-FWT on CUDA and OpenCL for accelerating on GPUs A method to compute automatically the parameters of the 3D-FWT running on systems with multicore CPU and manycore GPUs ICCS’13 – Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs 4

  5. Outline Introduction The Wavelet Transform Optimization techniques for 3D-FWT on a single GPU system Optimization techniques for 3D-FWT on hybrid systems Conclusions and Future work ICCS’13 – Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs 5

  6. Outline Introduction The Wavelet Transform Optimization techniques for 3D-FWT on a single GPU system Optimization techniques for 3D-FWT on hybrid systems Conclusions and Future work ICCS’13 – Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs 6

  7. The Wavelet Transform 1D-FWT • The wavelet transform uses simple filters for fast computing • The filters are applied to the signal. The filter output downsampled by two generating two bands • Maintaining the amount of data on each additional level with minimum info loss • Access pattern is determined by our mother wavelet function ICCS’13 – Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs 7

  8. The Wavelet Transform 2D-FWT • Generalize the 1D-FWT for an image (2D) • Applying the 1D-FWT to each row and to each column of the image ... after a three level Original image Rows transformed Columns transformed application of the filters ICCS’13 – Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs 8

  9. The Wavelet Transform 3D-FWT with tiling • Generalize the 1D-FWT for a sequence of video (3D) 1.N rows x N colums calls to 1D-FWT on frames 2.Each of N frames calls to 2D-FWT with tiling ICCS’13 – Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs 9

  10. The Wavelet Transform 3D-FWT on CUDA and OpenCL • Our 3D-FWT implementation in CUDA and OpenCL consists of the following three steps: 1. The host (CPU) allocates in memory the first four video frames 2. The first four images are transferred from host to device. – The 1D-FWT is then applied to the first four frames over the time dimension 3. The 2D-FWT is applied to detailed and reference video and results sent to CPU ICCS’13 – Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs 11

  11. The Wavelet Transform 3D-FWT on CUDA and OpenCL • We read two more frames (interleaved) to complete each new step ICCS’13 – Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs 15

  12. Outline Introduction The Wavelet Transform Optimization techniques for 3D-FWT on a single GPU system Optimization techniques for 3D-FWT on hybrid systems Conclusions and Future work ICCS’13 – Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs 16

  13. Outline Introduction The Wavelet Transform Optimization techniques for 3D-FWT on a single GPU system Optimization techniques for 3D-FWT on hybrid systems Conclusions and Future work ICCS’13 – Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs 17

  14. Optimization techniques for 3D-FWT on a single GPU system The method consists mainly on three stages 1. Detect automatically the available GPU in the system GPU Nvidia or ATI  3D-FWT 2. 3. The key parameter value of block or work-group size is selected automatically • The remaining parameters (grid size, the occupation of the shared memory, etc) are also calculated automatically ICCS’13 – Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs 18

  15. Optimization techniques for 3D-FWT on a single GPU system The method consists mainly on three stages 1. Detect automatically the available GPU in the system 2. GPU Nvidia or ATI  3D-FWT CUDA or OpenCL 3. The key parameter value of block or work-group size is selected automatically • The block size value is based on the CUDA occupancy calculator 1. Select the block size that maximizes the occupancy of each multiprocessor 2. If two or more values obtain the same occupancy, the maximum value of the number of active threads blocks per multiprocessor • The work-group size is equal to the value of CL_DEVICE_MAX_WORK_GROUP_SIZE ICCS’13 – Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs 19

  16. Optimization techniques for 3D-FWT on a single GPU system Experiments with 3D-FWT parameters for 3 GPUs Execution Run on 64 frames, each of them of size: Times 512x512 1024x1024 2048x2048 Block size 64 128 192 256 64 128 192 256 64 128 192 256 Tesla C870 58.68 56.28 53.51 58.68 225.74 214.36 209.01 217.21 889.83 841.47 840.14 850.26 Tesla C2050 35.33 53.17 32.13 33.59 122.12 115.02 110.88 113.32 467.50 438.46 427.69 433.84 FirePro V5800 130.06 135.87 131.29 114.87 452.95 346.29 313.35 307.54 2123.60 1496.27 1284.56 1217.59 • The optimization engine studies the problem for different block or work-group sizes • Selects 192 in the Tesla C870 and Fermi C2050 (optimal) • Selects 256 for the ATI FirePro (optimal) ICCS’13 – Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs 20

  17. Optimization techniques for 3D-FWT on a single GPU system Experiments with 3D-FWT parameters for 3 GPUs Execution Run on 64 frames, each of them of size: Times 512x512 1024x1024 2048x2048 Block size 64 128 192 256 64 128 192 256 64 128 192 256 Tesla C870 58.68 56.28 53.51 58.68 225.74 214.36 209.01 217.21 889.83 841.47 840.14 850.26 Tesla C2050 35.33 53.17 32.13 33.59 122.12 115.02 110.88 113.32 467.50 438.46 427.69 433.84 FirePro V5800 130.06 135.87 131.29 114.87 452.95 346.29 313.35 307.54 2123.60 1496.27 1284.56 1217.59 • The optimization engine studies the problem for different block or work-group sizes • Selects 192 in the Tesla C870 and Fermi C2050 (optimal) • Selects 256 for the ATI FirePro (optimal) ICCS’13 – Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs 21

  18. Optimization techniques for 3D-FWT on a single GPU system Experiments with 3D-FWT parameters for 3 GPUs Execution Run on 64 frames, each of them of size: Times 512x512 1024x1024 2048x2048 Block size 64 128 192 256 64 128 192 256 64 128 192 256 Tesla C870 58.68 56.28 53.51 58.68 225.74 214.36 209.01 217.21 889.83 841.47 840.14 850.26 Tesla C2050 35.33 53.17 32.13 33.59 122.12 115.02 110.88 113.32 467.50 438.46 427.69 433.84 FirePro V5800 130.06 135.87 131.29 114.87 452.95 346.29 313.35 307.54 2123.60 1496.27 1284.56 1217.59 • The optimization engine studies the problem for different block or work-group sizes • Selects 192 in the Tesla C870 and Fermi C2050 (optimal) • Selects 256 for the ATI FirePro (optimal) ICCS’13 – Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs 22

  19. Optimization techniques for 3D-FWT on a single GPU system Experiments with 3D-FWT parameters for 3 GPUs Execution Run on 64 frames, each of them of size: Times 512x512 1024x1024 2048x2048 Block size 64 128 192 256 64 128 192 256 64 128 192 256 Tesla C870 58.68 56.28 53.51 58.68 225.74 214.36 209.01 217.21 889.83 841.47 840.14 850.26 Tesla C2050 35.33 53.17 32.13 33.59 122.12 115.02 110.88 113.32 467.50 438.46 427.69 433.84 FirePro V5800 130.06 135.87 131.29 114.87 452.95 346.29 313.35 307.54 2123.60 1496.27 1284.56 1217.59 • The optimization engine studies the problem for different block or work-group sizes • Selects 192 in the Tesla C870 and Fermi C2050 (optimal) • Selects 256 for the ATI FirePro (optimal) ICCS’13 – Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs 23

Recommend


More recommend