decompression on heterogeneous
play

DECOMPRESSION ON HETEROGENEOUS MULTICORE ARCHITECTURES Wasuwee - PowerPoint PPT Presentation

DYNAMIC PARTITIONING-BASED JPEG DECOMPRESSION ON HETEROGENEOUS MULTICORE ARCHITECTURES Wasuwee Sodsong 1 , Jingun Hong 1 , Seongwook Chung 1 , Yeongkyu Lim 2 , Shin-Dug Kim 1 and Bernd Burgstaller 1 1 Yonsei University 2 LG Electronics JPEG


  1. DYNAMIC PARTITIONING-BASED JPEG DECOMPRESSION ON HETEROGENEOUS MULTICORE ARCHITECTURES Wasuwee Sodsong 1 , Jingun Hong 1 , Seongwook Chung 1 , Yeongkyu Lim 2 , Shin-Dug Kim 1 and Bernd Burgstaller 1 1 Yonsei University 2 LG Electronics

  2. JPEG Decompression 2 110010001010011010 101001011010100100 110000101000010011 010010101001000111 Entropy Coded Data

  3. JPEG Decompression 3 8 8 8 8 8 110010001010011010 Huffman 8 101001011010100100 Decoding 110000101000010011 8 010010101001000111 8 Entropy Coded Data Frequency Domain

  4. JPEG Decompression 4 8 8 8 8 8 110010001010011010 Huffman 8 101001011010100100 IDCT Decoding 110000101000010011 8 010010101001000111 8 Spatial Domain (YCbCr) Entropy Coded Data Frequency Domain

  5. JPEG Decompression 5 8 8 8 8 8 110010001010011010 Huffman 8 101001011010100100 IDCT Decoding 110000101000010011 8 010010101001000111 8 Spatial Domain (YCbCr) Entropy Coded Data Frequency Domain Upsampling YCbCr Color

  6. JPEG Decompression 6 8 8 8 8 8 110010001010011010 Huffman 8 101001011010100100 IDCT Decoding 110000101000010011 8 010010101001000111 8 Spatial Domain (YCbCr) Entropy Coded Data Frequency Domain Upsampling Color Conversion RGB Color YCbCr Color

  7. JPEG Decompression 7 8 8 8 8 8 110010001010011010 Huffman 8 101001011010100100 IDCT Decoding 110000101000010011 8 010010101001000111 8 Spatial Domain (YCbCr) Entropy Coded Data Frequency Domain Upsampling Color Output Conversion Bitmap Image RGB Color YCbCr Color

  8. Sequential JPEG Decompression 8  JPEG is an asymmetric compression  Compression performs once per image  Decompression performs once per use  463 out of the 500 most popular websites use JPEG images  Operates in blocks of 8x8 pixels  Sequential JPEG decoders apply IDCT, upsampling and color conversion block-by-block

  9. Parallelism in JPEG Decompression 9 Huffman Color .jpg IDCT Upsampling decoding Conversion Sequential Part Parallelizable Part  Sequential Part  Huffman decoding  NOT suitable for data-parallelism  Codewords have variable lengths.  The starting bit of a codeword in the encoded bitstream is only known once the previous codeword has been decoded.

  10. Parallelism in JPEG Decompression 10 Huffman Color .jpg IDCT Upsampling decoding Conversion Sequential Part Parallelizable Part  Parallelizable Part  Sequential Part  IDCT, upsampling and color  Huffman decoding conversion  NOT suitable for data-parallelism  Suitable for GPU computing and  Codewords have variable lengths SIMD operations on CPU  The starting bit of a codeword in the  low data dependency encoded bitstream is only known once the previous codeword has been  operates same instructions repeatedly decoded  has fixed input and output sizes

  11. Research Question 11 Huffman Color .jpg IDCT Upsampling decoding Conversion Sequential Part Parallelizable Part How to orchestrate JPEG decompression on CPU+GPU architectures?  Input image characterized by  Width  Height  Entropy  Need: work partitioning, schedule, execution infrastructure

  12. Our Contributions 12  Heterogeneous JPEG decoder on CPU+GPU architectures  profiling based performance model  dynamic partitioning scheme that automatically distributes the workload at run-time  Pipelined execution model overlaps sequential Huffman decoding with GPU computations  Parallelizable part is distributed across CPU and GPU  data-, task- and pipeline-parallelism  GPU kernels designed to minimize memory access overhead  libjpeg-turbo implementation and experimental evaluation for libjpeg-turbo library

  13. libjpeg & libjpeg-turbo 13  libjpeg is a sequential JPEG compression reference implementation by Independent JPEG group  First version released in 1991  libjpeg-turbo is a re-implementation of libjpeg  Utilizes SIMD instructions on x86 and ARM platforms.  Used by Google Chrome, Firefox, Webkit, Ubuntu, Fedora and openSUSE  Both libraries strictly designed to conserve memory  Inhibits coarse-grained parallelism  A non-goal with today's target architectures

  14. Re-engineering libjpeg-turbo 14 libjpeg-turbo Our Approach  To conserve memory, libjpeg-turbo  Store an entire image in memory: decodes images in units of 8 pixel rows:  8 rows at a time do not contain enough  Fully utilizes all GPU cores by processing computations to keep the data-parallel several larger image chunks. execution units of a GPU busy.  Reduce number of kernel invocations and  Significant constant overhead per kernel data transferring overhead. invocation and data transfer (host  device  host).

  15. Heterogeneous JPEG Decompression Overview 15 GPU-Only CPU GPU  Huffman  Motivation: One architecture is unutilized when the other is Idle Decoding processing  Observation: No dependency among 8x8 pixel blocks. Thus, Dispatch the CPU and the GPU can compute in parallel GPU Idle Kernel  Goal: Find partitioning size at runtime such that the load on the CPU and the GPU are balanced  Requirement: Performance model through offline profiling

  16. Performance Model 16  Offline profiling step on image training set  19 master images cropped to various sizes  Maximum image size is 25 megapixels  Profile execution time of the sequential part and the parallelizable part on CPU and GPU  Model all decompression steps using multivariate polynomial regression up to degree 7  Select the best-fit model by comparing Akaike information criterion (AIC) values

  17. Performance Model for the Parallelizable Part 17  Linearly scales as image size increased  Image dimension is known at the beginning of the decompression step  Parameters: width and height 40 30 Time (ms) 20 Subsampling 4:2:2 10 4:4:4 0 0 10M 20M Pixels

  18. Performance Model for the Sequential Part 18  Unlike the parallelizable part, Huffman decoding time does NOT have a high correlation with image width and height. 40 30 Time (ms) 20 10 0 0 5M 10M 15M 20M Pixels

  19. Performance Model for the Sequential Part 19  Huffman decoding time has a high correlation with the size of entropy coded data.  We have observed a linear trend as entropy density increased, entropy size in bytes per pixel.  Parameters: width, height and entropy size  Entropy size can be roughly approximated from JPEG file size. 6 Time (ns) 4 Subsampling 4:2:2 2 4:4:4 0.0 0.1 0.2 0.3 0.4 Entropy density (bytes/pixel)

  20. Overlapped Partitioning Scheme 20 Overlapped  Sharing workload of the GPU-Only CPU GPU parallelizable part between CPU GPU CPU and GPU Huffman Dispatch Huffman GPU Idle SIMD Kernel Decoding Dispatch GPU Idle Kernel

  21. Overlapped Partitioning Scheme 21 to GPU CPU GPU  to CPU  Idea: Share workload of the parallelizable part on the CPU and the GPU. Huffman Idle  Partitioning equation can be formulated as Decoding where is number of rows given to CPU, and are image width and height. Dispatch  When , the time spent on the CPU and GPU GPU are equaled. SIMD Kernel  and are known at runtime . We can use Newton’s method to solve for .  Problem: GPU is unutilized during Huffman decoding.

  22. Pipelined Partitioning Scheme 22 Overlapped  Sharing workload of the GPU-Only CPU GPU parallelizable part between CPU GPU CPU and GPU Huffman Dispatch Huffman GPU Idle SIMD Kernel Decoding  Increase parallelism by Pipelined Dispatch CPU GPU performing Huffman GPU Idle Kernel decoding and GPU kernel Huffman 1 Dispatch Kernel in pipelined fashion (Huffman 1) Huffman 2 Dispatch Kernel (Huffman 2) Huffman 3 Dispatch Kernel (Huffman 3)

  23. Pipelined Partitioning Scheme 23  𝐷 rows Chunk 1 CPU GPU  Chunk 2 𝐷 rows  Huffman 1 Chunk 3 𝐷 rows Dispatch Kernel  Idea: Execute Huffman decoding in a pipelined fashion (Huffman 1) with GPU kernel. Huffman 2  Split an image into several chucks of rows. Dispatch Kernel  An optimal chunk size is found through a profiling. (Huffman 2) Huffman 3  We can start kernel invocation as soon as an image Dispatch Kernel chunk is decoded. (Huffman 3)  On a fast GPU, only the execution time of last chunk is visible to users.  Problem: Does NOT guarantee improvement over CPU computation.

  24. Combined Partitioning Scheme 24 Overlapped GPU-Only Combined CPU GPU CPU GPU CPU GPU Huffman Huffman 1 Dispatch Dispatch Kernel Huffman GPU Idle SIMD (Huffman 1) Kernel Decoding Huffman 2 Dispatch Kernel Huffman 3 Pipelined Dispatch (Huffman 2) Dispatch Huffman 4 CPU GPU Kernel SIMD GPU (Huffman 3) Idle (Huffman 4) Kernel Huffan 1 Dispatch Kernel (Huffman 1) Huffman 2 Dispatch Kernel (Huffman 2) Huffman 3 Dispatch Kernel (Huffman 3)

Recommend


More recommend