DYNAMIC PARTITIONING-BASED JPEG DECOMPRESSION ON HETEROGENEOUS MULTICORE ARCHITECTURES Wasuwee Sodsong 1 , Jingun Hong 1 , Seongwook Chung 1 , Yeongkyu Lim 2 , Shin-Dug Kim 1 and Bernd Burgstaller 1 1 Yonsei University 2 LG Electronics
JPEG Decompression 2 110010001010011010 101001011010100100 110000101000010011 010010101001000111 Entropy Coded Data
JPEG Decompression 3 8 8 8 8 8 110010001010011010 Huffman 8 101001011010100100 Decoding 110000101000010011 8 010010101001000111 8 Entropy Coded Data Frequency Domain
JPEG Decompression 4 8 8 8 8 8 110010001010011010 Huffman 8 101001011010100100 IDCT Decoding 110000101000010011 8 010010101001000111 8 Spatial Domain (YCbCr) Entropy Coded Data Frequency Domain
JPEG Decompression 5 8 8 8 8 8 110010001010011010 Huffman 8 101001011010100100 IDCT Decoding 110000101000010011 8 010010101001000111 8 Spatial Domain (YCbCr) Entropy Coded Data Frequency Domain Upsampling YCbCr Color
JPEG Decompression 6 8 8 8 8 8 110010001010011010 Huffman 8 101001011010100100 IDCT Decoding 110000101000010011 8 010010101001000111 8 Spatial Domain (YCbCr) Entropy Coded Data Frequency Domain Upsampling Color Conversion RGB Color YCbCr Color
JPEG Decompression 7 8 8 8 8 8 110010001010011010 Huffman 8 101001011010100100 IDCT Decoding 110000101000010011 8 010010101001000111 8 Spatial Domain (YCbCr) Entropy Coded Data Frequency Domain Upsampling Color Output Conversion Bitmap Image RGB Color YCbCr Color
Sequential JPEG Decompression 8 JPEG is an asymmetric compression Compression performs once per image Decompression performs once per use 463 out of the 500 most popular websites use JPEG images Operates in blocks of 8x8 pixels Sequential JPEG decoders apply IDCT, upsampling and color conversion block-by-block
Parallelism in JPEG Decompression 9 Huffman Color .jpg IDCT Upsampling decoding Conversion Sequential Part Parallelizable Part Sequential Part Huffman decoding NOT suitable for data-parallelism Codewords have variable lengths. The starting bit of a codeword in the encoded bitstream is only known once the previous codeword has been decoded.
Parallelism in JPEG Decompression 10 Huffman Color .jpg IDCT Upsampling decoding Conversion Sequential Part Parallelizable Part Parallelizable Part Sequential Part IDCT, upsampling and color Huffman decoding conversion NOT suitable for data-parallelism Suitable for GPU computing and Codewords have variable lengths SIMD operations on CPU The starting bit of a codeword in the low data dependency encoded bitstream is only known once the previous codeword has been operates same instructions repeatedly decoded has fixed input and output sizes
Research Question 11 Huffman Color .jpg IDCT Upsampling decoding Conversion Sequential Part Parallelizable Part How to orchestrate JPEG decompression on CPU+GPU architectures? Input image characterized by Width Height Entropy Need: work partitioning, schedule, execution infrastructure
Our Contributions 12 Heterogeneous JPEG decoder on CPU+GPU architectures profiling based performance model dynamic partitioning scheme that automatically distributes the workload at run-time Pipelined execution model overlaps sequential Huffman decoding with GPU computations Parallelizable part is distributed across CPU and GPU data-, task- and pipeline-parallelism GPU kernels designed to minimize memory access overhead libjpeg-turbo implementation and experimental evaluation for libjpeg-turbo library
libjpeg & libjpeg-turbo 13 libjpeg is a sequential JPEG compression reference implementation by Independent JPEG group First version released in 1991 libjpeg-turbo is a re-implementation of libjpeg Utilizes SIMD instructions on x86 and ARM platforms. Used by Google Chrome, Firefox, Webkit, Ubuntu, Fedora and openSUSE Both libraries strictly designed to conserve memory Inhibits coarse-grained parallelism A non-goal with today's target architectures
Re-engineering libjpeg-turbo 14 libjpeg-turbo Our Approach To conserve memory, libjpeg-turbo Store an entire image in memory: decodes images in units of 8 pixel rows: 8 rows at a time do not contain enough Fully utilizes all GPU cores by processing computations to keep the data-parallel several larger image chunks. execution units of a GPU busy. Reduce number of kernel invocations and Significant constant overhead per kernel data transferring overhead. invocation and data transfer (host device host).
Heterogeneous JPEG Decompression Overview 15 GPU-Only CPU GPU Huffman Motivation: One architecture is unutilized when the other is Idle Decoding processing Observation: No dependency among 8x8 pixel blocks. Thus, Dispatch the CPU and the GPU can compute in parallel GPU Idle Kernel Goal: Find partitioning size at runtime such that the load on the CPU and the GPU are balanced Requirement: Performance model through offline profiling
Performance Model 16 Offline profiling step on image training set 19 master images cropped to various sizes Maximum image size is 25 megapixels Profile execution time of the sequential part and the parallelizable part on CPU and GPU Model all decompression steps using multivariate polynomial regression up to degree 7 Select the best-fit model by comparing Akaike information criterion (AIC) values
Performance Model for the Parallelizable Part 17 Linearly scales as image size increased Image dimension is known at the beginning of the decompression step Parameters: width and height 40 30 Time (ms) 20 Subsampling 4:2:2 10 4:4:4 0 0 10M 20M Pixels
Performance Model for the Sequential Part 18 Unlike the parallelizable part, Huffman decoding time does NOT have a high correlation with image width and height. 40 30 Time (ms) 20 10 0 0 5M 10M 15M 20M Pixels
Performance Model for the Sequential Part 19 Huffman decoding time has a high correlation with the size of entropy coded data. We have observed a linear trend as entropy density increased, entropy size in bytes per pixel. Parameters: width, height and entropy size Entropy size can be roughly approximated from JPEG file size. 6 Time (ns) 4 Subsampling 4:2:2 2 4:4:4 0.0 0.1 0.2 0.3 0.4 Entropy density (bytes/pixel)
Overlapped Partitioning Scheme 20 Overlapped Sharing workload of the GPU-Only CPU GPU parallelizable part between CPU GPU CPU and GPU Huffman Dispatch Huffman GPU Idle SIMD Kernel Decoding Dispatch GPU Idle Kernel
Overlapped Partitioning Scheme 21 to GPU CPU GPU to CPU Idea: Share workload of the parallelizable part on the CPU and the GPU. Huffman Idle Partitioning equation can be formulated as Decoding where is number of rows given to CPU, and are image width and height. Dispatch When , the time spent on the CPU and GPU GPU are equaled. SIMD Kernel and are known at runtime . We can use Newton’s method to solve for . Problem: GPU is unutilized during Huffman decoding.
Pipelined Partitioning Scheme 22 Overlapped Sharing workload of the GPU-Only CPU GPU parallelizable part between CPU GPU CPU and GPU Huffman Dispatch Huffman GPU Idle SIMD Kernel Decoding Increase parallelism by Pipelined Dispatch CPU GPU performing Huffman GPU Idle Kernel decoding and GPU kernel Huffman 1 Dispatch Kernel in pipelined fashion (Huffman 1) Huffman 2 Dispatch Kernel (Huffman 2) Huffman 3 Dispatch Kernel (Huffman 3)
Pipelined Partitioning Scheme 23 𝐷 rows Chunk 1 CPU GPU Chunk 2 𝐷 rows Huffman 1 Chunk 3 𝐷 rows Dispatch Kernel Idea: Execute Huffman decoding in a pipelined fashion (Huffman 1) with GPU kernel. Huffman 2 Split an image into several chucks of rows. Dispatch Kernel An optimal chunk size is found through a profiling. (Huffman 2) Huffman 3 We can start kernel invocation as soon as an image Dispatch Kernel chunk is decoded. (Huffman 3) On a fast GPU, only the execution time of last chunk is visible to users. Problem: Does NOT guarantee improvement over CPU computation.
Combined Partitioning Scheme 24 Overlapped GPU-Only Combined CPU GPU CPU GPU CPU GPU Huffman Huffman 1 Dispatch Dispatch Kernel Huffman GPU Idle SIMD (Huffman 1) Kernel Decoding Huffman 2 Dispatch Kernel Huffman 3 Pipelined Dispatch (Huffman 2) Dispatch Huffman 4 CPU GPU Kernel SIMD GPU (Huffman 3) Idle (Huffman 4) Kernel Huffan 1 Dispatch Kernel (Huffman 1) Huffman 2 Dispatch Kernel (Huffman 2) Huffman 3 Dispatch Kernel (Huffman 3)
Recommend
More recommend