Synthesizing Effective Data Compression Algorithms for GPUs Annie Yang and Martin Burtscher* Department of Computer Science
Highlights MPC compression algorithm Brand-new lossless compression algorithm for single- and double-precision floating-point data Systematically derived to work well on GPUs MPC features Compression ratio is similar to best CPU algorithms Throughput is much higher Requires little internal state (no tables or dictionaries) Synthesizing Effective Data Compression Algorithms for GPUs 2
Introduction High-Performance Computing Systems Depend increasingly on accelerators Process large amounts of floating-point (FP) data S Exponent Mantissa Moving this data is often the performance bottleneck Data compression Can increase transfer throughput Can reduce storage requirement But only if effective, fast (real-time), and lossless Synthesizing Effective Data Compression Algorithms for GPUs 3
Problem Statement Existing FP compression algorithms for GPUs Fast but compress poorly Existing FP compression algorithms for CPUs Compress much better but are slow Parallel codes run serial algorithms on multiple chunks Too much state per thread for a GPU implementation Best serial algos may not be scalably parallelizable Do effective FP compression algos for GPUs exist? And if so, how can we create such an algorithm? Synthesizing Effective Data Compression Algorithms for GPUs 4
Our Approach Need a brand-new massively-parallel algorithm Study existing FP compression algorithms Break them down into constituent parts Only keep GPU-friendly parts Generalize them as much as possible Resulted in algorithmic components Charles Trevelyan for http://plus.maths.org/ CUDA implementation: each component takes sequence of values as input and outputs transformed sequence Components operate on integer representation of data Synthesizing Effective Data Compression Algorithms for GPUs 5
Our Approach (cont.) Automatically synthesize compression algorithms by chaining components Use exhaustive search to find best four-component chains Synthesize decompressor Employ inverse components Perform opposite transformation on data Synthesizing Effective Data Compression Algorithms for GPUs 6
Mutator Components Mutators computationally transform each value Do not use information about any other value NUL outputs the input block (identity) INV flips all the bits │ , called cut, is a singleton pseudo component that converts a block of words into a block of bytes Merely a type cast, i.e., no computation or data copying Byte granularity can be better for compression Synthesizing Effective Data Compression Algorithms for GPUs 7
Shuffler Components Shufflers reorder whole values or bits of values Do not perform any computation Each thread block operates on a chunk of values BIT emits most significant bits of all values, followed by the second most significant bits, etc. DIM n groups values by dimension n Tested n = 2, 3, 4, 5, 8, 16, and 32 For example, DIM2 has the following effect: sequence A, B, C, D, E, F becomes A, C, E, B, D, F Synthesizing Effective Data Compression Algorithms for GPUs 8
Predictor Components Predictors guess values based on previous values and compute residuals (true minus guessed value) Residuals tend to cluster around zero, making them easier to compress than the original sequence Each thread block operates on a chunk of values LNV n s subtracts n th prior value from current value Tested n = 1, 2, 3, 5, 6, and 8 LNV n x XORs current with n th prior value Tested n = 1, 2, 3, 5, 6, and 8 Synthesizing Effective Data Compression Algorithms for GPUs 9
Reducer Components Reducers eliminate redundancies in value sequence All other components cannot change length of sequence, i.e., only reducers can compress sequence Each thread block operates on a chunk of values ZE emits bitmap of 0s followed by non-zero values Effective if input sequence contains many zeros RLE performs run-length encoding, i.e., replaces repeating values by count and a single value Effective if input contains many repeating values Synthesizing Effective Data Compression Algorithms for GPUs 10
Algorithm Synthesis Determine best four-stage algorithms with a cut Exhaustive search of all possible 138,240 combinations 13 double-precision data sets (19 – 277 MB) Observational data, simulation results, MPI messages Single-precision data derived from double-precision data Create general GPU-friendly compression algorithm Analyze best algorithm for each data set and precision Find commonalities and generalize into one algorithm Synthesizing Effective Data Compression Algorithms for GPUs 11
Best of 138,240 Algorithms data set double precision single precision LNV1s BIT LNV1s ZE | DIM5 ZE LNV6x | ZE msg_bt LNV5s | DIM8 BIT RLE LNV5s LNV5s LNV5x | ZE msg_lu DIM3 LNV5x BIT ZE | DIM3 LNV5x BIT ZE | msg_sp DIM5 LNV6x ZE | ZE RLE DIM5 LNV6s ZE | msg_sppm LNV1s DIM32 | DIM8 RLE LNV1s DIM32 | DIM4 RLE msg_sweep3d LNV1s BIT LNV1s ZE | LNV1s BIT LNV1s ZE | num_brain LNV1s BIT LNV1s ZE | LNV1s | DIM4 BIT RLE num_comet LNV1s BIT LNV1s ZE | LNV1s BIT LNV1s ZE | num_control LNV2s LNV2s LNV2x | ZE LNV2s LNV2s LNV2x | ZE num_plasma LNV1x ZE LNV1s ZE | LNV6s BIT LNV1s ZE | obs_error LNV2s | DIM8 BIT RLE LNV8s DIM2 | DIM4 RLE obs_info ZE BIT LNV1s ZE | ZE BIT LNV1s ZE | obs_spitzer LNV8s BIT LNV1s ZE | BIT LNV1x DIM32 | RLE obs_temp LNV6s BIT LNV1s ZE | LNV6s BIT LNV1s ZE | overall best Synthesizing Effective Data Compression Algorithms for GPUs 12
Analysis of Reducers data set double precision Double prec results only LNV1s BIT LNV1s ZE | msg_bt Single prec results similar LNV5s | DIM8 BIT RLE msg_lu ZE or RLE required at end DIM3 LNV5x BIT ZE | msg_sp DIM5 LNV6x ZE | ZE msg_sppm Not counting cut; (encoder) LNV1s DIM32 | DIM8 RLE msg_sweep3d ZE dominates LNV1s BIT LNV1s ZE | num_brain LNV1s BIT LNV1s ZE | num_comet Many 0s but not in a row LNV1s BIT LNV1s ZE | num_control First three stages LNV2s LNV2s LNV2x | ZE num_plasma Contain almost no reducers LNV1x ZE LNV1s ZE | obs_error Transformations are key to LNV2s | DIM8 BIT RLE obs_info ZE BIT LNV1s ZE | making reducer effective obs_spitzer LNV8s BIT LNV1s ZE | obs_temp Chaining whole compression LNV6s BIT LNV1s ZE | overall best algorithms may be futile Synthesizing Effective Data Compression Algorithms for GPUs 13
Analysis of Mutators data set double precision NUL and INV never used LNV1s BIT LNV1s ZE | msg_bt No need to invert bits LNV5s | DIM8 BIT RLE msg_lu Fewer stages perform worse DIM3 LNV5x BIT ZE | msg_sp DIM5 LNV6x ZE | ZE msg_sppm Cut often at end (not used) LNV1s DIM32 | DIM8 RLE msg_sweep3d Word granularity suffices LNV1s BIT LNV1s ZE | num_brain Easier/faster to implement LNV1s BIT LNV1s ZE | num_comet LNV1s BIT LNV1s ZE | DIM8 right after cut num_control LNV2s LNV2s LNV2x | ZE num_plasma DIM4 with single precision LNV1x ZE LNV1s ZE | obs_error Used to separate byte LNV2s | DIM8 BIT RLE obs_info positions of each word ZE BIT LNV1s ZE | obs_spitzer LNV8s BIT LNV1s ZE | obs_temp Synthesis yielded unforeseen LNV6s BIT LNV1s ZE | overall best use of DIM component Synthesizing Effective Data Compression Algorithms for GPUs 14
Analysis of Shufflers data set double precision Shufflers are important LNV1s BIT LNV1s ZE | msg_bt Almost always included LNV5s | DIM8 BIT RLE msg_lu BIT used very frequently DIM3 LNV5x BIT ZE | msg_sp DIM5 LNV6x ZE | ZE msg_sppm FP bit positions correlate LNV1s DIM32 | DIM8 RLE msg_sweep3d more strongly than values LNV1s BIT LNV1s ZE | num_brain DIM has two uses LNV1s BIT LNV1s ZE | num_comet LNV1s BIT LNV1s ZE | num_control Separate bytes (see before) LNV2s LNV2s LNV2x | ZE num_plasma Right after cut LNV1x ZE LNV1s ZE | obs_error Separate values of multi-dim LNV2s | DIM8 BIT RLE obs_info data sets (intended use) ZE BIT LNV1s ZE | obs_spitzer Early stages LNV8s BIT LNV1s ZE | obs_temp LNV6s BIT LNV1s ZE | overall best Synthesizing Effective Data Compression Algorithms for GPUs 15
Analysis of Predictors data set double precision Predictors very important LNV1s BIT LNV1s ZE | msg_bt (Data model) LNV5s | DIM8 BIT RLE msg_lu Used in every case DIM3 LNV5x BIT ZE | msg_sp DIM5 LNV6x ZE | ZE msg_sppm Often 2 predictors used LNV1s DIM32 | DIM8 RLE msg_sweep3d LNV n s dominates LNV n x LNV1s BIT LNV1s ZE | num_brain Arithmetic (sub) difference LNV1s BIT LNV1s ZE | num_comet superior to bit-wise (xor) LNV1s BIT LNV1s ZE | num_control LNV2s LNV2s LNV2x | ZE difference in residual num_plasma LNV1x ZE LNV1s ZE | obs_error Dimension n LNV2s | DIM8 BIT RLE obs_info Separates values of multi- ZE BIT LNV1s ZE | obs_spitzer dim data sets (in 1 st stage) LNV8s BIT LNV1s ZE | obs_temp LNV6s BIT LNV1s ZE | overall best Synthesizing Effective Data Compression Algorithms for GPUs 16
Recommend
More recommend