compression algorithms for gpus
play

Compression Algorithms for GPUs Annie Yang and Martin Burtscher* - PowerPoint PPT Presentation

Synthesizing Effective Data Compression Algorithms for GPUs Annie Yang and Martin Burtscher* Department of Computer Science Highlights MPC compression algorithm Brand-new lossless compression algorithm for single- and double-precision


  1. Synthesizing Effective Data Compression Algorithms for GPUs Annie Yang and Martin Burtscher* Department of Computer Science

  2. Highlights  MPC compression algorithm  Brand-new lossless compression algorithm for single- and double-precision floating-point data  Systematically derived to work well on GPUs  MPC features  Compression ratio is similar to best CPU algorithms  Throughput is much higher  Requires little internal state (no tables or dictionaries) Synthesizing Effective Data Compression Algorithms for GPUs 2

  3. Introduction  High-Performance Computing Systems  Depend increasingly on accelerators  Process large amounts of floating-point (FP) data S Exponent Mantissa  Moving this data is often the performance bottleneck  Data compression  Can increase transfer throughput  Can reduce storage requirement  But only if effective, fast (real-time), and lossless Synthesizing Effective Data Compression Algorithms for GPUs 3

  4. Problem Statement  Existing FP compression algorithms for GPUs  Fast but compress poorly  Existing FP compression algorithms for CPUs  Compress much better but are slow  Parallel codes run serial algorithms on multiple chunks  Too much state per thread for a GPU implementation  Best serial algos may not be scalably parallelizable  Do effective FP compression algos for GPUs exist?  And if so, how can we create such an algorithm? Synthesizing Effective Data Compression Algorithms for GPUs 4

  5. Our Approach  Need a brand-new massively-parallel algorithm  Study existing FP compression algorithms  Break them down into constituent parts  Only keep GPU-friendly parts  Generalize them as much as possible  Resulted in algorithmic components Charles Trevelyan for http://plus.maths.org/  CUDA implementation: each component takes sequence of values as input and outputs transformed sequence  Components operate on integer representation of data Synthesizing Effective Data Compression Algorithms for GPUs 5

  6. Our Approach (cont.)  Automatically synthesize compression algorithms by chaining components  Use exhaustive search to find best four-component chains  Synthesize decompressor  Employ inverse components  Perform opposite transformation on data Synthesizing Effective Data Compression Algorithms for GPUs 6

  7. Mutator Components  Mutators computationally transform each value  Do not use information about any other value  NUL outputs the input block (identity)  INV flips all the bits  │ , called cut, is a singleton pseudo component that converts a block of words into a block of bytes  Merely a type cast, i.e., no computation or data copying  Byte granularity can be better for compression Synthesizing Effective Data Compression Algorithms for GPUs 7

  8. Shuffler Components  Shufflers reorder whole values or bits of values  Do not perform any computation  Each thread block operates on a chunk of values  BIT emits most significant bits of all values, followed by the second most significant bits, etc.  DIM n groups values by dimension n  Tested n = 2, 3, 4, 5, 8, 16, and 32  For example, DIM2 has the following effect: sequence A, B, C, D, E, F becomes A, C, E, B, D, F Synthesizing Effective Data Compression Algorithms for GPUs 8

  9. Predictor Components  Predictors guess values based on previous values and compute residuals (true minus guessed value)  Residuals tend to cluster around zero, making them easier to compress than the original sequence  Each thread block operates on a chunk of values  LNV n s subtracts n th prior value from current value  Tested n = 1, 2, 3, 5, 6, and 8  LNV n x XORs current with n th prior value  Tested n = 1, 2, 3, 5, 6, and 8 Synthesizing Effective Data Compression Algorithms for GPUs 9

  10. Reducer Components  Reducers eliminate redundancies in value sequence  All other components cannot change length of sequence, i.e., only reducers can compress sequence  Each thread block operates on a chunk of values  ZE emits bitmap of 0s followed by non-zero values  Effective if input sequence contains many zeros  RLE performs run-length encoding, i.e., replaces repeating values by count and a single value  Effective if input contains many repeating values Synthesizing Effective Data Compression Algorithms for GPUs 10

  11. Algorithm Synthesis  Determine best four-stage algorithms with a cut  Exhaustive search of all possible 138,240 combinations  13 double-precision data sets (19 – 277 MB)  Observational data, simulation results, MPI messages  Single-precision data derived from double-precision data  Create general GPU-friendly compression algorithm  Analyze best algorithm for each data set and precision  Find commonalities and generalize into one algorithm Synthesizing Effective Data Compression Algorithms for GPUs 11

  12. Best of 138,240 Algorithms data set double precision single precision LNV1s BIT LNV1s ZE | DIM5 ZE LNV6x | ZE msg_bt LNV5s | DIM8 BIT RLE LNV5s LNV5s LNV5x | ZE msg_lu DIM3 LNV5x BIT ZE | DIM3 LNV5x BIT ZE | msg_sp DIM5 LNV6x ZE | ZE RLE DIM5 LNV6s ZE | msg_sppm LNV1s DIM32 | DIM8 RLE LNV1s DIM32 | DIM4 RLE msg_sweep3d LNV1s BIT LNV1s ZE | LNV1s BIT LNV1s ZE | num_brain LNV1s BIT LNV1s ZE | LNV1s | DIM4 BIT RLE num_comet LNV1s BIT LNV1s ZE | LNV1s BIT LNV1s ZE | num_control LNV2s LNV2s LNV2x | ZE LNV2s LNV2s LNV2x | ZE num_plasma LNV1x ZE LNV1s ZE | LNV6s BIT LNV1s ZE | obs_error LNV2s | DIM8 BIT RLE LNV8s DIM2 | DIM4 RLE obs_info ZE BIT LNV1s ZE | ZE BIT LNV1s ZE | obs_spitzer LNV8s BIT LNV1s ZE | BIT LNV1x DIM32 | RLE obs_temp LNV6s BIT LNV1s ZE | LNV6s BIT LNV1s ZE | overall best Synthesizing Effective Data Compression Algorithms for GPUs 12

  13. Analysis of Reducers data set double precision  Double prec results only LNV1s BIT LNV1s ZE | msg_bt  Single prec results similar LNV5s | DIM8 BIT RLE msg_lu  ZE or RLE required at end DIM3 LNV5x BIT ZE | msg_sp DIM5 LNV6x ZE | ZE msg_sppm  Not counting cut; (encoder) LNV1s DIM32 | DIM8 RLE msg_sweep3d  ZE dominates LNV1s BIT LNV1s ZE | num_brain LNV1s BIT LNV1s ZE | num_comet  Many 0s but not in a row LNV1s BIT LNV1s ZE | num_control  First three stages LNV2s LNV2s LNV2x | ZE num_plasma  Contain almost no reducers LNV1x ZE LNV1s ZE | obs_error  Transformations are key to LNV2s | DIM8 BIT RLE obs_info ZE BIT LNV1s ZE | making reducer effective obs_spitzer LNV8s BIT LNV1s ZE | obs_temp  Chaining whole compression LNV6s BIT LNV1s ZE | overall best algorithms may be futile Synthesizing Effective Data Compression Algorithms for GPUs 13

  14. Analysis of Mutators data set double precision  NUL and INV never used LNV1s BIT LNV1s ZE | msg_bt  No need to invert bits LNV5s | DIM8 BIT RLE msg_lu  Fewer stages perform worse DIM3 LNV5x BIT ZE | msg_sp DIM5 LNV6x ZE | ZE msg_sppm  Cut often at end (not used) LNV1s DIM32 | DIM8 RLE msg_sweep3d  Word granularity suffices LNV1s BIT LNV1s ZE | num_brain  Easier/faster to implement LNV1s BIT LNV1s ZE | num_comet LNV1s BIT LNV1s ZE |  DIM8 right after cut num_control LNV2s LNV2s LNV2x | ZE num_plasma  DIM4 with single precision LNV1x ZE LNV1s ZE | obs_error  Used to separate byte LNV2s | DIM8 BIT RLE obs_info positions of each word ZE BIT LNV1s ZE | obs_spitzer LNV8s BIT LNV1s ZE | obs_temp  Synthesis yielded unforeseen LNV6s BIT LNV1s ZE | overall best use of DIM component Synthesizing Effective Data Compression Algorithms for GPUs 14

  15. Analysis of Shufflers data set double precision  Shufflers are important LNV1s BIT LNV1s ZE | msg_bt  Almost always included LNV5s | DIM8 BIT RLE msg_lu  BIT used very frequently DIM3 LNV5x BIT ZE | msg_sp DIM5 LNV6x ZE | ZE msg_sppm  FP bit positions correlate LNV1s DIM32 | DIM8 RLE msg_sweep3d more strongly than values LNV1s BIT LNV1s ZE | num_brain  DIM has two uses LNV1s BIT LNV1s ZE | num_comet LNV1s BIT LNV1s ZE | num_control  Separate bytes (see before) LNV2s LNV2s LNV2x | ZE num_plasma  Right after cut LNV1x ZE LNV1s ZE | obs_error  Separate values of multi-dim LNV2s | DIM8 BIT RLE obs_info data sets (intended use) ZE BIT LNV1s ZE | obs_spitzer  Early stages LNV8s BIT LNV1s ZE | obs_temp LNV6s BIT LNV1s ZE | overall best Synthesizing Effective Data Compression Algorithms for GPUs 15

  16. Analysis of Predictors data set double precision  Predictors very important LNV1s BIT LNV1s ZE | msg_bt  (Data model) LNV5s | DIM8 BIT RLE msg_lu  Used in every case DIM3 LNV5x BIT ZE | msg_sp DIM5 LNV6x ZE | ZE msg_sppm  Often 2 predictors used LNV1s DIM32 | DIM8 RLE msg_sweep3d  LNV n s dominates LNV n x LNV1s BIT LNV1s ZE | num_brain  Arithmetic (sub) difference LNV1s BIT LNV1s ZE | num_comet superior to bit-wise (xor) LNV1s BIT LNV1s ZE | num_control LNV2s LNV2s LNV2x | ZE difference in residual num_plasma LNV1x ZE LNV1s ZE | obs_error  Dimension n LNV2s | DIM8 BIT RLE obs_info  Separates values of multi- ZE BIT LNV1s ZE | obs_spitzer dim data sets (in 1 st stage) LNV8s BIT LNV1s ZE | obs_temp LNV6s BIT LNV1s ZE | overall best Synthesizing Effective Data Compression Algorithms for GPUs 16

Recommend


More recommend