error controlled lossy compression optimized for high
play

Error-Controlled Lossy Compression Optimized for High Compression - PowerPoint PPT Presentation

Error-Controlled Lossy Compression Optimized for High Compression Ratios of Scientific Datasets Xin Liang(University of California, Riverside) Sheng Di (Argonne National Laboratory) Dingwen Tao (University of Alabama) Sihuan Li (University of


  1. Error-Controlled Lossy Compression Optimized for High Compression Ratios of Scientific Datasets Xin Liang(University of California, Riverside) Sheng Di (Argonne National Laboratory) Dingwen Tao (University of Alabama) Sihuan Li (University of California, Riverside) Shaomeng Li (National Center for Atmospheric Research) Hanqi Guo (Argonne National Laboratory) Zizhong Chen (University of California, Riverside) Franck Cappello (Argonne National Laboratory & UIUC) 1

  2. Outline u Introduction Large amount of scientific data Ø Limitations of lossless compression Ø u Related Works and Limitations u Proposed Predictors Mean-integrated Lorenzo predictor Ø Regression predictor Ø u Adaptive Predictor Selection u Evaluation u Conclusion 2 2 Exascale Computing Project

  3. Introduction – Large Amount of Data u Extremely large amount of data are produced by scientific simulations and instruments CESM (Climate Simulation) Ø ² 2.5 PB raw data produced ² 170 TB post-processed data HACC (Cosmology Simulation) Ø ² 20 PB data: a single 1-trillion-particle simulation ² Mira at ANL: 26 PB file system storage ² 20 PB / 26 PB ~ 80% Two partial visualizations of HACC simulation data: coarse grain on full volume or full resolution on small sub-volumes 3 3 Exascale Computing Project

  4. Introduction – Large Amount of Data u APS-U: next-generation APS (Advanced Photon Source) project at ANL Ø 15 PB data for storage Ø 35 TB post-processed floating-point data Ø 100 GB/s bandwidth between APS and Mira Ø 15 PB / 100 GB/s ~ 10 5 seconds (42 hours) 4 4 Exascale Computing Project

  5. Introduction – I/O Bottleneck u I/O improvement is less than the other parts u From 1960 ~ 2014 Ø Supercomputer speed increased by 11 orders of magnitude Ø I/O capacity increased 6 orders of magnitude Ø Internal drive access rate increased by 3~4 orders of magnitude Ø We are producing more data than we can store! Parallel I/O introductory tutorial, online 5 5 Exascale Computing Project

  6. Introduction – Limitations of Existing Lossless Compressors u Existing lossless compressors work not efficiently on large-scale scientific data (compression ratio up to 2) Table 1: Compression ratios for lossless compressors on large-scale simulations Compression ratio = Original data size / Compressed data size 6 6 Exascale Computing Project

  7. Introduction – Lossy Compressors u Lossy compression is then proposed to trade the accuracy for compression ratio u Error-bounded lossy compression, in addition, provides user with means to control the error 7 7 Exascale Computing Project

  8. Introduction – Error-bounded Lossy Compressors u Common compression modes for error-bounded lossy compressors Ø Point-wise absolute error bound ² ² SZ, ZFP, TTHRESH etc. Ø Point-wise relative error bound ² ² ISABELA, FPZIP, SZ etc. 8 8 Exascale Computing Project

  9. Introduction – Common Assessments u Common metrics for accessing error-bounded lossy compressors Ø Compression Ratio (cratio) ()*+,-./00/1 023/ ² cratio = *+,-./00/1 023/ Ø Compression/Decompression Rate/Speed (crate/drate) ()*+,-./00/1 023/ ()*+,-./00/1 023/ ² 45678 = :5678 = *+,-./002+) 92,/ 1/*+,-./002+) 92,/ Ø RMSE (root of mean squared error) & PSNR (peak signal-to-noise ratio) ? O PQR SO PTU @ ² ;<=> = @ ∑ 2B? 8 CD0 E FGH5 = 20 KLM ?N VWXY Ø Rate-distortion bitrate = G\]8L^ :676 9_-/ ∗ 8 4567\L Tao etal, Z-checker: A framework for assessing lossy compression of scientific data, IJHPCA 9 9 Exascale Computing Project

  10. Introduction – Existing Error-Bounded Lossy Compressors u Existing state-of-the-art error-bounded lossy compressors Ø ISABELA (NCSU) ² Sorting preconditioner ² B-Spline interpolation Ø FPZIP (LLNL) ² Lorenzo prediction with truncation on fixed-point residuals ² Entropy encoding Ø ZFP (LLNL) ² Block-wise exponent align and customized orthogonal block transform ² Block-wise embedded encoding Ø SZ-1.4 (ANL) ² Multi-layer prediction with linear-scaling quantization ² Huffman encoding followed by lossless compression 10 10 Exascale Computing Project

  11. Introduction – SZ Decorrelation (Prediction) encoding input prediction quantization (lossless) output u SZ has 4 key stages, among which prediction and quantization are the most important two. Current data point ' f 011 (L) f 111 ' f 101 ' f 001 ' f 010 ' f 110 z y x f 100 ' ' f 000 SZ multi-layer predictor, default 1 (Lorenzo predictor) 3D Lorenzo predictor Tao etal., Significantly Improving Lossy Compression for Scientific Data Sets Based on Multidimensional Prediction and Error- Controlled Quantization, IPDPS17. 11 11 Exascale Computing Project

  12. Introduction – SZ Linear-Scaling Quantization encoding input prediction quantization (lossless) output Uniform … � SZ-1.1 à SZ-1.4 Quan>za>on Code � Second-phase (i) Expand the quantization interval +2 � 2*Error Bound � Predicted Value � from the predicted value (made by previous prediction model) by linear Real Value � +1 � scaling of the error bound Second-phase 2*Error Bound � Predicted Value � (ii) Encode the real value using the quantization interval number (quantization code) 0 � First-phase Error Predicted Value � Bound � Error Bound � Second-phase 2*Error Bound � -1 � Predicted Value � Real Value � Second-phase -2 � 2*Error Bound � Predicted Value � Predicted Value � … � Quantization with one interval in SZ-1.1 Uniform quantization with linear scaling in SZ-1.4 Tao etal., Significantly Improving Lossy Compression for Tao etal., Significantly Improving Lossy Compression for Scientific Data Sets Based on Multidimensional Prediction Scientific Data Sets Based on Multidimensional Prediction 12 12 Exascale Computing Project and Error-Controlled Quantization, IPDPS17. and Error-Controlled Quantization, IPDPS17.

  13. Introduction – Limitation of Lorenzo Predictor u Has to use decompressed data for prediction Ø Low prediction accuracy for large error bound ² Predict using inaccurate decompressed data ² The higher dimensional data, the larger the error ² Lead to unexpected artifacts SZ 1.4 decompressed (cr 111:1) Origin data 13 13 Exascale Computing Project

  14. Introduction – Limitation of Lorenzo Predictor u Has to use decompressed data for prediction Ø Low prediction accuracy for large error bound ² Predict using inaccurate decompressed data ² The higher dimensional data, the larger the error ² Lead to unexpected artifacts Ø Hard to parallelize ² Frequent data dependency between points Ø Prone to error ² One error may propagate to all data SZ multi-layer predictor, default 1 (Lorenzo predictor) Tao etal., Significantly Improving Lossy Compression for Scientific Data Sets Based on Multidimensional Prediction and Error-Controlled Quantization, IPDPS17. 14 14 Exascale Computing Project

  15. New Design – Mean-Integrated Lorenzo Predictor u Use mean to approximate data in the densest interval Ø Advantages ² Predict without decompressed data ² Reduced artifacts ² Work well when data has obvious background Ø Limitations ² Only cover data in the interval ² Degraded performance for more uniform data 15 15 Exascale Computing Project

  16. New Design – Mean-Integrated Lorenzo Predictor u Adaptive selection – select Lorenzo or M-Lorenzo according to maximum data density in the densest interval ² Sample dataset: Hurricane, 13 data fields Mean-Integrated Lorenzo 110 1 0.9 100 0.8 90 Percent of 0.7 80 0.6 PSNR 70 0.5 60 0.4 50 0.3 40 0.2 M-Lorenzo 30 0.1 Lorenzo 20 0 0 1 2 3 4 5 6 0 1 2 3 4 5 6 Bit Rate Bit Rate Overall rate distortion on Hurricane Percent of M-Lorenzo 16 16 Exascale Computing Project

  17. New Design – Regression Predictor u Divide data into individual blocks and predict data point in a block by the coefficients from the linear regression model Ø Compute Regression Coefficients Ø Predict the data by regression coefficients 2D Regression ² 3D case: !"#" $ = & ' ∗ ) + & + ∗ , + & - ∗ . + & / Di etal., SZ turtorial, SC18 17 17 Exascale Computing Project

  18. New Design – Regression Predictor: Advantages u Does not need to use decompressed data for prediction Ø Prediction accuracy will hold for different error bounds ² Predict using the stored coefficients ² Less visible artifacts even when error bound is large OurSol decompressed (cr 117:1) Origin data SZ 1.4 decompressed (cr 111:1) 18 18 Exascale Computing Project

  19. New Design – Regression Predictor: Advantages u Does not need to use decompressed data for prediction Ø Prediction accuracy will hold for different error bounds ² Predict using the stored coefficients ² Less visible artifact even when error bound is large Ø Highly parallelizable both inter-block and intra-block ² No data dependency between data points Ø Controlled error propagation ² Error would propagate to at most one block 19 19 Exascale Computing Project

  20. New Design – Regression Predictor: Limitations u Regression works better under large error bound Ø Cannot model sharp change because of the reconstructed hyperplane is always linear ² Lorenzo: constant for 1D data, linear for 2D data, quadratic for 3D data etc.. Ø High storage cost for regression coefficients ² 1/54 overhead for 6x6x6 block, although it could be further compressed Ø Higher computational cost ² More multiplications than the Lorenzo predictor 20 20 Exascale Computing Project

Recommend


More recommend