wavesz a hardware algorithm co design of efficient lossy
play

waveSZ: A Hardware-Algorithm Co-Design of Efficient Lossy - PowerPoint PPT Presentation

waveSZ: A Hardware-Algorithm Co-Design of Efficient Lossy Compression for Scientific Data The University of Alabama Jiannan Tian Sheng Di Argonne National Laboratory Chengming Zhang The University of Alabama Xin Liang University of


  1. waveSZ: A Hardware-Algorithm Co-Design of Efficient Lossy Compression for Scientific Data The University of Alabama Jiannan Tian Sheng Di Argonne National Laboratory Chengming Zhang The University of Alabama Xin Liang University of California, Riverside Sian Jin The University of Alabama Dazhao Cheng University of North Carolina at Charlote Dingwen Tao The University of Alabama Franck Cappello Argonne National Laboratory February 24, 2020 PPoPP ’20 at San Diego, California, USA

  2. Background Introduction Proposed Design of w aveSZ Experimental Evaluation Conclusion and Future Work Trend of Supercomputing Systems Storage capacity and bandwidth are developing more slo wly compared to computational capability. supercomputer year class PF MS SB MS/SB PF/SB Cray Jaguar 2008 1 PFLOPS 1.75 PFLOPS 360 TB 240 GB/s 1.5k 7.3k Cray Blue Waters 2012 10 PFLOPS 13.3 PFLOPS 1.5 PB 1.1 TB/s 1.3k 13k 1.7 TB/s ⋆ Cray CORI 2017 10 PFLOPS 30 PFLOPS 1.4 PB 0.8k 17k > 10 PB ⋆⋆ IBM Summit 2018 100 PFLOPS 200 PFLOPS 2.5 TB/s > 4k 80k PF: peak FLOPS MS: memory size SB: storage bandwidth ⋆ when using burst buffer ⋆⋆ counting only DDR4 Source: F. Cappello (ANL) Table 1: Three classes of supercomputers showing their performance, MS and SB. Feb. 24, 2020 · PPoPP ’20, San Diego, California, USA · waveSZ · 2 / 17

  3. Background Introduction Proposed Design of w aveSZ Experimental Evaluation Conclusion and Future Work Current Status of Scientific Applications Today’s scientific research is data driven at a large scale (simulations or instruments). PB to process & analyze. (PB datasets are coming.) Data reduction is on demand. ◮ cosmology simulation HACC a generates 1 20 PB data per one-trillion-particle ( 10 12 ) simulation, 2 exhausting the FS b and 3 taking long to store c . 4 Reduction at rate 10 needed. ◮ climate simulation CESM generates 1 1 TB data per compute day, 2 increasing hardware budget in storage (NCAR), from 20% (2013) to 50% (2017). 3 Reduction rate at 10+ needed [A. Baker et al., HPDC ’16]. ◮ APS-U Project (High-Energy X-ray Beams storage Experiments) brain initiatives: 1 multi-hundred PB ( × 100 specimen) of storage. 2 Data analysis performed off-site on Argonne Leadership and analysis Computing Facility ANL Mira, with connection at 100 GB/s d,e . 3 150 TB/specimen Reduction rate at 100 needed. a Hardware/Hybrid Accelerated Cosmology Code b Mira at ANL has 26 PB FS, 20 PB/26 PB ≈ 80% Connectome c NSF Blue Waters (1TB/s I/O bandwidth), 5h30m to store the data d Would take ∼ 115 days to transfer the data Photon Source mouse brain X-ray e There is no 100 PB buffer at the APSL :( Upgrade

  4. Background Introduction Proposed Design of w aveSZ Experimental Evaluation Conclusion and Future Work (Error-Bounded) Lossy Compression Maters ◮ Scientific datasets lossless-compressed at rate 2:1 [Son et al. 2014] ◮ represented in floating-point ◮ We need 10:1 or even higher! ◮ Industry lossy compressors offer much higher reduction rate. ◮ designed/optimized considering human perception ◮ not suitable for supercomputer applications ◮ Strict error control toward scientific discovery and accurate JPEG, reduction rate decreasing and hence quality increasing, lef to right postanalysis ◮ data analysis with lossy datasets (afer or during simulation) ◮ execution restarting from failures ◮ calculation from lossy data in memory ◮ Need diverse compression modes ◮ absolute error bound ( L ∞ norm error) ◮ pointwise relative error bound rate from 10:1 to 250:1 ◮ RMSE error bound ( L 2 norm error) ◮ fixed bitrate ◮ SZ [Di and Cappello 2016; Tao et al. 2017; Xin et al. 2018] ◮ prediction-based lossy compressor framework for scientific data lossy compression for scientific data ◮ strictly control the global upper bound of compression error at varying reduction rate figure from Peter Lindstrom, LLNL Feb. 24, 2020 · PPoPP ’20, San Diego, California, USA · waveSZ · 4 / 17

  5. Background Introduction Proposed Design of w aveSZ Experimental Evaluation Conclusion and Future Work How SZ Works decorrelation approximation coding with strict prediction quantization variable-length lossy error control initial data + input output lossily comp parameters linear (1D), or linear-scaling, (Hu fg man code) -ressed data multidimensional of prediction errors low entropy × lossless dim 1 0 bincode: 1 (offset) ◮ Lorenzo predictor allows arbitrary-dimensional prediction. 3rd layer eb { j = 1 ( − 1 ) k j + 1 � n �� d �� ◮ � k 1 ... d � = 0 · D x 1 − k 1 , ··· , x d − k d . + − 0 ≤ k 1 ... d ≤ n k j 2nd layer ◮ Single-layer form Lorenzo predictor works the best generally [Tao + • predicted et al. 2017]. 2D form: 1st layer � D − 1 , − 1 D 0 , − 1 value dim 0 • �� �� − 1 1 ℓ ( D 0 , 0 ) = dot � . , 1 0 d − 1 , 0 D 0 , 0 true value ◮ Customized Huffman encoding ◮ sizeof(T) -byte long symbol to Huffman code processed processing ◮ high quantization quality (aggregated in center) makes Huffman unprocessed coded bitstream more possible to further gzip linear-scaling Lorenzo ( ℓ ) prediction quantization Feb. 24, 2020 · PPoPP ’20, San Diego, California, USA · waveSZ · 5 / 17

  6. Background Introduction Proposed Design of w aveSZ Experimental Evaluation Conclusion and Future Work Issues with SZ and Its Current FPGA Implementation “hard” “easy” to encode to encode ◮ Low throughput of SZ j +0 j +1 j +2 j +3 ◮ lack of parallelism: SIMD and SIMT cannot apply outlier outlier ◮ Limitations in FPGA GhostSZ i − 1 radius iteration ◮ totally performance-driven design m +0 i − 0 capacity ◮ 3 predictors in use, need extra bits to encode Figure 2: General ◮ more “workflow pipelines” (more resource) distribution patern of ◮ low compression ratio i − 1 quantization code. iteration m +1 GhostSZ i − 0 Distribution of encoding encoding prediction error predictors (2 bits) in quantized form (14 bits) Prediction Errors 300 300 waveSZ , SZ-1.4 i − 1 iteration #points encoding prediction error � SZ-1.4 200 m +2 200 in quantized form (16 bits) i − 0 � SZ-1.0 ◮ New use scenarios of adopting FPGA � GhostSZ 100 100 ◮ real-time processing; “inline processing” (Intel, 2018) Figure 1: Loop-carried 0 ◮ ExaNet—an FPGA-based direct network architecture of dependencies due to − 0 . 01 0 . 00 0 . 01 the European exascale systems [Ammendola et al. 2018] writeback. Figure 3: CESM-ATM CLDLOW. Feb. 24, 2020 · PPoPP ’20, San Diego, California, USA · waveSZ · 6 / 17

  7. Background Introduction Proposed Design of w aveSZ Experimental Evaluation Conclusion and Future Work Memory Access Patern and Dependency 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 0,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 0 0,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 1 1,0 1,1 1,2 1,3 1,4 1,5 1,6 1,7 1,8 1,9 1,0 1,1 1,2 1,3 1,4 1,5 1,6 1,7 1,8 1,9 ◮ Dependencies denoted with Manhatan 2 2,0 2,1 2,2 2,3 2,4 2,5 2,4 2,5 2,6 2,7 2,8 2,9 2 2,0 2,1 2,2 2,3 2,4 2,5 2,6 2,7 2,8 2,9 3 3,0 3,1 3,2 3,3 3,4 3,4 3,5 3 3,0 3,1 3,2 3,2 3,3 3,3 3,4 3,4 3,5 distance from • zero point 4 4 4,0 4,1 4,2 4,3 4,4 ◮ SZ-1.4 5 5 5,0 5,1 5,2 5,3 5,4 ◮ iterate against the dependencies, see (a) SZ-1.4 memory access (b) GhostSZ memory access Fig. 4(c) patern. patern. ◮ RAW at the last cycle, impossible to 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 extract parallelism 10 0 ◮ GhostSZ 11 1 ◮ overlook multidimensional smoothness 6 7 12 2 13 3 ◮ slice data of dimensionality into 1D 7 8 2 3 4 5 14 4 ◮ hence multiple • zero points 5 ◮ no dependency “vertically” (c) SZ-1.4 dependency in (d) GhostSZ dependency in Manhatan distance. Manhatan distance. Figure 4: SZ-1.4 and GhostSZ: memory access patern and data dependency in Manhatan distance. Feb. 24, 2020 · PPoPP ’20, San Diego, California, USA · waveSZ · 7 / 17

  8. Background Introduction Proposed Design of w aveSZ Experimental Evaluation Conclusion and Future Work Memory Access Patern and Dependency (cont’d) 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 0,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 0 0,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 dim 0 1 1,0 1,1 1,2 1,3 1,4 1,5 1,6 1,7 1,8 1,9 1 1,0 1,1 1,2 1,3 1,4 1,5 1,6 1,7 2 2 2,0 2,1 2,2 2,3 2,4 2,4 2,5 2,5 2,6 2,7 2,8 2,9 2,0 2,1 2,2 2,3 2,4 2,4 2,5 2,5 2,6 dim 1 3 3,0 3,1 3,2 3,3 3,4 3,4 3,5 3 3,0 3,1 3,2 3,3 3,4 3,4 3,5 ◮ Dependencies denoted with Manhatan 4 4 4,0 4,1 4,2 4,3 5 5 distance from • zero point 5,0 5,1 5,2 ◮ waveSZ (a) SZ-1.4 memory access (b) waveSZ memory access patern. ◮ iterate along the aligned patern. 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 dependency-free points 10 10 ◮ exploit the parallelism by pipelining 11 11 ◮ Pipelining 6 7 12 6 7 12 ◮ not to change too much 13 13 7 8 7 8 ◮ expect platform-support pipelining 14 14 control 15 (c) SZ-1.4 dependency in (d) waveSZ dependency in Manhatan Manhatan distance. distance. Figure 5: SZ-1.4 and waveSZ : memory access patern and data dependency in Manhatan distance. Feb. 24, 2020 · PPoPP ’20, San Diego, California, USA · waveSZ · 8 / 17

Recommend


More recommend