cusz a high performance gpu based lossy
play

CUSZ : A HighPerformance GPU Based Lossy Sian Jin October 5, 2020 - PowerPoint PPT Presentation

CUSZ : A HighPerformance GPU Based Lossy Sian Jin October 5, 2020 Argonne National Laboratory Franck Cappello Washington State University Dingwen Tao Clemson University Jon Calhoun Oak Ridge National Laboratory Xin Liang Washington


  1. CUSZ : A High‑Performance GPU Based Lossy Sian Jin October 5, 2020 Argonne National Laboratory Franck Cappello Washington State University Dingwen Tao Clemson University Jon Calhoun Oak Ridge National Laboratory Xin Liang Washington State University Clemson University Compression Framework for Scientific Data Robert Underwood Clemson University Megan Hickman Fulp The University of Alabama Cody Rivera University of California, Riverside Kai Zhao Argonne National Laboratory Sheng Di Washington State University Jiannan Tian PACT ’20, Virtual Event

  2. Background 17k 1.5 PB 1.1 TB/S 1.3k 13k Cray CORI 2017 10 PFLOPS 30 PFLOPS 1.4 PB 0.8k IBM Summit 10 PFLOPS 2018 100 PFLOPS 200 PFLOPS 2.5 TB/S 80k PF: peak FLOPS MS: memory size SB: storage bandwidth Source: F. Cappello (ANL) Table 1: Three classes of supercomputers showing their performance, MS and SB . Introduction 13.3 PFLOPS 2012 MS Design Evaluation Conclusion Trend of Supercomputing Systems Gap Between Compute and I/O The compute capability is ever growing while storage capacity and bandwidth are developing more slowly and not matching the pace. supercomputer year class Cray Blue Waters PF SB 1.75 PFLOPS 7.3k 1.5k 240GB/S 360TB 1 PFLOPS 2008 Cray Jaguar PF/SB MS/SB 1.7 TB/S • > 10 PB •• > 4k • when using burst buffer •• counting only DDR4 October 5, 2020 · PACT ’20, Virtual Event · CUSZ · 2 / 20

  3. Background APS‑U 5h30m to store 10x climate simulation of h/w budget for storage 2013 vs 2017 NSF Blue Waters, I/O at 1 TBps in need hundreds of PB CESM 100‑PB buffer 100x High‑Energy X‑Ray Beams Experiments brain initiatives or, connection at 100 GBps in need 20% vs 50% in need Introduction passive solution (?) Design Evaluation Conclusion Current Status of Scientific Applications: Big Data application data scale to reduce 26 PB for Mira@ANL HACC 20 PB use up FS 10x cosmology simulation per one‑trillion‑particle simulation October 5, 2020 · PACT ’20, Virtual Event · CUSZ · 3 / 20

  4. Background distinct in design goals github.com/szcompressor/SZ Figure from Peter Lindstrom (LLNL) 250:1, left to right) at varying reduction rate (10:1 to Lossy compression for scientific data [Di and Cappello 2016; Tao et al. 2017; Xin et al.2018] SZ 4 fixed bitrate 2 pointwise relative error bound modes compression Introduction need diverse e.g., JPEG, MPEG rate, not suitable for HPC Design Evaluation Conclusion Error‑Bounded Lossy Compression Matters 2:1 (FP‑type) 10:1 or higher lossless‑compress scientific datasets reduction ratio in need industry lossy compressor despite high reduction 1 absolute error bound ( L ∞ norm) 3 RMSE error bound ( L 2 norm) ▶ prediction‑based lossy compressor framework for scientific data ▶ strictly control the global upper bound of compression error October 5, 2020 · PACT ’20, Virtual Event · CUSZ · 4 / 20

  5. Background (Huffman code) APPROXIMATION CODING DECORRELATION error control with strict output lossy input lossless ‑ressed data lossily comp low entropy variable‑length Introduction of prediction errors linear‑scaling, quantization multidimensional linear (1D), or prediction parameters initial data + (Error‑Bound Workflow) SZ Framework Conclusion Evaluation Design × October 5, 2020 · PACT ’20, Virtual Event · CUSZ · 5 / 20

  6. Background solution Eliminate dependency and parallelize it. iteration iteration iteration Introduction Histograming [Gómez‑Luna et al.] solution All tasks are done on GPU. DUAL ‑QUANTIZATION: { PRE , POST }QUANTIZATION Customized Huffman codec (corse‑grained) Challenge Conclusion SCIENTIFIC DATA. Design Research Objective and Contribution and Contribution Motivation, Challenge, Evaluation j +0 j +1 j +2 j +3 ▶ CUSZ is THE FIRST STRICTLY ERROR‑BOUNDED LOSSY COMPRESSOR ON GPU FOR i − 1 m +0 i − 0 ▶ Tight data dependency (loop‑carried RAW) hinders parallelization. i − 1 m +1 i − 0 ▶ Host‑device communications only considering CPU/GPU suitableness. i − 1 m +2 i − 0 October 5, 2020 · PACT ’20, Virtual Event · CUSZ · 6 / 20

  7. Background Huffman code in units of eb concatenating to dense format MSB LSB bitwidth quant.code on PREQUANTIZATION set bitwidth ... Huff‑code 508 00000110 ... 00001010 509 00000101 ... 00000100 510 PREQUANTIZATION (no RAW) UNUSED 511 memcpy fixed‑length CUSTOMIZED AND PREDICTION DUAL‑QUANTIZATION deflating Huffman codes Huffman code Introduction Huffman codebook DEFLATED build and canonize histograming POSTQUANTIZATION (no RAW) in units of eb (unchanged) ... fixed‑length representation 00000011 ... 00000100 00000010 ... 00000001 original data 793‑‑ 863 652‑‑ 722 0.073% |+ 722‑‑ 793 0.026% |+ 0.0095% 0.14% |+ 863‑‑ 933 0.0021% |+ 933‑‑1024 0.00014% |+ 582‑‑ 652 512 00000110 ... 00001100 00000010 ... 00000011 513 00000011 ... 00000101 514 00000011 ... 00000000 515 range |+ freq. |‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑+ 442‑‑ 512 76% |‑‑‑‑‑+ 512‑‑ 582 24% HUFFMAN ENCODING floating‑point representation Diagram of CUSZ Design Evaluation Conclusion System Workflow 3 - 2 2 3 - 1 - 4 - 1 1 - 1 3 - 4 0 0 1 - 4 1 - 5 6 0 5 1 5 7 7 4 - - 1 - 10 3 0 0 2 - 2 - 3 0 2 - 2 2 - 2 1 4 5 - 5 1 - - - 4 4 - 4 4 0 0 3 - 3 - 3 3 1 - ℓ ‑prediction results in unit weight prediction (no RAW) 0 0 0 1 0 0 0 • t 0 1 0 • t 1 root 0 0 t 2 • 1 1 0 1 1 1 0 t n 1 October 5, 2020 · PACT ’20, Virtual Event · CUSZ · 7 / 20

  8. Background and make it under error control. quantization reconstruction w/ loop carried RAW Introduction dependency. SZ COMPRESSION DECOMPRESSION and reconstruction: prediction (reconstructed) data show during compression Loop‑Carried Read‑After‑Write (codec) are mutually reversed procedures. (P+Q) Procedure in SZ Design Evaluation Conclusion ▶ Lossless compression and decompression ▶ Simlarly, SZ makes to‑be‑decompressed k − 2 ��� e ◦ ⋆ k − 2 ��� d ◦ ⋆ d k − 2 − p ◦ k − 2 = e ◦ k − 2 ��� q ◦ k − 2 k − 1 ��� e ◦ ⋆ k − 1 ��� d ◦ ⋆ d k − 1 − p ◦ k − 1 = e ◦ k − 1 ��� q ◦ k − 1 ▶ Error control is conducted during quantization − p ◦ = e ◦ ��� q ◦ ��� e ◦ ⋆ ��� d ◦ ⋆ d k k k k k k e ◦ /(2 · eb ) × (2 · eb ) − e ◦ ≤ eb . ≡ ≡ ≡ ≡ ≡ ≡ q • ��� e • ��� d • k k k ▶ This introduces loop‑carried read‑after‑write October 5, 2020 · PACT ’20, Virtual Event · CUSZ · 8 / 20

  9. Background Introduction Design Evaluation Conclusion Fully Parallelized (P+Q) Procedure in CUSZ PRE QUANT pre quantization: POST QUANT DECOMPRESSION quantization in SZ. (unnecessary) CUSZ COMPRESSION ▶ Prioritize error control. ≡ δ ◦ ⋆ k − 2 ��� d ◦ ⋆ d k − 2 ��� d ◦ k − 2 − p ◦ k − 2 = δ ◦ k − 2 ≡ q ◦ k − 2 k − 2 ▶ Error control happens at the very beginning, d k − 1 ��� d ◦ k − 1 − p ◦ k − 1 = δ ◦ k − 1 ≡ q ◦ ≡ δ ◦ ⋆ k − 1 ��� d ◦ ⋆ k − 1 k − 1 d ◦ /(2 · eb ) × (2 · eb ) − d ◦ ≤ eb , ≡ δ ◦ ⋆ ��� d ◦ ⋆ d k ��� d ◦ − p ◦ = δ ◦ ≡ q ◦ k k k k k k ▶ And post quantization is corresponding to ≡ ≡ ≡ ≡ ≡ ≡ q • ≡ δ • ��� d • k k k October 5, 2020 · PACT ’20, Virtual Event · CUSZ · 9 / 20

  10. Background prediction DECOMPRESSION CUSZ COMPRESSION (unnecessary) POST QUANT PRE QUANT Introduction SZ COMPRESSION w/ loop carried RAW reconstruction quantization DECOMPRESSION vs Fully Parallelized CUSZ Design Evaluation Conclusion Original SZ (Loop‑Carried RAW) k − 2 ��� e ◦ ⋆ k − 2 ��� d ◦ ⋆ ≡ δ ◦ ⋆ k − 2 ��� d ◦ ⋆ d k − 2 − p ◦ k − 2 = e ◦ k − 2 ��� q ◦ d k − 2 ��� d ◦ k − 2 − p ◦ k − 2 = δ ◦ k − 2 ≡ q ◦ k − 2 k − 2 k − 2 k − 1 ��� e ◦ ⋆ k − 1 ��� d ◦ ⋆ ≡ δ ◦ ⋆ k − 1 ��� d ◦ ⋆ d k − 1 − p ◦ k − 1 = e ◦ k − 1 ��� q ◦ d k − 1 ��� d ◦ k − 1 − p ◦ k − 1 = δ ◦ k − 1 ≡ q ◦ k − 1 k − 1 k − 1 − p ◦ = e ◦ ��� q ◦ ��� e ◦ ⋆ ��� d ◦ ⋆ ��� d ◦ − p ◦ = δ ◦ ≡ q ◦ ≡ δ ◦ ⋆ ��� d ◦ ⋆ d k d k k k k k k k k k k k k ≡ ≡ ≡ ≡ ≡ ≡ ≡ ≡ ≡ ≡ ≡ ≡ q • ��� e • ��� d • q • ≡ δ • ��� d • k k k k k k October 5, 2020 · PACT ’20, Virtual Event · CUSZ · 10 / 20

  11. Background coarse‑grained manner. Design Evaluation Conclusion Canonical Codebook and Huffman Encoding adj. [Schwartz and Kallick 1964] Introduction thread busy. ca · non · i · cal ▶ Encoding/decoding is done in a ▶ A GPU thread is assigned to a data chunk. ▶ Tune degree of parallelism to keep every ▶ codebook transformed to a compact manner ▶ no tree in decoding ▶ tree build time: 4‑7 ms (for now) ▶ canonize for 200 us (1024 symbols) October 5, 2020 · PACT ’20, Virtual Event · CUSZ · 11 / 20

Recommend


More recommend