dss
play

DSS Data & Storage Services LZ4 HC COMPRESSION for ROOT and IO - PowerPoint PPT Presentation

DSS Data & Storage Services LZ4 HC COMPRESSION for ROOT and IO Baseline Evaluation ROOT IO Workshop - 6.12.2013 Andreas-Joachim Peters IT-DSS-TD CERN IT Department = compression x speed x size CH-1211


  1. DSS Data & Storage Services LZ4 HC COMPRESSION for ROOT and IO Baseline Evaluation ROOT IO Workshop - 6.12.2013 Andreas-Joachim Peters IT-DSS-TD CERN IT Department = compression x speed x size CH-1211 Geneva 23 Switzerland www.cern.ch/i t Friday, December 6, 13 4

  2. DSS Contents • Overview of Compression Algorithms • baseline for expectation • Implementation • format for multithreaded encoding and single-threaded decoding • implementation in ROOT • Results • benchmarks for various Tree’s • IO baseline measurement Internet Services CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t ROOT LZ4 COMPRESSION Friday, December 6, 13 5

  3. Data & Comparison Compression Algorithms Storage Services Core i5-3340M @2.7GHz, using the Open-Source Benchmark by m^2 (v0.14.2) compiled with GCC v4.6.1 Compr. Encoding Decoding Alogrithm Ratio Speed [MB/s] Speed [MB/s] LZ4 2.084 422 1820 Snappy 2.091 323 1070 ZLIB 1.2.8 2.730 65 280 level=1 ZLIB 1.2.8 3.099 21 300 level=6 LZ4HC 2.720 25 2080 Looking at this table: LZ4HC looks very interesting for workflows where n(read)>n(write) ... but the story is not at it’s end ... ROOT LZ4 COMPRESSION Friday, December 6, 13 6

  4. Data & Comparions Compression Algorithms ROOT I O Performance Composition Storage Services various bottlenecks uncached 0-500 MB/s 25-300 MB/s 5-2200 MB/s anything cached 2 GB/s uncompressed: Event Event Loca/Remote Decompression IO Assembly User Analysis TTree::GetEntry(i) { [ read ] [ unzip ] Select for ( branches ) Compute { Draw TBranch::GetEntry() } } async IO Parallelization parallel TTreeCache vector IO Optimization unzip Async Prefetch readahead ROOT LZ4 COMPRESSION Friday, December 6, 13 7

  5. Data & Simple approach for multithreaded Overview of Compr. Alg. Storage encoding ... Services LZ4HC decoding is extremely fast [2 GB/s] and does not need parallelism while encoding is slow => try parallel approach with low code change impact in ROOT on lowest level (inside R__zip) ... LZ4(HC) ZLIB/LZMA Compression in ROOT Compression in ROOT Buffer size: [bytes .. MB] Buffer defined by basket size adaptive chunking R__zip R__unzip Compr. Encoding Thread pool ratio: [1:1 .. 1:10] (8 threads) Buffer defined by contents memmove ROOT LZ4 COMPRESSION Friday, December 6, 13 8

  6. Data & Overview of Compr. Alg. LZ4 HC compressed buffer format Storage Services ‘X’ 8bit ‘Y’ 8bit <chunktype> 8bit 9 bytes enc size 24bit dec. size 24bit <chunktype> Header ID Size 1 64k Chunk 1 3 bytes enc size 24 bit 2 128k BODY Chunk 1 3 256k BODY 4 512k identical to ZLIB/LZMA 5 1M Chunk 2 enc size 24 bit Chunk 2 BODY ... ROOT LZ4 COMPRESSION Friday, December 6, 13 9

  7. Data & Overview of Compr. Alg. LZ4 HC ROOT Implementation Storage Services ROOT compression is part of libCore which has no access to threading. First prototype used C++11 threads , now using native ROOT threads + semaphores. Single threaded encoding and decoding is implemented in class ZipLZ4 under root/core/lz4/ in libCore. Multi threaded encoding is implemented in class ZipLZ4mt in root/io/io/src/ installing a singleton pointer on load starting eight worker threads. [ the construction and destruction of the thread pool is currently tied to libRIO - to be reviewed ] Compression code is ~2.5k lines in C (6 files) from https://code.google.com/p/lz4/ [LZ4 r108] Javascript implementation available https://github.com/pierrec/node-lz4 ROOT LZ4 COMPRESSION Friday, December 6, 13 10

  8. Data & LZ4 HC ROOT Benchmarks Overview of Compr. Alg. Storage Services Event.root eventexe 100000 1 Basket-Size: 50kb - 2.5M ; 21 Leaves Compression Decompression Level Ratio Compressor Speed Speed Reference 2.15 L=1 21 GZIP 1 88 2.27 L=6 10 2.35 L=9 1.9 LZ4 >1 1.68 52 178 LZ4HC 1 2.02 12 188 LZ4HC (mt) 1 2.02 32 188 LZMA 1 2.71 7 24 0 1.0 57 200 uncompressed ROOT LZ4 COMPRESSION Friday, December 6, 13 11

  9. Data & Overview of Compr. Alg. LZ4 HC ROOT Benchmarks Storage Services File Default Default LZ4HC LZ4HC CPU IO Branch Default Size NTUPLE Size Compr. Compr. #Events Read IO rate usage rate /Leafs Read (s) change [b] Type Ratio [s] [MB/s] change change ATLAS +17,0% -10.5% +28% 4.5G ZLIB 3.84 7K 55K 496s 445s 11.8 SUSY 10% -53.7% +583% 138s 64s 82.2 ATLAS -13% +12% 856M ZLIB 2.47 5.8K 12K 62s 54s 17.8 +11,4% HIGGS 10% -60% +130% 22s 8.8s 110 ALICE 230M ZLIB 5.4 423 657 12.4s 9.1s 26.2 +10% -26% +45% +4% CMS 213s L=9 +20 % L=9 +13% 2.5 GB ZLIB 5.04 305 1.4M 229s 13.3 229s L=1 +-0 % L=1 +0% +0% Higgs Events CMS +21% N.N N.N 3 GB ZLIB 4.21 5k 14K 570s PHOTON CMS 1.8G ZLIB 2.6 14 81M 110s 90s 24 +22% -18% +50% USER NTUPLE +50% -55% +230% 264s LHCB 1.5G LZMA 3.0 76 232k 119s 19.1 -37% zlib +71% 190s zlib +5% zlib For the CMS Photon file I missed class libraries to read it after conversion. Running CloneTree resulted in the ATLAS cases ROOT LZ4 COMPRESSION in 5-10% larger files using ZIP default compression [basket size optimization?] CloneTree is incredibly slow!!! Friday, December 6, 13 12

  10. Data & Overview of Compr. Alg. LZ4 HC ROOT Benchmarks Storage Services LZ4HC compressed trees are not always faster to read than ZLIB. If overall network IO is a bottleneck LZ4HC is already ruled out. It it not completely transparent to understand the different behavior of LZ4HC and ZLIB for the tested tree’s: the usability of LZ4HC depends strong on the input data. In general it would be good to have a fast conversion function in ROOT re-compressing baskets (in parallel) with a different algorithm e.g. it takes 20 minutes to convert the Higgs Event tree (2.5 MB/s). ROOT should implement a fast benchmark probe function showing the performance results for a subset of events in a given tree. Friday, December 6, 13 13

  11. Data & ROOT I O Baseline Comparions Compression Algorithms Storage Xeon 2.27GHz 8-core 16GB DDR3 Services Event Class (no split mode) with one binary blob member (size is Y axis) basketsize=64k 5000 Events: No ROOT ROOT no Compr. ROOT LZ4 Events are filled 4500 with random bytes and compress 1:1.45 4000 3500 No-ROOT: open (file) Certainly one would read while(read(ev-size)) at least 32k and not small 3000 close(file) single events with a single read call! MB/s 2500 1 x memcpy(ev-size) = 50% 2000 1500 ~2 x memcpy(ev-size) = 25% 1000 read sys-call limited 500 0 Performance rel. to non-ROOT IO 128 512 1k 4k 32k 128k 512k 1M 4M 16M 64M 512M Event Size - ROOT no comp. ROOT LZ4 60% 45% 30% 15% 0% 128 512 1k 4k 32k 128k 512k 1M 4M 16M 64M Friday, December 6, 13 14

  12. Data & IO Efficiency for example trees Comparions Compression Algorithms Storage using default compression Services ALICE AOD ATLAS SUSY ATLAS HIGGS CMS HIGGS CMS USER CMS PHOTON LHCB MDST 100% CPU Read IO 100 2000 1900 100 MB/s uncompressed event data per CPU second 75 1500 50 54 1000 900 42 38 34 25 500 300 LZMA! 21 180 15 9 0.06 0 0 20 Event Size [kb] CPU Eff. of IO compared to 64MB read size speed 10 10 7.5 % 5 This is an area to invest more work. This inefficiency wastes CPU 4.2 4.2 3.8 3.5 3.3 cycles to arrange data in memory not to analyze them => 2.1 re-evaluate the cost of minimal data size and framework flexibility. 2.5 1.4 1.4 1.2 1.1 0.7 1.5 0 0.5 non-ROOT norm. LZ4 norm. ROOT LZ4 COMPRESSION Friday, December 6, 13 15

  13. Data & Overview of Compr. Alg. Summary Storage Services LZ4HC compression trades inferior compression for lower CPU usage. It seems to be a good choice in use cases with lower number of branches or partial event reads . Less compression and faster decompression results in higher bandwidth requirements to reach 100% CPU usage. To gain from multithreaded LZ4HC encoding basket sizes must be at least of the order of several 64kb - otherwise multithreading does not result in faster compression. It might be interesting to apply the multithreaded encoding & decoding to LZMA which gives the best compression at a low performance. It helps single client performance but certainly costs CPU. With LZ4HC compression the inefficiency in the event assembly becomes more evident. This should be a focus of the future with the goal not just to parallelize it (using even more CPU) but to reduce the CPU needed to assemble events in memory (maybe never really convert them in C++ objects - just proxy). ROOT LZ4 COMPRESSION Friday, December 6, 13 16

Recommend


More recommend