Compressing Coldbox Data Ivan K. Furic, Remington Gerras University - PowerPoint PPT Presentation

Compressing Coldbox Data Ivan K. Furic, Remington Gerras University of Florida

ProtoDUNE-SP TDR: • Lossless compression factor = 4 • Implies reduction from 12bits/ADC readout to 3 bits per ADC readout • In the rest of this talk, not discussing factors, only average bits / ADC readout • Hence, keep in mind: • “3 bits” = TDR spec • “4 bits” = compression factor 3 • “6 bits” = compression factor 2

How well does a generic algorithm work? • ROOT’s native compression for 10 events, 1536 channels • 10k ADC readouts per channel per event, 2 bytes per ADC readout • Compressed: avg 5.73 bits per ADC readout [effective compression factor 2.1, half of the TDR spec]

Using “gzip -9” explicitly • Store data for a single channel in a file, compress • Performance depends on how the bits are packed in the file • Convention in figures below: 12 bits = 3 nibbles: H,M,L

What RMS will compress into 3 bits? • Consider “ideal” case for compression - uniform distribution of values • A uniform distribution across D consecutive discrete values has an # RMS of ! = √%& ; ( = ! 12 is the width of a flat distribution needed for a given ! • To encode D discrete values, one requires log2(D) bits: % • + ,-./ = log & ( = log & ! 12 = log & (!) + & log & 12 = log & (!) + 1.8 • In order to encode into 3 bits of data, the RMS of the distribution can’t be more than 2.3 ADC counts • Observed pedestal RMS’s are 6-8 ADC counts • Encoding raw values will not provide desired compression

Information Theory limits on compression • For a stochastic noiseless source emitting a set of symbols with frequencies p_i, the number of bits per symbol is the (Shannon) entropy: • Shannon, Claude E. (July–October 1948). "A Mathematical Theory of Communication". Bell System Technical Journal. 27 (3): 379–423.

Gaussian distributed discrete random values • Huffman compression achieves Shannon entropy level of performance • Need RMS of 2 bins to compress into 3 bits • RMS of 4 bins should compress into 4 bits • RMS’s of 6-8 bins should compress into 4.6-5.0 bits

Variable Distributions, Run #1287 • Consider three variables as targets to encode using a compression algorithm X n -2X n-1 +X n-2 X n X n -X n-1 Difference wrt linear prediction Raw ADC Counts Difference wrt (based on previous two counts) previous count

Variable Distribution RMS’s: Linear prediction Difference Raw ADC Counts

Truncated Huffman compression • Raw ADC counts: tree encodes values seen in event • For target variables, expect most values are in the range [-16,16] • Huffman-encode only this window • RAW + target: have additional (13-14 bit) Huffman code for “value outside range”, followed by full 12-bit value • 25 bit penalty for data not under control • compression performance will be worse than Shannon entropy

Performance on Run #1287 Encode Differences • Green = Shannon entropy • Blue = Channel+Event specific Huffman Trees Encode • Red = Use one Raw Values (random) Huffman Tree Encode wrt for all data Linear Distributions of avg bits per ADC word Prediction observed per channel, per event • Raw data requires lots of custom Huffman Trees • Encoding diff wrt linear prediction works best (avg less than 4 bits per ADC word)

Performance Loss For Generic Trees Encode Encode wrt Differences Linear Prediction • For two target variables, lose fraction of a bit in performance • Linear predictor loss is better contained, i.e. performance more predictable

Raw ADC Value Correlation Factors • Reproduced correlations observed by Tom in run 973 • Data in run 1287 appears to be much less correlated

What’s different between the two runs? Run #973 Run #1287 Raw ADC Channel-Channel Correlation Factor Raw ADC Channel-Channel Correlation Factor • Run 1287 has no correlation factors greater than ~10% • Run 973 has a significant tail in the RMS distribution • Possibly due to slow noise in the electronics?

Example: Anti-correlation from slow noise • Waveform for first event, channels 1199 vs 1216 • Causes significant increase in RMS, almost 100% uncorrelated

Comparison of variable RMS’s per channel: • Run 973 overall behavior of target variables is “better” than 1287 • Expect run 973 to compress better than run 1287

Compression performance on run 1287 vs 973 • Encoding Difference wrt previous ADC count

Compression Performance, run 1287 vs 973, cont’d • Encoding difference wrt Linear Prediction

Estimated Event Size • ProtoDUNE-SP TDR spec is to compress 230.4 MB of TDC data into 57.6 MB • Run compression test on 10 events, for both runs, record #bits used • Run 1287 conveniently reads out 1536 channels, 1/10 th of full protoDUNE-SP • Run 973 has 2304 channels reading out, scale numbers by 1536/2304 Run Number Difference, Difference, Linear Prediction, Linear Prediction, Size wrt TDR Spec Custom Trees Single Tree Custom trees Single Tree 1287 72.5 MB 73.4 MB 71.5 MB 72.2 MB +25% 0973 (scaled) 70.3 MB 71.1 MB 70.3 MB 70.4 MB +22% • 25% larger event size than required by TDR spec • ADC readout encoded on avg in 3.75 bits (TDR spec is 3) • Compression factor 3.20 (TDR spec is 4)

Conclusions, so far • Evaluated compression performance on coldbox data • Found two good candidate variables for encoding • Evaluated encoding with “truncated” Huffman compression • Found approach to be generic and robust • ~1% penalty for sub-optimal encoding tree, even across events • Expect similar performance for hard-coded common tree for all channels, all events (simplifies firmware implementation) • No performance loss in presence of “slow” noise • Estimate compressed event size to be 25% larger than TDR spec • No significant channel noise cross-correlation observed (in run #1287) • Likely not much to gain from combining information across channels • Found promising correlations with ADC counts earlier in the stream (further reduce avg RMS by 10%, i.e. 5% better compression)

Plans • Check cross-channel correlation between encoding variables • Re-check gzip performance on larger sample of events • Attempt to utilize information from earlier in the stream to further shrink target variable RMS • Choose single, hardcoded compression tree • Optimize decompression algorithm for speed, report performance • Study per-event compression performance on larger sample (e.g. entire run 1287) • Try ”gzip -9” on compressed output • Any other tests? • Report back with final findings, document

Compressing Coldbox Data Ivan K. Furic, Remington Gerras University - PowerPoint PPT Presentation

Compressing Coldbox Data Ivan K. Furic, Remington Gerras University of Florida ProtoDUNE-SP TDR: Lossless compression factor = 4 Implies reduction from 12bits/ADC readout to 3 bits per ADC readout In the rest of this talk, not

Lempel- -Ziv Ziv- -Welch (LZW) Welch (LZW) Lempel Data Compressing Model Data Compressing

Compressing Strings of the Kernel Wolfram Sang Consultant 21.8.2014, LinuxCon14 Wolfram Sang

Compressing and Searching XML Data Via Two Zips Paolo Ferragina Dipartimento di Informatica,

A Compressing Method for Genome Sequence Cluster using Sequence Alignment Kwang Su Jung 1 , Nam

gzip, tar Purpose file archiving -compressing multiple files into one smaller file

Performance Tuning an Algorithm for Compressing Relational Tables Authors Jyrki Katajainen and

COMPRESSING XKCD IMAGES By Akarsh Kumar XKCD IMAGE EXAMPLE COMPRESSION AND DECOMPRESSION

Compressing RSA/Rabin keys Public keys D. J. Bernstein Each user publishes a key 2 2047 + 1

On The Complexity of Compressing Obfuscation Gilad Asharov, Naomi Ephraim, Ilan Komargodski, and

Compressing DMA Engine: Leveraging Activation Sparsity For Training Deep Neural Networks Minsoo

Delta-DNN : Efficiently Compressing Deep Neural Networks via Exploiting Floats Similarity The 49

An Introduction to Information Theory Carlton Downey November 12, 2013 Motivation Information

Tone Reproduction Definition: Compressing the dynamic Photographic Tone Reproduction range of a

Mixing Between Two Compressing Cylinders Steven H. Batha Los Alamos National Laboratory K. W.

Compressing IP Forwarding Tables: Towards Entropy Bounds and Beyond Gbor Rtvri, Jnos

A Unified Approximation Framework for Compressing and Accelerating Deep Neural Networks Yuzhe Ma 1

Most of the slides are borrowed from the authors original presentation. original

Evaluation of a High Performance Code Compression Method Charles Lefurgy, Eva Piccininni, and

with Dictionaries an alternative to InnoDB table compression Yura Sorokin, Senior Software

Using Transparent Compression to Improve SSD-based I/O Caches Thanos Makatos, Yannis Klonatos,

Efficient Lightweight Compression Alongside Fast Scans Orestis Polychroniou Kenneth A. Ross

Lecture 9: Compression 1 / 52 Compression Recap Bu ff er Management Recap 2 / 52 Compression

A Little Confusing Without [a block digest], one must query the offset digest with all

Fast Text Compression with Neural Networks Matthew Mahoney Florida Institute of Technology