compressing coldbox data
play

Compressing Coldbox Data Ivan K. Furic, Remington Gerras University - PowerPoint PPT Presentation

Compressing Coldbox Data Ivan K. Furic, Remington Gerras University of Florida ProtoDUNE-SP TDR: Lossless compression factor = 4 Implies reduction from 12bits/ADC readout to 3 bits per ADC readout In the rest of this talk, not


  1. Compressing Coldbox Data Ivan K. Furic, Remington Gerras University of Florida

  2. ProtoDUNE-SP TDR: • Lossless compression factor = 4 • Implies reduction from 12bits/ADC readout to 3 bits per ADC readout • In the rest of this talk, not discussing factors, only average bits / ADC readout • Hence, keep in mind: • “3 bits” = TDR spec • “4 bits” = compression factor 3 • “6 bits” = compression factor 2

  3. How well does a generic algorithm work? • ROOT’s native compression for 10 events, 1536 channels • 10k ADC readouts per channel per event, 2 bytes per ADC readout • Compressed: avg 5.73 bits per ADC readout [effective compression factor 2.1, half of the TDR spec]

  4. Using “gzip -9” explicitly • Store data for a single channel in a file, compress • Performance depends on how the bits are packed in the file • Convention in figures below: 12 bits = 3 nibbles: H,M,L

  5. What RMS will compress into 3 bits? • Consider “ideal” case for compression - uniform distribution of values • A uniform distribution across D consecutive discrete values has an # RMS of ! = √%& ; ( = ! 12 is the width of a flat distribution needed for a given ! • To encode D discrete values, one requires log2(D) bits: % • + ,-./ = log & ( = log & ! 12 = log & (!) + & log & 12 = log & (!) + 1.8 • In order to encode into 3 bits of data, the RMS of the distribution can’t be more than 2.3 ADC counts • Observed pedestal RMS’s are 6-8 ADC counts • Encoding raw values will not provide desired compression

  6. Information Theory limits on compression • For a stochastic noiseless source emitting a set of symbols with frequencies p_i, the number of bits per symbol is the (Shannon) entropy: • Shannon, Claude E. (July–October 1948). "A Mathematical Theory of Communication". Bell System Technical Journal. 27 (3): 379–423.

  7. Gaussian distributed discrete random values • Huffman compression achieves Shannon entropy level of performance • Need RMS of 2 bins to compress into 3 bits • RMS of 4 bins should compress into 4 bits • RMS’s of 6-8 bins should compress into 4.6-5.0 bits

  8. Variable Distributions, Run #1287 • Consider three variables as targets to encode using a compression algorithm X n -2X n-1 +X n-2 X n X n -X n-1 Difference wrt linear prediction Raw ADC Counts Difference wrt (based on previous two counts) previous count

  9. Variable Distribution RMS’s: Linear prediction Difference Raw ADC Counts

  10. Truncated Huffman compression • Raw ADC counts: tree encodes values seen in event • For target variables, expect most values are in the range [-16,16] • Huffman-encode only this window • RAW + target: have additional (13-14 bit) Huffman code for “value outside range”, followed by full 12-bit value • 25 bit penalty for data not under control • compression performance will be worse than Shannon entropy

  11. Performance on Run #1287 Encode Differences • Green = Shannon entropy • Blue = Channel+Event specific Huffman Trees Encode • Red = Use one Raw Values (random) Huffman Tree Encode wrt for all data Linear Distributions of avg bits per ADC word Prediction observed per channel, per event • Raw data requires lots of custom Huffman Trees • Encoding diff wrt linear prediction works best (avg less than 4 bits per ADC word)

  12. Performance Loss For Generic Trees Encode Encode wrt Differences Linear Prediction • For two target variables, lose fraction of a bit in performance • Linear predictor loss is better contained, i.e. performance more predictable

  13. Raw ADC Value Correlation Factors • Reproduced correlations observed by Tom in run 973 • Data in run 1287 appears to be much less correlated

  14. What’s different between the two runs? Run #973 Run #1287 Raw ADC Channel-Channel Correlation Factor Raw ADC Channel-Channel Correlation Factor • Run 1287 has no correlation factors greater than ~10% • Run 973 has a significant tail in the RMS distribution • Possibly due to slow noise in the electronics?

  15. Example: Anti-correlation from slow noise • Waveform for first event, channels 1199 vs 1216 • Causes significant increase in RMS, almost 100% uncorrelated

  16. Comparison of variable RMS’s per channel: • Run 973 overall behavior of target variables is “better” than 1287 • Expect run 973 to compress better than run 1287

  17. Compression performance on run 1287 vs 973 • Encoding Difference wrt previous ADC count

  18. Compression Performance, run 1287 vs 973, cont’d • Encoding difference wrt Linear Prediction

  19. Estimated Event Size • ProtoDUNE-SP TDR spec is to compress 230.4 MB of TDC data into 57.6 MB • Run compression test on 10 events, for both runs, record #bits used • Run 1287 conveniently reads out 1536 channels, 1/10 th of full protoDUNE-SP • Run 973 has 2304 channels reading out, scale numbers by 1536/2304 Run Number Difference, Difference, Linear Prediction, Linear Prediction, Size wrt TDR Spec Custom Trees Single Tree Custom trees Single Tree 1287 72.5 MB 73.4 MB 71.5 MB 72.2 MB +25% 0973 (scaled) 70.3 MB 71.1 MB 70.3 MB 70.4 MB +22% • 25% larger event size than required by TDR spec • ADC readout encoded on avg in 3.75 bits (TDR spec is 3) • Compression factor 3.20 (TDR spec is 4)

  20. Conclusions, so far • Evaluated compression performance on coldbox data • Found two good candidate variables for encoding • Evaluated encoding with “truncated” Huffman compression • Found approach to be generic and robust • ~1% penalty for sub-optimal encoding tree, even across events • Expect similar performance for hard-coded common tree for all channels, all events (simplifies firmware implementation) • No performance loss in presence of “slow” noise • Estimate compressed event size to be 25% larger than TDR spec • No significant channel noise cross-correlation observed (in run #1287) • Likely not much to gain from combining information across channels • Found promising correlations with ADC counts earlier in the stream (further reduce avg RMS by 10%, i.e. 5% better compression)

  21. Plans • Check cross-channel correlation between encoding variables • Re-check gzip performance on larger sample of events • Attempt to utilize information from earlier in the stream to further shrink target variable RMS • Choose single, hardcoded compression tree • Optimize decompression algorithm for speed, report performance • Study per-event compression performance on larger sample (e.g. entire run 1287) • Try ”gzip -9” on compressed output • Any other tests? • Report back with final findings, document

Recommend


More recommend