Fast Dictionary-based Compression for Inverted Indexes Giulio Ermanno Pibiri Matthias Petri Alistair Mo ff at The University of Pisa The University of Melbourne The University of Melbourne and ISTI-CNR Melbourne, Australia Melbourne, Australia Pisa, Italy 12/02/2019
Context — Inverted Indexes We focus on compression effectiveness and decoding speed for inverted indexes . The inverted index is the de-facto data structure at the basis of every large-scale retrieval system.
Context — Inverted Indexes We focus on compression effectiveness and decoding speed for inverted indexes . The inverted index is the de-facto data structure at the basis of every large-scale retrieval system. 1 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 red 2 V = {always, boy, good, house, hungry, is, red, the} is the always house good L t 1 =[1, 3] is L t 2 =[4, 5] red the L t 3 =[1] boy 3 boy L t 4 =[2, 3] is is the L t 5 =[3, 5] hungry red house L t 6 =[1, 2, 3, 4, 5] is always L t 7 =[1, 2, 4] 5 hungry L t 8 =[2, 3, 5] 4
Many solutions Huge research corpora describing different space/time trade-offs. Elias gamma/delta Optimized PForDelta • • Variable-Byte family Elias-Fano • • Binary Interpolative Coding Partitioned Elias-Fano • • Simple family Clustered Elias-Fano • • PForDelta Asymmetric Numeral Systems • •
Many solutions Huge research corpora describing different space/time trade-offs. Elias gamma/delta Optimized PForDelta • • Variable-Byte family Elias-Fano • • Binary Interpolative Coding Partitioned Elias-Fano • • Simple family Clustered Elias-Fano • • PForDelta Asymmetric Numeral Systems • • Space Time Variable-Byte Spectrum Interpolative + SIMD ~ 3X smaller ~ 4.5X faster
Many solutions Huge research corpora describing different space/time trade-offs. Elias gamma/delta Optimized PForDelta • • Variable-Byte family Elias-Fano • • Binary Interpolative Coding Partitioned Elias-Fano • • Simple family Clustered Elias-Fano • • PForDelta Asymmetric Numeral Systems • • Space Time Variable-Byte Spectrum Interpolative + SIMD ~ 3X smaller ~ 4.5X faster RQ Can we inherit both advantages?
A crucial fact Patterns of d -gaps are repetitive .
A crucial fact Patterns of d -gaps are repetitive . Gov2
DINT — D ictionary of INT egers l + 1 • Encode a whole pattern with a single dictionary fixed-to-fixed reference of b bits arrangement • Decode a whole pattern with a single dictionary access input stream 2 b … c 1 c 2 c 3 c 4 c 5 c 6 c 6 c 7 e c 8 c 9 c 10 c 11 … b b b b
DINT — D ictionary of INT egers l + 1 • Encode a whole pattern with a single dictionary fixed-to-fixed reference of b bits arrangement • Decode a whole pattern with a single dictionary access input stream 2 b … c 1 c 2 c 3 c 4 c 5 c 6 c 6 c 7 e c 8 c 9 c 10 c 11 … b b b b
DINT — D ictionary of INT egers l + 1 • Encode a whole pattern with a single dictionary fixed-to-fixed reference of b bits arrangement • Decode a whole pattern with a single dictionary access input stream 2 b … c 1 c 2 c 3 c 4 c 5 c 6 c 6 c 7 e c 8 c 9 c 10 c 11 … b b b b
DINT — D ictionary of INT egers l + 1 • Encode a whole pattern with a single dictionary fixed-to-fixed reference of b bits arrangement • Decode a whole pattern with a single dictionary access input stream 2 b … c 1 c 2 c 3 c 4 c 5 c 6 c 6 c 7 e c 8 c 9 c 10 c 11 … b b b b 1/3 of the time is saved
Refinements 1 Packed dictionary structure Exploiting string overlap 2 Optimal block parsing 3 Multiple dictionaries
Experimental results: setting Datasets Machine Intel Xeon 6144 processor, 512 GiB RAM, Linux 4.13.0 Compiler gcc 7.2.0 (with all optimizations) C++ code available at https://github.com/jermp/dint
Experimental results: compression effectiveness
Experimental results: compression effectiveness l = 16 b = 16
Experimental results: effectiveness/efficiency plot
Experimental results: effectiveness/efficiency plot
Experimental results: effectiveness/efficiency plot
Further readings Chapter 6 and 7 of my Ph.D. thesis. (more datasets, comparisons, query timings) http://pages.di.unipi.it/pibiri/papers/phd_thesis.pdf
Thanks for your attention, time, patience! Any questions?
Recommend
More recommend