fast dictionary based compression for inverted indexes
play

Fast Dictionary-based Compression for Inverted Indexes Giulio - PowerPoint PPT Presentation

Fast Dictionary-based Compression for Inverted Indexes Giulio Ermanno Pibiri Matthias Petri Alistair Mo ff at The University of Pisa The University of Melbourne The University of Melbourne and ISTI-CNR Melbourne, Australia Melbourne,


  1. Fast Dictionary-based Compression for Inverted Indexes Giulio Ermanno Pibiri Matthias Petri Alistair Mo ff at The University of Pisa The University of Melbourne The University of Melbourne and ISTI-CNR 
 Melbourne, Australia Melbourne, Australia Pisa, Italy 12/02/2019

  2. Context — Inverted Indexes We focus on compression effectiveness and decoding speed for inverted indexes . The inverted index is the de-facto data structure at the basis of every large-scale retrieval system.

  3. Context — Inverted Indexes We focus on compression effectiveness and decoding speed for inverted indexes . The inverted index is the de-facto data structure at the basis of every large-scale retrieval system. 1 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 red 2 V = {always, boy, good, house, hungry, is, red, the} is the always house good L t 1 =[1, 3] is L t 2 =[4, 5] red the L t 3 =[1] boy 3 boy L t 4 =[2, 3] is is the L t 5 =[3, 5] hungry red house L t 6 =[1, 2, 3, 4, 5] is always L t 7 =[1, 2, 4] 5 hungry L t 8 =[2, 3, 5] 4

  4. Many solutions Huge research corpora describing different space/time trade-offs. Elias gamma/delta Optimized PForDelta • • Variable-Byte family Elias-Fano • • Binary Interpolative Coding Partitioned Elias-Fano • • Simple family Clustered Elias-Fano • • PForDelta Asymmetric Numeral Systems • •

  5. Many solutions Huge research corpora describing different space/time trade-offs. Elias gamma/delta Optimized PForDelta • • Variable-Byte family Elias-Fano • • Binary Interpolative Coding Partitioned Elias-Fano • • Simple family Clustered Elias-Fano • • PForDelta Asymmetric Numeral Systems • • Space Time Variable-Byte 
 Spectrum Interpolative + SIMD ~ 3X smaller ~ 4.5X faster

  6. Many solutions Huge research corpora describing different space/time trade-offs. Elias gamma/delta Optimized PForDelta • • Variable-Byte family Elias-Fano • • Binary Interpolative Coding Partitioned Elias-Fano • • Simple family Clustered Elias-Fano • • PForDelta Asymmetric Numeral Systems • • Space Time Variable-Byte 
 Spectrum Interpolative + SIMD ~ 3X smaller ~ 4.5X faster RQ Can we inherit both advantages?

  7. A crucial fact Patterns of d -gaps are repetitive .

  8. A crucial fact Patterns of d -gaps are repetitive . Gov2

  9. DINT — D ictionary of INT egers l + 1 • Encode a whole pattern with a single dictionary 
 fixed-to-fixed reference of b bits arrangement • Decode a whole pattern with a single dictionary access input stream 2 b … c 1 c 2 c 3 c 4 c 5 c 6 c 6 c 7 e c 8 c 9 c 10 c 11 … b b b b

  10. DINT — D ictionary of INT egers l + 1 • Encode a whole pattern with a single dictionary 
 fixed-to-fixed reference of b bits arrangement • Decode a whole pattern with a single dictionary access input stream 2 b … c 1 c 2 c 3 c 4 c 5 c 6 c 6 c 7 e c 8 c 9 c 10 c 11 … b b b b

  11. DINT — D ictionary of INT egers l + 1 • Encode a whole pattern with a single dictionary 
 fixed-to-fixed reference of b bits arrangement • Decode a whole pattern with a single dictionary access input stream 2 b … c 1 c 2 c 3 c 4 c 5 c 6 c 6 c 7 e c 8 c 9 c 10 c 11 … b b b b

  12. DINT — D ictionary of INT egers l + 1 • Encode a whole pattern with a single dictionary 
 fixed-to-fixed reference of b bits arrangement • Decode a whole pattern with a single dictionary access input stream 2 b … c 1 c 2 c 3 c 4 c 5 c 6 c 6 c 7 e c 8 c 9 c 10 c 11 … b b b b 1/3 of the time is saved

  13. Refinements 1 Packed dictionary structure Exploiting string overlap 2 Optimal block parsing 3 Multiple dictionaries

  14. Experimental results: setting Datasets Machine Intel Xeon 6144 processor, 512 GiB RAM, Linux 4.13.0 Compiler gcc 7.2.0 (with all optimizations) C++ code available at https://github.com/jermp/dint

  15. Experimental results: compression effectiveness

  16. Experimental results: compression effectiveness l = 16 b = 16

  17. Experimental results: effectiveness/efficiency plot

  18. Experimental results: effectiveness/efficiency plot

  19. Experimental results: effectiveness/efficiency plot

  20. Further readings Chapter 6 and 7 of my Ph.D. thesis. (more datasets, comparisons, query timings) http://pages.di.unipi.it/pibiri/papers/phd_thesis.pdf

  21. Thanks for your attention, time, patience! Any questions?

Recommend


More recommend