space and time e ffi cient data structures for massive
play

Space- and Time-E ffi cient Data Structures for Massive Datasets - PowerPoint PPT Presentation

Space- and Time-E ffi cient Data Structures for Massive Datasets Giulio Ermanno Pibiri Referee Supervisor Referee Daniel Lemire Rossano Venturini Simon Gog Department of Computer Science University of Pisa 08/03/2019 Evidence The


  1. Space- and Time-E ffi cient Data Structures for Massive Datasets Giulio Ermanno Pibiri Referee Supervisor Referee Daniel Lemire Rossano Venturini Simon Gog Department of Computer Science University of Pisa 08/03/2019

  2. Evidence The increase of data and, hence, information does not scale with technology.

  3. Evidence The increase of data and, hence, information does not scale with technology. “Software is getting slower more rapidly than hardware becomes faster.” Niklaus Wirth, A Plea for Lean Software

  4. Evidence The increase of data and, hence, information does not scale with technology. “Software is getting slower more rapidly than hardware becomes faster.” Niklaus Wirth, A Plea for Lean Software Even more relevant today!

  5. Achieved results Clustered Elias-Fano Indexes Journal paper Giulio Ermanno Pibiri and Rossano Venturini ACM Transactions on Information Systems (TOIS) Full paper, 34 pages, 2017. Dynamic Elias-Fano Representation Conference paper Giulio Ermanno Pibiri and Rossano Venturini Annual Symposium on Combinatorial Pattern Matching (CPM) Full paper, 14 pages, 2017. On Optimally Partitioning Variable-Byte Codes Journal paper Giulio Ermanno Pibiri and Rossano Venturini IEEE Transactions on Knowledge and Data Engineering (TKDE). To appear. Full paper, 12 pages, 2019. Fast Dictionary-based Compression for Inverted Indexes Giulio Ermanno Pibiri, Matthias Petri and Alistair Moffat Conference paper ACM Conference on Web Search and Data Mining (WSDM) Full paper, 9 pages, 2019. Efficient Data Structures for Massive N-Gram Datasets Conference paper Giulio Ermanno Pibiri and Rossano Venturini ACM Conference on Research and Development in Information Retrieval (SIGIR) Full paper, 10 pages, 2017. Journal paper Handling Massive N-Gram Datasets Efficiently Giulio Ermanno Pibiri and Rossano Venturini ACM Transactions on Information Systems (TOIS). To appear. Full paper, 41 pages, 2019.

  6. Achieved results Clustered Elias-Fano Indexes Journal paper Giulio Ermanno Pibiri and Rossano Venturini ACM Transactions on Information Systems (TOIS) Full paper, 34 pages, 2017. Dynamic Elias-Fano Representation Conference paper integer Giulio Ermanno Pibiri and Rossano Venturini Annual Symposium on Combinatorial Pattern Matching (CPM) sequences Full paper, 14 pages, 2017. On Optimally Partitioning Variable-Byte Codes Journal paper Giulio Ermanno Pibiri and Rossano Venturini IEEE Transactions on Knowledge and Data Engineering (TKDE). To appear. Full paper, 12 pages, 2019. Fast Dictionary-based Compression for Inverted Indexes Giulio Ermanno Pibiri, Matthias Petri and Alistair Moffat Conference paper ACM Conference on Web Search and Data Mining (WSDM) Full paper, 9 pages, 2019. Efficient Data Structures for Massive N-Gram Datasets Conference paper Giulio Ermanno Pibiri and Rossano Venturini ACM Conference on Research and Development in Information Retrieval (SIGIR) Full paper, 10 pages, 2017. Journal paper Handling Massive N-Gram Datasets Efficiently Giulio Ermanno Pibiri and Rossano Venturini ACM Transactions on Information Systems (TOIS). To appear. Full paper, 41 pages, 2019.

  7. Achieved results Clustered Elias-Fano Indexes Journal paper Giulio Ermanno Pibiri and Rossano Venturini ACM Transactions on Information Systems (TOIS) Full paper, 34 pages, 2017. Dynamic Elias-Fano Representation Conference paper integer Giulio Ermanno Pibiri and Rossano Venturini Annual Symposium on Combinatorial Pattern Matching (CPM) sequences Full paper, 14 pages, 2017. On Optimally Partitioning Variable-Byte Codes Journal paper Giulio Ermanno Pibiri and Rossano Venturini IEEE Transactions on Knowledge and Data Engineering (TKDE). To appear. Full paper, 12 pages, 2019. Fast Dictionary-based Compression for Inverted Indexes Giulio Ermanno Pibiri, Matthias Petri and Alistair Moffat Conference paper ACM Conference on Web Search and Data Mining (WSDM) Full paper, 9 pages, 2019. Efficient Data Structures for Massive N-Gram Datasets Conference paper Giulio Ermanno Pibiri and Rossano Venturini ACM Conference on Research and Development in Information Retrieval (SIGIR) Full paper, 10 pages, 2017. Journal paper Handling Massive N-Gram Datasets Efficiently short strings Giulio Ermanno Pibiri and Rossano Venturini ACM Transactions on Information Systems (TOIS). To appear. Full paper, 41 pages, 2019.

  8. Problem 1 Consider a sorted integer sequence.

  9. Problem 1 Consider a sorted integer sequence. How to represent it as a bit-vector where each original integer is uniquely-decodable, using as few as possible bits? How to maintain fast decompression speed ?

  10. Ubiquity Inverted indexes Databases E-Commerce Graph compression Semantic data Geo-spatial data

  11. Inverted indexes The inverted index is the de-facto data structure at the basis of every large-scale retrieval system.

  12. Inverted indexes The inverted index is the de-facto data structure at the basis of every large-scale retrieval system. 1 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 red 2 {always, boy, good, house, hungry, is, red, the} is the always house good is red the boy 3 boy is is the hungry red house is always 5 hungry 4

  13. Inverted indexes The inverted index is the de-facto data structure at the basis of every large-scale retrieval system. 1 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 red 2 {always, boy, good, house, hungry, is, red, the} is the always house good L t 1 =[1, 3] is L t 2 =[4, 5] red the L t 3 =[1] boy 3 boy L t 4 =[2, 3] is is the hungry L t 5 =[3, 5] red house L t 6 =[1, 2, 3, 4, 5] is always L t 7 =[1, 2, 4] 5 hungry L t 8 =[2, 3, 5] 4

  14. Many solutions Large research corpora describing different space/time trade-offs. ~1970 Elias’ Gamma and Delta • Variable-Byte Family • Binary Interpolative Coding • Simple Family • PForDelta • QMX • Elias-Fano • 2014 Partitioned Elias-Fano •

  15. Many solutions Large research corpora describing different space/time trade-offs. ~1970 Elias’ Gamma and Delta • Variable-Byte Family • Binary Interpolative Coding • Simple Family • PForDelta • QMX • Elias-Fano • 2014 Partitioned Elias-Fano • Time Space Binary Variable-Byte Spectrum Interpolative Family Coding ~ 3X smaller ~ 4.5X faster

  16. Key research questions Space Time Binary Variable-Byte Spectrum Interpolative (VByte) Coding (BIC) Family ~ 3X smaller ~ 4.5X faster

  17. Key research questions Space Time Binary Variable-Byte Spectrum Interpolative (VByte) Coding (BIC) Family ~ 3X smaller ~ 4.5X faster Is it possible to design an encoding that is as small as BIC and much faster ? 1

  18. Key research questions Space Time Binary Variable-Byte Spectrum Interpolative (VByte) Coding (BIC) Family ~ 3X smaller ~ 4.5X faster Is it possible to design an Is it possible to design an encoding that is as small as encoding that is as fast as BIC and much faster ? VByte and much smaller ? 1 2

  19. Key research questions Space Time Binary Variable-Byte Spectrum Interpolative (VByte) Coding (BIC) Family ~ 3X smaller ~ 4.5X faster Is it possible to design an Is it possible to design an encoding that is as small as encoding that is as fast as BIC and much faster ? VByte and much smaller ? 1 2 What about both objectives at the same time?! 3

  20. Key research questions Space Time Binary Variable-Byte Spectrum Interpolative (VByte) Coding (BIC) Family ~ 3X smaller ~ 4.5X faster Is it possible to design an Is it possible to design an encoding that is as small as encoding that is as fast as BIC and much faster ? VByte and much smaller ? 1 2 TOIS 2017 TKDE 2019 What about both objectives at the same time?! 3 WSDM 2019

  21. 1 - Clustered inverted indexes (TOIS 2017) Every encoder represents each sequence individually .

  22. 1 - Clustered inverted indexes (TOIS 2017) Every encoder represents each sequence individually . Encode clusters of (similar) inverted lists.

  23. 1 - Clustered inverted indexes (TOIS 2017) Every encoder represents each sequence individually . Encode clusters of (similar) inverted lists. reference list

  24. 1 - Clustered inverted indexes (TOIS 2017) Every encoder represents each sequence individually . Encode clusters of (similar) inverted lists. reference list

  25. 1 - Clustered inverted indexes (TOIS 2017) Every encoder represents each sequence individually . Encode clusters of (similar) inverted lists. reference list Space Time Slightly slower Always better than than PEF (~20%) PEF (by up to 11% ) Spectrum Much faster than and better than BIC BIC (2X) (by up to 6.25% )

  26. 2 - Optimally-partitioned Variable-Byte codes (TKDE 2019) The majority of values are small ( very small indeed).

  27. 2 - Optimally-partitioned Variable-Byte codes (TKDE 2019) The majority of values are small ( very small indeed).

  28. 2 - Optimally-partitioned Variable-Byte codes (TKDE 2019) The majority of values are small ( very small indeed). Encode dense regions with unary codes, sparse regions with VByte.

  29. 2 - Optimally-partitioned Variable-Byte codes (TKDE 2019) The majority of values are small ( very small indeed). Encode dense regions with unary codes, sparse regions with VByte. Optimal partitioning in Query processing speed Compression ratio linear time and constant and sequential decoding improves by 2X . space. (almost) not affected .

Recommend


More recommend