ordered set problems
play

Ordered Set Problems Giulio Ermanno Pibiri - PowerPoint PPT Presentation

Ordered Set Problems Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it http://pages.di.unipi.it/pibiri 07/06/2019 The Static Ordered Set Problem Given a set of n items and an order relation defined on them, we are asked to design a data


  1. Ordered Set Problems Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it http://pages.di.unipi.it/pibiri 07/06/2019

  2. The Static Ordered Set Problem Given a set of n items and an order relation defined on them, 
 we are asked to design a data structure that supports 
 Access , Contains , Successor , Predecessor efficiently.

  3. The Static Ordered Set Problem Given a set of n items and an order relation defined on them, 
 we are asked to design a data structure that supports 
 Access , Contains , Successor , Predecessor efficiently. Let us assume our items are integers 
 drawn from some universe of size u ≥ n .

  4. The Static Ordered Set Problem Given a set of n items and an order relation defined on them, 
 we are asked to design a data structure that supports 
 Access , Contains , Successor , Predecessor efficiently. Let us assume our items are integers 
 drawn from some universe of size u ≥ n . If the integers are not to be compressed : 
 use an array . 
 If the keys are uniformly distributed , Operations are made efficient 
 interpolation search can help: 
 by binary search with loop unrolling 
 O(log log n ) time with high probability . with cut-off to SSE/AVX (SIMD) linear search 
 on small segments.

  5. The Static Ordered Set Problem Given a set of n items and an order relation defined on them, 
 we are asked to design a data structure that supports 
 Access , Contains , Successor , Predecessor efficiently. Let us assume our items are integers 
 drawn from some universe of size u ≥ n . If the integers are not to be compressed : 
 use an array . 
 If the keys are uniformly distributed , Operations are made efficient 
 interpolation search can help: 
 by binary search with loop unrolling 
 O(log log n ) time with high probability . with cut-off to SSE/AVX (SIMD) linear search 
 on small segments. Let us also assume n is so big that we 
 must compress the set.

  6. Sorted integer sets are ubiquitous Inverted indexes Databases E-Commerce Graph compression Semantic data Geospatial data

  7. The Static Compressed Ordered Set Problem Large research corpora describing different space/time trade-offs. Elias’ Gamma and Delta ~1970 • Elias-Fano • Variable-Byte Family • Binary Interpolative Coding • Simple Family • PForDelta • QMX • Quasi-Succinct • Partitioned Elias-Fano • Clustered Elias-Fano • Optimal Variable-Byte • 2019 DINT • + set intersection, union and decode

  8. Partitioning by Cardinality The problem of (almost all) such representations is that 
 Access, Contains, Predecessor/Successor 
 are not natively supported , but we can just 
 decode sequentially.

  9. Partitioning by Cardinality The problem of (almost all) such representations is that 
 Access, Contains, Predecessor/Successor 
 are not natively supported , but we can just 
 decode sequentially. Solution 1 
 Introduce some redundancy to accelerate queries: 
 the so-called skip pointers.

  10. Partitioning by Cardinality The problem of (almost all) such representations is that 
 Access, Contains, Predecessor/Successor 
 are not natively supported , but we can just 
 decode sequentially. Solution 1 
 Introduce some redundancy to accelerate queries: 
 the so-called skip pointers. 3 9 10 14 23 24 25 34 38 42 44 49 50 65 71 98 B

  11. Partitioning by Cardinality The problem of (almost all) such representations is that 
 Access, Contains, Predecessor/Successor 
 are not natively supported , but we can just 
 decode sequentially. Solution 1 
 Introduce some redundancy to accelerate queries: 
 the so-called skip pointers. Upperbounds 14 34 49 98 3 9 10 14 23 24 25 34 38 42 44 49 50 65 71 98 B

  12. Partitioning by Cardinality The problem of (almost all) such representations is that 
 Access, Contains, Predecessor/Successor 
 are not natively supported , but we can just 
 decode sequentially. Solution 1 
 Introduce some redundancy to accelerate queries: 
 the so-called skip pointers. Upperbounds 14 34 49 98 3 9 10 14 23 24 25 34 38 42 44 49 50 65 71 98 B Upperbounds Offsets Bits

  13. Partitioning by Cardinality The problem of (almost all) such representations is that 
 Access, Contains, Predecessor/Successor 
 are not natively supported , but we can just 
 decode sequentially. Solution 1 
 Introduce some redundancy to accelerate queries: 
 the so-called skip pointers. Upperbounds 14 34 49 98 Solution 2 
 3 9 10 14 23 24 25 34 38 42 44 49 50 65 71 98 Redesign the data structure. B Upperbounds Offsets Bits

  14. Partitioning by Universe

  15. Partitioning by Universe Does this remind you of something?

  16. Partitioning by Universe Does this remind you of something? [Elias-Fano 1971-1975]

  17. Partitioning by Universe Does this remind you of something? √ u summary 1 1 0 1 1 0 0 1 1 1 0 1 0 0 0 0 0 1 1 1 √ u [Elias-Fano 1971-1975] [van Emde Boas 1974-1975]

  18. Partitioning by Universe Assume a slice size of 2 3

  19. Partitioning by Universe Assume a slice size of 2 3 Contains(x): i = x >> 3 
 search for x - (i << 3) in the i-th slice

  20. Partitioning by Universe Assume a slice size of 2 3 Contains(x): x = 010101 i = x >> 3 
 search for x - (i << 3) in the i-th slice

  21. Partitioning by Universe Assume a slice size of 2 3 Contains(x): x = 010101 i = x >> 3 
 010101 search for x - (i << 3) in the i-th slice

  22. Partitioning by Universe Assume a slice size of 2 3 Contains(x): x = 010101 i = x >> 3 
 010101 x - 16 = 5 search for x - (i << 3) in the i-th slice

  23. Partitioning by Universe Assume a slice size of 2 3 Contains(x): x = 010101 i = x >> 3 
 010101 x - 16 = 5 search for x - (i << 3) in the i-th slice Successor(x): 
 i = x >> 3 
 search for successor of x - (i << 3) in the i-th slice 
 (if i-th slice is empty or x - (i << 3) > max_value in i-th slice, 
 then return first value on the right)

  24. Partitioning by Universe Assume a slice size of 2 3 Contains(x): x = 010101 i = x >> 3 
 010101 x - 16 = 5 search for x - (i << 3) in the i-th slice Successor(x): 
 i = x >> 3 
 search for successor of x - (i << 3) in the i-th slice 
 (if i-th slice is empty or x - (i << 3) > max_value in i-th slice, 
 then return first value on the right) Intersection between lists has to intersect only the slices in common between the lists.

  25. Bitmaps Good old data structure for storing dense sets : 
 x-th bit is set if integer x is in the set.

  26. Bitmaps Good old data structure for storing dense sets : 
 x-th bit is set if integer x is in the set. S = {0,1,5,7,8,10,11,14,18,21,22,28,29,30} 1 1 0 0 0 1 0 1 1 0 1 1 0 0 1 0 0 0 1 0 0 1 1 0 0 0 0 0 1 0 1 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

  27. Bitmaps Good old data structure for storing dense sets : 
 x-th bit is set if integer x is in the set. S = {0,1,5,7,8,10,11,14,18,21,22,28,29,30} 1 1 0 0 0 1 0 1 1 0 1 1 0 0 1 0 0 0 1 0 0 1 1 0 0 0 0 0 1 0 1 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Contains: testing a bit 
 Successor/Predecessor: __builtin_ctzll 
 Select: __builtin_ctzll 
 Max: __builtin_clzll 
 Min: __builtin_ctzll 
 Decode : __builtin_ctzll 
 Insertion: setting a bit 
 Deletion: clearing a bit

  28. Bitmaps Good old data structure for storing dense sets : 
 x-th bit is set if integer x is in the set. S = {0,1,5,7,8,10,11,14,18,21,22,28,29,30} 1 1 0 0 0 1 0 1 1 0 1 1 0 0 1 0 0 0 1 0 0 1 1 0 0 0 0 0 1 0 1 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Contains: testing a bit 
 Successor/Predecessor: __builtin_ctzll 
 Select: __builtin_ctzll 
 Max: __builtin_clzll 
 Min: __builtin_ctzll 
 Decode : __builtin_ctzll 
 Insertion: setting a bit 
 Deletion: clearing a bit Nothing is better than a bitmap for dense sets.

  29. Roaring [Lemire et al. 2013] Assume u = 2 32 2 16 ≤ 2 16 spans of 2 16 values each … 2 16 2 16 2 16

  30. Roaring [Lemire et al. 2013] Assume u = 2 32 2 16 ≤ 2 16 spans of 2 16 values each … Sparse Dense Sparse 2 16 2 16 2 16 Dense : cardinality > 4096 Sparse : otherwise Ensure at most 16 bits x key 
 (excluding overhead) Dense spans are represented with bitmaps of 2 16 bits. Sparse spans are represented with sorted-arrays of 16-bit integers.

  31. Slicing Assume u = 2 32 2 16 ≤ 2 16 slices of 2 16 values each … Dense: cardinality > 2 16 /2 … Sparse Dense Sparse (ensure at most 2 bits x key) 2 16 2 16 2 16 ≤ 2 8 slices of 2 8 values each Dense: cardinality ≥ 31 D S D S D D S (ensure at most 8 bits x key) 2 8 2 8 2 8 2 8 2 8 2 8 2 8 Dense slices are represented with bitmaps of 2 16 or 2 8 bits. Sparse slices are represented with sorted-arrays of 8-bit integers.

Recommend


More recommend