the pgm index a fully dynamic compressed learned index
play

The PGM-index: a fully-dynamic compressed learned index with - PowerPoint PPT Presentation

The PGM-index: a fully-dynamic compressed learned index with provable worst-case bounds Paolo Giorgio Ferragina Vinciguerra pgm.di.unipi.it The predecessor search problem Given sorted input keys (e.g. integers), implement


  1. The PGM-index: a fully-dynamic compressed learned index with provable worst-case bounds Paolo Giorgio Ferragina Vinciguerra

  2. pgm.di.unipi.it The predecessor search problem β€’ Given π‘œ sorted input keys (e.g. integers), implement π‘žπ‘ π‘“π‘’π‘“π‘‘π‘“π‘‘π‘‘π‘π‘  𝑦 = β€œlargest key ≀ 𝑦 ” β€’ Range queries and joins in DBs, conjunctive queries in search engines, IP routing… β€’ Lookups alone are much easier; just use Cuckoo hashing for lookups at most 2 memory accesses (without sorting data!) π‘žπ‘ π‘“π‘’π‘“π‘‘π‘“π‘‘π‘‘π‘π‘  36 = 36 2 11 13 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 1 π‘œ π‘žπ‘ π‘“π‘’π‘“π‘‘π‘“π‘‘π‘‘π‘π‘  50 = 48 2

  3. pgm.di.unipi.it Indexes 𝑙𝑓𝑧 = 36 B-tree π‘žπ‘π‘‘π‘—π‘’π‘—π‘π‘œ = 11 2 11 13 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 1 π‘œ 3 (values associated to keys are not shown)

  4. pgm.di.unipi.it Input data as pairs (𝑙𝑓𝑧, π‘žπ‘π‘‘π‘—π‘’π‘—π‘π‘œ) positions keys 2 11 13 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 1 π‘œ 4 Ao et al. [VLDB 2011]

  5. pgm.di.unipi.it Input data as pairs (𝑙𝑓𝑧, π‘žπ‘π‘‘π‘—π‘’π‘—π‘π‘œ) 4 3 positions 2 1 11 13 15 2 keys 2 11 13 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 1 2 3 4 π‘œ 5 Ao et al. [VLDB 2011]

  6. pgm.di.unipi.it Learned indexes 𝑙𝑓𝑧 Black-box trained on a dataset of pairs (key, pos) 𝒠 = { 2,1 , 11,2 , … , (95, π‘œ)} positions keys (approximate) π‘žπ‘π‘‘π‘—π‘’π‘—π‘π‘œ 2 11 13 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 1 π‘œ Binary search in [π‘žπ‘π‘‘π‘—π‘’π‘—π‘π‘œ βˆ’ 𝑓𝑠𝑠𝑝𝑠, π‘žπ‘π‘‘π‘—π‘’π‘—π‘π‘œ + 𝑓𝑠𝑠𝑝𝑠] 6 Ao et al. [VLDB 2011], Kraska et al. [SIGMOD 2018]

  7. pgm.di.unipi.it The problem with learned indexes Too much I/O when Unpredictable data is on disk latency Very slow Fast query time and excellent to train space usage in practice, but no worst-case guarantees Unscalable to big data Must be tuned for each new dataset Vulnerable to adversarial inputs Blind to the and queries query distribution 7

  8. pgm.di.unipi.it Introducing the PGM-index Constant I/O when Predictable data is on disk latency Very fast Fast query time and excellent to build space usage in practice, and guaranteed worst-case bounds Scalable to big data No additional tuning needed Resistant to adversarial inputs Query distribution and queries aware 8

  9. pgm.di.unipi.it Ingredients of the PGM-index Opt. piecewise linear model Fixed model β€œerror” Ξ΅ Recursive design Fast to construct, best space usage Control the size of the search range Adapt to the memory hierarchy for linear learned indexes (like the page size in a B-tree) and enable query-time guarantees 9

  10. pgm.di.unipi.it PGM-index construction Step 1 . Compute the optimal piecewise linear 𝜁 -approximation in Ο(π‘œ) time 2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123 128 140 145 146 10 1 π‘œ

  11. pgm.di.unipi.it PGM-index construction Step 1 . Compute the Step 2 . Store the optimal piecewise linear segments as triples 𝜁 -approximation 𝑑 ! = 𝑙𝑓𝑧, π‘‘π‘šπ‘π‘žπ‘“, π‘—π‘œπ‘’π‘“π‘ π‘‘π‘“π‘žπ‘’ in Ο(π‘œ) time 2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123 128 140 145 146 11 1 π‘œ

  12. pgm.di.unipi.it Partial memory layout of the PGM-index Each segment indexes a variable and potentially large sequence of keys while guaranteeing a search range size of 2𝜁 + 1 Segments (2, sl, ic) (23, sl, ic) (31, sl, ic) (48, sl, ic) (71, sl, ic) (88, sl, ic) (122, sl, ic) (145, sl, ic) 2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123 128 140 145 146 1 π‘œ Binary search in [π‘žπ‘π‘‘ βˆ’ 𝜁, π‘žπ‘π‘‘ + 𝜁] 12

  13. pgm.di.unipi.it PGM-index construction Step 1 . Compute the Step 2 . Store the optimal piecewise linear segments as triples 𝜁 -approximation 𝑑 ! = 𝑙𝑓𝑧, π‘‘π‘šπ‘π‘žπ‘“, π‘—π‘œπ‘’π‘“π‘ π‘‘π‘“π‘žπ‘’ in Ο(π‘œ) time Step 3 . Keep only 𝑑 ! . 𝑙𝑓𝑧 2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123 128 140 145 146 13 1 π‘œ

  14. pgm.di.unipi.it PGM-index construction Step 1 . Compute the Step 2 . Store the optimal piecewise linear segments as triples 𝜁 -approximation 𝑑 ! = 𝑙𝑓𝑧, π‘‘π‘šπ‘π‘žπ‘“, π‘—π‘œπ‘’π‘“π‘ π‘‘π‘“π‘žπ‘’ in Ο(π‘œ) time Step 3 . Keep only 𝑑 ! . 𝑙𝑓𝑧 2 23 31 48 71 88 122 145 14

  15. pgm.di.unipi.it PGM-index construction Step 1 . Compute the Step 2 . Store the optimal piecewise linear segments as triples 𝜁 -approximation 𝑑 ! = 𝑙𝑓𝑧, π‘‘π‘šπ‘π‘žπ‘“, π‘—π‘œπ‘’π‘“π‘ π‘‘π‘“π‘žπ‘’ in Ο(π‘œ) time Step 3 . Keep only 𝑑 ! . 𝑙𝑓𝑧 Step 4 . Repeat recursively 2 23 31 48 71 88 122 145 15

  16. pgm.di.unipi.it Memory layout of the PGM-index (2, sl, ic) It can also be constructed Very fast construction, a couple in a single pass of seconds for 1 billion keys (2, sl, ic) (31, sl, ic) (88, sl, ic) (145, sl, ic) (2, sl, ic) (23, sl, ic) (31, sl, ic) (48, sl, ic) (71, sl, ic) (88, sl, ic) (122, sl, ic) (145, sl, ic) 2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123 128 140 145 146 1 π‘œ 16

  17. pgm.di.unipi.it Predecessor search with 𝜁 = 1 The PGM-index is never π‘žπ‘ π‘“π‘’π‘“π‘‘π‘“π‘‘π‘‘π‘π‘  57 ? worse in time and space 𝐢 = disk page-size (2, sl, ic) than a B-tree Set 𝜁 = Θ 𝐢 for queries in 𝑃(log ! π‘œ) I/Os (2, sl, ic) (31, sl, ic) (88, sl, ic) (145, sl, ic) 𝑃(π‘œ/𝜁) space 2𝜁 + 1 (2, sl, ic) (23, sl, ic) (31, sl, ic) (48, sl, ic) (71, sl, ic) (88, sl, ic) (122, sl, ic) (145, sl, ic) 2𝜁 + 1 2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123 128 140 145 146 1 π‘œ 2𝜁 + 1 17

  18. Experiments

  19. pgm.di.unipi.it Experiments Avg search range Fastest CSS-tree 128-byte pages β‰ˆ 350 MB Page size Matched by PGM with 2 Ξ΅ set to 256 β‰ˆ 4 4 MB ( βˆ’ 83 Γ— ) 2Ξ΅ 19 Intel Xeon Gold 5118 CPU @ 2.30GHz, data held in main memory

  20. pgm.di.unipi.it Experiments on updates 20 Intel Xeon Gold 5118 CPU @ 2.30GHz, data held in main memory

  21. pgm.di.unipi.it Experiments on updates B + -tree page size Index size 128-byte 5.65 GB 3891Γ— 256-byte 2.98 GB 2051Γ— 512-byte 1.66 GB 1140Γ— 1024-byte 0.89 GB 611Γ— Dynamic PGM-index: 1.45 MB 21 Intel Xeon Gold 5118 CPU @ 2.30GHz, data held in main memory

  22. pgm.di.unipi.it Why the PGM is so effective? A B-tree node A PGM-index node Page size 𝐢 2𝜁 = 𝐢 𝑙 ! 𝑙 " … 𝑙 # (𝑙, π‘‘π‘š, 𝑗𝑑) … In one I/O and 𝑃 log " 𝐢 steps the Here the search range is reduced search range is reduced by 1/𝐢 by at least 1/𝐢 w.h.p. 1/𝐢 F Ferragina et al. [ICML 2020] 22

  23. pgm.di.unipi.it New experiments with tuned Linear RMI 8-byte keys, 8-byte payload β€’ PGM improved the empirical Tuned Linear RMI and PGM have the same size β€’ performance of a tuned Linear RMI 10M predecessor searches, uniform query workload β€’ Each PGM took about 2 seconds to construct They tested positive lookups. Here we test predecessor queries RMI took 30 Γ— more! 23 New tuned Linear RMI implementation and datasets from Marcus et al., 2020 [arXiv:2006.12804]

  24. pgm.di.unipi.it New experiments with tuned Hybrid RMI 8-byte keys, 8-byte payload β€’ RMI with non-linear models, tuned via grid search β€’ 10M predecessor searches, uniform query workload β€’ Avg search range 2 8 Max search range 2 8 Avg 2 15 Max 2 29 Each PGM took about 2 seconds to construct Hybrid RMI took 40 Γ— (90 Γ— with tuning) more! 24 New tuned Hybrid RMI implementation and datasets from Marcus et al., 2020 [arXiv:2006.12804]

  25. pgm.di.unipi.it New experiments Adversarial 8-byte keys, 8-byte payload β€’ RMI with non-linear models, tuned via grid search β€’ query workload 10M predecessor searches β€’ About adversarial data inputs, see Kornaropoulos et al., 2020 [arXiv:2008.00297] 25 New tuned Hybrid RMI implementation and datasets from Marcus et al., 2020 [arXiv:2006.12804]

  26. pgm.di.unipi.it More results in the paper Query-distribution aware Index compression Multicriteria tuner Minimise average query time wrt Reduce the space of the index by a Minimise query time under a a given query workload further 52% via the compression of given space constraint and vice versa slopes and intercepts in a few dozens of seconds 26

Recommend


More recommend