Superseding Traditional Indexes with Multicriteria Data Structures ˜ GIORGIO VINCIGUERRA PhD student in Computer Science giorgio.vinciguerra@phd.unipi.it
Outline ˜ 1. Multicriteria data structures 2. The dictionary problem • External memory model • Multiway trees • Novel approaches • Our results 3. Bonus slides 2
Motivation ˜ 1. Algorithms and data structures often offer a collection of different trade-offs (e.g. time, space occupancy, energy consumption, …) 2. Software engineers have to choose the one that best fits the needs of their application 3. These needs change with time, data, devices, and users 3
Multicriteria Data Structures ˜ A multicriteria data structure selects the best data structure within some performance and computational constraints FAMILY CONSTRAINTS OPTIMISATION of data structures space, time, energy… find the best structure 6
The dictionary problem ˜ We are given a set of “objects”, and we are asked to store them succinctly and to support efficient retrieval Databases File Systems Search Engines Social Networks 7
Memory hierarchy ˜ 9
Memory hierarchy ˜ L1 L3 L2 10
Memory hierarchy ˜ L1 L3 L2 11
Memory hierarchy ˜ 16 µs (SSD) 100 ns 150 ms 3 ms (HDD) L1 32 KB 8 GB 256 GB ∞ TB L2 256 KB L3 3 MB 12
The External Memory (aka I/O) model ˜ 1. Internal memory (RAM) of capacity 𝑁 2. External memory (disk) of unlimited capacity 3. RAM and disk exchange blocks of size 𝐶 4. Count # transfers in Big O instead of # ops 𝐶 ≈ 4𝐿𝑗𝐶 𝑁 14
The External Memory (aka I/O) model ˜ 1. Internal memory (RAM) of capacity 𝑁 2. External memory (disk) of unlimited capacity 3. RAM and disk exchange blocks of size 𝐶 4. Count # transfers in Big O instead of # ops 𝐶 = 64𝐶 LLC 𝑁 15
Back to the dictionary problem I n t e g e r ˜ s o r r e a l s We are given a set of “objects”, and we are asked to store them succinctly and to support efficient retrieval ✓ s i e e r q u e g a n r n d a n t o i p g . e . 61 71 12 15 18 1 24 22 88 34 3 10 5 13 55 44 60 2 5 74 90 81 16
Predecessor search & range queries ˜ 𝑞𝑠𝑓𝑒 36 = 36 𝑠𝑏𝑜𝑓 67,110 2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123 1 𝑜 𝑞𝑠𝑓𝑒 50 = 48 𝑁 17
Baseline solutions for predecessor search ˜ 2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123 1 𝑜 Solution RAM model EM model EM model Worst case time Worst case I/Os Best case I/Os Scan Ο 𝑜 Ο(𝑜/𝐶) Ο 1 𝐶 = 4 𝑁 18
Baseline solutions for predecessor search ˜ 2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123 1 𝑜 Solution RAM model EM model EM model Worst case time Worst case I/Os Best case I/Os Scan Ο 𝑜 Ο(𝑜/𝐶) Ο 1 Binary search Ο log 𝑜 𝐶 = 4 𝑁 19
Baseline solutions for predecessor search ˜ 2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123 1 𝑜 Solution RAM model EM model EM model Worst case time Worst case I/Os Best case I/Os Scan Ο 𝑜 Ο(𝑜/𝐶) Ο 1 Binary search Ο log 𝑜 Ο(log(𝑜/𝐶)) Ο(log(𝑜/𝐶)) 𝐶 = 4 𝑁 20
B + trees ˜ 31 76 ∞ 12 23 31 122 ∞ ∞ 55 71 76 2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123 𝑜 1 21
B + trees ˜ 48? 31 76 ∞ 12 23 31 122 ∞ ∞ 55 71 76 2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123 𝑜 1 22
B + trees ˜ 31 76 ∞ 𝐶 = 3 12 23 31 122 ∞ ∞ 55 71 76 𝐶 + 1 2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123 𝑜 1 Solution Space RAM model EM model EM model Worst case Worst case Best case time I/Os I/Os Scan Ο 1 Ο 𝑜 Ο(𝑜/𝐶) Ο 1 Binary search Ο 1 Ο log 𝑜 Ο(log(𝑜/𝐶)) Ο(log(𝑜/𝐶)) B + tree Ο 𝑜 Ο log 𝑜 Ο log > 𝑜 Ο log > 𝑜 23
B-trees are everywhere ˜ 1. “B-trees have become, de facto, a standard for file organization” Comer. Ubiquitous B-tree . ACM Computing Surveys. ’79 2. This is still true today 24
B-trees are everywhere ˜ 31 76 ∞ 12 23 31 122 ∞ ∞ 55 71 76 2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123 1 𝑜 25
+ B-trees are machine learning models ˜ “All existing index structures can be replaced with other types of models, including deep-learning models, which we term learned indexes.” 𝑙𝑓𝑧 Trained on the dataset { 𝑙𝑓𝑧 H , 𝑗 } HJK,…,M 𝑞𝑝𝑡𝑗𝑢𝑗𝑝𝑜 2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123 𝑜 1 𝑞𝑝𝑡𝑗𝑢𝑗𝑝𝑜 − 𝜁, 𝑞𝑝𝑡𝑗𝑢𝑗𝑝𝑜 + 𝜁 26
+ B-trees are machine learning models ˜ “All existing index structures can be replaced with other types of models, including deep-learning models, which we term learned indexes.” 𝑙𝑓𝑧 Trained on the dataset { 𝑙𝑓𝑧 H , 𝑗 } HJK,…,M 𝑞𝑝𝑡𝑗𝑢𝑗𝑝𝑜 2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123 𝑜 1 2^2 2 O 2 K 2 P 27
+ The Recursive Model Index (RMI) ˜ 𝑙𝑓𝑧 Stage 1 Model 1.1 Stage 2 Model 2.1 Model 2.2 Model 2.3 Stage 3 Model 3.1 Model 3.2 Model 3.3 Model 3.4 𝑞𝑝𝑡 2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123 𝑜 1 𝑙𝑓𝑧 ∈ 𝑞𝑝𝑡 − 𝜁, 𝑞𝑝𝑡 + 𝜁 ? 28
+ Construction of RMI ˜ 1. Train the root model on the dataset 2. Use it to distribute keys to the next stage 3. Repeat for each model in the next stage (on smaller datasets) Stage 1 pos Model 1.1 key Stage 2 Model 2.1 Model 2.2 Model 2.3 29
+ Performance of RMI ˜ 30
Limitations of RMI ˜ 1. Fixed structure with many hyperparameters # stages, # models in each stage, kinds of regression models 2. No a priori error guarantees Difficult to predict latencies 3. Models are agnostic to the power of models below Can result in underused models (waste of space) Stage1 1.1 Stage2 2.1 2.2 2.3 Stage3 3.1 3.2 3.3 3.4 32
Our idea (submitted) ˜ Compute the optimal piecewise linear approx with guaranteed error 𝜁 in Ο(𝑜) 33
Our idea (submitted) ˜ Save the 𝑛 segments in a vector as triples 𝑡 H = 𝑙𝑓𝑧, 𝑡𝑚𝑝𝑞𝑓, 𝑗𝑜𝑢𝑓𝑠𝑑𝑓𝑞𝑢 34
Our idea (submitted) ˜ Drop all the points except 𝑡 H . 𝑙𝑓𝑧 35
Our idea (submitted) ˜ … and repeat! 36
Memory layout of the PGM-index ˜ 37
Some asymptotic bounds ˜ Data Structure Space of index RAM model EM model EM model Worst case time Worst case I/Os Best case I/Os Ο log 𝑜 Ο log 𝑜 Plain sorted array Ο(1) Ο log 𝑜 𝐶 𝐶 Multiway tree Θ(𝑜) Ο log 𝑜 Ο log X 𝑜 Ο log X 𝑜 RMI Fixed Ο(?) Ο(?) Ο 1 PGM-index Θ(𝑛) Ο log 𝑛 Ο log Y 𝑛 Ο 1 𝑑 ≥ 2𝜁 = Ω(𝐶) 𝐶 𝑛 segments, 𝜁 error 38 𝑜 keys
PGM-index in practice 3 seconds to ˜ compute Whole datasets First 25M entries Number of segments Web logs = 715M points Longitude = 166M points Error of the IoT = 26M points position estimate 39
Space-time performance ˜ 40
How to explore this space of trade-offs? ˜ Given a space bound 𝑇 , find efficiently the index that minimizes the query time within space 𝑇 and vice versa 41
Back to Multicriteria Data Structures ˜ A multicriteria data structure is defined by a family of data structures and an optimisation algorithm that selects the best data structure in the family within some computational constraints FAMILY CONSTRAINTS OPTIMISATION PGM-indexes ∀ ε Space & Time ??? 42
The Multicriteria PGM-index ˜ 1. We designed a cost model for the space 𝑡 𝜁 and the time 𝑢(𝜁) 2. … but we don’t have a closed formula for 𝑡 𝜁 , it depends on the input array 3. We fit 𝑡 𝜁 with a power law of the form 𝑏𝜁 ^_ space 43 ε
Under the hood ˜ 1. A sort of interpolation search over 𝜁 values 2. Each iteration improves the fitting of 𝑏𝜁 ^_ updating 𝑏, 𝑐 3. Bias the 𝜁 -iterate towards the midpoint of a bin. search 4. In practice, given a space (time) bound, it finds the fastest (most compact) index for 715M keys in < 1 min space 𝜁 ∗ 𝜁 a 𝜁 P 𝜁 K 44
Future work ˜ 1. Insertion and deletions 2. Non-linear models 3. Compression 45
Bonus slides ˜ Tools that you may find useful
3 × faster than py_distance 117 × faster than scipy.spatial.distance.euclidean
Recommend
More recommend