Superseding Traditional Indexes with Multicriteria Data Structures - PowerPoint PPT Presentation

Superseding Traditional Indexes with Multicriteria Data Structures ˜ GIORGIO VINCIGUERRA PhD student in Computer Science giorgio.vinciguerra@phd.unipi.it

Outline ˜ 1. Multicriteria data structures 2. The dictionary problem • External memory model • Multiway trees • Novel approaches • Our results 3. Bonus slides 2

Motivation ˜ 1. Algorithms and data structures often offer a collection of different trade-offs (e.g. time, space occupancy, energy consumption, …) 2. Software engineers have to choose the one that best fits the needs of their application 3. These needs change with time, data, devices, and users 3

Multicriteria Data Structures ˜ A multicriteria data structure selects the best data structure within some performance and computational constraints FAMILY CONSTRAINTS OPTIMISATION of data structures space, time, energy… find the best structure 6

The dictionary problem ˜ We are given a set of “objects”, and we are asked to store them succinctly and to support efficient retrieval Databases File Systems Search Engines Social Networks 7

Memory hierarchy ˜ 9

Memory hierarchy ˜ L1 L3 L2 10

Memory hierarchy ˜ L1 L3 L2 11

Memory hierarchy ˜ 16 µs (SSD) 100 ns 150 ms 3 ms (HDD) L1 32 KB 8 GB 256 GB ∞ TB L2 256 KB L3 3 MB 12

The External Memory (aka I/O) model ˜ 1. Internal memory (RAM) of capacity 𝑁 2. External memory (disk) of unlimited capacity 3. RAM and disk exchange blocks of size 𝐶 4. Count # transfers in Big O instead of # ops 𝐶 ≈ 4𝐿𝑗𝐶 𝑁 14

The External Memory (aka I/O) model ˜ 1. Internal memory (RAM) of capacity 𝑁 2. External memory (disk) of unlimited capacity 3. RAM and disk exchange blocks of size 𝐶 4. Count # transfers in Big O instead of # ops 𝐶 = 64𝐶 LLC 𝑁 15

Back to the dictionary problem I n t e g e r ˜ s o r r e a l s We are given a set of “objects”, and we are asked to store them succinctly and to support efficient retrieval ✓ s i e e r q u e g a n r n d a n t o i p g . e . 61 71 12 15 18 1 24 22 88 34 3 10 5 13 55 44 60 2 5 74 90 81 16

Predecessor search & range queries ˜ 𝑞𝑠𝑓𝑒 36 = 36 𝑠𝑏𝑜𝑕𝑓 67,110 2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123 1 𝑜 𝑞𝑠𝑓𝑒 50 = 48 𝑁 17

Baseline solutions for predecessor search ˜ 2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123 1 𝑜 Solution RAM model EM model EM model Worst case time Worst case I/Os Best case I/Os Scan Ο 𝑜 Ο(𝑜/𝐶) Ο 1 𝐶 = 4 𝑁 18

Baseline solutions for predecessor search ˜ 2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123 1 𝑜 Solution RAM model EM model EM model Worst case time Worst case I/Os Best case I/Os Scan Ο 𝑜 Ο(𝑜/𝐶) Ο 1 Binary search Ο log 𝑜 𝐶 = 4 𝑁 19

Baseline solutions for predecessor search ˜ 2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123 1 𝑜 Solution RAM model EM model EM model Worst case time Worst case I/Os Best case I/Os Scan Ο 𝑜 Ο(𝑜/𝐶) Ο 1 Binary search Ο log 𝑜 Ο(log(𝑜/𝐶)) Ο(log(𝑜/𝐶)) 𝐶 = 4 𝑁 20

B + trees ˜ 31 76 ∞ 12 23 31 122 ∞ ∞ 55 71 76 2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123 𝑜 1 21

B + trees ˜ 48? 31 76 ∞ 12 23 31 122 ∞ ∞ 55 71 76 2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123 𝑜 1 22

B + trees ˜ 31 76 ∞ 𝐶 = 3 12 23 31 122 ∞ ∞ 55 71 76 𝐶 + 1 2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123 𝑜 1 Solution Space RAM model EM model EM model Worst case Worst case Best case time I/Os I/Os Scan Ο 1 Ο 𝑜 Ο(𝑜/𝐶) Ο 1 Binary search Ο 1 Ο log 𝑜 Ο(log(𝑜/𝐶)) Ο(log(𝑜/𝐶)) B + tree Ο 𝑜 Ο log 𝑜 Ο log > 𝑜 Ο log > 𝑜 23

B-trees are everywhere ˜ 1. “B-trees have become, de facto, a standard for file organization” Comer. Ubiquitous B-tree . ACM Computing Surveys. ’79 2. This is still true today 24

B-trees are everywhere ˜ 31 76 ∞ 12 23 31 122 ∞ ∞ 55 71 76 2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123 1 𝑜 25

+ B-trees are machine learning models ˜ “All existing index structures can be replaced with other types of models, including deep-learning models, which we term learned indexes.” 𝑙𝑓𝑧 Trained on the dataset { 𝑙𝑓𝑧 H , 𝑗 } HJK,…,M 𝑞𝑝𝑡𝑗𝑢𝑗𝑝𝑜 2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123 𝑜 1 𝑞𝑝𝑡𝑗𝑢𝑗𝑝𝑜 − 𝜁, 𝑞𝑝𝑡𝑗𝑢𝑗𝑝𝑜 + 𝜁 26

+ B-trees are machine learning models ˜ “All existing index structures can be replaced with other types of models, including deep-learning models, which we term learned indexes.” 𝑙𝑓𝑧 Trained on the dataset { 𝑙𝑓𝑧 H , 𝑗 } HJK,…,M 𝑞𝑝𝑡𝑗𝑢𝑗𝑝𝑜 2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123 𝑜 1 2^2 2 O 2 K 2 P 27

+ The Recursive Model Index (RMI) ˜ 𝑙𝑓𝑧 Stage 1 Model 1.1 Stage 2 Model 2.1 Model 2.2 Model 2.3 Stage 3 Model 3.1 Model 3.2 Model 3.3 Model 3.4 𝑞𝑝𝑡 2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123 𝑜 1 𝑙𝑓𝑧 ∈ 𝑞𝑝𝑡 − 𝜁, 𝑞𝑝𝑡 + 𝜁 ? 28

+ Construction of RMI ˜ 1. Train the root model on the dataset 2. Use it to distribute keys to the next stage 3. Repeat for each model in the next stage (on smaller datasets) Stage 1 pos Model 1.1 key Stage 2 Model 2.1 Model 2.2 Model 2.3 29

+ Performance of RMI ˜ 30

Limitations of RMI ˜ 1. Fixed structure with many hyperparameters # stages, # models in each stage, kinds of regression models 2. No a priori error guarantees Difficult to predict latencies 3. Models are agnostic to the power of models below Can result in underused models (waste of space) Stage1 1.1 Stage2 2.1 2.2 2.3 Stage3 3.1 3.2 3.3 3.4 32

Our idea (submitted) ˜ Compute the optimal piecewise linear approx with guaranteed error 𝜁 in Ο(𝑜) 33

Our idea (submitted) ˜ Save the 𝑛 segments in a vector as triples 𝑡 H = 𝑙𝑓𝑧, 𝑡𝑚𝑝𝑞𝑓, 𝑗𝑜𝑢𝑓𝑠𝑑𝑓𝑞𝑢 34

Our idea (submitted) ˜ Drop all the points except 𝑡 H . 𝑙𝑓𝑧 35

Our idea (submitted) ˜ … and repeat! 36

Memory layout of the PGM-index ˜ 37

Some asymptotic bounds ˜ Data Structure Space of index RAM model EM model EM model Worst case time Worst case I/Os Best case I/Os Ο log 𝑜 Ο log 𝑜 Plain sorted array Ο(1) Ο log 𝑜 𝐶 𝐶 Multiway tree Θ(𝑜) Ο log 𝑜 Ο log X 𝑜 Ο log X 𝑜 RMI Fixed Ο(?) Ο(?) Ο 1 PGM-index Θ(𝑛) Ο log 𝑛 Ο log Y 𝑛 Ο 1 𝑑 ≥ 2𝜁 = Ω(𝐶) 𝐶 𝑛 segments, 𝜁 error 38 𝑜 keys

PGM-index in practice 3 seconds to ˜ compute Whole datasets First 25M entries Number of segments Web logs = 715M points Longitude = 166M points Error of the IoT = 26M points position estimate 39

Space-time performance ˜ 40

How to explore this space of trade-offs? ˜ Given a space bound 𝑇 , find efficiently the index that minimizes the query time within space 𝑇 and vice versa 41

Back to Multicriteria Data Structures ˜ A multicriteria data structure is defined by a family of data structures and an optimisation algorithm that selects the best data structure in the family within some computational constraints FAMILY CONSTRAINTS OPTIMISATION PGM-indexes ∀ ε Space & Time ??? 42

The Multicriteria PGM-index ˜ 1. We designed a cost model for the space 𝑡 𝜁 and the time 𝑢(𝜁) 2. … but we don’t have a closed formula for 𝑡 𝜁 , it depends on the input array 3. We fit 𝑡 𝜁 with a power law of the form 𝑏𝜁 ^_ space 43 ε

Under the hood ˜ 1. A sort of interpolation search over 𝜁 values 2. Each iteration improves the fitting of 𝑏𝜁 ^_ updating 𝑏, 𝑐 3. Bias the 𝜁 -iterate towards the midpoint of a bin. search 4. In practice, given a space (time) bound, it finds the fastest (most compact) index for 715M keys in < 1 min space 𝜁 ∗ 𝜁 a 𝜁 P 𝜁 K 44

Future work ˜ 1. Insertion and deletions 2. Non-linear models 3. Compression 45

Bonus slides ˜ Tools that you may find useful

3 × faster than py_distance 117 × faster than scipy.spatial.distance.euclidean

Superseding Traditional Indexes with Multicriteria Data Structures - PowerPoint PPT Presentation

Superseding Traditional Indexes with Multicriteria Data Structures GIORGIO VINCIGUERRA PhD student in Computer Science giorgio.vinciguerra@phd.unipi.it Outline 1. Multicriteria data structures 2. The dictionary problem External

Module 7: Creating and Maintaining Indexes Overview Creating Indexes Creating Index

Modern OLTP Indexes (Part 2) 1 / 43 Modern OLTP Indexes (Part 2) Recap Recap 2 / 43 Modern OLTP

An Example of Index An Example of Index pattern of structure in indicators pattern of structure

Module 6: Planning Indexes Overview Introduction to Indexes Index Architecture How

Lindi Jumbo Flake Graphite Project Updating and superseding the investor presentation issued 12

getting active after SCI Traditional Email Interaction: Traditional Email Interaction:

Dow Jones Sustainability Indexes A cooperation of Dow Jones Indexes and SAM Content Key

RECIPE : Converting Concurrent DRAM Indexes to Persistent-Memory Indexes Se Kwon Lee, Jayashree

Indexes 1 Demo 2 Indexes Index = data structure

Nothing is Traditional about Nothing is Traditional about Environments in a Traditional

From Traditional Neural From Traditional NN . . . Networks to Deep Learning Need to Go Beyond .

PERSONALITY INDEXES For Hiring, Team Building, and the Bottom Line Presentation by Deb Harris /

Scalable Low-Latency Indexes for a Key-Value Store Ankita Kejriwal With Arjun Gopalan, Ashish

Responsibility Lvia apukov Ida Wikstrm Leonard Guik Victoria Knabl Structure

PhUSE 2016 Paper CC08 Perish the Sort: Using Indexes and Hash Objects for Efficient Programming

Constructing High Frequency Price Indexes Data Daniel Melser Using Scanner Data Daniel Melser

recursive algorithms 1 Oct. 2, 2017 1 Example 1: Factorial (iterative) ! = 1 2 3

LEARNED INDEXES: A NEW IDEA FOR EFFICIENT DATA ACCESS ROBERT RODGER - GODATADRIVEN - 12 JUNE

Recursive Virtualization and Programming for Network and Cloud Resources

An Analog Characterization of Elementarily Computable Functions over the Real Numbers Olivier

MAKING A DIFFERENCE, ONE BATTERY AT A TIME ENGINEERING CIRCULAR ECONOMY DO YOU USE BATTERIES? 2

Covid-19 Pandemic Click to edit Master subtitle style Click to edit Master subtitle style Adult

De-escalating from Black to Red Plans for week commencing 27 April Whilst the hospital remains

1 st International Workshop on Planning of Ambulance Services: Theory and Practice CWI,