The Case for R244 Learned Index Structures Michael Chi Ian Tang Kraska, T., Beutel, A., Chi, E. H., Dean, J., & Polyzotis, N. 1
Background 2
Index Structures โข Index structures are built for efficient data access โข E.g. B-Trees 3
Index Structures as Models 4
Index Structures as Models 5
Range Index 6
Range Index Models = CDF Models True position ๐โ = ๐ ๐๐๐ ๐๐๐ง = | ๐ ๐ โค ๐๐๐ง | = ๐ ๐ โค ๐๐๐ง โ ๐ ๐ ๐ โค ๐๐๐ง is the CDF of keys 7
Range Index Models = CDF Models Model: ๐๐๐ก = ๐บ ๐๐๐ง โ ๐ โ ๐ โ ๐บ ๐๐๐ง โ ๐ ๐ โค ๐๐๐ง 8
The Recursive Model Index (RMI) โข Prediction from previous stage chooses the next model โข Progressively refine the prediction 9
The Recursive Model Index โข Benefits โข Decouple execution cost & model size โข Notion of progressively learning the shape of CDF โข Divide the space into smaller ranges, easier to refine the final prediction 10
The Recursive Model Index โข Worst case performance โข If last stage models do not meet error requirement, replace by B-Trees โข Have same worst case guarantee as B-Trees 11
The Recursive Model Index - Training โข Loss defined as: ๐ ๐ฆ โ ๐ง 2 ๐ = เท (๐ฆ,๐ง) โข Simple model trained in seconds, Neural Nets in minutes 12
Experiments โข Integer Datasets โข Weblogs dataset contains 200M log entries โข Maps dataset indexed the longitude of โ 200M user -maintained features โข Log-normal dataset synthesized by sampling 190M unique values โข Models โข 2-stage RMI model having second-stage sizes (10k, 50k, 100k, and 200k) โข Read-optimized B-Tree with different page sizes 13
Results 14
Point Index 15
Point Index โข Example: hash-map โข Deterministically map keys to positions inside an array 16
The Hash-Model Index โข Build a hash function based on the CDF of the data ( ๐ is size of hash-map): โ(๐๐๐ง) = ๐บ ๐๐๐ง โ ๐ ๐บ ๐๐๐ง โ ๐ ๐ โค ๐๐๐ง 17
The Hash-Model Index โข Main objective is to reduce number of conflicts โข Conflicts could induce high cost depending on architecture (e.g. distributed) 18
Experiments โข Learned models with same settings as in range index โข Compared against MurmurHash3-like hash-function 19
Existence Index 20
Existence Index โข Example: Bloom filters โข Return whether a key exists in a dataset โข No false negatives, but has potential false positives 21
Bloom filters as a Classification Problem โข Binary probabilistic classification task: Whether key exists in dataset Exists key Model Does not exist 22
Bloom filters as a Classification Problem โข Guarantee for no false negative โข Overflow bloom filter: remember false negatives from models 23
Experiments โข Data โข 1.7M blacklisted phishing URLs โข Negative set: random URLs + whitelisted URLs โข Comparison โข Learned filter: RNN with GRU โข Normal Bloom filter 24
Critique 25
Major Contributions 1. Proposed the idea of applying machine learning in index structures 2. Solutions to offering guarantees on performance, determinism with ML models 3. Showed significant performance improvements (time and space) 4. Inspired new research direction (27 citations since June 2018) 26
Criticism 1. Detail of platform used for experiments not given 2. Little discussion on training time 3. Experiments on CPU only 27
Conclusion & Future Direction โข Proposed a new direction in database research that โข Makes effective use of machine learning methods โข Shows promising preliminary results โข Inspired new research work โข Requires more details on performance evaluation โข Potentials in learned algorithms, multi-dimensional indexes 28
Recommend
More recommend