the case for
play

The Case for R244 Learned Index Structures Michael Chi Ian Tang - PowerPoint PPT Presentation

The Case for R244 Learned Index Structures Michael Chi Ian Tang Kraska, T., Beutel, A., Chi, E. H., Dean, J., & Polyzotis, N. 1 Background 2 Index Structures Index structures are built for efficient data access E.g. B-Trees 3


  1. The Case for R244 Learned Index Structures Michael Chi Ian Tang Kraska, T., Beutel, A., Chi, E. H., Dean, J., & Polyzotis, N. 1

  2. Background 2

  3. Index Structures โ€ข Index structures are built for efficient data access โ€ข E.g. B-Trees 3

  4. Index Structures as Models 4

  5. Index Structures as Models 5

  6. Range Index 6

  7. Range Index Models = CDF Models True position ๐‘žโˆ— = ๐‘ ๐‘๐‘œ๐‘™ ๐‘™๐‘“๐‘ง = | ๐‘™ ๐‘™ โ‰ค ๐‘™๐‘“๐‘ง | = ๐‘„ ๐‘Œ โ‰ค ๐‘™๐‘“๐‘ง โˆ— ๐‘‚ ๐‘„ ๐‘Œ โ‰ค ๐‘™๐‘“๐‘ง is the CDF of keys 7

  8. Range Index Models = CDF Models Model: ๐‘ž๐‘๐‘ก = ๐บ ๐‘™๐‘“๐‘ง โˆ— ๐‘‚ โ‰ˆ ๐‘ž โˆ— ๐บ ๐‘™๐‘“๐‘ง โ‰ˆ ๐‘„ ๐‘Œ โ‰ค ๐‘™๐‘“๐‘ง 8

  9. The Recursive Model Index (RMI) โ€ข Prediction from previous stage chooses the next model โ€ข Progressively refine the prediction 9

  10. The Recursive Model Index โ€ข Benefits โ€ข Decouple execution cost & model size โ€ข Notion of progressively learning the shape of CDF โ€ข Divide the space into smaller ranges, easier to refine the final prediction 10

  11. The Recursive Model Index โ€ข Worst case performance โ€ข If last stage models do not meet error requirement, replace by B-Trees โ€ข Have same worst case guarantee as B-Trees 11

  12. The Recursive Model Index - Training โ€ข Loss defined as: ๐‘” ๐‘ฆ โˆ’ ๐‘ง 2 ๐‘€ = เท (๐‘ฆ,๐‘ง) โ€ข Simple model trained in seconds, Neural Nets in minutes 12

  13. Experiments โ€ข Integer Datasets โ€ข Weblogs dataset contains 200M log entries โ€ข Maps dataset indexed the longitude of โ‰ˆ 200M user -maintained features โ€ข Log-normal dataset synthesized by sampling 190M unique values โ€ข Models โ€ข 2-stage RMI model having second-stage sizes (10k, 50k, 100k, and 200k) โ€ข Read-optimized B-Tree with different page sizes 13

  14. Results 14

  15. Point Index 15

  16. Point Index โ€ข Example: hash-map โ€ข Deterministically map keys to positions inside an array 16

  17. The Hash-Model Index โ€ข Build a hash function based on the CDF of the data ( ๐‘ is size of hash-map): โ„Ž(๐‘™๐‘“๐‘ง) = ๐บ ๐‘™๐‘“๐‘ง โˆ— ๐‘ ๐บ ๐‘™๐‘“๐‘ง โ‰ˆ ๐‘„ ๐‘Œ โ‰ค ๐‘™๐‘“๐‘ง 17

  18. The Hash-Model Index โ€ข Main objective is to reduce number of conflicts โ€ข Conflicts could induce high cost depending on architecture (e.g. distributed) 18

  19. Experiments โ€ข Learned models with same settings as in range index โ€ข Compared against MurmurHash3-like hash-function 19

  20. Existence Index 20

  21. Existence Index โ€ข Example: Bloom filters โ€ข Return whether a key exists in a dataset โ€ข No false negatives, but has potential false positives 21

  22. Bloom filters as a Classification Problem โ€ข Binary probabilistic classification task: Whether key exists in dataset Exists key Model Does not exist 22

  23. Bloom filters as a Classification Problem โ€ข Guarantee for no false negative โ€ข Overflow bloom filter: remember false negatives from models 23

  24. Experiments โ€ข Data โ€ข 1.7M blacklisted phishing URLs โ€ข Negative set: random URLs + whitelisted URLs โ€ข Comparison โ€ข Learned filter: RNN with GRU โ€ข Normal Bloom filter 24

  25. Critique 25

  26. Major Contributions 1. Proposed the idea of applying machine learning in index structures 2. Solutions to offering guarantees on performance, determinism with ML models 3. Showed significant performance improvements (time and space) 4. Inspired new research direction (27 citations since June 2018) 26

  27. Criticism 1. Detail of platform used for experiments not given 2. Little discussion on training time 3. Experiments on CPU only 27

  28. Conclusion & Future Direction โ€ข Proposed a new direction in database research that โ€ข Makes effective use of machine learning methods โ€ข Shows promising preliminary results โ€ข Inspired new research work โ€ข Requires more details on performance evaluation โ€ข Potentials in learned algorithms, multi-dimensional indexes 28

Recommend


More recommend