introducing lsm tree into postgresql making it as a
play

Introducing LSM-tree into PostgreSQL, making it as a data gobbler - PowerPoint PPT Presentation

Introducing LSM-tree into PostgreSQL, making it as a data gobbler By Shichao Jin From https://zhuanlan.zhihu.com/p/56259025 About Me 8 years experience in DBMS R&D Redshift AWS and Facebook RocksDB Ex-PhD student at UWaterloo


  1. Introducing LSM-tree into PostgreSQL, making it as a data gobbler By Shichao Jin From https://zhuanlan.zhihu.com/p/56259025

  2. About Me 8 years experience in DBMS R&D ● Redshift AWS and Facebook RocksDB ● Ex-PhD student at UWaterloo ● Fan of PostgreSQL ● Founder of VidarDB ● 2

  3. Outline ● Experiment & Analysis ● LSM-tree Introduction ● Our Implementation 3

  4. Background ● Storage engine accounts for a big chunk of the performance Data Structure ● What is “storage engine”? Storage Engine incarnation of a data DBMS structure 4

  5. Background ● B+tree/B-tree: InnoDB, BerkeleyDB, WiredTiger ● LSM-tree: LevelDB, RocksDB 5

  6. Experiment Setting Hardware : CPU: Xeon E3-1265L, 2.5GHz, 4 cores Disk: 7200 RPM RAM: 16GB DRR 3 Software : TPC-H 10GB, shuffled ! Ubuntu 18.04, PostgreSQL11.6, RocksDB 6.2, default setting libpqxx 6

  7. Experiment Results 7

  8. Why huge difference? Data structure! PostgreSQL uses B+tree ! So what’s wrong with B+tree? 8

  9. B+tree Reading Pattern Rare case: data come in order Common case: data come in random order Red rectangle means disk page cached! 9

  10. Problem with B+tree I only talk about the insertion case here, but the root cause comes from the reading pattern of B+tree. Every insertion of B+tree needs to trigger random reading to locate the exact position of B+tree. Note: Reading uses cache ! 10

  11. Problem with B+tree (Cont.) Data is large, and usually come not in the order of primary key, the previous cached disk blocks are useless! B+tree insertion causes too many disk access! 11

  12. Disk vs. SSD vs. RAM in my laptop Sequential vs. Random Seq Magnetic Disk SSD RAM Ran 12

  13. Solution No touch of disk (excluding WAL) when insertion! Tiered design. LSM-tree (Log-structured merge-tree) comes to rescue! RAM SSD/Disk 13

  14. LSM-tree Immutable table Mutable table new tuple Tail Index A ~ Z A ~ Z RAM Cache Disk A ~ Z A ~ Z A ~ Z L0 SST A ~ F G ~ N O ~ Z L1 SST A ~ G H ~ O P ~ Z L2 SST SST: Sorted String Table 14

  15. LSM-tree Skiplist Immutable table Mutable table new tuple O(logN) Tail Index A ~ Z A ~ Z RAM Cache Disk A ~ Z A ~ Z A ~ Z L0 SST Sorted A ~ F G ~ N O ~ Z L1 SST kv pairs A ~ G H ~ O P ~ Z L2 SST 15

  16. Basic Operations Put(key, value), Delete(key) Immutable table Mutable table new tuple K | 2020 Tail WAL Index A ~ Z A ~ Z RAM overwrite Cache Disk A ~ Z A ~ Z A ~ Z L0 SST K | 2019 A ~ F G ~ N O ~ Z L1 SST A ~ G H ~ O P ~ Z L2 SST 16

  17. Basic Operations Get(Key), return value Immutable table Mutable table new tuple Tail Index A ~ Z A ~ Z RAM Cache Disk A ~ Z A ~ Z A ~ Z L0 SST A ~ F G ~ N O ~ Z L1 SST A ~ G H ~ O P ~ Z L2 SST 17

  18. Basic Operations Iterator(): seek & next Immutable table Mutable table new tuple Tail C Index A ~ Z A ~ Z A RAM Cache D E Disk A ~ Z A ~ Z A ~ Z L0 SST B F A ~ F G ~ N O ~ Z L1 SST G A ~ G H ~ O P ~ Z L2 SST 18

  19. Advanced Operations Flush(): RAM → L0, usually by the background thread Immutable table Mutable table new tuple Tail Index A ~ Z A ~ Z RAM Cache Disk A ~ Z A ~ Z A ~ Z L0 SST A ~ F G ~ N O ~ Z L1 SST A ~ G H ~ O P ~ Z L2 SST 19

  20. Advanced Operations Compact(k1, k2): L0 → L1, L1 → L2, usually called by the background thread Immutable table Mutable table new tuple Tail Write Read Index A ~ Z A ~ Z RAM Cache Disk A ~ Z A ~ Z A ~ Z L0 SST RAM size A ~ F G ~ N O ~ Z L1 SST A ~ G H ~ O P ~ Z L2 SST 20

  21. Advanced Operations Snapshot(seq): sequence number, which is always increasing Immutable table Mutable table new tuple Tail Index A ~ Z A ~ Z RAM Tuple Cache multi-version Disk A ~ Z A ~ Z A ~ Z L0 SST component A ~ F G ~ N O ~ Z L1 SST A ~ G H ~ O P ~ Z L2 SST 21

  22. Implementation Foreign data wrapper (FDW) ● RocksDB is based on the multi-thread model ● Shared memory ● threads Process: Select, RocksDB Update…. Shared memory 22

  23. More To Do Engine recognizes the data types of PostgreSQL ● Transaction? ● thread Migrate to pluggable storage engine, table AM ● 23

  24. Motivation of designing a new data structure 1. random R/W of RAM is 2000X faster than magnetic disk, but 120X more expensive 2. Ratio of RAM/disk size is 40~200X larger than 1970s, when B+tree is designed 3. New storage choice, NVRAM 4. Requires expertise to choose different engines, data systems 24

  25. Thanks! Shichao@vidardb.com Twitter: @jsc0218 github.com/vidardb/vidardb github.com/vidardb/PostgresForeignDataWrapper 25

Recommend


More recommend