pebblesdb building key value stores using fragmented log
play

PebblesDB: Building Key-Value Stores using Fragmented Log - PowerPoint PPT Presentation

PebblesDB: Building Key-Value Stores using Fragmented Log Structured Merge Trees Pandian Raju 1 , Rohan Kadekodi 1 , Vijay Chidambaram 1,2 , Ittai Abraham 2 1 The University of Texas at Austin 2 VMware Research What is a key-value store?


  1. FLSM structure In-memory Memory Storage 2 …. 37 23 …. 48 Level 0 15 70 Level 1 1 …. 12 15 …. 59 77 …. 87 82 …. 95 15 40 70 95 Level 2 2 …. 8 16 …. 32 15 …. 23 96 …. 99 45 …. 65 70 …. 90 Note how files are logically grouped within guards 40

  2. FLSM structure In-memory Memory Storage 2 …. 37 23 …. 48 Level 0 15 70 Level 1 1 …. 12 15 …. 59 77 …. 87 82 …. 95 15 40 70 95 Level 2 2 …. 8 16 …. 32 15 …. 23 96 …. 99 45 …. 65 70 …. 90 Guards get more fine grained deeper into the tree 41

  3. How does FLSM reduce write amplification? 42

  4. How does FLSM reduce write amplification? In-memory 30 …. 68 Memory Storage 2 …. 37 23 …. 48 Level 0 15 70 Level 1 1 …. 12 15 …. 59 77 …. 87 82 …. 95 15 40 70 95 Level 2 2 …. 8 16 …. 32 15 …. 23 96 …. 99 45 …. 65 70 …. 90 Max files in level 0 is configured to be 2 43

  5. How does FLSM reduce write amplification? 15 In-memory 15 …. 68 2 …. 68 2 …. 14 Memory Storage 2 …. 37 23 …. 48 30 …. 68 Level 0 15 70 Level 1 1 …. 12 15 …. 59 77 …. 87 82 …. 95 15 40 70 95 Level 2 2 …. 8 16 …. 32 15 …. 23 96 …. 99 45 …. 65 70 …. 90 Compacting level 0 44

  6. How does FLSM reduce write amplification? 15 In-memory 2 …. 14 15 …. 68 Memory Storage Level 0 15 70 Level 1 1 …. 12 15 …. 59 77 …. 87 82 …. 95 15 40 70 95 Level 2 2 …. 8 16 …. 32 15 …. 23 96 …. 99 45 …. 65 70 …. 90 Fragmented files are just appended to next level 45

  7. How does FLSM reduce write amplification? In-memory 15 …. 68 Memory Storage Level 0 15 70 Level 1 1 …. 12 15 …. 59 77 …. 87 82 …. 95 2 …. 14 15 …. 68 15 40 70 95 Level 2 2 …. 8 16 …. 32 15 …. 23 96 …. 99 45 …. 65 70 …. 90 Guard 15 in Level 1 is to be compacted 46

  8. How does FLSM reduce write amplification? 40 In-memory 15 …. 68 40 …. 68 15 …. 39 Memory Storage Level 0 15 70 Level 1 1 …. 12 77 …. 87 82 …. 95 2 …. 14 15 40 70 95 Level 2 2 …. 8 16 …. 32 96 …. 99 15 …. 23 45 …. 65 70 …. 90 Files are combined, sorted and fragmented 47

  9. How does FLSM reduce write amplification? 40 In-memory 40 …. 68 15 …. 39 Memory Storage Level 0 15 70 Level 1 1 …. 12 77 …. 87 82 …. 95 2 …. 14 15 40 95 70 Level 2 2 …. 8 16 …. 32 15 …. 23 45 …. 65 96 …. 99 70 …. 90 Fragmented files are just appended to next level 48

  10. How does FLSM reduce write amplification? FLSM doesn’t re-write data to the same level in most cases How does FLSM maintain read performance? FLSM maintains partially sorted levels to efficiently reduce the search space 49

  11. Selecting Guards • Guards are chosen randomly and dynamically • Dependent on the distribution of data 50

  12. Selecting Guards • Guards are chosen randomly and dynamically • Dependent on the distribution of data Keyspace 1 1e+9 51

  13. Selecting Guards • Guards are chosen randomly and dynamically • Dependent on the distribution of data Keyspace 1 1e+9 52

  14. Selecting Guards • Guards are chosen randomly and dynamically • Dependent on the distribution of data Keyspace 1 1e+9 53

  15. Operations: Write Put(1, “abc”) Write (key, value) In-memory Memory Storage 2 …. 37 23 …. 48 Level 0 15 70 Level 1 1 …. 12 15 …. 59 77 …. 87 82 …. 95 15 40 70 95 Level 2 2 …. 8 16 …. 32 15 …. 23 96 …. 99 45 …. 65 70 …. 90 FLSM structure 54

  16. Operations: Get Get(23) In-memory Memory Storage 2 …. 37 23 …. 48 Level 0 15 70 Level 1 1 …. 12 15 …. 59 77 …. 87 82 …. 95 15 40 70 95 Level 2 2 …. 8 16 …. 32 15 …. 23 96 …. 99 45 …. 65 70 …. 90 FLSM structure 55

  17. Operations: Get Get(23) In-memory Memory Storage 2 …. 37 23 …. 48 Level 0 15 70 Level 1 1 …. 12 15 …. 59 77 …. 87 82 …. 95 15 40 70 95 Level 2 2 …. 8 16 …. 32 15 …. 23 96 …. 99 45 …. 65 70 …. 90 Search level by level starting from memory 56

  18. Operations: Get Get(23) In-memory Memory Storage 2 …. 37 23 …. 48 Level 0 15 70 Level 1 1 …. 12 15 …. 59 77 …. 87 82 …. 95 15 40 70 95 Level 2 2 …. 8 16 …. 32 15 …. 23 96 …. 99 45 …. 65 70 …. 90 All level 0 files need to be searched 57

  19. Operations: Get Get(23) In-memory Memory Storage 2 …. 37 23 …. 48 Level 0 15 70 Level 1 1 …. 12 15 …. 59 77 …. 87 82 …. 95 15 40 70 95 Level 2 2 …. 8 16 …. 32 15 …. 23 96 …. 99 45 …. 65 70 …. 90 Level 1: File under guard 15 is searched 58

  20. Operations: Get Get(23) In-memory Memory Storage 2 …. 37 23 …. 48 Level 0 15 70 Level 1 1 …. 12 15 …. 59 77 …. 87 82 …. 95 15 40 70 95 Level 2 2 …. 8 16 …. 32 15 …. 23 96 …. 99 45 …. 65 70 …. 90 Level 2: Both the files under guard 15 are searched 59

  21. High write throughput in FLSM • Compaction from memory to level 0 is stalled • Writes to memory is also stalled Write (key, value) In-memory Memory Storage Level 0 2 …. 98 23 …. 48 1 …. 37 18 …. 48 If rate of insertion is higher than rate of compaction, write throughput depends on the rate of compaction 60

  22. High write throughput in FLSM • Compaction from memory to level 0 is stalled • Writes to memory is also stalled Write (key, value) FLSM has faster compaction because of lesser I/O and In-memory Memory hence higher write throughput Storage Level 0 2 …. 98 23 …. 48 1 …. 37 18 …. 48 If rate of insertion is higher than rate of compaction, write throughput depends on the rate of compaction 61

  23. Challenges in FLSM • Every read/range query operation needs to examine multiple files per level • For example, if every guard has 5 files, read latency is increased by 5x (assuming no cache hits) Trade-off between write I/O and read performance 62

  24. Outline • Log-Structured Merge Tree (LSM) • Fragmented Log-Structured Merge Tree (FLSM) • Building PebblesDB using FLSM • Evaluation • Conclusion 63

  25. PebblesDB • Built by modifying HyperLevelDB ( ± 9100 LOC) to use FLSM • HyperLevelDB, built over LevelDB, to provide improved parallelism and compaction • API compatible with LevelDB, but not with RocksDB 64

  26. Optimizations in PebblesDB • Challenge (get/range query): Multiple files in a guard • Get() performance is improved using file level bloom filter 65

  27. Optimizations in PebblesDB • Challenge (get/range query): Multiple files in a guard • Get() performance is improved using file level bloom filter Is key 25 Definitely not Bloom filter present? Possibly yes 66

  28. Optimizations in PebblesDB • Challenge (get/range query): Multiple files in a guard • Get() performance is improved using file level bloom filter 15 70 Level 1 1 …. 12 15 …. 39 82 …. 95 77 …. 97 Maintained Bloom Filter Bloom Filter Bloom Filter Bloom Filter in-memory 67

  29. Optimizations in PebblesDB • Challenge (get/range query): Multiple files in a guard • Get() performance is improved using file level bloom filter 15 70 Level 1 1 …. 12 15 …. 39 82 …. 95 77 …. 97 Maintained Bloom Filter Bloom Filter Bloom Filter Bloom Filter in-memory PebblesDB reads same number of files as any LSM based store 68

  30. Optimizations in PebblesDB • Challenge (get/range query): Multiple files in a guard • Get() performance is improved using file level bloom filter • Range query performance is improved using parallel threads and better compaction 69

  31. Outline • Log-Structured Merge Tree (LSM) • Fragmented Log-Structured Merge Tree (FLSM) • Building PebblesDB using FLSM • Evaluation • Conclusion 70

  32. Evaluation Real world workloads - YCSB Crash recovery Micro-benchmarks CPU and memory Low memory usage Small dataset Aged file system NoSQL applications 71

  33. Evaluation Real world workloads - YCSB Crash recovery Micro-benchmarks CPU and memory Low memory usage Small dataset Aged file system NoSQL applications 72

  34. Real world workloads - YCSB • Yahoo! Cloud Serving Benchmark - Industry standard macro-benchmark • Insertions: 50M, Operations: 10M, key size: 16 bytes and value size: 1 KB 2.5 2 Throughput ratio wrt HyperLevelDB 1.5 1 0.5 0 Load A Run A Run B Run C Run D Load E Run E Run F Total IO Load A - 100 % writes Run D - 95% reads (latest), 5% writes Run A - 50% reads, 50% writes Load E - 100% writes Run B - 95% reads, 5% writes Run E - 95% range queries, 5% writes Run C - 100% reads Run F - 50% reads, 50% read-modify-writes 73

  35. Real world workloads - YCSB • Yahoo! Cloud Serving Benchmark - Industry standard macro-benchmark • Insertions: 50M, Operations: 10M, key size: 16 bytes and value size: 1 KB 2.5 2 35.08 Kops/s 33.98 Kops/s 22.41 Kops/s 57.87 Kops/s 34.06 Kops/s 32.09 Kops/s Throughput ratio wrt 25.8 Kops/s 5.8 Kops/s 952.93 GB HyperLevelDB 1.5 1 0.5 0 Load A Run A Run B Run C Run D Load E Run E Run F Total IO Load A - 100 % writes Run D - 95% reads (latest), 5% writes Run A - 50% reads, 50% writes Load E - 100% writes Run B - 95% reads, 5% writes Run E - 95% range queries, 5% writes Run C - 100% reads Run F - 50% reads, 50% read-modify-writes 74

  36. Real world workloads - YCSB • Yahoo! Cloud Serving Benchmark - Industry standard macro-benchmark • Insertions: 50M, Operations: 10M, key size: 16 bytes and value size: 1 KB 2.5 2 35.08 Kops/s 33.98 Kops/s 22.41 Kops/s 57.87 Kops/s 34.06 Kops/s 32.09 Kops/s Throughput ratio wrt 25.8 Kops/s 5.8 Kops/s 952.93 GB HyperLevelDB 1.5 1 0.5 0 Load A Run A Run B Run C Run D Load E Run E Run F Total IO Load A - 100 % writes Run D - 95% reads (latest), 5% writes Run A - 50% reads, 50% writes Load E - 100% writes Run B - 95% reads, 5% writes Run E - 95% range queries, 5% writes Run C - 100% reads Run F - 50% reads, 50% read-modify-writes 75

  37. Real world workloads - YCSB • Yahoo! Cloud Serving Benchmark - Industry standard macro-benchmark • Insertions: 50M, Operations: 10M, key size: 16 bytes and value size: 1 KB 2.5 2 35.08 Kops/s 33.98 Kops/s 22.41 Kops/s 57.87 Kops/s 34.06 Kops/s 32.09 Kops/s Throughput ratio wrt 25.8 Kops/s 5.8 Kops/s 952.93 GB HyperLevelDB 1.5 1 0.5 0 Load A Run A Run B Run C Run D Load E Run E Run F Total IO Load A - 100 % writes Run D - 95% reads (latest), 5% writes Run A - 50% reads, 50% writes Load E - 100% writes Run B - 95% reads, 5% writes Run E - 95% range queries, 5% writes Run C - 100% reads Run F - 50% reads, 50% read-modify-writes 76

  38. Real world workloads - YCSB • Yahoo! Cloud Serving Benchmark - Industry standard macro-benchmark • Insertions: 50M, Operations: 10M, key size: 16 bytes and value size: 1 KB 2.5 2 35.08 Kops/s 33.98 Kops/s 22.41 Kops/s 57.87 Kops/s 34.06 Kops/s 32.09 Kops/s Throughput ratio wrt 25.8 Kops/s 5.8 Kops/s 952.93 GB HyperLevelDB 1.5 1 0.5 0 Load A Run A Run B Run C Run D Load E Run E Run F Total IO Load A - 100 % writes Run D - 95% reads (latest), 5% writes Run A - 50% reads, 50% writes Load E - 100% writes Run B - 95% reads, 5% writes Run E - 95% range queries, 5% writes Run C - 100% reads Run F - 50% reads, 50% read-modify-writes 77

  39. Real world workloads - YCSB • Yahoo! Cloud Serving Benchmark - Industry standard macro-benchmark • Insertions: 50M, Operations: 10M, key size: 16 bytes and value size: 1 KB 2.5 2 35.08 Kops/s 33.98 Kops/s 22.41 Kops/s 57.87 Kops/s 34.06 Kops/s 32.09 Kops/s Throughput ratio wrt 25.8 Kops/s 5.8 Kops/s 952.93 GB HyperLevelDB 1.5 1 0.5 0 Load A Run A Run B Run C Run D Load E Run E Run F Total IO Load A - 100 % writes Run D - 95% reads (latest), 5% writes Run A - 50% reads, 50% writes Load E - 100% writes Run B - 95% reads, 5% writes Run E - 95% range queries, 5% writes Run C - 100% reads Run F - 50% reads, 50% read-modify-writes 78

  40. NoSQL stores - MongoDB • YCSB on MongoDB, a widely used key-value store • Inserted 20M key-value pairs with 1 KB value size and 10M operations 2.5 2 Throughput ratio wrt WiredTiger 1.5 1 0.5 0 Load A Run A Run B Run C Run D Load E Run E Run F Total IO Load A - 100 % writes Run D - 95% reads (latest), 5% writes Run A - 50% reads, 50% writes Load E - 100% writes Run B - 95% reads, 5% writes Run E - 95% range queries, 5% writes Run C - 100% reads Run F - 50% reads, 50% read-modify-writes 79

  41. NoSQL stores - MongoDB • YCSB on MongoDB, a widely used key-value store • Inserted 20M key-value pairs with 1 KB value size and 10M operations 2.5 2 20.73 Kops/s 15.52 Kops/s 19.69 Kops/s 23.53 Kops/s 20.68 Kops/s Throughput ratio wrt 9.95 Kops/s 0.65 Kops/s 9.78 Kops/s 426.33 GB WiredTiger 1.5 1 0.5 0 Load A Run A Run B Run C Run D Load E Run E Run F Total IO Load A - 100 % writes Run D - 95% reads (latest), 5% writes Run A - 50% reads, 50% writes Load E - 100% writes Run B - 95% reads, 5% writes Run E - 95% range queries, 5% writes Run C - 100% reads Run F - 50% reads, 50% read-modify-writes 80

  42. NoSQL stores - MongoDB • YCSB on MongoDB, a widely used key-value store • Inserted 20M key-value pairs with 1 KB value size and 10M operations 2.5 2 20.73 Kops/s 15.52 Kops/s 19.69 Kops/s 23.53 Kops/s 20.68 Kops/s Throughput ratio wrt 9.95 Kops/s 0.65 Kops/s 9.78 Kops/s 426.33 GB WiredTiger 1.5 1 0.5 0 Load A Run A Run B Run C Run D Load E Run E Run F Total IO Load A - 100 % writes Run D - 95% reads (latest), 5% writes Run A - 50% reads, 50% writes Load E - 100% writes Run B - 95% reads, 5% writes Run E - 95% range queries, 5% writes Run C - 100% reads Run F - 50% reads, 50% read-modify-writes 81

  43. NoSQL stores - MongoDB • YCSB on MongoDB, a widely used key-value store • Inserted 20M key-value pairs with 1 KB value size and 10M operations 2.5 2 20.73 Kops/s 15.52 Kops/s 19.69 Kops/s 23.53 Kops/s 20.68 Kops/s Throughput ratio wrt 9.95 Kops/s 0.65 Kops/s 9.78 Kops/s 426.33 GB WiredTiger 1.5 1 0.5 0 Load A Run A Run B Run C Run D Load E Run E Run F Total IO Load A - 100 % writes Run D - 95% reads (latest), 5% writes Run A - 50% reads, 50% writes Load E - 100% writes Run B - 95% reads, 5% writes Run E - 95% range queries, 5% writes Run C - 100% reads Run F - 50% reads, 50% read-modify-writes 82

  44. NoSQL stores - MongoDB • YCSB on MongoDB, a widely used key-value store • Inserted 20M key-value pairs with 1 KB value size and 10M operations 2.5 2 20.73 Kops/s 15.52 Kops/s 19.69 Kops/s 23.53 Kops/s 20.68 Kops/s Throughput ratio wrt 9.95 Kops/s 0.65 Kops/s 9.78 Kops/s 426.33 GB WiredTiger 1.5 1 0.5 0 Load A Run A Run B Run C Run D Load E Run E Run F Total IO Load A - 100 % writes Run D - 95% reads (latest), 5% writes Run A - 50% reads, 50% writes Load E - 100% writes Run B - 95% reads, 5% writes Run E - 95% range queries, 5% writes Run C - 100% reads Run F - 50% reads, 50% read-modify-writes 83

  45. NoSQL stores - MongoDB • YCSB on MongoDB, a widely used key-value store • Inserted 20M key-value pairs with 1 KB value size and 10M operations 2.5 2 20.73 Kops/s 15.52 Kops/s 19.69 Kops/s 23.53 Kops/s 20.68 Kops/s Throughput ratio wrt 9.95 Kops/s 0.65 Kops/s 9.78 Kops/s 426.33 GB WiredTiger 1.5 PebblesDB combines low write IO of WiredTiger with 1 high performance of RocksDB 0.5 0 Load A Run A Run B Run C Run D Load E Run E Run F Total IO Load A - 100 % writes Run D - 95% reads (latest), 5% writes Run A - 50% reads, 50% writes Load E - 100% writes Run B - 95% reads, 5% writes Run E - 95% range queries, 5% writes Run C - 100% reads Run F - 50% reads, 50% read-modify-writes 84

  46. Outline • Log-Structured Merge Tree (LSM) • Fragmented Log-Structured Merge Tree (FLSM) • Building PebblesDB using FLSM • Evaluation • Conclusion 85

  47. Conclusion • PebblesDB: key-value store built on Fragmented Log-Structured Merge Trees • Increases write throughput and reduces write IO at the same time • Obtains 6X the write throughput of RocksDB • As key-value stores become more widely used, there have been several attempts to optimize them • PebblesDB combines algorithmic innovation (the FLSM data structure) with careful systems building 86

  48. https://github.com/utsaslab/pebblesdb

  49. https://github.com/utsaslab/pebblesdb

  50. Backup slides 89

  51. Operations: Seek • Seek(target): Returns the smallest key in the database which is >= target • Used for range queries (for example, return all entries between 5 and 18) Get(1) Level 0 – 1, 2, 100, 1000 Level 1 – 1, 5, 10, 2000 Level 2 – 5, 300, 500 90

  52. Operations: Seek • Seek(target): Returns the smallest key in the database which is >= target • Used for range queries (for example, return all entries between 5 and 18) Seek(200) Level 0 – 1, 2, 100, 1000 Level 1 – 1, 5, 10, 2000 Level 2 – 5, 300, 500 91

  53. Operations: Seek • Seek(target): Returns the smallest key in the database which is >= target • Used for range queries (for example, return all entries between 5 and 18) 92

  54. Operations: Seek Seek(23) In-memory Memory Storage 2 …. 37 23 …. 48 Level 0 15 70 Level 1 1 …. 12 15 …. 59 77 …. 87 82 …. 95 15 40 70 95 Level 2 2 …. 8 16 …. 32 96 …. 99 15 …. 23 45 …. 65 70 …. 90 FLSM structure 93

  55. Operations: Seek Seek(23) In-memory Memory Storage 2 …. 37 23 …. 48 Level 0 15 70 Level 1 1 …. 12 15 …. 59 77 …. 87 82 …. 95 15 40 70 95 Level 2 2 …. 8 16 …. 32 96 …. 99 15 …. 23 45 …. 65 70 …. 90 All levels and memtable need to be searched 94

  56. Optimizations in PebblesDB • Challenge with reads: Multiple sstable reads per level • Optimized using sstable level bloom filters • Bloom filter: determine if an element is in a set Is key 25 Definitely not Bloom filter present? Possibly yes 95

  57. Optimizations in PebblesDB • Challenge with reads: Multiple sstable reads per level • Optimized using sstable level bloom filters • Bloom filter: determine if an element is in a set 15 70 Level 1 1 …. 12 15 …. 39 82 …. 95 77 …. 97 Maintained Bloom Filter Bloom Filter Bloom Filter Bloom Filter in-memory True Get(97) 96

  58. Optimizations in PebblesDB • Challenge with reads: Multiple sstable reads per level • Optimized using sstable level bloom filters • Bloom filter: determine if an element is in a set 15 70 Level 1 1 …. 12 15 …. 39 82 …. 95 77 …. 97 Bloom Filter Bloom Filter Bloom Filter Bloom Filter False True Get(97) 97

  59. Optimizations in PebblesDB • Challenge with reads: Multiple sstable reads per level • Optimized using sstable level bloom filters • Bloom filter: determine if an element is in a set 15 70 Level 1 1 …. 12 15 …. 39 82 …. 95 77 …. 97 Bloom Filter Bloom Filter Bloom Filter Bloom Filter PebblesDB reads at most one file per guard with high probability 98

  60. Optimizations in PebblesDB • Challenge with seeks: Multiple sstable reads per level • Parallel seeks: Parallel threads to seek() on files in a guard Thread 1 Thread 2 15 70 Level 1 1 …. 12 15 …. 39 77 …. 97 82 …. 95 Seek(85) 99

  61. Optimizations in PebblesDB • Challenge with seeks: Multiple sstable reads per level • Parallel seeks: Parallel threads to seek() on files in a guard • Seek based compaction: Triggers compaction for a level during a seek-heavy workload • Reduce the average number of sstables per guard • Reduce the number of active levels Seek based compaction increases write I/O but as a trade-off to improve seek performance 100

Recommend


More recommend