PebblesDB: Building Key-Value Stores using Fragmented Log - PowerPoint PPT Presentation

FLSM structure In-memory Memory Storage 2 …. 37 23 …. 48 Level 0 15 70 Level 1 1 …. 12 15 …. 59 77 …. 87 82 …. 95 15 40 70 95 Level 2 2 …. 8 16 …. 32 15 …. 23 96 …. 99 45 …. 65 70 …. 90 Note how files are logically grouped within guards 40

FLSM structure In-memory Memory Storage 2 …. 37 23 …. 48 Level 0 15 70 Level 1 1 …. 12 15 …. 59 77 …. 87 82 …. 95 15 40 70 95 Level 2 2 …. 8 16 …. 32 15 …. 23 96 …. 99 45 …. 65 70 …. 90 Guards get more fine grained deeper into the tree 41

How does FLSM reduce write amplification? 42

How does FLSM reduce write amplification? In-memory 30 …. 68 Memory Storage 2 …. 37 23 …. 48 Level 0 15 70 Level 1 1 …. 12 15 …. 59 77 …. 87 82 …. 95 15 40 70 95 Level 2 2 …. 8 16 …. 32 15 …. 23 96 …. 99 45 …. 65 70 …. 90 Max files in level 0 is configured to be 2 43

How does FLSM reduce write amplification? 15 In-memory 15 …. 68 2 …. 68 2 …. 14 Memory Storage 2 …. 37 23 …. 48 30 …. 68 Level 0 15 70 Level 1 1 …. 12 15 …. 59 77 …. 87 82 …. 95 15 40 70 95 Level 2 2 …. 8 16 …. 32 15 …. 23 96 …. 99 45 …. 65 70 …. 90 Compacting level 0 44

How does FLSM reduce write amplification? 15 In-memory 2 …. 14 15 …. 68 Memory Storage Level 0 15 70 Level 1 1 …. 12 15 …. 59 77 …. 87 82 …. 95 15 40 70 95 Level 2 2 …. 8 16 …. 32 15 …. 23 96 …. 99 45 …. 65 70 …. 90 Fragmented files are just appended to next level 45

How does FLSM reduce write amplification? In-memory 15 …. 68 Memory Storage Level 0 15 70 Level 1 1 …. 12 15 …. 59 77 …. 87 82 …. 95 2 …. 14 15 …. 68 15 40 70 95 Level 2 2 …. 8 16 …. 32 15 …. 23 96 …. 99 45 …. 65 70 …. 90 Guard 15 in Level 1 is to be compacted 46

How does FLSM reduce write amplification? 40 In-memory 15 …. 68 40 …. 68 15 …. 39 Memory Storage Level 0 15 70 Level 1 1 …. 12 77 …. 87 82 …. 95 2 …. 14 15 40 70 95 Level 2 2 …. 8 16 …. 32 96 …. 99 15 …. 23 45 …. 65 70 …. 90 Files are combined, sorted and fragmented 47

How does FLSM reduce write amplification? 40 In-memory 40 …. 68 15 …. 39 Memory Storage Level 0 15 70 Level 1 1 …. 12 77 …. 87 82 …. 95 2 …. 14 15 40 95 70 Level 2 2 …. 8 16 …. 32 15 …. 23 45 …. 65 96 …. 99 70 …. 90 Fragmented files are just appended to next level 48

How does FLSM reduce write amplification? FLSM doesn’t re-write data to the same level in most cases How does FLSM maintain read performance? FLSM maintains partially sorted levels to efficiently reduce the search space 49

Selecting Guards • Guards are chosen randomly and dynamically • Dependent on the distribution of data 50

Selecting Guards • Guards are chosen randomly and dynamically • Dependent on the distribution of data Keyspace 1 1e+9 51

Operations: Write Put(1, “abc”) Write (key, value) In-memory Memory Storage 2 …. 37 23 …. 48 Level 0 15 70 Level 1 1 …. 12 15 …. 59 77 …. 87 82 …. 95 15 40 70 95 Level 2 2 …. 8 16 …. 32 15 …. 23 96 …. 99 45 …. 65 70 …. 90 FLSM structure 54

Operations: Get Get(23) In-memory Memory Storage 2 …. 37 23 …. 48 Level 0 15 70 Level 1 1 …. 12 15 …. 59 77 …. 87 82 …. 95 15 40 70 95 Level 2 2 …. 8 16 …. 32 15 …. 23 96 …. 99 45 …. 65 70 …. 90 FLSM structure 55

Operations: Get Get(23) In-memory Memory Storage 2 …. 37 23 …. 48 Level 0 15 70 Level 1 1 …. 12 15 …. 59 77 …. 87 82 …. 95 15 40 70 95 Level 2 2 …. 8 16 …. 32 15 …. 23 96 …. 99 45 …. 65 70 …. 90 Search level by level starting from memory 56

Operations: Get Get(23) In-memory Memory Storage 2 …. 37 23 …. 48 Level 0 15 70 Level 1 1 …. 12 15 …. 59 77 …. 87 82 …. 95 15 40 70 95 Level 2 2 …. 8 16 …. 32 15 …. 23 96 …. 99 45 …. 65 70 …. 90 All level 0 files need to be searched 57

Operations: Get Get(23) In-memory Memory Storage 2 …. 37 23 …. 48 Level 0 15 70 Level 1 1 …. 12 15 …. 59 77 …. 87 82 …. 95 15 40 70 95 Level 2 2 …. 8 16 …. 32 15 …. 23 96 …. 99 45 …. 65 70 …. 90 Level 1: File under guard 15 is searched 58

Operations: Get Get(23) In-memory Memory Storage 2 …. 37 23 …. 48 Level 0 15 70 Level 1 1 …. 12 15 …. 59 77 …. 87 82 …. 95 15 40 70 95 Level 2 2 …. 8 16 …. 32 15 …. 23 96 …. 99 45 …. 65 70 …. 90 Level 2: Both the files under guard 15 are searched 59

High write throughput in FLSM • Compaction from memory to level 0 is stalled • Writes to memory is also stalled Write (key, value) In-memory Memory Storage Level 0 2 …. 98 23 …. 48 1 …. 37 18 …. 48 If rate of insertion is higher than rate of compaction, write throughput depends on the rate of compaction 60

High write throughput in FLSM • Compaction from memory to level 0 is stalled • Writes to memory is also stalled Write (key, value) FLSM has faster compaction because of lesser I/O and In-memory Memory hence higher write throughput Storage Level 0 2 …. 98 23 …. 48 1 …. 37 18 …. 48 If rate of insertion is higher than rate of compaction, write throughput depends on the rate of compaction 61

Challenges in FLSM • Every read/range query operation needs to examine multiple files per level • For example, if every guard has 5 files, read latency is increased by 5x (assuming no cache hits) Trade-off between write I/O and read performance 62

Outline • Log-Structured Merge Tree (LSM) • Fragmented Log-Structured Merge Tree (FLSM) • Building PebblesDB using FLSM • Evaluation • Conclusion 63

PebblesDB • Built by modifying HyperLevelDB ( ± 9100 LOC) to use FLSM • HyperLevelDB, built over LevelDB, to provide improved parallelism and compaction • API compatible with LevelDB, but not with RocksDB 64

Optimizations in PebblesDB • Challenge (get/range query): Multiple files in a guard • Get() performance is improved using file level bloom filter 65

Optimizations in PebblesDB • Challenge (get/range query): Multiple files in a guard • Get() performance is improved using file level bloom filter Is key 25 Definitely not Bloom filter present? Possibly yes 66

Optimizations in PebblesDB • Challenge (get/range query): Multiple files in a guard • Get() performance is improved using file level bloom filter 15 70 Level 1 1 …. 12 15 …. 39 82 …. 95 77 …. 97 Maintained Bloom Filter Bloom Filter Bloom Filter Bloom Filter in-memory 67

Optimizations in PebblesDB • Challenge (get/range query): Multiple files in a guard • Get() performance is improved using file level bloom filter 15 70 Level 1 1 …. 12 15 …. 39 82 …. 95 77 …. 97 Maintained Bloom Filter Bloom Filter Bloom Filter Bloom Filter in-memory PebblesDB reads same number of files as any LSM based store 68

Optimizations in PebblesDB • Challenge (get/range query): Multiple files in a guard • Get() performance is improved using file level bloom filter • Range query performance is improved using parallel threads and better compaction 69

Evaluation Real world workloads - YCSB Crash recovery Micro-benchmarks CPU and memory Low memory usage Small dataset Aged file system NoSQL applications 71

Evaluation Real world workloads - YCSB Crash recovery Micro-benchmarks CPU and memory Low memory usage Small dataset Aged file system NoSQL applications 72

Real world workloads - YCSB • Yahoo! Cloud Serving Benchmark - Industry standard macro-benchmark • Insertions: 50M, Operations: 10M, key size: 16 bytes and value size: 1 KB 2.5 2 Throughput ratio wrt HyperLevelDB 1.5 1 0.5 0 Load A Run A Run B Run C Run D Load E Run E Run F Total IO Load A - 100 % writes Run D - 95% reads (latest), 5% writes Run A - 50% reads, 50% writes Load E - 100% writes Run B - 95% reads, 5% writes Run E - 95% range queries, 5% writes Run C - 100% reads Run F - 50% reads, 50% read-modify-writes 73

Real world workloads - YCSB • Yahoo! Cloud Serving Benchmark - Industry standard macro-benchmark • Insertions: 50M, Operations: 10M, key size: 16 bytes and value size: 1 KB 2.5 2 35.08 Kops/s 33.98 Kops/s 22.41 Kops/s 57.87 Kops/s 34.06 Kops/s 32.09 Kops/s Throughput ratio wrt 25.8 Kops/s 5.8 Kops/s 952.93 GB HyperLevelDB 1.5 1 0.5 0 Load A Run A Run B Run C Run D Load E Run E Run F Total IO Load A - 100 % writes Run D - 95% reads (latest), 5% writes Run A - 50% reads, 50% writes Load E - 100% writes Run B - 95% reads, 5% writes Run E - 95% range queries, 5% writes Run C - 100% reads Run F - 50% reads, 50% read-modify-writes 74

NoSQL stores - MongoDB • YCSB on MongoDB, a widely used key-value store • Inserted 20M key-value pairs with 1 KB value size and 10M operations 2.5 2 Throughput ratio wrt WiredTiger 1.5 1 0.5 0 Load A Run A Run B Run C Run D Load E Run E Run F Total IO Load A - 100 % writes Run D - 95% reads (latest), 5% writes Run A - 50% reads, 50% writes Load E - 100% writes Run B - 95% reads, 5% writes Run E - 95% range queries, 5% writes Run C - 100% reads Run F - 50% reads, 50% read-modify-writes 79

NoSQL stores - MongoDB • YCSB on MongoDB, a widely used key-value store • Inserted 20M key-value pairs with 1 KB value size and 10M operations 2.5 2 20.73 Kops/s 15.52 Kops/s 19.69 Kops/s 23.53 Kops/s 20.68 Kops/s Throughput ratio wrt 9.95 Kops/s 0.65 Kops/s 9.78 Kops/s 426.33 GB WiredTiger 1.5 1 0.5 0 Load A Run A Run B Run C Run D Load E Run E Run F Total IO Load A - 100 % writes Run D - 95% reads (latest), 5% writes Run A - 50% reads, 50% writes Load E - 100% writes Run B - 95% reads, 5% writes Run E - 95% range queries, 5% writes Run C - 100% reads Run F - 50% reads, 50% read-modify-writes 80

NoSQL stores - MongoDB • YCSB on MongoDB, a widely used key-value store • Inserted 20M key-value pairs with 1 KB value size and 10M operations 2.5 2 20.73 Kops/s 15.52 Kops/s 19.69 Kops/s 23.53 Kops/s 20.68 Kops/s Throughput ratio wrt 9.95 Kops/s 0.65 Kops/s 9.78 Kops/s 426.33 GB WiredTiger 1.5 PebblesDB combines low write IO of WiredTiger with 1 high performance of RocksDB 0.5 0 Load A Run A Run B Run C Run D Load E Run E Run F Total IO Load A - 100 % writes Run D - 95% reads (latest), 5% writes Run A - 50% reads, 50% writes Load E - 100% writes Run B - 95% reads, 5% writes Run E - 95% range queries, 5% writes Run C - 100% reads Run F - 50% reads, 50% read-modify-writes 84

Conclusion • PebblesDB: key-value store built on Fragmented Log-Structured Merge Trees • Increases write throughput and reduces write IO at the same time • Obtains 6X the write throughput of RocksDB • As key-value stores become more widely used, there have been several attempts to optimize them • PebblesDB combines algorithmic innovation (the FLSM data structure) with careful systems building 86

https://github.com/utsaslab/pebblesdb

Backup slides 89

Operations: Seek • Seek(target): Returns the smallest key in the database which is >= target • Used for range queries (for example, return all entries between 5 and 18) Get(1) Level 0 – 1, 2, 100, 1000 Level 1 – 1, 5, 10, 2000 Level 2 – 5, 300, 500 90

Operations: Seek • Seek(target): Returns the smallest key in the database which is >= target • Used for range queries (for example, return all entries between 5 and 18) Seek(200) Level 0 – 1, 2, 100, 1000 Level 1 – 1, 5, 10, 2000 Level 2 – 5, 300, 500 91

Operations: Seek • Seek(target): Returns the smallest key in the database which is >= target • Used for range queries (for example, return all entries between 5 and 18) 92

Operations: Seek Seek(23) In-memory Memory Storage 2 …. 37 23 …. 48 Level 0 15 70 Level 1 1 …. 12 15 …. 59 77 …. 87 82 …. 95 15 40 70 95 Level 2 2 …. 8 16 …. 32 96 …. 99 15 …. 23 45 …. 65 70 …. 90 FLSM structure 93

Operations: Seek Seek(23) In-memory Memory Storage 2 …. 37 23 …. 48 Level 0 15 70 Level 1 1 …. 12 15 …. 59 77 …. 87 82 …. 95 15 40 70 95 Level 2 2 …. 8 16 …. 32 96 …. 99 15 …. 23 45 …. 65 70 …. 90 All levels and memtable need to be searched 94

Optimizations in PebblesDB • Challenge with reads: Multiple sstable reads per level • Optimized using sstable level bloom filters • Bloom filter: determine if an element is in a set Is key 25 Definitely not Bloom filter present? Possibly yes 95

Optimizations in PebblesDB • Challenge with reads: Multiple sstable reads per level • Optimized using sstable level bloom filters • Bloom filter: determine if an element is in a set 15 70 Level 1 1 …. 12 15 …. 39 82 …. 95 77 …. 97 Maintained Bloom Filter Bloom Filter Bloom Filter Bloom Filter in-memory True Get(97) 96

Optimizations in PebblesDB • Challenge with reads: Multiple sstable reads per level • Optimized using sstable level bloom filters • Bloom filter: determine if an element is in a set 15 70 Level 1 1 …. 12 15 …. 39 82 …. 95 77 …. 97 Bloom Filter Bloom Filter Bloom Filter Bloom Filter False True Get(97) 97

Optimizations in PebblesDB • Challenge with reads: Multiple sstable reads per level • Optimized using sstable level bloom filters • Bloom filter: determine if an element is in a set 15 70 Level 1 1 …. 12 15 …. 39 82 …. 95 77 …. 97 Bloom Filter Bloom Filter Bloom Filter Bloom Filter PebblesDB reads at most one file per guard with high probability 98

Optimizations in PebblesDB • Challenge with seeks: Multiple sstable reads per level • Parallel seeks: Parallel threads to seek() on files in a guard Thread 1 Thread 2 15 70 Level 1 1 …. 12 15 …. 39 77 …. 97 82 …. 95 Seek(85) 99

Optimizations in PebblesDB • Challenge with seeks: Multiple sstable reads per level • Parallel seeks: Parallel threads to seek() on files in a guard • Seek based compaction: Triggers compaction for a level during a seek-heavy workload • Reduce the average number of sstables per guard • Reduce the number of active levels Seek based compaction increases write I/O but as a trade-off to improve seek performance 100

PebblesDB: Building Key-Value Stores using Fragmented Log - PowerPoint PPT Presentation

PebblesDB: Building Key-Value Stores using Fragmented Log Structured Merge Trees Pandian Raju 1 , Rohan Kadekodi 1 , Vijay Chidambaram 1,2 , Ittai Abraham 2 1 The University of Texas at Austin 2 VMware Research What is a key-value store?

Scaling Log-Structured KV-Stores featuring Monkey and Dostoevsky SIGMOD17 / SIGMOD18 Niv Dayan

Pebbles DB: Building Key-Value Stores using Fragmented Log- Structured Merge Trees(II) Peter

Fragmented Log Structured Merge Trees (Part 1) Presented by Deepak Varghese Pebble DB

Key-Value Stores Key-value stores are popular. web searching, social networks, e-commerce,

(142733/102960-Log[4])+(614851/73920-2 Log[64]) h 2 +(2329/1680-Log[4]) h 4 -h 10 /20160

Chandra data reduction The CDFs Giorgio, Margherita, Elisabeta, Eleonora, Lazarus, Enrica,

Hash- Tables Introduction Dictionary Dictionary stores key-value pairs Find( k ) Insert( k

Syslog and Log Rotate Computer Center, CS, NCTU Log files Execution information of each

Distributed ephemeral log service Log entries are replicated,dispersed See Ivy,

Section 3.7 Derivatives of logarithmic functions 1 Rules of exponentials and logarithms 1.

WiscKey: Separating Keys from Values in SSD-Conscious Storage Lanyue Lu, Thanumalayan Pillai,

Distributed Systems CS425/ECE428 05/01/2020 Todays agenda Distributed key-value stores

Scheduling Problems in Write-Optimized Key-Value Stores Prashant Pandey 1 Michael A. Bender 1 Rob

COLD STORES PLC - Company Profile Cold Stores manufacturers and markets a unique brand of ice

Column-Stores vs. Row-Stores: How Different Are They Really? Daniel Abadi (Yale), Samuel Madden

Paper 1 example question A Find the value of 1 1 1 1 27 1 27 log 3 2 27 4

Sample Amplification: Increasing Dataset Size even when Learning is Impossible Brian Axelrod

Single Sample t Test Assessment of differences between groups t comp & t crit t crit

1 Tyrosine Tyrosine TH [ 18 F] FDOPA DOPA DDC Amphetamine Reserpin DA Tetrabenazine DAT [

ADHD PHARMACOLOGY University of Hawaii Hilo Pre -Nursing Program NURS 203 General

Mitigating Gender Bias Amplification in Distribution by Posterior Regularization Shengyu Jia *

Ring Amplifiers for Switched Capacitor Circuits Benjamin Hershberg 1 , Skyler Weaver 1 , Kazuki

Amplification by Shuffling: From Local to Central Differential Privacy via Anonymity Vitaly

Privacy Amplification by Mixing and Diffusion Mechanisms Borja Balle, Gilles Barthe, Marco

PebblesDB: Building Key-Value Stores using Fragmented Log - PowerPoint PPT Presentation

PebblesDB: Building Key-Value Stores using Fragmented Log Structured Merge Trees Pandian Raju 1 , Rohan Kadekodi 1 , Vijay Chidambaram 1,2 , Ittai Abraham 2 1 The University of Texas at Austin 2 VMware Research What is a key-value store?

Scaling Log-Structured KV-Stores featuring Monkey and Dostoevsky SIGMOD17 / SIGMOD18 Niv Dayan

Pebbles DB: Building Key-Value Stores using Fragmented Log- Structured Merge Trees(II) Peter

Fragmented Log Structured Merge Trees (Part 1) Presented by Deepak Varghese Pebble DB

Key-Value Stores Key-value stores are popular. web searching, social networks, e-commerce,

(142733/102960-Log[4])+(614851/73920-2 Log[64]) h 2 +(2329/1680-Log[4]) h 4 -h 10 /20160

Chandra data reduction The CDFs Giorgio, Margherita, Elisabeta, Eleonora, Lazarus, Enrica,

Hash- Tables Introduction Dictionary Dictionary stores key-value pairs Find( k ) Insert( k

Syslog and Log Rotate Computer Center, CS, NCTU Log files Execution information of each

Distributed ephemeral log service Log entries are replicated,dispersed See Ivy,

Section 3.7 Derivatives of logarithmic functions 1 Rules of exponentials and logarithms 1.

WiscKey: Separating Keys from Values in SSD-Conscious Storage Lanyue Lu, Thanumalayan Pillai,

Distributed Systems CS425/ECE428 05/01/2020 Todays agenda Distributed key-value stores

Scheduling Problems in Write-Optimized Key-Value Stores Prashant Pandey 1 Michael A. Bender 1 Rob

COLD STORES PLC - Company Profile Cold Stores manufacturers and markets a unique brand of ice

Column-Stores vs. Row-Stores: How Different Are They Really? Daniel Abadi (Yale), Samuel Madden

Paper 1 example question A Find the value of 1 1 1 1 27 1 27 log 3 2 27 4

Sample Amplification: Increasing Dataset Size even when Learning is Impossible Brian Axelrod

Single Sample t Test Assessment of differences between groups t comp &amp; t crit t crit

1 Tyrosine Tyrosine TH [ 18 F] FDOPA DOPA DDC Amphetamine Reserpin DA Tetrabenazine DAT [

ADHD PHARMACOLOGY University of Hawaii Hilo Pre -Nursing Program NURS 203 General

Mitigating Gender Bias Amplification in Distribution by Posterior Regularization Shengyu Jia *

Ring Amplifiers for Switched Capacitor Circuits Benjamin Hershberg 1 , Skyler Weaver 1 , Kazuki

Amplification by Shuffling: From Local to Central Differential Privacy via Anonymity Vitaly

Privacy Amplification by Mixing and Diffusion Mechanisms Borja Balle, Gilles Barthe, Marco

Single Sample t Test Assessment of differences between groups t comp & t crit t crit