MatrixKV: Reducing Write Stalls and Write Amplification in LSM-tree Based KV Stores with a Matrix Container in NVM Ting Yao 1 , Yiwen Zhang 1 , Jiguang Wan 1 , Qiu Cui 2 , Liu Tang 2 , Hong Jiang 3 , Changsheng Xie 1 , and Xubin He 4 1 Huazhong University of Science and Technology, China 2 PingCAP, China 3 University of Texas at Arlington, USA 4 Temple University, USA
Outline Background and Motivations MatrixKV Evaluation Conclusion 2
LSM-tree based Key-value stores Log-structured merge tree (LSM-tree) • Write intensive scenarios Applications : Properties: • Batched sequential writes: high write throughput • Fast read • Fast range queries 3
LSM-tree and RocksDB Systems with DRAM-SSD storage Insert Exponentially increased level sizes (AF) MemTable DRAM Immutable Operations MemTable Flush 1. Insert L 0 Compaction 2. Flush SSD L 1 3. Compaction between L i -L i+1 ◦ L0-L1 compaction L n ◦ L1-L2 compaction SSD based RocksDB ◦ ……
Challenge 1: Write stall Random write an 80 GB Dataset to an SSD based RocksDB. (20 million KV items, 16byte-4KB) Write stall: Application throughput periodically drop to nearly zero. Unpredictable performance. Long tail latency. L0-L1 compaction! 3.1GB compaction data . 5
Root cause of write stall: L0-L1 compaction Merge & Sort C m Memory Read L 0 Disk L 1 CPU cycle. SSD bandwidth. L 2 L n SSTable L0-L1 compaction: The all-to-all coarse-grained compaction 6
Challenge 2: write amplification Random write an 80 GB Dataset to an SSD based RocksDB. (20 million KV items, 16byte-4KB) Write amplification: Average throughput decreases gradually. Decreased performance. Increased LSM depth! More compaction and higher WA 7
Root cause of increased write amplification Level by level compactions: Write C m Memory amplification increases with the depth of LSM-trees. L 0 Disk L 1 WA=AF * N; L 2 AF is the amplification factor of L n SSTable adjacent two levels. (AF=10) N is the number of levels. 8
State-of-art solution with NVM NVM is byte-addressable, persistent, and fast! NoveLSM: Adopting NVM to store large mutable MemTable. 1.7x higher random write performance but more severe write stalls! MemTable MemTable DRAM NVM Immutable Immutable MemTable MemTable L 0 L 1 SSD L n NoveLSM *Sudarsun Kannan, Nitish Bhat, Ada Gavrilovska, Andrea Arpaci-Dusseau, and Remzi Arpaci-Dusseau. Redesigning lsms for 9 nonvolatile memory with novelsm. In 2018 USENIX Annual Technical Conference (ATC18), 2018.
Motivation All-to-all L0-L1 compaction Increased depth Higher write amplification Write stall Unstable performance Decreased performance MatrixKV: Reducing Write Stalls and Write Amplification in LSM-tree Based KV Stores by exploiting NVM 10
Outline Background and Motivations MatrixKV Evaluation Conclusion 11
Overall Architecture Put 1. Matrix container in NVM : Manage L0’s data on NVM mem DRAM imm 2. Column compaction: A fine granularity Flush Receiver column compaction to reduce write stalls Cross-row hints PMDK Matrix Container NVM 3. Reducing levels on SSD: Reduce LSM- tree’s (L 0 of LSM-trees) level numbers to decrease WA (on SSD) Compactor Column compaction Posix 4. Cross-Row hint search: A hint search LSM-trees with reduced levels SSD algorithm in Matrix container to improve L1, L2, read performance 12
Matrix Container Matrix container includes a receiver and Flush from DRAM a compactor. RowTable 3 Receiver stores flushed data row by row 2 and organized in RowTable. 1 Receiver 0 A: A receiver turns into a compactor A B once filled with RowTables 3 Compactor 2 1 Compactor compacts data from L 0 to L 1 0 NVM a-c c-e e-n n-o u-z on SSD column by column. g n SSD a c a c d e h o B: NVM pages of a column are freed and SSTables on L 1 available for receiver to accept new data after the column compaction. 13
RowTable Data: sorted kv items Metadata: a sorted array k 0 K 1 K ... k n k 0 v 0 k 1 v 1 ... k n v n P 0 P 1 P... P n Offset Offset Offset Offset Page 0 Page 0 Page ... Pagen (a) RowTable structure Consisting of data and metadata. Data region: serialized KV items from the immutable MemTable Metadata region: a sorted array. • Key • page number • offset in the page • forward pointer (i.e., $p_n$) 14
Fine grained column compaction The non-overlapped L1 is a key space with multiple contiguous key ranges. 3 3 5 7 10 23 28 35 Example: Compactor ( NVM) 1. Range 0-3. 3 3 6 8 13 30 45 51 2. The amount of compaction data VS. the threshold of compaction. 1 1 4 9 10 13 38 42 3. Add the next subrange 3-5 -> Range 0-5. 3 3 8 11 12 14 40 48 4. Add the next subrange 5-8 -> Range L 1 0 3 5 8 10 15 20 26 30 0-8. (SSD) ... 5. Reach the threshold of compaction, Start column compaction 15
Fine grained column compaction The non-overlapped L1 is a key space with multiple contiguous key ranges. 3 3 5 5 7 10 23 28 35 Example: Compactor ( NVM) 1. Range 0-3. 3 3 6 8 13 30 45 51 2. The amount of compaction data VS. the threshold of compaction. 1 1 4 4 9 10 13 38 42 3. Add the next subrange 3-5 -> Range 0-5. 3 3 8 11 12 14 40 48 4. Add the next subrange 5-8 -> Range L 1 0 3 5 8 10 15 20 26 30 0-8. (SSD) ... 5. Reach the threshold of compaction, Start column compaction 16
Fine grained column compaction The non-overlapped L1 is a key space with multiple contiguous key ranges. 3 3 5 5 7 7 10 23 28 35 Example: Compactor ( NVM) 1. Range 0-3. 3 3 6 6 8 8 13 30 45 51 2. The amount of compaction data VS. the threshold of compaction. 1 1 4 4 9 10 13 38 42 3. Add the next subrange 3-5 -> Range 0-5. 3 3 8 8 11 12 14 40 48 4. Add the next subrange 5-8 -> Range L 1 0 3 5 8 10 15 20 26 30 0-8. (SSD) ... 5. Reach the threshold of compaction, Range [0-8] Range (8-30] Range ... Start column compaction 17
Reducing LSM-tree depth WA=AF * N Flattening LSM-trees with wider levels L 0 256MB L 0 8 GB L 1 256 MB • Make the AF unchanged NVM SSD L 2 2.56 GB L 1 8 GB • Reduce N L 2 80 GB L 3 25.6 GB SSD L 4 256 GB L 3 800 GB Increased unsorted L0 L 5 2.56 TB L 4 8 TB Column compaction Conventional LSM-tree Flattened LSM-tree in MatrixKV Decrease search efficiency in L0 Cross-row hint search
Cross-Row hint search Constructing with forward pointer 12 • RowTable i key x 3 5 7 10 23 28 35 RowTable3 10 23 • RowTable i-1, key y • y ≥ x 3 6 8 8 13 13 30 30 45 51 RowTable2 Search process with forward pointer 1 4 9 9 10 10 13 13 38 42 RowTable1 • E.g., fetch key=12 3 8 11 11 12 12 14 14 40 48 RowTable0
Evaluation Setup Comparisons RocksDB-SSD: SSD based RocksDB RocksDB-L0-NVM: placing L0 in NVM, system with DRAM, NVM, and SSD (8GB NVM) NoveLSM: a heterogeneous system of DRAM, NVM, and SSD (8GB NVM) MatrixKV: a heterogeneous system of DRAM, NVM, and SSD (8GB NVM) Test environment Linux 64-bit Linux 4.13.9 CPU 2 * Genuine Intel(R) 2.20GHz processors Memory 32 GB NVM 128 GB * 2 Intel Optane DC PMM FIO 4 KB (MB/s) Random: 2346(R), 1363(W) Sequential: 2567(R),1444(W) SSD 800GB Intel SSDSC2BB800G7 FIO 4 KB (MB/s) Random: 250(R), 68(W) Sequential: 445(R),354(W) 20
Random Write Throughput MatrixKV obtains the best performance in different value sizes E.g. 4 KB value size MatrixKV outperforms RocksDB- L0-NVM and NoveLSM by 3.6x and 2.6x. 21
Write stalls 1. Better random write throughout. 2. MatrixKV has more stable throughput. Reduce write stalls! 22
Tail Latency Latency (us) avg. 90% 99% 99.9% RocksDB-SSD 974 566 11055 17983 NoveLSM 450 317 2080 2169 RocksDB-L0-NVM 477 786 1112 528 MatrixKV 263 247 405 663 MatrixKV obtains the shortest latency in all cases. E.g. 99% latency of MatrixKV is 27x, 5x, and 1.9x lower than RocksDB-SSD, NoveLSM, and RocksDB-L0-NVM respectively. 23
Fine granularity column compaction Why MatrixKV reduces write stalls ? • 467 times column compaction • 0.33 GB each 24
Write amplification The WA of randomly writing 80 GB dataset. WA = Amount of data written to SSDs / Amount of data written by users MatrixKV ’ WA is 3.43x. MatrixKV reduces the number of compactions with flattened LSM-trees. 25
Summary Conventional SSD-based KV stores • unpredictable performance due to write stalls • sacrificed performance due to WA MatrixKV: an LSM-tree based KV store on systems with DRAM, NVM, and SSD storages • Matrix container in NVM • Column compaction • Hint search • Reducing levels on SSD Reduce write stalls and improves write performance. 26
Thanks! Open-source code: https://github.com/PDS-Lab/MatrixKV Email: tingyao@hust.edu.cn 27
Recommend
More recommend