LOG-STRUCTURED MERGE-TRIE PART 1 Xingbo Wu and Yuehai Xu, Wayne State University; Zili Shao, The Hong Kong Polytechnic University; Song Jiang, Wayne State University Presented by: Joel Friberg
LSM-Trie Overview ■ 32MB Htable KV-item organization ■ Almost no index – hash based ■ Fixed size buckets to match disk blocks (4KB) ■ Linear and Exponential levels in the trie (112 total) ■ 16bit bloom filters (5% false positive rate achieved) ■ 1 disk read necessary for bloom filters (BloomCluster) ■ Optimized for up to 10TB store https://www.researchgate.net/profile/Pasi_Fraenti/publication/321323711/figure/fig8/AS:576074708525076@1514358321712/Prefix-tree-example.png
Question 1 “In the meantime, for some KV stores, such as SILT [24], major efforts are made to optimize reads by minimizing metadata size, while write performance can be compromised without conducting multi- level incremental compactions” Explain how high write amplifications are produced in SILT. ■ Single SortedStore on disk for everything ■ Entries in HashStore can cover large range ■ Large ratio between actual data to write and data to merge http://ranger.uta.edu/~sjiang/CSE6350-spring-19/lecture-7.pdf
Question 2 “Note that LSM -trie uses hash functions to organize its data and accordingly does not support range search.” Do FAWN and LevelDB support range search? ■ FAWN is hash based – no range search ■ LevelDB stores sorted KV pairs, indices are block ranges – can range search
Question 3 Use Figure 1 to explain the difference between linear and exponential growth patterns.
Question 4 “Because 4KB block is a disk access unit, it is not necessary to maintain a larger index to determine byte offset of each item in a block.” Show how a lookup with a given key is carried out in LevelDB? ■ Binary search MemTable ■ Recursively binary search and check bloom filter for SSTables that index is in range of on each level ■ Retrieve value http://ranger.uta.edu/~sjiang/CSE6350-spring-19/lecture-7.pdf
Question 5 “Instead, we first apply a cryptographic hash function, such as SHA -1, on the key, and then use the hashed key, or hashkey in short, to make the determination.” Assuming a user- provided key has 160 bits, what’s the issue if LSM -trie used the user keys, instead of hashed keys, in its data structure and operations? ■ Cryptographic hash follows normal distribution ■ User key may be unbalanced https://appliedgo.net/balancedtree/
Question 6 “Among all compactions moving data from Lk to Lk+1, we must make sure their key ranges are not overlapped to keep any two SSTables at Level L k+1 from having overlapped key ranges. However, this cannot be achieved with the LevelDB data organization …” Please explain why LevelDB cannot achieve it? ■ SSTable has limited capacity ■ Key range size of SSTable highly variable ■ SSTables cover different ranges at each sublevel http://ranger.uta.edu/~sjiang/CSE6350-spring-19/lecture-7.pdf
Question 7 Use Figures 2 and 3 to describe the LSM- trie’s structure and how compaction is performed in the trie.
Conclusion ■ Optimized for many small items ■ High performance read and write ■ Hash based with some indices used for large items ■ No range search ■ Utilizes exponential levels (5) and linear levels (8 per exponential levels 1-4, 80 on level 5) to store up to 10TB of data
Recommend
More recommend