Scaling Log-Structured KV-Stores featuring Monkey and Dostoevsky SIGMOD17 / SIGMOD18 Niv Dayan
Log-Structured KV-Stores
Log-Structured KV-Stores
Why Log-Structured KV-Stores?
Why Log-Structured KV-Stores? fast writes
Why Log-Structured KV-Stores? memory storage
Why Log-Structured KV-Stores?
Why Log-Structured KV-Stores?
Why Log-Structured KV-Stores? byte -addressable block -addressable
write data
write data
write data
In-Place Writes write data
In-Place Writes B-trees write data
In-Place Writes B-trees write data
Log-Structured Writes
Log-Structured Writes buffer writes
Log-Structured Writes buffer writes
Log-Structured Writes buffer writes
Log-Structured Writes buffer writes
Log-Structured Writes buffer writes
Log-Structured KV-Stores fast writes buffer writes
Log-Structured KV-Stores fast writes fast reads massive data
Background
Background buffer The Log-Structured Merge-Tree
Background buffer LSM-tree
buffer
writes buffer
key value pairs buffer
key value Sherlock: a fictional detective Waldo: an inconspicuous traveler buffer
buffer gets full
level buffer sort & flush 0 1
level buffer sort & flush … sorted runs 0 1
0 buffer 1 sort-merge 2
level 0 buffer exponentially increasing capacities o n e 1 level 1 I / O p e r r u n level 2 2 level 3 3
where’s level Waldo 0 buffer b i n a 1 r y s e a r c h i n g 2 3
where’s level Waldo 0 buffer pointers o n e 1 I / O p e r r u n 2 3
where’s level Waldo Bloom 0 buffer pointers filters 1 2 3
where’s level Waldo Bloom 0 buffer pointers filters true 1 negative 2 3
where’s level Waldo Bloom 0 buffer pointers filters true 1 negative false 2 positive 3
where’s level Waldo Bloom 0 buffer pointers filters true 1 negative false 2 positive true 3 positive
Bloom 0 buffer pointers filters merging frequency 1 2 3
merging writes reads
merging writes reads
merging Leveling Tiering write-optimized read-optimized
Leveling Tiering read-optimized write-optimized
Leveling Tiering read-optimized write-optimized gather
Leveling Tiering read-optimized write-optimized gather merge & flush
Leveling Tiering read-optimized write-optimized gather
Leveling Tiering read-optimized write-optimized gather merge
Leveling Tiering read-optimized write-optimized gather merge flush
Leveling Tiering read-optimized write-optimized gather merge
Leveling Tiering read-optimized write-optimized log R ( N )
Leveling Tiering read-optimized write-optimized 1 run per level R runs per level log R ( N ) size ratio
Leveling Tiering read-optimized write-optimized 1 run per level R runs per level log R ( N ) size ratio
Leveling Tiering read-optimized write-optimized 1 run per level R runs per level size ratio R
Leveling Tiering read-optimized write-optimized 1 run per level 1 run per level size ratio R
Leveling Tiering read-optimized write-optimized 1 run per level T runs per level size ratio R
Leveling Tiering read-optimized write-optimized O(l Nl ) runs per level 1 run per level sorted log array size ratio R
log Tiering Leveling sorted array
log Tiering size ratio R Leveling sorted array
log Tiering size ratio R Leveling sorted array
R log Tiering size ratio R Leveling sorted R array
Monkey Dostoevsky
M onkey: O ptimal N avigable Key -Value Store SIGMOD17
M onkey: O ptimal N avigable Key -Value Store SIGMOD17 Niv Dayan Manos Athanassoulis Stratos Idreos
M onkey: O ptimal N avigable Key -Value Store SIGMOD17 Bloom data filters
Bloom data bits/entry filters x x x
Bloom data bits/entry filters x x x
false Bloom data positive rate filters O(e -x ) O(e -x ) O(e -x )
false Bloom positive rate filters O(e -x ) O( e -x · log R ( N )) I/O O(e -x ) = O(e -x )
false Bloom positive rate filters O(e -x ) O( e -x · log R ( N )) I/O O(e -x ) = O(e -x )
false Bloom positive rate filters O(e -x ) O(e -x ) O(e -x ) most memory
false Bloom positive rate filters O(e -x ) O(e -x ) O(e -x ) most memory saves at most 1 I/O!
reallocate
reallocate
same memory - fewer false positives reallocate
relax false positive rates 0 < p 0 < 1 0 < p 1 < 1 0 < p 2 < 1
model relax read false positive rates = f( p 0 , p 1 …) cost 0 < p 0 < 1 0 < p 1 < 1 memory = f( p 0 , p 1 …) footprint 0 < p 2 < 1
model relax L read ∑ false positive rates = p i cost 1 0 < p 0 < 1 0 < p 1 < 1 L memory T L − i ⋅ ln( p i ) N ∑ = − ln(2) 2 footprint 0 < p 2 < 1 i
model relax optimize L read ∑ false positive rates = p i cost 1 0 < p 0 < 1 0 < p 1 < 1 L memory T L − i ⋅ ln( p i ) N ∑ = in terms of p 0 , p 1 … − ln(2) 2 footprint 0 < p 2 < 1 i
false positive rate p 0 ≈ O( e -x / R 2 ) p 1 ≈ O( e -x / R 1 ) O( e -x / R 0 ) p 2 ≈
false positive rate geometric O( e -x /R 2 ) progression = O(e - x ) I/O O( e -x /R 1 ) O( e -x /R 0 )
O( e -x · log R ( N )) > O( e - x ) I/O
O( e -x · log R ( N )) O( e - x ) I/O
O( e -x · log R ( N )) read latency (ms) RocksDB Monkey O( e - x ) I/O number of entries (log scale)
Existing Monkey
Existing Monkey Dostoevsky
tiering Monkey leveling
I/O overheads with leveling point long range short range writes
point false positive rates O( e - x / R 2 ) exponentially O( e - x / R ) decreasing O( e - x )
false positive rates O(e - x / R 2 ) O(e - x / R ) O(e - x ) largest level point
point long range short range writes largest level O(e - x )
Recommend
More recommend