Castle: Reinventing Storage for Big Data Tom Wilkie Founder & VP Engineering
Before the Flood 1990 Small databases BTree indexes BTree File systems RAID Old hardware
Two Revolutions 2010 Distributed, shared-nothing databases Write-optimised indexes Write-optimised indexes BTree file systems BTree file systems ... RAID RAID New hardware New hardware
Bridging the Gap 2011 Distributed, shared-nothing databases Castle Castle ... New hardware New hardware
With Big Data, how With Big Data, how do I... do I... S T O H S P A N S
What’s in the Castle?
Shared memory interface keys Userspace Acunu Kernel userspace interface values In-kernel async, shared workloads memory ring shared buffers kernelspace interface Streaming interface range key buffered key buffered queries insert value insert get value get Doubling Arrays doubling array mapping layer insert Bloom filters key queues get arrays x range arrays queries management key insert merges Arrays mapping layer modlist btree Version tree key btree insert key get btree range queries value arrays Cache block mapping & cacheing layer "Extent" layer prefetcher extent block cache extent freespace allocator manager flusher & mapper page cache linux's block & Linux Kernel MM layers Block layer Memory manager
Shared memory interface Castle keys Userspace Acunu Kernel userspace interface values In-kernel async, shared workloads memory ring shared buffers kernelspace Streaming interface interface range key buffered key buffered queries insert value insert get value get • Opensource (GPLv2, MIT Doubling Arrays doubling array mapping layer for user libraries) Bloom filters insert key queues get arrays x range arrays queries management key • http://bitbucket.org/acunu insert merges Arrays mapping layer modlist btree • Loadable Kernel Module, Version tree key btree insert key get btree targeting CentOS’s 2.6.18 range queries value arrays • http://www.acunu.com/ Cache block mapping & cacheing layer "Extent" layer prefetcher extent block cache extent blogs/andy-twigg/why- freespace allocator manager flusher & mapper page cache acunu-kernel/ linux's block & Linux Kernel MM layers Block layer Memory manager
The Interface Shared memory interface keys Userspace Acunu Kernel userspace interface values In-kernel async, shared workloads memory ring shared buffers kernelspace Streaming interface interface key buffered range key buffered get value get queries insert value insert Doubling Arrays doubling array mapping layer Bloom filters insert key queues get arrays x castle_{back,objects}.c range arrays queries management key
Acunu Kernel userspace interface values In-kernel async, shared workloads memory ring shared buffers Doubling Array kernelspace Streaming interface interface key buffered range key buffered get value get queries insert value insert Doubling Arrays doubling array mapping layer Bloom filters insert key queues get arrays x range arrays queries management key insert merges Arrays mapping layer modlist btree Version tree key btree insert key get btree range queries value arrays castle_{da,bloom}.c
Range Query Update (Size Z) O(log B N) O(Z/B) B-Tree random IOs random IOs B = “block size”, say 8KB at 100 bytes/entry ~= 100 entries
Doubling Array Inserts 2 2 9 9 Buffer arrays in memory until we have > B of them
Doubling Array Inserts 11 2 9 2 8 9 11 etc... 8 11 8 Similar to log-structured merge trees (LSM), cache- oblivious lookahead array (COLA), ...
Demo https://acunu-videos.s3.amazonaws.com/dajs.html
Range Query Update (Size Z) O(log B N) O(Z/B) B-Tree random IOs random IOs O((log N)/B) Doubling Array sequential IOs B = “block size”, say 8KB at 100 bytes/entry ~= 100 entries
Doubling Array Queries query(k) • Add an index to each array to do lookups • query(k) searches each array independently
Doubling Array Queries query(k) • Bloom Filters can help exclude arrays from search • ... but don’t help with range queries
8KB @ 100MB/s, w/ 8ms seek 100 / 5 = 100 IOs/s = 20 updates/s ~ log (2^30)/log 100 = 5 IOs/update Range Query Update (Size Z) O(log B N) O(Z/B) B-Tree random IOs random IOs O((log N)/B) O(Z/B) Doubling Array sequential IOs sequential IOs 13k / 0.2 8KB @ 100MB/s ~ log (2^30)/100 = 65k updates/s = 13k IOs/s = 0.2 IOs/update B = “block size”, say 8KB at 100 bytes/entry ~= 100 entries
Acunu Kernel userspace interface values In-kernel async, shared workloads memory ring shared buffers Doubling Array kernelspace Streaming interface interface key buffered range key buffered get value get queries insert value insert Doubling Arrays doubling array mapping layer Bloom filters insert key queues get arrays x range arrays queries management key insert merges Arrays mapping layer modlist btree Version tree key btree insert key get btree range queries value arrays castle_{da,bloom}.c
ke Doubling Arrays doubling array mapping layer “Mod-list” B-Tree Bloom filters insert key queues get arrays x range arrays queries management key insert merges Arrays mapping layer modlist btree Version tree key btree insert key get btree range queries value arrays Cache block mapping & cacheing layer "Extent" layer prefetcher extent block cache extent freespace allocator manager flusher & mapper page cache castle_{btree,versions}.c ck & Linux Kernel rs
Copy-on-Write BTree Idea: • Apply path-copying [DSST] to the B-tree Problems: • Space blowup: Each update may rewrite an entire path • Slow updates: as above A log file system makes updates sequential, but relies on random access and garbage collection (achilles heel!)
Range Update Space Query CoW B- O(log B N v ) O(Z/B) O(N B log B N v ) Tree random IOs random IOs N v = #keys live (accessible) at version v
“BigTable” snapshots v1 • Inserts produce arrays 1 a 1 b
“BigTable” snapshots v1 v2 • Inserts produce arrays • Snapshots increment ref 1 2 a a 1 2 b b 1 c counts on arrays • Merges product more arrays, decrement ref count on old arrays
“BigTable” snapshots v1 v2 • Inserts produce arrays • Snapshots increment ref 1 1 a 1 1 b counts on arrays 1 1 a b c • Merges product more arrays, decrement ref count on old arrays
“BigTable” snapshots v1 v2 • Inserts produce arrays • Snapshots increment ref 1 1 a 1 1 b counts on arrays 1 1 a b c • Merges product more arrays, decrement ref count on old arrays • Space blowup
Range Update Space Query CoW B- O(log B N v ) O(Z/B) O(N B log B N v ) Tree random IOs random IOs “BigTable” O((log N)/B) O(Z/B) O(VN) style DA sequential IOs sequential IOs N v = #keys live (accessible) at version v
“Mod-list” BTree Idea: • Apply fat-nodes [DSST] to the B-tree • ie insert (key, version, value) tuples, with special operations Problems: • Similar performance to a BTree If you limit the #versions, can be constructed sequentially, and embedded into a DA
Range Update Space Query CoW B- O(log B N v ) O(Z/B) O(N B log B N v ) Tree random IOs random IOs “BigTable” O((log N)/B) O(Z/B) O(VN) LevelDB style DA sequential IOs sequential IOs “Mod-list” O((log N)/B) O(Z/B) CASTLE O(N) in a DA sequential IOs sequential IOs N v = #keys live (accessible) at version v
Stratified BTree Problem: v0 v1 v0 v1 v0 v1 v1 v2 v2 v1 v2 v1 Embedded “Mod- newer older list” #versions limit merge (duplicates removed) Solution: k1 k3 k5 k2 k4 v1 v0 v2 v1 v0 v2 v1 v0 v2 v1 Version-split arrays v-split during merges k4 k5 k1 k2 k3 v0 entries here are {v2} v0 v0 v2 v2 v0 v2 duplicates k2 k4 k5 k1 v1 v2 {v1,v0} v1 v0 v1 v0 v1 v0 v1
ke Doubling Arrays doubling array mapping layer “Mod-list” B-Tree Bloom filters insert key queues get arrays x range arrays queries management key insert merges Arrays mapping layer modlist btree Version tree key btree insert key get btree range queries value arrays Cache block mapping & cacheing layer "Extent" layer prefetcher extent block cache extent freespace allocator manager flusher & mapper page cache castle_{btree,versions}.c ck & Linux Kernel rs
Arrays mapping layer modlist btree Version tree key btree insert Disk Layout: RDA key get btree range queries value arrays Cache block mapping & cacheing layer "Extent" layer prefetcher extent block cache extent freespace allocator manager flusher & mapper page cache linux's block & Linux Kernel MM layers Block layer Memory manager castle_{cache,extent,freespace,rebuild}.c
Disk Layout: RDA random duplicate allocation 4 2 2 1 4 5 5 3 1 3 5 2 7 10 7 6 9 9 10 6 8 8 8 9 15 12 14 14 11 11 12 13 13 15 13 14 16 16
Performance Comparison
Recommend
More recommend