castle reinventing storage for big data
play

Castle: Reinventing Storage for Big Data Tom Wilkie Founder & - PowerPoint PPT Presentation

Castle: Reinventing Storage for Big Data Tom Wilkie Founder & VP Engineering Before the Flood 1990 Small databases BTree indexes BTree File systems RAID Old hardware Two Revolutions 2010 Distributed, shared-nothing databases


  1. Castle: Reinventing Storage for Big Data Tom Wilkie Founder & VP Engineering

  2. Before the Flood 1990 Small databases BTree indexes BTree File systems RAID Old hardware

  3. Two Revolutions 2010 Distributed, shared-nothing databases Write-optimised indexes Write-optimised indexes BTree file systems BTree file systems ... RAID RAID New hardware New hardware

  4. Bridging the Gap 2011 Distributed, shared-nothing databases Castle Castle ... New hardware New hardware

  5. With Big Data, how With Big Data, how do I... do I... S T O H S P A N S

  6. What’s in the Castle?

  7. Shared memory interface keys Userspace Acunu Kernel userspace interface values In-kernel async, shared workloads memory ring shared buffers kernelspace interface Streaming interface range key buffered key buffered queries insert value insert get value get Doubling Arrays doubling array mapping layer insert Bloom filters key queues get arrays x range arrays queries management key insert merges Arrays mapping layer modlist btree Version tree key btree insert key get btree range queries value arrays Cache block mapping & cacheing layer "Extent" layer prefetcher extent block cache extent freespace allocator manager flusher & mapper page cache linux's block & Linux Kernel MM layers Block layer Memory manager

  8. Shared memory interface Castle keys Userspace Acunu Kernel userspace interface values In-kernel async, shared workloads memory ring shared buffers kernelspace Streaming interface interface range key buffered key buffered queries insert value insert get value get • Opensource (GPLv2, MIT Doubling Arrays doubling array mapping layer for user libraries) Bloom filters insert key queues get arrays x range arrays queries management key • http://bitbucket.org/acunu insert merges Arrays mapping layer modlist btree • Loadable Kernel Module, Version tree key btree insert key get btree targeting CentOS’s 2.6.18 range queries value arrays • http://www.acunu.com/ Cache block mapping & cacheing layer "Extent" layer prefetcher extent block cache extent blogs/andy-twigg/why- freespace allocator manager flusher & mapper page cache acunu-kernel/ linux's block & Linux Kernel MM layers Block layer Memory manager

  9. The Interface Shared memory interface keys Userspace Acunu Kernel userspace interface values In-kernel async, shared workloads memory ring shared buffers kernelspace Streaming interface interface key buffered range key buffered get value get queries insert value insert Doubling Arrays doubling array mapping layer Bloom filters insert key queues get arrays x castle_{back,objects}.c range arrays queries management key

  10. Acunu Kernel userspace interface values In-kernel async, shared workloads memory ring shared buffers Doubling Array kernelspace Streaming interface interface key buffered range key buffered get value get queries insert value insert Doubling Arrays doubling array mapping layer Bloom filters insert key queues get arrays x range arrays queries management key insert merges Arrays mapping layer modlist btree Version tree key btree insert key get btree range queries value arrays castle_{da,bloom}.c

  11. Range Query Update (Size Z) O(log B N) O(Z/B) B-Tree random IOs random IOs B = “block size”, say 8KB at 100 bytes/entry ~= 100 entries

  12. Doubling Array Inserts 2 2 9 9 Buffer arrays in memory until we have > B of them

  13. Doubling Array Inserts 11 2 9 2 8 9 11 etc... 8 11 8 Similar to log-structured merge trees (LSM), cache- oblivious lookahead array (COLA), ...

  14. Demo https://acunu-videos.s3.amazonaws.com/dajs.html

  15. Range Query Update (Size Z) O(log B N) O(Z/B) B-Tree random IOs random IOs O((log N)/B) Doubling Array sequential IOs B = “block size”, say 8KB at 100 bytes/entry ~= 100 entries

  16. Doubling Array Queries query(k) • Add an index to each array to do lookups • query(k) searches each array independently

  17. Doubling Array Queries query(k) • Bloom Filters can help exclude arrays from search • ... but don’t help with range queries

  18. 8KB @ 100MB/s, w/ 8ms seek 100 / 5 = 100 IOs/s = 20 updates/s ~ log (2^30)/log 100 = 5 IOs/update Range Query Update (Size Z) O(log B N) O(Z/B) B-Tree random IOs random IOs O((log N)/B) O(Z/B) Doubling Array sequential IOs sequential IOs 13k / 0.2 8KB @ 100MB/s ~ log (2^30)/100 = 65k updates/s = 13k IOs/s = 0.2 IOs/update B = “block size”, say 8KB at 100 bytes/entry ~= 100 entries

  19. Acunu Kernel userspace interface values In-kernel async, shared workloads memory ring shared buffers Doubling Array kernelspace Streaming interface interface key buffered range key buffered get value get queries insert value insert Doubling Arrays doubling array mapping layer Bloom filters insert key queues get arrays x range arrays queries management key insert merges Arrays mapping layer modlist btree Version tree key btree insert key get btree range queries value arrays castle_{da,bloom}.c

  20. ke Doubling Arrays doubling array mapping layer “Mod-list” B-Tree Bloom filters insert key queues get arrays x range arrays queries management key insert merges Arrays mapping layer modlist btree Version tree key btree insert key get btree range queries value arrays Cache block mapping & cacheing layer "Extent" layer prefetcher extent block cache extent freespace allocator manager flusher & mapper page cache castle_{btree,versions}.c ck & Linux Kernel rs

  21. Copy-on-Write BTree Idea: • Apply path-copying [DSST] to the B-tree Problems: • Space blowup: Each update may rewrite an entire path • Slow updates: as above A log file system makes updates sequential, but relies on random access and garbage collection (achilles heel!)

  22. Range Update Space Query CoW B- O(log B N v ) O(Z/B) O(N B log B N v ) Tree random IOs random IOs N v = #keys live (accessible) at version v

  23. “BigTable” snapshots v1 • Inserts produce arrays 1 a 1 b

  24. “BigTable” snapshots v1 v2 • Inserts produce arrays • Snapshots increment ref 1 2 a a 1 2 b b 1 c counts on arrays • Merges product more arrays, decrement ref count on old arrays

  25. “BigTable” snapshots v1 v2 • Inserts produce arrays • Snapshots increment ref 1 1 a 1 1 b counts on arrays 1 1 a b c • Merges product more arrays, decrement ref count on old arrays

  26. “BigTable” snapshots v1 v2 • Inserts produce arrays • Snapshots increment ref 1 1 a 1 1 b counts on arrays 1 1 a b c • Merges product more arrays, decrement ref count on old arrays • Space blowup

  27. Range Update Space Query CoW B- O(log B N v ) O(Z/B) O(N B log B N v ) Tree random IOs random IOs “BigTable” O((log N)/B) O(Z/B) O(VN) style DA sequential IOs sequential IOs N v = #keys live (accessible) at version v

  28. “Mod-list” BTree Idea: • Apply fat-nodes [DSST] to the B-tree • ie insert (key, version, value) tuples, with special operations Problems: • Similar performance to a BTree If you limit the #versions, can be constructed sequentially, and embedded into a DA

  29. Range Update Space Query CoW B- O(log B N v ) O(Z/B) O(N B log B N v ) Tree random IOs random IOs “BigTable” O((log N)/B) O(Z/B) O(VN) LevelDB style DA sequential IOs sequential IOs “Mod-list” O((log N)/B) O(Z/B) CASTLE O(N) in a DA sequential IOs sequential IOs N v = #keys live (accessible) at version v

  30. Stratified BTree Problem: v0 v1 v0 v1 v0 v1 v1 v2 v2 v1 v2 v1 Embedded “Mod- newer older list” #versions limit merge (duplicates removed) Solution: k1 k3 k5 k2 k4 v1 v0 v2 v1 v0 v2 v1 v0 v2 v1 Version-split arrays v-split during merges k4 k5 k1 k2 k3 v0 entries here are {v2} v0 v0 v2 v2 v0 v2 duplicates k2 k4 k5 k1 v1 v2 {v1,v0} v1 v0 v1 v0 v1 v0 v1

  31. ke Doubling Arrays doubling array mapping layer “Mod-list” B-Tree Bloom filters insert key queues get arrays x range arrays queries management key insert merges Arrays mapping layer modlist btree Version tree key btree insert key get btree range queries value arrays Cache block mapping & cacheing layer "Extent" layer prefetcher extent block cache extent freespace allocator manager flusher & mapper page cache castle_{btree,versions}.c ck & Linux Kernel rs

  32. Arrays mapping layer modlist btree Version tree key btree insert Disk Layout: RDA key get btree range queries value arrays Cache block mapping & cacheing layer "Extent" layer prefetcher extent block cache extent freespace allocator manager flusher & mapper page cache linux's block & Linux Kernel MM layers Block layer Memory manager castle_{cache,extent,freespace,rebuild}.c

  33. Disk Layout: RDA random duplicate allocation 4 2 2 1 4 5 5 3 1 3 5 2 7 10 7 6 9 9 10 6 8 8 8 9 15 12 14 14 11 11 12 13 13 15 13 14 16 16

  34. Performance Comparison

Recommend


More recommend