the power of the log lsm append only data structures
play

The Power of the Log LSM & Append Only Data Structures Ben - PowerPoint PPT Presentation

The Power of the Log LSM & Append Only Data Structures Ben Stopford Confluent Inc @benstopford Kafka: a Streaming Platform Producer Consumer Connectors Connectors The Log Streaming Engine KAFKAs Distributed Log Append Only Linear


  1. The Power of the Log LSM & Append Only Data Structures Ben Stopford Confluent Inc

  2. @benstopford

  3. Kafka: a Streaming Platform Producer Consumer Connectors Connectors The Log Streaming Engine

  4. KAFKA’s Distributed Log Append Only Linear Scans

  5. Messaging is a Log-Shaped Problem Append Only Linear Scans

  6. Not all problems are Log-Shaped

  7. Many problems benefit from being addressed in a “log-shaped” way

  8. Supporting Lookups

  9. Lookups in a log Head Tail

  10. Trees provide Selectivity Index bob hary mike steve vince dave fred

  11. But the overarching structure implies Dispersed Writes Random IO bob hary mike steve vince dave fred

  12. Log Structured Merge Trees 1996

  13. Used in a range of modern databases • BigTable • MongoDB • HBase • WiredTiger • LevelDB • Cassandra • SQLite4 • MySQL • RocksDB • InfluxDB ...

  14. If a systems have a natural grain, it is one formed of sequential operations which favour locality

  15. Caching & Prefetching Disk Controller Page Cache L3 cache CPU Caches L2 cache L1 cache Pre-fetch is your Application-level caching friend

  16. Write efficiency comes from amortising writes into sequential operations

  17. Taken from ACMQueue: The Pathologies of Big Data

  18. So if we go against the grain of the system, RAM can actually be slower than disk

  19. Going against the grain means dispersed operations that break locality Good Locality Poor Locality

  20. The beauty of the log lies in its sequentially Append Only Linear Scans

  21. LSM is about re-imagining search as as a “log-shaped” problem

  22. Arrange writes to be Append Only Bob = Carpenter Update in Place Ordered File (Random IO) Bob = Cabinet Maker Bob = Carpenter Append Only Journal (Sequential IO) Bob = Cabinet Maker

  23. Avoid dispersed writes

  24. Simple LSM

  25. Writes are collected in memory Writes sort RAM write to disk small older index file files

  26. When enough have buffered, sort. Writes sorted Batched RAM write to disk small older index file files

  27. Write the sorted file to disk Writes sorted Batched write to disk Small, sorted older immutable file files

  28. Repeat... Writes sorted Batched write to disk New files Older files

  29. Batching -> Fast Sequential IO Writes Sorted memtable Batched write to disk New files Older files

  30. That’s the core write path

  31. What about reads?

  32. Search reverse-chronologically (1) Is “bob” here? (3) Is “bob” here? newer older files files (2) Is “bob” here? (4) Is “bob” here?

  33. Worst Case We consult every file

  34. We might have a lot of files!

  35. LSM naturally optimises for writes, over reads This is a reasonable tradeoff to make

  36. Optimizing reads is easier than optimising writes

  37. Optimisation 1 Bound the number of files

  38. Create levels Level-1 Level-0

  39. Separate thread merges old files, de- duplicating them. Level-1 Level-0

  40. Separate thread merges old files, de- duplicating them. Level-1 Level-0

  41. Merging process is reminiscent of merge sort

  42. Take this further with levels Level-3 Memtable Level-2 Level-1 Level-0

  43. But single reads still require many individual lookups: • Number of searches: – 1 per base level – 1 per level above

  44. Optimisation 2 Caching & Friends

  45. Add Memory i.e. More Caching / Pre-fetch

  46. Read Ahead & Prefetch Disk Controller Page Cache L3 cache L2 cache L1 cache Pre-fetch is your friend

  47. If only there was a more efficient way to avoid searching each file!

  48. Elven Magic?

  49. Bloom Filters Bit Set Answers the question: Do I need to look in this file to find the value for this key? Hash Function Size -> probability of false positive Key

  50. Bloom Filters • Space efficient, probabilistic data structure • As keyspace grows: – p(collision) increases – Index size is fixed

  51. Many more degrees of freedom for optimising reads RAM file metadata & bloom filter Disk

  52. Log Structured Merge Trees • A collection of small, immutable indexes • All sequential operations, de-duplicate by merging files • Index/Bloom in RAM to increase read performance

  53. Subtleties • Writes are 1 x IO (blind writes) , rather than 2 x IO’s (read + modify) • Batching writes decreases write amplification. In trees leaf pages must be updated.

  54. Immutability => Simpler locking semantics Only memtable is mutable

  55. Does it work? Lots of real world examples

  56. Measureable in the real world • Innodb vs MyRocks results, taken from Mark Callaghan’s blog: http://bit.ly/2mhWT7p • There are many subtleties. Take all benchmarks with a pinch of salt.

  57. Elements of Beauty • Reframing the problem to be Log-Centric. To go with the grain of the system. • Optimise for the harder problem • Compartmentalises writes (coordination) to a single point. Reads -> immutable structures.

  58. Applies in many other areas • Sequentiality – Databases: write ahead logs – Columnar databases: Merge Joins – Kafka • Immutability – Snapshot isolation over explicit locking. – Replication (state machines replication)

  59. Log-Centric Approaches Work in Applications too

  60. Event Sourcing • Journaling of state changes Journal • No “update in place” + 10.36 - 12.12 + 23.70 Object + 13.33

  61. CQRS Client Query Command log Write Read Optimised Optimised

  62. How Applications or Services share state

  63. Log-Centric Services Read-Replica Writer Read-Replica Writes are localised to a single service Read-Replica

  64. Log-Centric Services Read-Replica Writer Read-Replica Immutable log Read-Replica

  65. Log-Centric Services Read-Replica Writer Read-Replica Many, independent Read-Replica read replicas

  66. Elements of Beauty • Reframing the problem to be Log-Centric. To go with the grain of the system. • Optimise for the harder problem • Compartmentalises writes (coordination) to a single point. Reads -> immutable structures.

  67. Decentralised Design In both database design as well as in application development

  68. The Log is the central building block Pushes us towards the natural grain of the system

  69. The Log A single unifying abstraction

  70. References LSM: • benstopford.com/2015/02/14/log-structured-merge-trees/ • smalldatum.blogspot.co.uk/2017/02/using-modern-sysbench-to-compare.html • www.quora.com/How-does-the-Log-Structured-Merge-Tree-work • bLSM paper: http://bit.ly/2mT7Vje Other • Pat Helland (Immutability) cidrdb.org/cidr2015/Papers/CIDR15_Paper16.pdf • Peter Ballis (Coordination Avoidance): http://bit.ly/2m7XxnI • Jay Kreps: I Heart Logs (O’Reilly 2014) • The Data Dichotomy: http://bit.ly/2hk9c2K

  71. Thank you @benstopford http://benstopford.com ben@confluent.io

Recommend


More recommend