Nitro: A Fast, Scalable In-Memory Storage Engine for NoSQL Global Secondary Index Sarath Lakshman, Sriram Melkote, John Liang, Ravi Mayuram Couchbase, Inc Presenter: Xiaoyao Qian • 04.04.2017
4 million entries/sec 10 million lookups/sec 2
3 https://www.mysql.com/why-mysql/benchmarks/
Motivation 4
5
Lock-Free Backup & Memory MVCC GC Evaluation GSI Skiplist Recovery Reclamation Ordered Linked List 6
Lock-Free Backup & Memory MVCC GC Evaluation GSI Skiplist Recovery Reclamation n : #nodes in next level f : fanout factor Avg O(logN): insert, lookup, delete 7
Lock-Free Backup & Memory MVCC GC Evaluation GSI Skiplist Recovery Reclamation Lock-free List Operations 8
Lock-Free Backup & Memory MVCC GC Evaluation GSI Skiplist Recovery Reclamation isdeleted=0 isdeleted=1 1 4 8 6 DoubleCAS 9
Lock-Free Backup & Memory MVCC GC Evaluation GSI Skiplist Recovery Reclamation MVCC: Multi-Version Concurrency Control - Immutable snapshots - Fast and low overhead snapshots - Avoid phantom reads - Memory efficiency - Fast and scalable garbage collection 10
Lock-Free Backup & Memory MVCC GC Evaluation GSI Skiplist Recovery Reclamation Descriptor: refcount = x Descriptor: refcount = y MVCC primitives: lifetime and descriptor 11
Lock-Free Backup & Memory MVCC GC Evaluation GSI Skiplist Recovery Reclamation Snapshot Iteration filter with bornSn>termSn && deadSn>=termSn 12
Lock-Free Backup & Memory MVCC GC Evaluation GSI Skiplist Recovery Reclamation Comparison with Copy-On-Write B+ Tree (COW B+) 13
Lock-Free Backup & Memory MVCC GC Evaluation GSI Skiplist Recovery Reclamation 1. The snapshot Sn(x) descriptor shows refcount = 0 2. The previous snapshot Sn(x-1) has been garbage collected, i.e garbage collection of snapshots can only be performed in the sequential order of the snapshot termSn 3. #gc_workers = #concurrent_writers 4. Writers keep track of deadList which is attached to the snapshot descriptor. Whenever a node is marked as deleted, add to deadList . 5. GC workers use deadList of a snapshot to perform physical node removal from the skiplist 14
Lock-Free Backup & Memory MVCC GC Evaluation GSI Skiplist Recovery Reclamation 1. Traverse level 0 linked list of the skiplist, Minimum backup file size ✓ and write out the entries into data files ✓ Compression friendly 2. All entries that don’t belong to the snapshot Since skiplist is ordered, the data written ✓ are ignored to disk is also ordered 3. Node metadata (i.e lifetime) are not ❌ Could block garbage collection serialized. They can be recreated during recovery 15
Lock-Free Backup & Memory MVCC GC Evaluation GSI Skiplist Recovery Reclamation Backup Backup Backup shard1 shard2 shard3 16
Lock-Free Backup & Memory MVCC GC Evaluation GSI Skiplist Recovery Reclamation Buf: [nil, nil, nil, nil] Recovery 17
Lock-Free Backup & Memory MVCC GC Evaluation GSI Skiplist Recovery Reclamation Buf: [nil, nil, nil, nil] -> [n1, n1, n1, n1] Recovery 18
Lock-Free Backup & Memory MVCC GC Evaluation GSI Skiplist Recovery Reclamation Buf: [n1, n1, n1, n1] -> [n2, n2, n1, n1] Recovery 19
Lock-Free Backup & Memory MVCC GC Evaluation GSI Skiplist Recovery Reclamation Buf: [n2, n2, n1, n1] -> [n3, n3, n3, n3] Recovery 20
Lock-Free Backup & Memory MVCC GC Evaluation GSI Skiplist Recovery Reclamation Buf: [n3, n3, n3, n3] -> [n4, n3, n3, n3] Recovery 21
Lock-Free Backup & Memory MVCC GC Evaluation GSI Skiplist Recovery Reclamation Buf: [n4, n3, n3, n3] -> [n5, n5, n5, n5] Recovery 22
Lock-Free Backup & Memory MVCC GC Evaluation GSI Skiplist Recovery Reclamation Buf: [n5, n5, n5, n5] -> [n6, n6, n6, n5] Recovery 23
Lock-Free Backup & Memory MVCC GC Evaluation GSI Skiplist Recovery Reclamation Buf: [n6, n6, n6, n5] -> [n7, n6, n6, n5] Recovery 24
Lock-Free Backup & Memory MVCC GC Evaluation GSI Skiplist Recovery Reclamation Buf: [n7, n6, n6, n5] -> [nil, nil, nil, nil] Recovery 25
Lock-Free Backup & Memory MVCC GC Evaluation GSI Skiplist Recovery Reclamation Backup worker Garbage collector Backing up termSn INIT ack Unlink, and write ACTIVE eligible data to delta backup files TERMINATE Are you done? Close delta backup files ack 26 Non-intrusive Backup
Lock-Free Backup & Memory MVCC GC Evaluation GSI Skiplist Recovery Reclamation 27
Lock-Free Backup & Memory MVCC GC Evaluation GSI Skiplist Recovery Reclamation BarrierSession: AccessBarrier liveCount = 2 t1 t2 t3 28
Lock-Free Backup & Memory MVCC GC Evaluation GSI Skiplist Recovery Reclamation BarrierSessionClos e BarrierSession: AccessBarrier liveCount = 2 t1 t2 t3 29
Lock-Free Backup & Memory MVCC GC Evaluation GSI Skiplist Recovery Reclamation Terminated BarrierSession: AccessBarrier liveCount = 2 t1 t2 t3 30
Lock-Free Backup & Memory MVCC GC Evaluation GSI Skiplist Recovery Reclamation 31
Lock-Free Backup & Memory MVCC GC Evaluation GSI Skiplist Recovery Reclamation 32
Lock-Free Backup & Memory MVCC GC Evaluation GSI Skiplist Recovery Reclamation Global Secondary Index architecture 33
Lock-Free Backup & Memory MVCC GC Evaluation GSI Skiplist Recovery Reclamation 34
https://github.com/couchbase/nitro “TALK IS CHEAP, ~15,000 lines of code SHOW ME THE mainly in Golang, with a little C/C++ CODE” Apache 2.0 Licence 35
Questions & Discussions 1. #GC_workers = #writers? Wouldn’t that be too intense? 2. Skiplist may not be good in cache utilization because of not consecutive memory. Can this be optimized? 3. How can a single large index be distributed? 36
Recommend
More recommend