Deukyeon Hwang Wook-Hee Kim Youjip Won Beomseok Nam UNIST UNIST Hanyang Univ. UNIST/SKKU
Fast but Asymmetric Non-Volatility Byte-Addressability Large Capacity Access Latency
CPU Caches Persistent Memory (Non-Volatile) (Volatile) LOST 40! 30 30 40 10 20 30 40 30 40 cache line FLUSH
Inserting 25 into a node 10 20 30 40 (0 ) Partially updated tree node is inconsistent 10 20 30 40 40 (1 ) → 10 20 30 30 40 Append-Only Update (2 ) 10 20 25 30 40 (3 )
Node Split Node A Node A Node B 10 20 30 10 20 30 40 40 60 60 ʌ ʌ ʌ P1 P2 P3 P1 P2 P3 P4 P6 P4 P6 Logging → Selective Persistence (Internal node in DRAM)
▪ Append-Only • Unsorted keys ▪ Selective Persistence • Internal node → DRAM • Internal nodes have to be reconstructed from leaf nodes after failures • Logging for leaf nodes ▪ Previous solutions NV- Tree [FAST’15] Append-Only leaf update + Selective Persistence wB+- Tree [VLDB’15] Append-Only node update + bitmap/slot array metadata FP- Tree [SIGMOD’16] Append-Only leaf update + fingerprints + Selective Persistence
F ailure- A tomic S hif T Append-Only (FAST) (Unsorted keys) Lock-Free Search F ailure- A tomic Selective Persistence I n-place R ebalancing (DRAM + PM) (FAIR)
▪ Modern processors reorder instructions to utilize the memory bandwidth ▪ Memory ordering in x86 and ARM x86 ARM stores-after-stores Y N stores-after-loads N N loads-after-stores N N loads-after-loads N N Inst. w/ dependency Y Y ▪ x86 guarantees Total Store Ordering (TSO) ▪ Dependent instructions are not reordered
▪ Pointers in B+-Tree store unique memory addresses ▪ 8-byte pointer can be atomically updated Read transactions detect transient inconsistency between duplicate pointers ▪ transient inconsistency • In-flight state partially updated by a write transaction 10 20 30 40 40 P1 P2 P3 P4 P5 P5
10 20 30 40 P1 P2 P3 P4 P5 P5 mfence(); mfence(); TSO 10 20 30 40 40 P1 P2 P3 P4 P5 P5
Insert (25, P6) into a node using FAST g: Garbage 10 20 30 40 g g ʌ : Null ʌ ʌ P1 P2 P3 P4 P5 Read transactions can succeed in finding a key even if a system crashes in any step
Insert (25, P6) into a node using FAST 10 20 30 40 g g ʌ P1 P2 P3 P4 P5 P5
Insert (25, P6) into a node using FAST 10 20 30 40 40 g ʌ P1 P2 P3 P4 P5 P5
Insert (25, P6) into a node using FAST 10 20 30 40 40 g ʌ P1 P2 P3 P4 P5 P5
Insert (25, P6) into a node using FAST read transaction 10 20 30 40 40 g ʌ P1 P2 P3 P4 P5 P5 Key 40 between duplicate pointers is ignored!
Insert (25, P6) into a node using FAST 10 20 30 40 40 g ʌ P1 P2 P3 P4 P4 P5 Shifting P4 invalidates the left 40
Insert (25, P6) into a node using FAST 10 20 30 30 40 g ʌ P1 P2 P3 P4 P4 P5
Insert (25, P6) into a node using FAST 10 20 30 30 40 g ʌ P1 P2 P3 P3 P4 P5
Insert (25, P6) into a node using FAST 10 20 25 30 40 g ʌ P1 P2 P3 P3 P4 P5
Insert (25, P6) into a node using FAST 10 20 25 30 40 g ʌ P1 P2 P3 P6 P4 P5 Storing P6 validates 25
▪ It is necessary to call clflush at the boundary of cache line 10 20 30 40 g g ʌ ʌ P1 P2 P3 P4 P5 Cache Line Cache Line 2 1 10 20 30 30 40 g ʌ P1 P2 P3 P3 P4 P5 mfence() clflush( ) Cache Line 2 mfence() Cache Line Cache Line 1 2
▪ Let’s avoid expensive logging by making read transactions be aware of rebalancing operations ▪ B link -Tree 10 20 30 40 70 80 90
FAIR split a node Node A Node B 10 20 30 40 40 60 60 ʌ ʌ P1 P2 P3 P4 P6 P4 P6 A read transaction can detect transient inconsistency if keys are out of order
FAIR split a node Node A Node B 10 20 30 40 60 ʌ ʌ P1 P2 P3 P4 P6 Setting NULL pointer validates Node B. Node A and Node B are virtually a single node
FAIR split a node Node A Node B 10 20 30 40 60 ʌ ʌ P1 P2 P3 P4 P6 Migrated keys can be accessed via sibling pointer
FAIR split a node Node A Node B 10 20 30 40 50 60 ʌ ʌ P1 P2 P3 P4 P6 P5
Insert a key into the parent node using FAST after FAIR split Node R root 10 70 70 C2 C3 C3 Node A Node B Node C 10 20 30 40 50 60 70 80 90
Insert a key into the parent node using FAST after FAIR split Node R root 10 70 70 C3 C2 C2 Node A Node B Node C 10 20 30 40 50 60 70 80 90 Node B can be accessed from Node A
Insert a key into the parent node using FAST after FAIR split ➢ Searching the key 50 from the root after a system crash Node R root 10 70 70 key accessed by read transaction C3 C2 C2 Node A Node B Node C 10 20 30 40 50 60 70 80 90 Node B can be accessed from Node A
Insert a key into the parent node using FAST after FAIR split Node R root 10 40 70 C3 C2 C4 Node A Node B Node C 10 20 30 40 50 60 70 80 90 FAST inserting makes Node B visible atomically
Read transactions can tolerate any inconsistency caused by write transactions → Read transactions can access the transient inconsistent tree node being modified by a write transaction → Lock-Free Search
Read transaction [Example 1] Searching 30 while inserting (15, P6) Write transaction read → 10 20 30 40 g g ʌ ʌ P1 P2 P3 P4 P5 shift →
Read transaction [Example 1] Searching 30 while inserting (15, P6) Write transaction read → 10 20 30 40 g g ʌ P1 P2 P3 P4 P5 P5 shift →
Read transaction [Example 1] Searching 30 while inserting (15, P6) Write transaction read → 10 20 30 40 40 g ʌ P1 P2 P3 P4 P5 P5 shift →
Read transaction [Example 1] Searching 30 while inserting (15, P6) Write transaction read → 10 20 30 40 40 g ʌ P1 P2 P3 P4 P4 P5 shift →
Read transaction [Example 1] Searching 30 while inserting (15, P6) Write transaction read → 10 20 30 30 40 g ʌ P1 P2 P3 P4 P4 P5 shift →
Read transaction [Example 1] Searching 30 while inserting (15, P6) Write transaction read → 10 20 30 30 40 g ʌ P1 P2 P3 P3 P4 P5 shift →
Read transaction [Example 1] Searching 30 while inserting (15, P6) Write transaction read → 10 20 20 30 40 g ʌ P1 P2 P3 P3 P4 P5 shift →
Read transaction [Example 1] Searching 30 while inserting (15, P6) Write transaction read → 10 20 20 30 40 g ʌ P1 P2 P2 P3 P4 P5 shift →
Read transaction [Example 1] Searching 30 while inserting (15, P6) Write transaction FOUND! read → 10 20 20 30 40 g ʌ P1 P2 P2 P3 P4 P5 shift →
Read transaction [Example 2] Searching 30 while deleting (20, P2) Write transaction read → 10 20 30 40 g g ʌ ʌ P1 P2 P3 P4 P5 shift
Read transaction [Example 2] Searching 30 while deleting (20, P2) Write transaction read → 10 20 30 40 g g ʌ ʌ P1 P3 P3 P4 P5 shift
Read transaction [Example 2] Searching 30 while deleting (20, P2) Write transaction read → 10 30 30 40 g g ʌ ʌ P1 P3 P3 P4 P5 shift
Read transaction [Example 2] Searching 30 while deleting (20, P2) Write transaction read → 10 30 30 40 g g ʌ ʌ P1 P3 P4 P4 P5 shift
Read transaction [Example 2] Searching 30 while deleting (20, P2) Write transaction read → 10 30 40 40 g g ʌ ʌ P1 P3 P4 P4 P5 shift
Read transaction [Example 2] Searching 30 while deleting (20, P2) Write transaction read → 10 30 40 40 g g ʌ ʌ P1 P3 P4 P5 P5 shift
Read transaction [Example 2] Searching 30 while deleting (20, P2) Write transaction 30 NOT FOUND read → 10 30 40 40 g g ʌ ʌ P1 P3 P4 P5 P5 shift The read transaction cannot find the key 30 due to shift operation
▪ Direction flag: • Odd Number • Even Number – Deletion shifts to the left. – Insertion shifts to the right. – Search must scan from Right to Left – Search must scan from Left to Right read → Search 40 10 20 30 40 g g counter 2 ʌ ʌ P1 P2 P3 P4 P5 Insert 25 shift →
▪ Direction flag: • Odd Number • Even Number – Deletion shifts to the left. – Insertion shifts to the right. – Search must scan from Right to Left – Search must scan from Left to Right read Search 40 10 20 30 40 g g counter 3 ʌ ʌ P1 P2 P3 P4 P5 Delete 25 shift
▪ Direction flag: • Odd Number • Even Number – Deletion shifts to the left. – Insertion shifts to the right. – Search must scan from Right to Left – Search must scan from Left to Right read → Search 40 10 20 30 40 g g counter 3 2 ʌ ʌ P1 P2 P3 P4 P5 Delete 25 shift The read transaction has to check the counter once again to make sure the counter has not changed. Otherwise, search the node again.
Transaction A Transaction B BEGIN INSERT 10 SUSPENDED BEGIN SEARCH 10(FOUND) COMMIT WAKE UP ABORT Dirty reads problem The ordering of Transaction A and Transaction B cannot be determined
Isolation Level Highest Serializable Repeatable reads Read committed Read uncommitted Lowest Our Lock-Free Search supports low isolation level
Recommend
More recommend