WORT: Write Optimal Radix Tree for Persistent Memory Storage Systems Se Kwon Lee K. Hyun Lim 1 , Hyunsub Song, Beomseok Nam, Sam H. Noh UNIST 1 Hongik University
Persistent Memory (PM) § Persistent memory is expected to replace both DRAM & NAND NAND PCM DRAM STT-MRAM Non-volatility o o o x Read (ns) 2.5 X 10 4 5 - 30 20 – 70 10 2 X 10 5 Write (ns) 10 - 100 150 - 220 10 x o o o Byte-addressable Density 185.8 Gbit/cm 2 0.36 Gbit/cm 2 13.5 Gbit/cm 2 9.1 Gbit/cm 2 K. Suzuki and S. Swanson. “A Survey of Trends in Non-Volatile Memory Technologies: 2000-2014”, IMW 2015 Non-volatile High performance Persistent Memory 2
Indexing Structure for PM Storage Systems 13 30 B+Tree 5 20 40 50 … 1 4 9 10 30 38 48 60 70 3
Consistency Issue of B+tree in PM § B+tree is a block-based index • Key sorting à Block granularity write • Rebalancing à Multi-blocks granularity write § Persistent memory Can result in • Byte-addressable à Byte granularity write consistency problem • Write reordering 4
Consistency Issue of B+tree in PM § Traditional case Volatile CPU Caches 30 35 30 31 35 2 3 DRAM Write reordering 30 31 35 3 Not persistent data Non-volatile Block based storage 30 35 Block granularity update 2 5
Consistency Issue of B+tree in PM § PM case Volatile CPU Caches 30 35 30 31 35 2 3 Byte granularity update Non-volatile Persistent Memory Write reordering 30 35 Crash 2 Persistent data Garbage data persistently stored 6
Primitives for Data Consistency in PM § Durability Volatile • CLFLUSH (Flush cache line) CPU Caches − Can be reordered § Ordering • MFENCE (Load and Store fence) Non-volatile Persistent − Order CPU cache line flush Memory instructions 7
Primitives for Data Consistency in PM § Durability CPU Volatile • CLFLUSH (Flush cache line) Serialization of CLFLUSH and MFENCE is CPU Caches − Can be reordered known to cause large overhead § Ordering • MFENCE (Load and Store fence) Non-volatile Persistent − Order CPU cache line flush Memory instructions 8
Primitives for Data Consistency in PM § Atomicity • 8-byte failure atomicity 30 31 35 30 31 35 3 3 − Need only CLFLUSH • Logging or CoW based atomicity Non-volatile (more than 8 bytes) Log area Data area − Requires duplicate copies 30 35 2 9
Primitives for Data Consistency in PM § Atomicity • 8-byte failure atomicity 30 31 35 30 31 35 3 3 − Need only CLFLUSH • Logging or CoW based atomicity Non-volatile Logging increases cache line flush overhead (more than 8 bytes) Log area Data area − Requires duplicate copies 30 35 2 10
B+tree Variants for Persistent Memory How can we ensure consistency using failure-atomic writes without logging? Unsorted keys à Append-only with metadata Failure-atomic update of metadata wB+Tree (VLDB’ 15) NVTree (FAST’ 15) FPTree (SIGMOD’ 16) Slot array 5 Fingerprints 9 2 3 1 7 Flag Flag Flag (+/-) (+/-) (+/-) Entry … K1 Kn … K1 K2 K3 K1 Kn … Cnt. bmp bmp P1 Pn … P1 P2 P3 P1 Pn … P next Unsorted key à Decreases search performance 11
B+tree Variants for Persistent Memory § Logging still necessary 30 32 Overflow • Multi-block granularity updates 35 30 32 38 Split due to node splits and merges 35 38 New key − Cannot update atomically • Logging-based solution − wB+Tree, FPTree large overhead • Tree reconstruction based solution − NVTree 12
B+tree Variants for Persistent Memory Key sorting 30 35 30 31 35 2 3 Fundamental characteristics of B+tree cause problems Rebalancing 30 32 Overflow 35 30 32 38 Split 35 38 New key 13
B+tree Variants for Persistent Memory Key sorting 30 35 30 31 35 2 3 Why use B+ trees in the first place? Fundamental characteristics of B+tree cause problems Rebalancing Perhaps there is a better tree data structure more suited for PM? 30 32 Overflow 35 30 32 38 Split 35 38 New key 14
Our Contributions § Show Radix Tree is a suitable data structure for PM § Propose optimal radix tree variants WORT and WOART • WORT: Write Optimal Radix Tree • WOART: Write Optimal redesigned Adaptive Radix Tree (ART) − Optimal: maintain consistency only with single failure-atomic write without any duplicate copies 15
Radix Tree § Deterministic structure … C A … … A C … … A C Z C ACA ACC ACZ CAC 16
Radix Tree § Deterministic structure • No key comparison … C A … … A C … … A C Z C ACA ACC ACZ CAC 17
Radix Tree § Deterministic structure 8-byte pointer • No key comparison … − Only 8-byte pointer entries C A − Implicitly stored keys … … A C … … A C Z C ACA ACC ACZ CAC 18
Radix Tree § Deterministic structure • No key comparison … − Only 8-byte pointer entries C A − Implicitly stored keys − No problem caused by key sorting … … A N … … A C Z C ACA ACC ACZ CAC 19
Radix Tree § Deterministic structure • No key comparison … − Only 8-byte pointer entries C A − Implicitly stored keys − No problem caused by key sorting … … A N • No modification of other keys … … − Single 8-byte pointer write per node A C Z C − Easy to use failure-atomic write ACA ACC ACZ CAC 20
Problem of Deterministic Structure § For sparse key distribution • Waste excessive memory space à Optimized through path compression High utilization … … … … … Low utilization … … … … key key key key key key key key … … … 21
Path Compression in Radix Tree § Path compression • Search paths that do not need to be distinguished can be removed … Unnecessary search path C A … … A C … … C A C Z ACA ACC ACZ CAC 22
Path Compression in Radix Tree § Path compression • Common search path is compressed in header • Improve memory utilization & indexing performance … A … C Compression header … A C Z ACA ACC ACZ 23
Node Split with Path Compression § Path compression split AZA to be inserted Prefix keys are not equal AZ != AC … AC A C Z ACA ACC ACZ 24
Node Split with Path Compression § Path compression split ① New parent allocation … Split A C Z C AZA … A A C C A C Z ACA ACC ACZ 25
Node Split with Path Compression § Path compression split … A C Z ② Decompression of old common prefix AZA … A C A C Z ACA ACC ACZ 26
Node Split with Path Compression § Path compression split … A C However, this split process causes consistency Z ② Old common prefix update problem in PM. AZA … A C A C Z ACA ACC ACZ 27
Path compression Problem in PM 28
Consistency Issue of Path Compression § Path compression split • cause updates of multiple nodes • have to employ expensive logging methods … A C Z Consistent state AZA … A C Z Crash Inconsistent state ACA ACC ACZ A C 29
Path compression Solution 30
WORT (Write-Optimal Radix Tree) for PM § Failure-atomic path compression • Add node depth field to compression header Compression header (8 bytes) … struct Header { 0 AC unsigned char depth; A C Z unsigned char PrefixArr[7]; } ACA ACC ACZ 31
WORT (Write-Optimal Radix Tree) for PM § Failure-atomic path compression • Add node depth field to compression header AZA to be inserted Compression header (8 bytes) … 0 AC A C Z ACA ACC ACZ 32
WORT (Write-Optimal Radix Tree) for PM § Failure-atomic path compression • Add node depth field to compression header Compression header (8 bytes) … 0 A C Z Consistent state AZA 2 … ② Decompression of old common prefix A C Z Crash Inconsistent state ACA ACC ACZ 0 A C 33
WORT (Write-Optimal Radix Tree) for PM § Failure-atomic path compression • Failure detection in WORT − Depth in a header ≠ Counted depth à Crashed header Compression header (8 bytes) … 0 A C Z Inconsistent state AZA … 0 A C A C Z Not equal to ACA ACC ACZ expected tree depth (2) 34
WORT (Write-Optimal Radix Tree) for PM § Failure-atomic path compression • Failure recovery in WORT − Compression header can be reconstructed à Atomically overwrite Compression header (8 bytes) Consistent state ACA … 0 A 2 ACC C Z Inconsistent state AZA … 0 A C A C Z ACA ACC ACZ 35
Write Optimal Data Structure for PM § Our proposed radix tree variant is optimal for PM • Consistency is always guaranteed with a single 8-byte failure-atomic write without any additional copies for logging or CoW WORT (Write Optimal Radix Tree) WOART (Write Optimal Adaptive Radix Tree) 1. Failure-atomic path compression 2. Redesigned adaptive node 36
Evaluation § Experimental environment System configuration Description CPU Intel Xeon E5-2620V3 X 2 OS Linux CentOS 6.6 (64bit) kernel v4.7.0 Emulated with 256GB DRAM PM Write latency: Injecting additional stall cycles 37
Evaluation § Experimental environment Comparison group Radix tree variants B+tree variants WORT wB+Tree (VLDB’ 15) NVTree (FAST’ 15) FPTree (SIGMOD’ 16) DRAM DRAM PM PM 38
Recommend
More recommend