Proteus: A Flexible and Fast Software supported Hardware Logging approach for NVM Seunghee Shin, Satish Tirukkovalluri, James Tuck, and Yan Solihin North Carolina State University The 2018 Non-Volatile Memories Workshop (NVMW 2018) 1
Background Disk / Flash + Fast + Non-volatile DRAM + Byte-addressable - Slow - Volatile - Block-addressable NVM + Fast + Byte-addressable + Non-volatile • Use NVM as storage or main memory? • We assume NV main memory (NVMM) – Keep important data in memory instead of file – Need to ensure failure safety 2
Failure Safety through Durable Transactions • Durable transaction - Needed to ensure failure safety A B C D System Failure Insert Node X Undo-logging X - All updates in a transaction are atomically durable - Atomicity can be achieved through HW or SW undo logging 3
Transaction with Software Undo-Logging • Step 1 - Create undo log and make it durable • Step 2 - Set log-flag and make it durable, indicating transaction start • Step 3 - Perform data updates and make them durable • Step 4 - Unset log-flag and make it durable, indicating transaction end 4
Memory Persistency • Unpredictable persist ordering - Persist: operation which makes NVMM writes durable Private - NVMM persist order is determined by LLC writebacks, Cache instead of program order • Persistency Model Shared Cache (LLC) - Defines when stores become durable (i.e. placed in the persistence domain) - E.g. Intel PMEM persistency model , strict MC persistency model, epoch persistency model, buffered epoch persistency model, strand persistency model, NVMM etc. Unpredictable order 5
Intel PMEM Instruction and ADR • Asynchronous DRAM Refresh (ADR) - Added write pending queue (WPQ) in MC st A st B to persistence domain L1 L1 L1 L1 - Flush data in WPQ to NVMM automatically L2 L2 on system failure clwb clwb A • CLWB Shared Cache - Write back a dirty block from caches to WPQ MC MC - A fence is needed for ordering st A NVMM NVMM clwb A st A PERSISTENCE st B sfence PERSISTENCE DOMAIN st B DOMAIN 6
Let’s Revisit Software Logging • Software logging (SL) - Software performs log creation, maintenance, and truncation - (+) Flexible (e.g. no OS support needed) - (−) High performance overheads (~50% slowdown) FENCE Program Program Order Order log A st A log A st A log B st B log B st B log C st C log C st C log D st D log D st D Time Time Software Logging Hardware Logging • Hardware logging (HL) a. Memory fence is not required between logging and data modification b. New logging optimizations possible - Hardware creates and manages logs automatically (e.g. ATOM [HPCA’17]) - (+) Low performance overheads - (−) Not flexible 7
Software Supported Hardware Logging SL HL Flexible, but not fast Fast, but not flexible SSHL Fast and flexible • Software Supported Hardware Logging (SSHL) - Hardware provides logging instructions - Software performs logging operations using logging instructions - Hardware applies optimizations 8
Proteus: SSHL Design • Flexibility: Software involvement in logging - Add instructions which starts logging operations in hardware - Two instructions are required: log-load and log-flush • Performance Optimizations - Parallel logging: process multiple loggings concurrently - Redundant logging detection and removal • Endurance Optimization (log write removal) - With the introduction of ADR, WPQ is considered non-volatile - Key insight: logs are no longer needed when a transaction commits - Remove logs without flushing to NVMM 9
Proteus: New Logging Instructions i1: tx_begin Code generation i2: log-load LR1, A tx_begin $LR1 A = … i3: log-flush LR1, (LTA)+ i4: st A B = … i5: log-load LR2, B L1 L1 tx_end i6: log-flush LR2, (LTA)+ L2 i7: st B i8: tx_end log-flush $LR1 M2 log-load $LR1 M1 log-load $LR1 M1 LR1= Mem[M1] Shared Cache log-flush $LR1 M2 Mem[M2] = LR1 MC - log-from address (M1): address of original data - log-to address (M2): address of log entry NVM M1 M2 - Log data register (LR#): register holding logging data 10
Proteus Hardware Design Pipeline Register File txID log-start log-end cur-log LoadQ StoreQ LogQ fp LDR Int from to data Dep. Dep. Check Check txID : holds current transaction ID being executed in the core Log Look-up Table (LLT) log-start : the start address of the log area Log Pending Queue (LPQ) Prevent redundant loggings in a transaction log-end : the end address of the log area Holds logs until the transaction ends or there is no free entries Cache cur-log : tracks the current free log entry Separate logs from WPQ to avoid the incoming read requests check log entries Log Queue (LogQ) Memory Controller with ADR Log data register (LDR) Maintain log to store dependencies LLT tag LRU txID Arbiter Router Keep log data while logging instructions are in pipeline Keep track of logging executions (parallel loggings) Prioritize writes from WPQ unless LPQ has no free entries (less than threshold) WPQ LPQ txID coreID loginfo data Arbiter NVMM 11
Proteus Hardware Design Register File txID 01 02 02 LogQ StoreQ log-start 0x100 tx_begin from to data log-end 0x300 log-load LR1, (0x800) cur-log 0x200 log-flush LR1, (LTA)+ LDR store B, (0x800) 0x800 0x800 0x800 0x800 0x200 0x200 0x200 A A A LR1 clwb (0x800) 0x800: A 0x800: B Cache LR2 sfence Memory Controller with ADR tx_end Router WPQ LPQ txID coreID loginfo data 02 02 1 1 0x200 0x200 A A B Arbiter NVMM 12
Methodology System Configuration Processor OOO, 3.4GHz, 4 cores, L1 I/D Cache 32KB,8-way,64Bblock,4cycles,private per core L2 Cache 256KB, 8-way, 64B block, 12 cycles, private per core L3 Cache 8MB, 16-way, 64B block, 42 cycles, shared by all cores DDR3 like interface, 800MHz, 8GB 1 channel NVM 16 Banks per rank, 2KB row-buffer tCAS-tRCD-tRP-tRAS-tRC-tWR-tWTR-tRTP-tRRD-tFAW 11-29(109)-11-28-39-12-6-6-5-24 (tRCD 29 for Read, 109 for Write) LDR: 8 registers, LogQ: 8 entries Proteus LLT: 64 entries (8way), LPQ: 256 entries - MarssX86 + DRAMsim2 simulator is used - NVM has 50ns for read latency and 150ns for write latency 13
Evaluation (1) - Speedup 46% better than baseline 10% better than ATOM RB tree Btree Queue AvlTree Hashmap StringSwap - Baseline: software logging using Intel PMEM instructions - Proteus performs 46% better than baseline, 10% better than ATOM 14
Evaluation (2) – Numbers of writes ATOM introduces 3.4x more writes than Proteus - Baseline: no logging (not failure safe) - ATOM incurs 350% more writes than baseline - Proteus has similar writes to baseline (only 2% higher) 15
Conclusions • Software logging is expensive but flexible • Hardware logging is fast but inflexible • Proteus: Software Supported Hardware Logging (SSHL) - Fast and flexible - New logging instructions allow software to manage logging - Performance optimizations: parallel logging, redundant logging removal - Endurance optimization: remove logs before flushing to NVMM • Results - Performance: 46% better vs. SW logging (10% better vs. ATOM) - Endurance 2% more writes to NVMM vs. 350% with ATOM 16
Thank you 17
Recommend
More recommend