Libnvmmio : Reconstructing SW IO Path with Failure-Atomic Memory-Mapped Interface � � Jungsik Choi 1 , Jaewan Hong 2 , Youngjin Kwon 2 , Hwansoo Han 1 � � 1 2 USENIX ATC ‘20
SW Overhead Greater than Storage Latency ms SW Overhead HDD SSD Latency 𝜈 s TLC 3D NAND SSD XL-Flash SSD Optane SSD ns DCPMM PM NVDIMM-N PM Time 2
Reconstruct SW IO Path with Libnvmmio • Libnvmmio - Library Application … - Run on any POSIX FS (DAX-mmap) write fsync close open read - Transparent MMIO with logging Libnvmmio Logs Atomic Write - Make common IO path efficient Memory Mapped Files • Handle data ops at user-level open munmap MMIO a / mmap / close • Route metadata ops to kernel FS Kernel NVM-aware FS - Low-latency & scalable IO - Data-atomicity NVMM Files 3
User-Level IO is Suitable in NVMM system • Kernel’s IO stacks introduce SW overhead • User-level IO with mmap Application - Access files directly with load / store read / write load / store - Reduce user/kernel mode switches VFS - Avoid complex IO stacks File System - No indexing, no permission checks Device Driver OS Kernel • MMIO is the fastest way to access files NVMM 4
Logging is more Efficeint than CoW • CoW (or shadow paging) - High write amplification - Hugepages make CoW more expensive - Frequent TLB-shutdown • Logging (or journaling) - Writing data twice: logs and files - Differential logging - Checkpointing can be postponed 5
Redo vs. Undo • Most logging systems use only one policy (redo or undo) • They have different pros & cons depending on access type - REDO is better for writing, UNDO is better for reading App App UNDO REDO File File Log Log Write Async Write Read 6
Hybrid Logging • Uses adaptive policy depending on the access type of a file - Read-intensive file à Undo logging - Write-intensive file à Redo logging • Maintains per-file read/write counters • Determines logging policy on each fsync • Achieves the best case performance of two logging policies - Reduce SW overhead and improve logging efficiency 7
Centralized Logging with Fine-Grained Locks • Decentralized logging was designed for transactions - e.g. , per-thread logging, per-transaction logging • Centralized logging is appropriate for file IO, but not scalable - Requires fine-grained locks for scalable file IO Log Log Log Log File File Centralized Logging Decentralized Logging 8
Per-Block Logging Multi-Level Radix Tree Tree Per-Block Log File 9
Lock-Free Radix Tree 9 9 9 9 12 File Offset Global Upper Middle Table Offset Table Index LMD Log Entry Entry LUD (4KB) entry LGD size rwlock offset len Delta dest lgd policy skip epoch radix_root Per-Block Log 10
Commit & Checkpoint based on Epoch • Per-block logs are atomically committed on fsync • Libnvmmio commits by increasing the global epoch value - Committed logs have an epoch smaller than the global epoch • Background ckeckpointing Memory Mapped File 2 Radix Tree 2 1 Per-File Metadata 2 Per-Block Background Logs Threads 11
Design Summary Libnvmmio provides low-latency and scalable IO while guaranteeing data-atomicity • Low-latency IO • Scalable IO −User-level IO with mmap −Per-block logging −Differential logging −Lock-free index data structure −Hybrid logging −Various log sizes −Epoch-based committing −Background checkpointing 12
Experimental Setup • Experimental Machines - 32GB NVDIMM-N , 20 cores and 32GB DRAM - 256GB Optane DC , 16 cores and 128GB DRAM (in our paper) • Comparison systems Filesystem File IO Data-Atomicity Kernel Ext4-DAX Kernel X 5.1 PMFS Kernel X 4.13 NOVA Kernel O 5.1 SplitFS User O 4.13 Libnvmmio * User O 5.1 13
Hybrid Logging 20 (lapsHd 7imH (sHc) 8ndR 5HdR 15 HyEUid 10 5 0 0:100 10:90 20:80 30:70 40:60 50:50 60:40 70:30 80:20 90:10 100:0 5:: 5aWiR 14
FIO: Different Access Patterns • A single thread, file size=4GB, block size=4KB, time=60s 8 BDnGwLGWh (GLB/V) (xW4-DAX P0)6 6 N2VA /LEnvPPLR 4 2 0 6R RR 6W RW AFFeVV PDWWern 15
FIO: Different Write Sizes (xW4-DAX 30)6 12VA /LEnvPPLo 1.5 BDnGwLGWK (GLB/V) 4 1.0 3 2 0.5 1 0.0 0 128B 1.B 4.B 64.B 10B WrLWe 6Lze 16
FIO: Random Write with Multithreads PrLvDte fLOe 6hDreG fLOe (xt4-DAX 25 BDnGwLGth (GLB/V) 25 P0)6 20 12VA 20 /LEnvPPLo 15 15 10 10 5 5 1 2 4 8 16 1 2 4 8 16 # ThreDGV # ThreDGV 17
TPC-C on SQLite • Underlying FS with WAL, and Libnvmmio without WAL OnOy XnderOyLng FS LLEnvPPLo on FS 1.5 1orPDOLzed tSPC 1.0 0.5 0.0 Ext4-DAX P0FS 1OVA SSOLtFS 18
SQLite WAL vs. Libnvmmio • SQLite WAL • Libnvmmio −Design for block devices −Design for NVMM −Similar to REDO logging −Hybrid Logging −Read both WAL and DB file −Read DB file (UNDO) −Only one writer at a time −Concurrent writes −Synchronous checkpointing −Background checkpointing • Easily improve performance with Libnvmmio −Support any FS, Even FS that does not provide data-atomicity 19
Conclusion • It is important to minimize SW overhead in NVMM systems • Libnvmmio is a simple and practical solution - Reconstruct SW IO path - Run on any filesystem that provide DAX-mmap • Low-latency, scalable IO while guaranteeing data-atomicity - 2.2x better throughput - 13x better scalability • https://github.com/chjs/libnvmmio 20
QnA chjs@skku.edu
Recommend
More recommend