NOVA-Fortis: A Fault-Tolerant Non- Volatile Main Memory File System Jian Andiry Xu, Lu Zhang , Amirsaman Memaripour, Akshatha Gangadharaiah, Amit Borase, Tamires Brito Da Silva, Andy Rudoff (Intel), Steven Swanson Non-Volatile Systems Laboratory Department of Computer Science and Engineering University of California, San Diego 1
Non-volatile Memory and DAX • Non-volatile main memory (NVMM) – PCM, STT-RAM, ReRAM, 3D XPoint technology – Reside on memory bus, load/store interface Application load/store load/store DRAM NVMM File system HDD / SSD 2
Non-volatile Memory and DAX • Non-volatile main memory (NVMM) – PCM, STT-RAM, ReRAM, 3D XPoint technology – Reside on memory bus, load/store interface Application • Direct Access (DAX) mmap() DAX-mmap() – DAX file I/O bypasses the page cache – DAX-mmap() maps NVMM pages to application DRAM NVMM address space directly and bypasses file system copy – “Killer app” HDD / SSD 3
Application expectations on NVMM File System DAX Fault Direct Speed POSIX I/O Atomicity Tolerance Access 4
ext4 xfs BtrFS F2FS DAX ✔ ✔ Fault Direct Speed POSIX I/O Atomicity ❌ ❌ ❌ Tolerance Access 5
PMFS ext4-DAX xfs-DAX ✔✔ ✔ ❌ ❌ DAX 6 Fault Direct
Strata SOSP ’17 ✔ ✔ ✔ ❌ ❌ DAX 7 Fault Direct
NOVA FAST ’16 ✔ ✔ ❌ ✔✔ DAX 8 Fault Direct
NOVA-Fortis ✔ ✔✔ ✔ ✔ DAX 9 Fault Direct
Challenges DAX 10
NOVA: Log-structured FS for NVMM • Per-inode logging – High concurrency Per-inode logging – Parallel recovery • High scalability Inode Head Tail – Per-core allocator, journal and inode Inode log table • Atomicity Data Data Data – Logging for single inode update – Journaling for update across logs – Copy-on-Write for file data Jian Xu and Steven Swanson, NOVA: A Log-structured File System for Hybrid 11 Volatile/Non- volatile Main Memories, FAST ’16.
Snapshot 13
Snapshot support • Snapshot is essential for file system backup • Widely used in enterprise file systems – ZFS, Btrfs, WAFL • Snapshot is not available with DAX file systems 14
Snapshot for normal file I/O 0 1 2 Current snapshot write(0, 4K); take_snapshot(); write(0, 4K); 0 Page 0 1 Page 0 1 Page 0 2 Page 0 File log write(0, 4K); take_snapshot(); Data Data Data Data Data Data Data Data write(0, 4K); recover_snapshot(1); File write entry Data in snapshot Reclaimed data Current data 15
Memory Ordering With DAX-mmap() D V Valid D = 42; ✓ ? False Fence(); ✓ 42 False ✓ 42 True V = True; ✗ ? True • Recovery invariant: if V == True, then D is valid 16
Memory Ordering With DAX-mmap() Application D D = 42; V Fence(); DAX-mmap() V = True; NVMM Page 1 Page 3 • Recovery invariant: if V == True, then D is valid • D and V live in two pages of a mmap ()’d region . 17
DAX Snapshot: Idea • Set pages read-only, then copy-on-write Applications: no file system intervention File system: DAX-mmap() File data: RO 18
DAX Snapshot: Incorrect implementation • Application invariant: if V is True, then D is valid Snapshot Application Application NOVA values thread values snapshot D = ?; D V D V V = False; snapshot_begin(); ? F ? set_read_only(page_d); page fault D = 42; copy_on_write(page_d); 42 F V = True; 42 T set_read_only(page_v); ? T snapshot_end(); ? T 19
DAX Snapshot: Correct implementation • Delay CoW page faults completion until all pages are read-only Snapshot Application Application NOVA values thread values snapshot D = ?; D V D V V = False; snapshot_begin(); ? F ? set_read_only(page_d); page fault D = 42; set_read_only(page_v); ? F snapshot_end(); ? F 42 F copy_on_write(page_d); V = True; copy_on_write(page_v); 42 T 20
Performance impact of snapshots • Normal execution vs. taking snapshots every 10s – Negligible performance loss through read()/write() – Average performance loss 3.7% through mmap() W/O snapshot W snapshot 1.2 Filebench (read/write) WHISPER (DAX-mmap()) 1 0.8 0.6 0.4 0.2 0 21
Protecting Metadata and Data 22
NVMM Failure Modes • Detectable errors – Media errors detected by NVMM controller Software: Receives MCE – Raises Machine Check Exception Read Detects uncorrectable errors NVMM Ctrl.: (MCE) Raises exception • Undetectable errors NVMM data: Media error – Media errors not detected by NVMM controller – Software scribbles 23
NVMM Failure Modes • Detectable errors – Media errors detected by NVMM controller Software: Consumes corrupted data – Raises Machine Check Exception Read NVMM Ctrl.: Sees no error (MCE) • Undetectable errors NVMM data: Media error – Media errors not detected by NVMM controller – Software scribbles 24
NVMM Failure Modes • Detectable errors – Media errors detected by NVMM controller Software: Bug code scribbles NVMM – Raises Machine Check Exception Write NVMM Ctrl.: Updates ECC (MCE) • Undetectable errors NVMM data: Scribble error – Media errors not detected by NVMM controller – Software scribbles 25
NOVA-Fortis Metadata Protection • Detection inode ’ Head’ Head’ Tail’ Tail’ csum’ csum ’ H1’ T1’ – CRC32 checksums in all structures inode – Use memcpy_mcsafe() to catch Head Head Head Tail Tail Tail csum csum H1 T1 MCEs • Correction log ent1 c1 … entN cN – Replicate all metadata: inodes, logs, superblock, etc. l og’ ent1’ c1’ … entN ’ cN ’ – Tick-tock: persist primary before updating replica Data 1 Data 2 26
NOVA-Fortis Data Protection inode ’ Head’ Head’ Tail’ Tail’ csum ’ csum’ H1’ T1’ inode Head Head Head Tail Tail Tail csum csum H1 T1 • Metadata – CRC32 + replication for all structures log ent1 c1 … entN cN • Data c1 l og’ ent1’ … entN ’ cN ’ ’ – RAID-4 style parity – Replicated checksums Data 1 Data 2 1 Block (8 stripes) P = ⊕ S 0..7 S 0 S 1 S 2 S 3 S 4 S 5 S 6 S 7 P C i = CRC32C(S i ) Replicated 27
File data protection with DAX-mmap • Stores are invisible to the file systems • The file systems cannot protect mmap’ed data • NOVA- Fortis’ data protection contract: DAX NOVA-Fortis protects pages from media errors and scribbles iff they are not mmap ()’d for writing. 28
File data protection with DAX-mmap • NOVA-Fortis logs mmap() operations User-space load/store load/store Applications: Kernel-space read/write mmap() NOVA-Fortis: NVDIMMs protected File data: unprotected File log: mmap log entry 29
File data protection with DAX-mmap • On munmap and during recovery, NOVA-Fortis restores protection User-space load/store munmap() Applications: Kernel-space read/write mmap() NOVA-Fortis: NVDIMMs Protection restored File data: File log: 30
File data protection with DAX-mmap • On munmap and during recovery, NOVA-Fortis restores protection User-space System Failure + Applications: recovery Kernel-space NOVA-Fortis: read/write mmap() NVDIMMs File data: File log: 31
Performance 32
Latency breakdown VFS alloc inode journaling memcpy_mcsafe memcpy_nocache append entry free old data calculate entry csum verify entry csum replicate inode replicate log verify data csum update data csum update data parity Create Append 4KB Overwrite 4KB Overwrite 512B Read 4KB Read 16KB 0 1 2 3 4 5 6 Latency (microsecond) 33
Latency breakdown VFS alloc inode journaling memcpy_mcsafe memcpy_nocache append entry free old data calculate entry csum verify entry csum replicate inode replicate log verify data csum update data csum update data parity Create Append 4KB Overwrite 4KB Overwrite 512B Read 4KB Read 16KB 0 1 2 3 4 5 6 Latency (microsecond) Metadata Protection 34
Latency breakdown VFS alloc inode journaling memcpy_mcsafe memcpy_nocache append entry free old data calculate entry csum verify entry csum replicate inode replicate log verify data csum update data csum update data parity Create Append 4KB Overwrite 4KB Overwrite 512B Read 4KB Read 16KB 0 1 2 3 4 5 6 Latency (microsecond) Metadata Protection Data Protection 35
Application performance Normalized throughput 1.2 1 Normalized throughput 0.8 0.6 0.4 0.2 0 Fileserver Varmail MongoDB SQLite TPCC Average ext4-DAX Btrfs NOVA w/ MP w/ MP+DP 36
Conclusion • Fault tolerance is critical for file system, but existing DAX file systems don’t provide it • We identify new challenges that NVMM file system fault tolerance poses • NOVA-Fortis provides fault tolerance with high performance – 1.5x on average to DAX-aware file systems without reliability features – 3x on average to other reliable file systems 37
Give a try https://github.com/NVSL/linux-nova 38
Thanks! 39
Backup slides 40
Hybrid DRAM/NVMM system • Non-volatile main memory (NVMM) – PCM, STT-RAM, ReRAM, 3D XPoint technology • File system for NVMM Host CPU NVMM FS DRAM NVMM 41
Recommend
More recommend