Strata: A Cross Media File System Youngjin Kwon , Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, Thomas Anderson 1
Let’s build a fast server NoSQL store, Database, File server, Mail server … Requirements • Small updates (1 Kbytes) dominate • Dataset scales up to 10 TB • Updates must be crash consistent 2
Storage diversification Latency $/GB Better performance Higher capacity DRAM 100 ns 8.6 NVM (soon) 300 ns 4.0 SSD 10 us 0.25 HDD 10 ms 0.02 3
Storage diversification Byte-addressable : cache-line granularity IO Latency $/GB Better performance Higher capacity DRAM 100 ns 8.6 NVM (soon) 300 ns 4.0 SSD 10 us 0.25 HDD 10 ms 0.02 3
Storage diversification Byte-addressable : cache-line granularity IO Latency $/GB Better performance Higher capacity DRAM 100 ns 8.6 NVM (soon) 300 ns 4.0 SSD 10 us 0.25 HDD 10 ms 0.02 Large erasure blocks need to be sequentially written Random writes: 5~6x slowdown due to GC [FAST’15] 3
A fast server on today’s file system • Small updates (1 Kbytes) dominate • Dataset scales up to 10TB • Updates must be crash consistent Application Kernel file system 91% NVM Kernel file system: NOVA [ FAST 16, SOSP 17 ] 4
A fast server on today’s file system • Small updates (1 Kbytes) dominate • Dataset scales up to 10TB • Updates must be crash consistent Small, random IO is slow! Application Kernel file system 91% NVM Kernel file system: NOVA [ FAST 16, SOSP 17 ] 4
A fast server on today’s file system • Small updates (1 Kbytes) dominate • Dataset scales up to 10TB • Updates must be crash consistent Small, random IO is slow! Application Write to device Kernel code Kernel file system 91% 91% 1 KB 0 1.5 3 4.5 6 NVM IO latency (us) Kernel file system: NOVA [ FAST 16, SOSP 17 ] 4
A fast server on today’s file system • Small updates (1 Kbytes) dominate • Dataset scales up to 10TB • Updates must be crash consistent Small, random IO is slow! Application Write to device Kernel code Kernel file system 91% 91% 1 KB 0 1.5 3 4.5 6 NVM IO latency (us) Kernel file system: NVM is so fast that kernel is the bottleneck NOVA [ FAST 16, SOSP 17 ] 4
A fast server on today’s file system • Small updates (1 Kbytes) dominate • Dataset scales up to 10TB • Updates must be crash consistent Application Kernel file system NVM 5
A fast server on today’s file system • Small updates (1 Kbytes) dominate • Dataset scales up to 10TB • Updates must be crash consistent Need huge capacity, but NVM alone is too expensive! Application ($40K for 10TB) Kernel file system NVM 5
A fast server on today’s file system • Small updates (1 Kbytes) dominate • Dataset scales up to 10TB • Updates must be crash consistent Need huge capacity, but NVM alone is too expensive! Application ($40K for 10TB) Kernel file system NVM For low-cost capacity with high performance, must leverage multiple device types 5
A fast server on today’s file system • Small updates (1 Kbytes) dominate • Dataset scales up to 10TB • Updates must be crash consistent Application Kernel file system Block-level caching NVM SSD HDD 6
A fast server on today’s file system • Small updates (1 Kbytes) dominate • Dataset scales up to 10TB • Updates must be crash consistent • Block-level caching manages data in blocks, Application but NVM is byte-addressable • Extra level of indirection Kernel file system Block-level caching NVM SSD HDD 6
A fast server on today’s file system • Small updates (1 Kbytes) dominate • Dataset scales up to 10TB • Updates must be crash consistent • Block-level caching manages data in blocks, Application but NVM is byte-addressable • Extra level of indirection Kernel file system NOVA Block-level caching Block-level caching 1 KB NVM SSD Better 0 3 6 9 12 HDD IO latency (us) 6
A fast server on today’s file system • Small updates (1 Kbytes) dominate • Dataset scales up to 10TB • Updates must be crash consistent • Block-level caching manages data in blocks, Application but NVM is byte-addressable • Extra level of indirection Kernel file system NOVA Block-level caching Block-level caching 1 KB NVM SSD Better 0 3 6 9 12 HDD Block-level caching is too slow IO latency (us) 6
A fast server on today’s file system • Small updates (1 Kbytes) dominate • Dataset scales up to 10TB • Updates must be crash consistent 7
A fast server on today’s file system • Small updates (1 Kbytes) dominate • Dataset scales up to 10TB • Updates must be crash consistent SQLite HDFS ZooKeeper LevelDB HSQLDB Mercurial Git 0 2 4 6 8 10 Crash vulnerabilities Pillai et al., OSDI 2014 7
A fast server on today’s file system • Small updates (1 Kbytes) dominate • Dataset scales up to 10TB • Updates must be crash consistent SQLite HDFS ZooKeeper LevelDB HSQLDB Mercurial Git 0 2 4 6 8 10 Crash vulnerabilities Applications struggle for crash consistency Pillai et al., OSDI 2014 7
Problems in today’s file systems • Kernel mediates every operation NVM is so fast that kernel is the bottleneck • Tied to a single type of device For low-cost capacity with high performance, must leverage multiple device types NVM (soon), SSD, HDD • Aggressive caching in DRAM, write to device only when you must (fsync) Applications struggle for crash consistency 8
Strata : A Cross Media File System Performance : especially small, random IO • Fast user-level device access Low-cost capacity : leverage NVM, SSD & HDD • Transparent data migration across different storage media • Efficiently handle device IO properties Simplicity : intuitive crash consistency model • In-order, synchronous IO • No fsync() required 9
Strata: main design principle Log operations to NVM at user-level Digest and migrate data in kernel 10
Strata: main design principle Log operations to NVM at user-level Performance : Kernel bypass, but private Digest and migrate data in kernel 10
Strata: main design principle Log operations to NVM at user-level Performance : Kernel bypass, but private Simplicity : Intuitive crash consistency Digest and migrate data in kernel 10
Strata: main design principle Log operations to NVM at user-level Performance : Kernel bypass, but private Simplicity : Intuitive crash consistency Digest and migrate data in kernel Coordinate multi-process accesses 10
Strata: main design principle Log operations to NVM at user-level Performance : Kernel bypass, but private Simplicity : Intuitive crash consistency Digest and migrate data in kernel Coordinate multi-process accesses Apply log operations to shared data 10
Strata: main design principle LibFS Log operations to NVM at user-level Performance : Kernel bypass, but private Simplicity : Intuitive crash consistency KernelFS Digest and migrate data in kernel Coordinate multi-process accesses Apply log operations to shared data 10
Outline • LibFS: Log operations to NVM at user-level • Fast user-level access • In-order, synchronous IO • KernelFS: Digest and migrate data in kernel • Asynchronous digest • Transparent data migration • Shared file access • Evaluation 11
Log operations to NVM at user-level unmodified application POSIX API Strata: LibFS Private operation log NVM … creat write rename File operations (data & metadata) 12
Log operations to NVM at user-level unmodified • Fast writes application • Directly access fast NVM POSIX API Strata: LibFS • Sequentially append data • Cache-line granularity Kernel- bypass • Blind writes Private operation log NVM … creat write rename File operations (data & metadata) 12
Log operations to NVM at user-level unmodified • Fast writes application • Directly access fast NVM POSIX API Strata: LibFS • Sequentially append data • Cache-line granularity Kernel- bypass • Blind writes Private operation log • Crash consistency NVM • On crash, kernel replays log … creat write rename File operations (data & metadata) 12
Intuitive crash consistency unmodified application POSIX API Strata: LibFS Kernel- Synchronous IO bypass Private operation log NVM 13
Intuitive crash consistency unmodified application • When each system call returns: POSIX API Strata: • Data/metadata is durable LibFS • In-order update Kernel- Synchronous IO bypass • Atomic write • Limited size (log size) Private operation log NVM 13
Intuitive crash consistency unmodified application • When each system call returns: POSIX API Strata: • Data/metadata is durable LibFS • In-order update Kernel- Synchronous IO bypass • Atomic write • Limited size (log size) Private operation log NVM fsync() is no-op 13
Intuitive crash consistency unmodified application • When each system call returns: POSIX API Strata: • Data/metadata is durable LibFS • In-order update Kernel- Synchronous IO bypass • Atomic write • Limited size (log size) Private operation log NVM fsync() is no-op Fast synchronous IO: NVM and kernel-bypass 13
Recommend
More recommend