lightnvm the linux open channel ssd subsystem matia tias
play

LightNVM: The Linux Open-Channel SSD Subsystem Matia tias Bj - PowerPoint PPT Presentation

LightNVM: The Linux Open-Channel SSD Subsystem Matia tias Bj Bjrli ling (ITU (ITU, CN CNEX La Labs), Javier Gonzlez (CNEX Labs), Philippe Bonnet (ITU) 0% Writes - Read Latency 4K Random Read 4K Random Read Latency Percentile 2


  1. ½ LightNVM: The Linux Open-Channel SSD Subsystem Matia tias Bj Bjørli ling (ITU (ITU, CN CNEX La Labs), Javier González (CNEX Labs), Philippe Bonnet (ITU)

  2. 0% Writes - Read Latency 4K Random Read 4K Random Read Latency Percentile 2

  3. 20% Writes - Read Latency 4K Random Read / 4K Random Write 4ms! 4K Random Read Latency Signficant outliers! Worst-case 30X Percentile 3

  4. NAND Capacity Continues to Grow Workload #2 Workload #3 Workload #4 Workload #1 Solid State Drive Performance – Endurance – DRAM overheads Source: William Tidwell -The Harder Alternative – Managing NAND capacity in the 3D age 4

  5. What contributes to outliers? Even if Writes and Reads does not collide from application Indirection and a Narrow Storage interface cause outliers Host: Log-on-Log Device: Write Indirection & Unknown State Log-structured Database (e.g., RocksDB) User 1 Reads Writes Space Metadata Mgmt. Address Mapping Garbage Collection pread/pwrite Solid-State Drive Pipeline VFS Write Buffer Log-structured File-system NAND Controller Kernel Log- 2 ii Space Metadata Mgmt. Address Mapping Garbage Collection die 0 die 1 die 2 die 3 Structured Block Layer Read/Write/Trim Drive maps logical data Buffered Writes to the physical location Solid-State Drive with Best Effort 3 HW Metadata Mgmt. Address Mapping Garbage Collection Host is oblivious to physical data placement due to Unable to align data logically indirection 5 = Write amplification increase + extra GC

  6. Open-Channel SSDs I/O Isolation Predictable Latency Data Placement & I/O Provide isolation I/Os are synchronous. Scheduling between tenants by Access time to parallel Manage the non-volatile allocating independent units are explicit defined. memory as a block device, parallel units through a file-system or inside your application. 6

  7. Solid-State Drives Read/Write Host Interface Read/Write/Erase Solid-State Drive Parallel Units Responsibilities Media Controller R/W/E to R/W Channel X Flash Translation Layer Media Error Handling Channel Y Read (50-100us) Media Retention Management Write (1-5ms) Erase (3-15ms) Tens of Units! Manage Media Constraints ECC, RAID, Retention 7

  8. Rebalance the Storage Interface Expose device parallelism • Parallel units (LUNs) are exposed as independent units to the host. • Can be a logical or a physical representation. • Explicit performance characteristics. Log-Structured Storage • Exposes storage as chunks that must be written sequentially. • Similar to the HDD Shingled Magnetic Recording (SMR) interface. • No need for internal garbage collection by the device. Integrate with file-systems and databases, and can also implement I/O determinism, streams, barriers, and other new data management schemes without changing device firmware. 8

  9. Specification Device model • Defines parallel units and how they are laid out in the LBA address space. • Defines chunks. Each chunk is a range of LBAs where writes must be sequential. To write again, a chunk must be reset. – A chunk can be in one of four states (free/open/closed/offline) – If a chunk is open, there is a write pointer associated. – The model is media-agnostic. Geometry and I/O Commands • Read/Write/Reset – Scalars and Vectors 9

  10. Drive Model - Chunks Logical Block Address Space … 0 1 Chunk - 1 Chunk LBA 0 1 LBA -1 Reads Write Reset Min. Write size Chunk granularity granularity Synchronous – May fail – Logical block granularity Synchronous – May fail – An error only marks For example 4KB An error marks write chunk bad, and not bad, not whole SSD whole SSD

  11. Drive Model - Organization Host Parallelism across NVMe LUN LUN PU Groups (Shared bus) SSD LUN Parallel Units (LUNs) LUN PU Chunk Chunk Logical Block Address Space Group … 0 1 Group - 1 … PU 0 1 PU - 1 … Chunk 0 1 Chunk - 1 … LBA 0 1 LBA -1

  12. LightNVM Subsystem Architecture 1. NVMe Device Driver Detection of OCSSD User Implements specification Application(s) Space 2. LightNVM Subsystem Geometry Vectored Kernel Read/Write R/W/E File System (optional) Generic layer Space Scalar pblk (3) Core functionality LightNVM Subsystem (2) Target management 3. High-level I/O Interfaces NVMe Device Driver (1) PPA Addressing Block device using a target Hardware Open-Channel SSD Application integration with liblightnvm File-systems, ... 12

  13. pblk - Host-side Flash Translation Layer Mapping table • Linux Logical block granularity File System Kernel Write buffering Read Path Write Path Write Context make_rq make_rq • Lockless circular buffer Add Entry • Multiple producers Lookup Cache Hit Write Buffer Write Entry • Single consumer (Write Thread) GC/Rate-limiting Thread Error Handling Write Thread L2P Table Read Error Handling • Device write/reset errors LightNVM Subsystem Garbage Collection Write NVMe Device Driver • Refresh data • Rewrite chunks Hardware Open-Channel SSD 13

  14. Experimentation • Drive CNEX Labs Open-Channel SSD NVMe, Gen3x8, 2TB MLC NAND Implements Open-Channel 1.2 specification • Parallelism 16 channels 8 parallel units per channel (Total: 128 PUs) • Parallel unit characteristic Min. Write size: 16K + 64B OOB Chunks: 1,067, Chunk size: 16MB • Throughput per parallel unit: Write: 47MB/s Read: 108MB/s (4K), 280MB/s (64K) 14

  15. Base Performance – Throughput + Latency RR slightly lower Grows with parallelism Request I/O Size 15

  16. Limit # of Active Writers Limit number of writers to improve read latency Single Read or Write Perf. at Write Perf. 200MB/s Mixed Read/Write Write latency increases, and read latency reduces 256K Write QD1 256K Read QD16 A priori knowledge of workload. Write 200MB/s 16

  17. Multi-Tenant Workloads NVMe SSD OCSSD 2 Tenants (1W/1R) 4 Tenants (3W/1R) 8 Tenants (7W/1R) Source: Multi-Tenant I/O Isolation with Open-Channel SSDs, 17 Javier González and Matias Bjørling , NVMW ‘17

  18. Lessons Learned 1. Warranty to end-users – Users has direct access to media. 2. Media characterization is complex and performed for each type of NAND memory – Abstract the media to a ”clean” interface. 3. Write buffering – For MLC/TLC media, write buffering is required. Decide if in host or in device. 4. Application-agnostic wear leveling is mandatory – Enable statistics for host to make appropriate decisions. 18

  19. Conclusion Contributions LightNVM • New storage interface between host • Initial release of subsystem with Linux and drive. kernel 4.4 (January 2016). • The Linux kernel LightNVM subsystem. • User-space library (liblightnvm) support upstream in Linux kernel 4.11 (April • pblk: A host-side Flash Translation Layer 2017). for Open-Channel SSDs. • pblk available in Linux kernel 4.12 (July • Demonstration of an Open-Channel 2017). SSD. • Open-Channel SSD 2.0 specification released (January 2018) and support available from Linux kernel 4.17 (May 2018). 12-03-2018 · 19

  20. Thank You 12-03-2018 · 20

Recommend


More recommend