NVMe-over-Fabrics Performance Characterization and the Path to Low-Overhead Flash Disaggregation Zvika Guz, Harry Li, Anahita Shayesteh, and Vijay Balakrishnan Memory Solution Lab Samsung Semiconductor Inc. 1
Synopsis Performance characterization of NVMe-oF in the context of Flash disaggregation Overview – NVMe and NVMe-over-Fabrics – Flash disaggregation Performance characterization – Stress-testing remote storage – Disaggregating RocksDB Summary 2
Non-Volatile Memory Express (NVMe) A storage protocol standard on top of PCIe: – Standardize access to local non-volatile memory over PCIe The predominant protocol for PCIe-based SSD devices – NVMe-SSDs connect through PCIe and support the standard High-performance through parallelization: – Large number of deep submission/completion queues NVMe-SSDs deliver lots of IOPS/BW – 1MIOPS, 6GB/s from a single device – 5x more than SAS-SSD, 20x more than SATA-SSD 3
Storage Disaggregation Separates compute and storage to different nodes – Storage is accessed over a network rather than locally Enables independent resource scaling – Allow flexible infrastructure tuning to dynamic loads – Reduces resource underutilization – Improves cost-efficiency by eliminating waste Remote access introduces overheads – Additional interconnect latencies – Network/protocol processing affect both storage and compute nodes HDD disaggregation is common in datacenters – HDD are so slow that these overheads are negligible 4
Storage Flash Disaggregation NVMe disaggregation is more challenging – ~90 μ s latency network/protocol latencies are more pronounced – ~1MIOPS protocol overheads tax the CPU and degrade performance Flash disaggregation via iSCSI is difficult: – iSCSI “introduces 20% throughput drop at the application level ” * – Even then, it can still be a cost-efficiency win We show that these overheads go away with NVMe-oF * A. Klimovic, C. Kozyrakis, E. Thereska, B. John, and S. Kumar, “ Flash storage disaggregation ,” EuroSys’ 16 5
NVMe-oF: NVMe-over-Fabrics Recent extension of the NVMe standard – Enables access to remote NVMe devices over different network fabrics Maintains the current NVMe architecture, and: – Adds support for message-based NVMe operations Advantages: – Parallelism: extends the multiple queue-paired design of NVMe – Efficiency: eliminates protocol translations along the I/O path – Performance Supported fabrics: – RDMA – InfiniBand, iWarp, RoCE – Fiber Channel, FCoE 6
Methodology Three configurations: 1. Baseline: Local, (direct-attached) 2. Remote storage with NVMe-oF over RoCEv2 3. Remote storage with iSCSI Baseline: direct-attached (DAS) • Followed best-known-methods for tuning Hardware setup: – 3 host servers (a.k.a. compute nodes, or datastore servers) • Dual-socket Xeon E5-2699 – 1 target server (a.k.a. storage server) • Quad-socket Xeon E7-8890 – 3x Samsung PM1725 NVMe-SSDs • Random: 750/120 KIOPS read/write • Sequential: 3000/2000 MB/sec read/write – Network: • ConnectX-4 100Gb Ethernet NICs with RoCE support Remote storage setup • 100Gb top-of-rack switch 7
Maximum Throughput NVMe-oF throughput is the same as DAS – iSCSI cannot keep up for high IOPS rates 4KB Random Traffic Throughput 2,500,000 2,250,000 40% 2,000,000 DAS 1,750,000 IOPS NVMf 1,500,000 iSCSI 1,250,000 1,000,000 750,000 500,000 250,000 0 100/0 80/20 50/50 20/80 0/100 Read/Write Instruction Mix 8
Host CPU Overheads NVMe-oF CPU processing overheads are minimal – iSCSI adds significant load on the host (30%) • Even when performance is on par with DAS Host CPU Utilization 45 40 35 Utilization [%] DAS 30 25 NVMf 20 iSCSI 15 10 5 0 100/0 80/20 50/50 20/80 0/100 Read/Write Instruction Mix 9
Storage Server CPU Overheads CPU processing on target is limited – 90% of DAS read-only throughput with 1/12 th of the cores Cost efficiency win: fewer cores per NVMe-SSD in the server 2,500,000 100 DAS Target CPU Utilization [%] 90 NVMf 32 cores 2,000,000 80 NVMf 16 cores 70 IOPS NVMf 8 cores 1,500,000 60 iSCSI 32 cores 50 iSCSI 16 cores 1,000,000 40 2.4x iSCSI 8 cores 30 500,000 20 NVMf CPU% 10 iSCSI CPU% 0 0 100/0 80/20 Read/Write Instruction Mix 10
Latency Under Load NVMe-oF latencies are the same as DAS for all practical loads – Both average and tail iSCSI: 4KB Random Read Load Latency 10,000 – Saturates sooner 9,000 DAS avg latency 8,000 – 10x slower even Latency [usec] DAS 95th percentile 7,000 under light loads 6,000 NVMf avg latency 5,000 NVMf 95th percentile 4,000 iSCSI avg latency 3,000 2,000 iSCSI 95th percentile 1,000 0 IOPS 11
Latency Under Load NVMe-oF latencies are the same as DAS for all practical loads – Both average and tail iSCSI: 4KB Random Read Load Latency – Saturates sooner 1,200 1,000 – 10x slower even DAS avg latency Latency [usec] DAS 95th percentile 800 under light loads NVMf avg latency 600 NVMf 95th percentile 400 iSCSI avg latency iSCSI 95th percentile 200 0 IOPS 12
KV-Store Disaggregation (1/3) Evaluated using RocksDB, driven with db_bench – 3 hosts – 3 rocksdb instances per host – 800B and 10KB objects – 80/20 read-write mix 13
KV-Store Disaggregation (2/3) NVMe-oF performance on-par with DAS – 2% throughput difference • vs. 40% performance degradation for iSCSI RocksDB Performance 300,000 Disk Bandwidth over Time on the Target Operations Per Second 250,000 200,000 DAS NVMf 150,000 iSCSI 100,000 50,000 0 800B 10KB Objects Size 14
KV-Store Disaggregation (3/3) NVMe-oF performance on-par with DAS – 2% throughput difference • vs. 40% performance degradation for iSCSI – Average latency increase by 11%, tail latency increase by 2% • Average Latency: 507 μ s 568 μ s Read Latency CDF • 99 th percentile: 3.6ms 3.7ms 100% 90% – 10% CPU utilization overhead 80% Percentage 70% on host 60% DAS 50% NVMf 40% 30% 20% 10% 0% Latency [us] 15
Summary NVMe-oF reduces remote storage overheads to a bare minimum – Negligible throughput difference, similar latency – Low processing overheads on both host and target • Applications ( host ) gets the same performance • Storage server ( target ) can support more drives with fewer cores NVMe-oF makes disaggregation more viable – No need to offset iSCSI >>20% performance lose Thank You! zvika.guz@samsung.com 16
Backup 17
Unloaded Latency Breakdown NVMe-oF adds 11.7 μ s over DAS access latency – Close to the 10 μ s spec target 4K Unloaded Read Latency target side host side [usec] Fio (dev/nvmeXnY) userspace userspace Kernel Kernel 0 10 20 30 40 50 60 70 80 90 100 VFS NVMeT_Core NVMeT_RDMA Latency [usec] Latency [usec] File System submit IO request NVMe DAS Path 81.6 submit IO request Block layer RDMA Stack Block layer enqueue IO request Others 1.52 enqueue IO request NVMe_Core NVMe_Core NVMf Target Modules 4.57 Setup command NVMe_PCI NVMf Host Modules 3.25 NVMe_RDMA RDMA Stack Transport Fabric Fabric 2.43 18
FAQ #1: SPDK Storage Performance Development Kit (SPDK) – Provides user-mode storage drivers • NVMe, NVMe-oF target, and NVMe-oF host – Better performance through: • Eliminating kernel context switches • Polling rather than interrupts Unloaded Latency Will improve NVMe-oF performance 100 11.7 μ s 8.9 μ s 90 – BUT , was not stable enough for our setup 80 Latency [usec] 70 For unloaded latency: 60 50 – SPDK target further reduces 40 30 latency overhead 20 – SPDK local SPDK target similar to 10 0 local NVMe-oF DAS SPDK DAS SPDK NVMf NVMf Target 19
FAQ #1: SPDK Storage Performance Development Kit (SPDK) – Provides user-mode storage drivers • NVMe, NVMe-oF target, and NVMe-oF host – Better performance through: • Eliminating kernel context switches • Polling rather than interrupts Unloaded Latency Will improve NVMe-oF performance 100 11.7 μ s 11.7 μ s 90 – BUT , was not stable enough for our setup 80 Latency [usec] 70 For unloaded latency: 60 50 – SPDK target further reduces 40 30 latency overhead 20 – SPDK local SPDK target similar to 10 0 local NVMe-oF DAS SPDK DAS SPDK NVMf NVMf Target 20
FAQ #2: Hyper-convergence vs. Disaggregation Hyper-convergence Infrastructure (HCI) – Software-defined approach – Bundles commodity servers into a clustered pool – Abstract underlining hardware into a virtualized computing platform We focus on web-scale data centers – Disaggregation fits well within their deployment model • Several classes of server, some of which are storage-centric • Already disaggregate HDD NVMe-oF, HCI, and disaggregation are not mutually exclusive – HCI on-top of NVMe-oF – Hybrid architectures 21
Recommend
More recommend