nvme over fabrics performance characterization and the
play

NVMe-over-Fabrics Performance Characterization and the Path to - PowerPoint PPT Presentation

NVMe-over-Fabrics Performance Characterization and the Path to Low-Overhead Flash Disaggregation Zvika Guz, Harry Li, Anahita Shayesteh, and Vijay Balakrishnan Memory Solution Lab Samsung Semiconductor Inc. 1 Synopsis Performance


  1. NVMe-over-Fabrics Performance Characterization and the Path to Low-Overhead Flash Disaggregation Zvika Guz, Harry Li, Anahita Shayesteh, and Vijay Balakrishnan Memory Solution Lab Samsung Semiconductor Inc. 1

  2. Synopsis Performance characterization of NVMe-oF in the context of Flash disaggregation  Overview – NVMe and NVMe-over-Fabrics – Flash disaggregation  Performance characterization – Stress-testing remote storage – Disaggregating RocksDB  Summary 2

  3. Non-Volatile Memory Express (NVMe)  A storage protocol standard on top of PCIe: – Standardize access to local non-volatile memory over PCIe  The predominant protocol for PCIe-based SSD devices – NVMe-SSDs connect through PCIe and support the standard  High-performance through parallelization: – Large number of deep submission/completion queues  NVMe-SSDs deliver lots of IOPS/BW – 1MIOPS, 6GB/s from a single device – 5x more than SAS-SSD, 20x more than SATA-SSD 3

  4. Storage Disaggregation  Separates compute and storage to different nodes – Storage is accessed over a network rather than locally  Enables independent resource scaling – Allow flexible infrastructure tuning to dynamic loads – Reduces resource underutilization – Improves cost-efficiency by eliminating waste  Remote access introduces overheads – Additional interconnect latencies – Network/protocol processing affect both storage and compute nodes  HDD disaggregation is common in datacenters – HDD are so slow that these overheads are negligible 4

  5. Storage Flash Disaggregation  NVMe disaggregation is more challenging – ~90 μ s latency  network/protocol latencies are more pronounced – ~1MIOPS  protocol overheads tax the CPU and degrade performance  Flash disaggregation via iSCSI is difficult: – iSCSI “introduces 20% throughput drop at the application level ” * – Even then, it can still be a cost-efficiency win  We show that these overheads go away with NVMe-oF * A. Klimovic, C. Kozyrakis, E. Thereska, B. John, and S. Kumar, “ Flash storage disaggregation ,” EuroSys’ 16 5

  6. NVMe-oF: NVMe-over-Fabrics  Recent extension of the NVMe standard – Enables access to remote NVMe devices over different network fabrics  Maintains the current NVMe architecture, and: – Adds support for message-based NVMe operations  Advantages: – Parallelism: extends the multiple queue-paired design of NVMe – Efficiency: eliminates protocol translations along the I/O path – Performance  Supported fabrics: – RDMA – InfiniBand, iWarp, RoCE – Fiber Channel, FCoE 6

  7. Methodology  Three configurations: 1. Baseline: Local, (direct-attached) 2. Remote storage with NVMe-oF over RoCEv2 3. Remote storage with iSCSI Baseline: direct-attached (DAS) • Followed best-known-methods for tuning  Hardware setup: – 3 host servers (a.k.a. compute nodes, or datastore servers) • Dual-socket Xeon E5-2699 – 1 target server (a.k.a. storage server) • Quad-socket Xeon E7-8890 – 3x Samsung PM1725 NVMe-SSDs • Random: 750/120 KIOPS read/write • Sequential: 3000/2000 MB/sec read/write – Network: • ConnectX-4 100Gb Ethernet NICs with RoCE support Remote storage setup • 100Gb top-of-rack switch 7

  8. Maximum Throughput  NVMe-oF throughput is the same as DAS – iSCSI cannot keep up for high IOPS rates 4KB Random Traffic Throughput 2,500,000 2,250,000 40% 2,000,000 DAS 1,750,000 IOPS NVMf 1,500,000 iSCSI 1,250,000 1,000,000 750,000 500,000 250,000 0 100/0 80/20 50/50 20/80 0/100 Read/Write Instruction Mix 8

  9. Host CPU Overheads  NVMe-oF CPU processing overheads are minimal – iSCSI adds significant load on the host (30%) • Even when performance is on par with DAS Host CPU Utilization 45 40 35 Utilization [%] DAS 30 25 NVMf 20 iSCSI 15 10 5 0 100/0 80/20 50/50 20/80 0/100 Read/Write Instruction Mix 9

  10. Storage Server CPU Overheads  CPU processing on target is limited – 90% of DAS read-only throughput with 1/12 th of the cores  Cost efficiency win: fewer cores per NVMe-SSD in the server 2,500,000 100 DAS Target CPU Utilization [%] 90 NVMf 32 cores 2,000,000 80 NVMf 16 cores 70 IOPS NVMf 8 cores 1,500,000 60 iSCSI 32 cores 50 iSCSI 16 cores 1,000,000 40 2.4x iSCSI 8 cores 30 500,000 20 NVMf CPU% 10 iSCSI CPU% 0 0 100/0 80/20 Read/Write Instruction Mix 10

  11. Latency Under Load  NVMe-oF latencies are the same as DAS for all practical loads – Both average and tail  iSCSI: 4KB Random Read Load Latency 10,000 – Saturates sooner 9,000 DAS avg latency 8,000 – 10x slower even Latency [usec] DAS 95th percentile 7,000 under light loads 6,000 NVMf avg latency 5,000 NVMf 95th percentile 4,000 iSCSI avg latency 3,000 2,000 iSCSI 95th percentile 1,000 0 IOPS 11

  12. Latency Under Load  NVMe-oF latencies are the same as DAS for all practical loads – Both average and tail  iSCSI: 4KB Random Read Load Latency – Saturates sooner 1,200 1,000 – 10x slower even DAS avg latency Latency [usec] DAS 95th percentile 800 under light loads NVMf avg latency 600 NVMf 95th percentile 400 iSCSI avg latency iSCSI 95th percentile 200 0 IOPS 12

  13. KV-Store Disaggregation (1/3)  Evaluated using RocksDB, driven with db_bench – 3 hosts – 3 rocksdb instances per host – 800B and 10KB objects – 80/20 read-write mix 13

  14. KV-Store Disaggregation (2/3)  NVMe-oF performance on-par with DAS – 2% throughput difference • vs. 40% performance degradation for iSCSI RocksDB Performance 300,000 Disk Bandwidth over Time on the Target Operations Per Second 250,000 200,000 DAS NVMf 150,000 iSCSI 100,000 50,000 0 800B 10KB Objects Size 14

  15. KV-Store Disaggregation (3/3)  NVMe-oF performance on-par with DAS – 2% throughput difference • vs. 40% performance degradation for iSCSI – Average latency increase by 11%, tail latency increase by 2% • Average Latency: 507 μ s  568 μ s Read Latency CDF • 99 th percentile: 3.6ms  3.7ms 100% 90% – 10% CPU utilization overhead 80% Percentage 70% on host 60% DAS 50% NVMf 40% 30% 20% 10% 0% Latency [us] 15

  16. Summary  NVMe-oF reduces remote storage overheads to a bare minimum – Negligible throughput difference, similar latency – Low processing overheads on both host and target • Applications ( host ) gets the same performance • Storage server ( target ) can support more drives with fewer cores  NVMe-oF makes disaggregation more viable – No need to offset iSCSI >>20% performance lose Thank You! zvika.guz@samsung.com 16

  17. Backup 17

  18. Unloaded Latency Breakdown  NVMe-oF adds 11.7 μ s over DAS access latency – Close to the 10 μ s spec target 4K Unloaded Read Latency target side host side [usec] Fio (dev/nvmeXnY) userspace userspace Kernel Kernel 0 10 20 30 40 50 60 70 80 90 100 VFS NVMeT_Core NVMeT_RDMA Latency [usec] Latency [usec] File System submit IO request NVMe DAS Path 81.6 submit IO request Block layer RDMA Stack Block layer enqueue IO request Others 1.52 enqueue IO request NVMe_Core NVMe_Core NVMf Target Modules 4.57 Setup command NVMe_PCI NVMf Host Modules 3.25 NVMe_RDMA RDMA Stack Transport Fabric Fabric 2.43 18

  19. FAQ #1: SPDK  Storage Performance Development Kit (SPDK) – Provides user-mode storage drivers • NVMe, NVMe-oF target, and NVMe-oF host – Better performance through: • Eliminating kernel context switches • Polling rather than interrupts Unloaded Latency  Will improve NVMe-oF performance 100 11.7 μ s 8.9 μ s 90 – BUT , was not stable enough for our setup 80 Latency [usec] 70  For unloaded latency: 60 50 – SPDK target further reduces 40 30 latency overhead 20 – SPDK local  SPDK target similar to 10 0 local  NVMe-oF DAS SPDK DAS SPDK NVMf NVMf Target 19

  20. FAQ #1: SPDK  Storage Performance Development Kit (SPDK) – Provides user-mode storage drivers • NVMe, NVMe-oF target, and NVMe-oF host – Better performance through: • Eliminating kernel context switches • Polling rather than interrupts Unloaded Latency  Will improve NVMe-oF performance 100 11.7 μ s 11.7 μ s 90 – BUT , was not stable enough for our setup 80 Latency [usec] 70  For unloaded latency: 60 50 – SPDK target further reduces 40 30 latency overhead 20 – SPDK local  SPDK target similar to 10 0 local  NVMe-oF DAS SPDK DAS SPDK NVMf NVMf Target 20

  21. FAQ #2: Hyper-convergence vs. Disaggregation  Hyper-convergence Infrastructure (HCI) – Software-defined approach – Bundles commodity servers into a clustered pool – Abstract underlining hardware into a virtualized computing platform  We focus on web-scale data centers – Disaggregation fits well within their deployment model • Several classes of server, some of which are storage-centric • Already disaggregate HDD  NVMe-oF, HCI, and disaggregation are not mutually exclusive – HCI on-top of NVMe-oF – Hybrid architectures 21

Recommend


More recommend