Fla lashNet: t: Fla lash/Netw twork ork Sta tack C k Co-De Design ign Animesh Trivedi, Nikolas Ioannou, Bernard Metzler, Patrick Stuedi, Jonas Pfefferle, Ioannis Koltsidas, Kornilios Kourtis, and Thomas R. Gross IBM Research and ETH Zurich, Switzerland
Modern Distributed Systems data intensive run on 100-1000s of servers performance depends upon both network and storage SYSTOR 2017, Haifa 2
Modern Distributed Systems performance depends upon both network and storage SYSTOR 2017, Haifa 3
Modern Distributed Systems ● StackMap: Low-Latency Networking with the OS Stack and Dedicated NICs, USENIX'16 ● Network Stack Specialization for Performance, SIGCOMM'14 ● mTCP: A Highly Scalable User-level TCP Stack for Multicore Systems, NSDI'14 ● MegaPipe: A New Programming Interface for Scalable Network I/O, OSDI'12 performance depends upon ● … both network and storage SYSTOR 2017, Haifa 4
Modern Distributed Systems ● StackMap: Low-Latency Networking with the OS Stack and Dedicated NICs, USENIX'16 ● Network Stack Specialization for Performance, SIGCOMM'14 ● mTCP: A Highly Scalable User-level TCP Stack for Multicore Systems, NSDI'14 ● MegaPipe: A New Programming Interface for Scalable Network I/O, OSDI'12 performance depends upon ● … both network and storage ● NVMeDirect: A User-space I/O Framework for Application-specific Optimization on NVMe SSDs, HotStorage'16 ● OS I/O Path Optimizations for Flash Solid-state Drives, USENIX'14 ● Linux Block IO: Introducing Multi-queue SSD Access on Multi-core Systems, SYSTOR'13 ● When Poll is Better Than Interrupt, FAST'12 ● … SYSTOR 2017, Haifa 5
Modern Distributed Systems ● StackMap: Low-Latency Networking with the OS Stack and Dedicated NICs, USENIX'16 ● Network Stack Specialization for Performance, SIGCOMM'14 ● mTCP: A Highly Scalable User-level TCP Stack for Multicore Systems, NSDI'14 ● MegaPipe: A New Programming Interface for Scalable Network I/O, OSDI'12 performance depends upon ● … both network and storage ● NVMeDirect: A User-space I/O Framework for Application-specific Optimization on NVMe SSDs, HotStorage'16 ● OS I/O Path Optimizations for Flash Solid-state Drives, USENIX'14 ● Linux Block IO: Introducing Multi-queue SSD Access on Multi-core Systems, SYSTOR'13 ● When Poll is Better Than Interrupt, FAST'12 ● … SYSTOR 2017, Haifa 6
Modern Distributed Systems ● StackMap: Low-Latency Networking with the OS Stack and Dedicated NICs, USENIX'16 ● Network Stack Specialization for Performance, SIGCOMM'14 ● mTCP: A Highly Scalable User-level TCP Stack for Multicore Systems, NSDI'14 ● MegaPipe: A New Programming Interface for Scalable Network I/O, OSDI'12 performance depends upon ● … both network and storage ● NVMeDirect: A User-space I/O Framework for Application-specific Optimization on NVMe SSDs, HotStorage'16 ● OS I/O Path Optimizations for Flash Solid-state Drives, USENIX'14 ● Linux Block IO: Introducing Multi-queue SSD Access on Multi-core Systems, SYSTOR'13 ● When Poll is Better Than Interrupt, FAST'12 ● … SYSTOR 2017, Haifa 7
The Cost of the Gap 1400 1200 1000 800 K IOPS 600 400 200 0 spec. block IO netperf iSCSI KV NFS HDFS SYSTOR 2017, Haifa 8
The Cost of the Gap 1400 1200 1000 800 K IOPS 600 400 200 0 spec. block IO netperf iSCSI KV NFS HDFS SYSTOR 2017, Haifa 9
The Cost of the Gap 1400 1200 1000 800 K IOPS 600 400 200 0 spec. block IO netperf iSCSI KV NFS HDFS SYSTOR 2017, Haifa 10
The Reason for the Gap client flash storage SYSTOR 2017, Haifa 11
The Reason for the Gap 1 request 2 request processing client flash 3 response storage SYSTOR 2017, Haifa 12
The Reason for the Gap 1 request 2 request processing client flash 3 response storage performance = network IO + server time + storage IO SYSTOR 2017, Haifa 13
The Reason for the Gap 1 request 2 request processing client flash 3 response storage performance = network IO + server time + storage IO application involvement scheduling fs lookups and overheads ... SYSTOR 2017, Haifa 14
A Detailed Look: send usersp userspace kern ernel 1. TCP/IP processing SYSTOR 2017, Haifa 15
A Detailed Look: send 3. receive request usersp userspace kern ernel 2. receive processing 1. TCP/IP processing SYSTOR 2017, Haifa 16
A Detailed Look: send 3. receive request usersp userspace kern ernel 4. fs translation 2. receive processing 5. block I/O 1. TCP/IP processing SYSTOR 2017, Haifa 17
A Detailed Look: send 3. receive 7. response request transmission userspace usersp kern ernel 4. fs translation 2. receive processing 5. block I/O 1. TCP/IP 6. flash I/O processing completion SYSTOR 2017, Haifa 18
A Detailed Look: send 3. receive 7. response request transmission userspace usersp kern ernel 4. fs translation 8. send 2. receive processing 5. block I/O 1. TCP/IP 6. flash I/O 9.TX done processing completion SYSTOR 2017, Haifa 19
A Detailed Look: sendfjle 3. receive 7. response request transmission userspace usersp kern ernel 4. fs translation 8. send 2. receive processing 5. block I/O 1. TCP/IP 6. flash I/O 9.TX done processing completion SYSTOR 2017, Haifa 20
A Detailed Look: sendfjle 3. receive 7. response request transmission userspace usersp kern ernel 4. fs translation 8. send 2. receive processing 5. block I/O 1. TCP/IP 6. flash I/O 9.TX done processing completion SYSTOR 2017, Haifa 21
The FlashNet Approach 3. receive request usersp userspace kern ernel 4. fs translation 2. receive processing 5. block I/O 1. TCP/IP processing SYSTOR 2017, Haifa 22
The FlashNet Approach eliminate application involvement 3. receive request userspace usersp kern ernel 4. fs translation 2. receive processing 5. block I/O 1. TCP/IP processing SYSTOR 2017, Haifa 23
The FlashNet Approach eliminate application involvement 3. receive request usersp userspace reduce file kern ernel 4. fs translation system overheads 2. receive processing 5. block I/O 1. TCP/IP processing SYSTOR 2017, Haifa 24
The FlashNet Approach eliminate application involvement 3. receive request usersp userspace reduce file kern ernel 4. fs translation system overheads 2. receive processing 5. block I/O enable direct network and storage interaction 1. TCP/IP processing SYSTOR 2017, Haifa 25
The FlashNet Approach eliminate application involvement RDMA usersp userspace reduce file kern ernel 4. fs translation system overheads 2. RDMA processing 5. block I/O enable direct network and storage interaction 1. TCP/IP processing SYSTOR 2017, Haifa 26
The FlashNet Approach eliminate application involvement RDMA usersp userspace reduce file kern ernel simple fs layout system overheads 2. RDMA processing 5. block I/O enable direct network and storage interaction 1. TCP/IP processing SYSTOR 2017, Haifa 27
The FlashNet Approach eliminate application involvement RDMA usersp userspace reduce file kern ernel simple fs layout system overheads 2. RDMA processing use VM 5. block I/O enable direct network and storage interaction 1. TCP/IP processing SYSTOR 2017, Haifa 28
The FlashNet Approach usersp userspace 2. RDMA v to L 6. response kern ernel a B A transmission 2. RDMA processing 3. block I/O 1. TCP/IP 6. flash I/O 7.TX done processing completion SYSTOR 2017, Haifa 29
FlashNet: A Co-Designed Network and Storage Stack 64-bit LBA space flash controller ● flash virtualization ● I/O management ● ... SYSTOR 2017, Haifa 30
FlashNet: A Co-Designed Network and Storage Stack ● contigious file allocation ● supporting m & local file I/O m a p ● … ContigFS flash controller SYSTOR 2017, Haifa 31
FlashNet: A Co-Designed Network and Storage Stack ● lazy RDMA pinning ● resolving flash & file addresses ● … ContigFS RDMA controller flash controller SYSTOR 2017, Haifa 32
FlashNet: A Co-Designed Network and Storage Stack FlashNet I/O stack ContigFS RDMA controller flash controller SYSTOR 2017, Haifa 33
FlashNet: A Co-Designed Network and Storage Stack file server application virtual address LBA ContigFS RDMA STag PBA controller flash controller network control setup expanding to storage SYSTOR 2017, Haifa 34
FlashNet: A Co-Designed Network and Storage Stack file server application virtual address LBA ContigFS RDMA STag PBA controller flash controller network control setup expanding to storage data path from a flash device to a client buffer SYSTOR 2017, Haifa 35
FlashNet: A Co-Designed Network and Storage Stack client server application nfs ContigFS RDMA iSCSI controller flash RNIC controller SKB network control setup expanding to storage data path from a flash device to a client buffer SYSTOR 2017, Haifa 36
Performance Evaluation How efficient is FlashNet's IO path? Does it help with applications? ...more in the paper 9-machine cluster testbed CPU : dual socket E5-2690, 2.9 GHz, 16 cores DRAM : 256 GB, DDR3 1600 MHz NIC : 40Gbit/s Ethernet 3xNVMe Flash : 6.6 GB/sec (read), 2.7 GB/sec (write) peak 4kB read IOPS: 1.3 M SYSTOR 2017, Haifa 37
Recommend
More recommend