Performance of RDMA-Capable Storage Performance of RDMA-Capable Storage Protocols on Wide-Area Network Protocols on Wide-Area Network Weikuan Yu Weikuan Yu Nageswara S.V. Rao Nageswara S.V. Rao Pete Wyckoff* Pete Wyckoff* Jeffrey S. Vetter Jeffrey S. Vetter Ohio Supercomputer Center* Supercomputer Center* Ohio Managed by UT-Battelle for the Department of Energy
InfiniBand Clusters around the World InfiniBand Clusters around the World SGI (US) CEA (France) Ranger (US) Tsubame (Japan) Dawning (China) EKA (India) 2 Managed by UT-Battelle for the Department of Energy PDSW'08, Austin, TX
The Problem of Computing Islands The Problem of Computing Islands • Islands of InfiniBand (IB) clusters Islands of InfiniBand (IB) clusters • – More IB clusters are deployed More IB clusters are deployed – – Some already connected, e.g. through Some already connected, e.g. through TeraGrid TeraGrid – • But only via TCP/IP protocols But only via TCP/IP protocols • • Data transfer across these islands Data transfer across these islands • – Need ever-greater data movement capabilities. – Need ever-greater data movement capabilities. – GridFTP, BBCP or other special storage configuration GridFTP, BBCP or other special storage configuration – – TCP performance on Long Distance can be low – TCP performance on Long Distance can be low • With 10GigE on UltraScience Net (no tuning) With 10GigE on UltraScience Net (no tuning) • – 9.2 Gbps at 0.2 mile 9.2 Gbps at 0.2 mile – – 8.2 Gbps at 1400 miles 8.2 Gbps at 1400 miles – – 2.3-2.5 Gbps at 6600+ miles 2.3-2.5 Gbps at 6600+ miles – 3 Managed by UT-Battelle for the Department of Energy PDSW'08, Austin, TX
RDMA (IB) in Clusters and Local Area Networks RDMA (IB) in Clusters and Local Area Networks Sub-microsecond latency Sub-microsecond latency • • Superb bandwidth (32Gbps with IB QDR) Superb bandwidth (32Gbps with IB QDR) • • Heavily used for clustering Heavily used for clustering • • Getting popular in storage environment Getting popular in storage environment • • – NFS over RDMA ( NFS over RDMA (NFSoRDMA NFSoRDMA) ) – – SCSI RDMA Protocol (SRP) – SCSI RDMA Protocol (SRP) – iSCSI over RDMA ( iSCSI over RDMA (iSER iSER) ) – Applications Applications MPI NFS/iSERI/SRP MPI NFS/iSERI/SRP Verbs Verbs InfiniBand HCA InfiniBand HCA 1 µ sec 4 Managed by UT-Battelle for the Department of Energy PDSW'08, Austin, TX
Sample Performance of RDMA-based Storage Sample Performance of RDMA-based Storage RDMA enables good iSCSI bandwidth within LAN RDMA enables good iSCSI bandwidth within LAN • • Nearly doubled the performance for iSCSI Nearly doubled the performance for iSCSI • • 5 Managed by UT-Battelle for the Department of Energy PDSW'08, Austin, TX
Feasibility of RDMA (IB) on WAN Feasibility of RDMA (IB) on WAN Long-range Extensions for InfiniBand available Long-range Extensions for InfiniBand available • • – Network Equipment Technologies (NET): NX5010 Network Equipment Technologies (NET): NX5010 – – Obsidian Research: Longbow Obsidian Research: Longbow – Long latency (10 4 Long latency (10 4 ~10 ~10 5 5 µ sec) µ sec) • • High bandwidth yet feasible High bandwidth yet feasible • • – – Good Good distance scalability and tolerance to interfering traffic distance scalability and tolerance to interfering traffic – Good network throughput and MPI-level Performance – Good network throughput and MPI-level Performance Can RDMA provide a good transport protocol for storage on WAN? Can RDMA provide a good transport protocol for storage on WAN? • • Applications Applications MPI NFS/iSERI/SRP MPI NFS/iSERI/SRP Verbs Verbs InfiniBand HCA InfiniBand HCA 10 4 ~10 5 µ sec 6 Managed by UT-Battelle for the Department of Energy PDSW'08, Austin, TX
Experimental Environment Experimental Environment Hardware Hardware • • – Long-range IB extension devices from NET (Network Equipment – Long-range IB extension devices from NET (Network Equipment Technologies, Inc) Technologies, Inc) – Mellanox PCI-Express 4x DDR HCAs HCAs (InfiniHost-III and Connect-X) (InfiniHost-III and Connect-X) – Mellanox PCI-Express 4x DDR Software Packages Software Packages • • – OFED-1.3 from openfabrics openfabrics.org .org – OFED-1.3 from – Linux-2.6.25 with Linux-2.6.25 with NFSoRDMA NFSoRDMA and and iSER iSER support support – Performance of RDMA-based Storage Protocols on WAN Performance of RDMA-based Storage Protocols on WAN • • – NFS over RDMA – NFS over RDMA – – iSCSI over RDMA iSCSI over RDMA 7 Managed by UT-Battelle for the Department of Energy PDSW'08, Austin, TX
UltraScience Net at ORNL UltraScience Net at ORNL Experimental WAN Network Experimental WAN Network • • – Oak Ridge, Atlanta, Chicago, Seattle, and Sunnyvale Atlanta, Chicago, Seattle, and Sunnyvale – Oak Ridge, – OC192 backbone connections – OC192 backbone connections – 4300 miles one way, 8600 miles loop-back – 4300 miles one way, 8600 miles loop-back 8 Managed by UT-Battelle for the Department of Energy PDSW'08, Austin, TX
RDMA-based Transport RDMA-based Transport Request and request becomes pure control messages, Request and request becomes pure control messages, • • and have to travel long distance on WAN and have to travel long distance on WAN Use of RDMA read (round‐trip operations) for clients to write data Use of RDMA read (round‐trip operations) for clients to write data • • Possible additional control messages for NFSoRDMA for long arguments Possible additional control messages for NFSoRDMA for long arguments • • Further fragmentation due to the use of page‐based operations Further fragmentation due to the use of page‐based operations • • 9 Managed by UT-Battelle for the Department of Energy PDSW'08, Austin, TX
RDMA on WAN RDMA on WAN RDMA has good network‐level performance within short distance WAN RDMA has good network‐level performance within short distance WAN • • High bandwidth at long distance is only possible for large messages High bandwidth at long distance is only possible for large messages • • Low RDMA‐read performance for page‐based messages (4KB), even at Low RDMA‐read performance for page‐based messages (4KB), even at • • 0.2 mile when using InfiniHost‐III 0.2 mile when using InfiniHost‐III HCAs HCAs 10 Managed by UT-Battelle for the Department of Energy PDSW'08, Austin, TX
NFS over RDMA NFS over RDMA NFS over RDMA achieves good NFS over RDMA achieves good bandwidth within short distance bandwidth within short distance • • But significant optimizations are needed for long distance But significant optimizations are needed for long distance • • 11 Managed by UT-Battelle for the Department of Energy PDSW'08, Austin, TX
NFS - Large block size NFS - Large block size NFS over IPoIB‐CM benefits from large block size NFS over IPoIB‐CM benefits from large block size • • NFS over RDMA needs to support large block size for better fit NFS over RDMA needs to support large block size for better fit • • on long‐distance WAN on long‐distance WAN 12 Managed by UT-Battelle for the Department of Energy PDSW'08, Austin, TX
NFS over RDMA - using Connect-X NFS over RDMA - using Connect-X Better RDMA read in connect‐X improves Better RDMA read in connect‐X improves the performance the performance • • of file write for NFS over RDMA of file write for NFS over RDMA Performance at long distance is yet to determine Performance at long distance is yet to determine • • 13 Managed by UT-Battelle for the Department of Energy PDSW'08, Austin, TX
iSCSI over RDMA (iSER iSER) ) iSCSI over RDMA ( RDMA enables high‐performance iSCSI within short distance RDMA enables high‐performance iSCSI within short distance • • RDMA has good promise over long distance as shown with large RDMA has good promise over long distance as shown with large • • messages messages 14 Managed by UT-Battelle for the Department of Energy PDSW'08, Austin, TX
Recommend
More recommend