nfs over rdma
play

NFS over RDMA Brent Callaghan, Theresa Lingutla-Raj, Alex Chiu, - PowerPoint PPT Presentation

SIGCOMM 2003, NICELI Workshop NFS NFS over RDMA Brent Callaghan, Theresa Lingutla-Raj, Alex Chiu, Peter Staubach, Omer Asad Sun Microsystems, Inc. NFS over RDMA 1 of 17 SIGCOMM 2003, NICELI Workshop Why RDMA as a Transport? Nice to


  1. SIGCOMM 2003, NICELI Workshop NFS NFS over RDMA Brent Callaghan, Theresa Lingutla-Raj, Alex Chiu, Peter Staubach, Omer Asad Sun Microsystems, Inc. NFS over RDMA 1 of 17

  2. SIGCOMM 2003, NICELI Workshop Why RDMA as a Transport? • Nice to have at 1 Gb/sec but must have for 10 Gb/sec • Offload protocol processing from general purpose CPU to dedicated protocol hardware • Offload host memory/IO bus with direct data placement (DDP) NFS over RDMA 2 of 17

  3. SIGCOMM 2003, NICELI Workshop NFS is an RDMA Sweet Spot • Clients and servers are close – Most commonly on a LAN – Often in the same server room or rack – Bandwidth high - latency low • NFS moves big chunks of data – 8 KB for NFS version 2 – No limit for NFS version 3 • Most clients read & write 32 KB chunks • Solaris servers accept up to 1 MB reads/writes NFS over RDMA 3 of 17

  4. SIGCOMM 2003, NICELI Workshop RDMA as a new RPC Transport NFS ACL NFS NLM NFS CHANGES RPC/XDR RDMA UDP TCP IP NFS over RDMA 4 of 17

  5. SIGCOMM 2003, NICELI Workshop Small RPC Messages • Most NFS messages are quite small – Less than 1 KB • No RDMA needed - just use SENDs RPC Call SEND Small pre-posted receive buffer RPC Reply SEND Small pre-posted receive buffer NFS over RDMA 5 of 17

  6. SIGCOMM 2003, NICELI Workshop Moving NFS data with RDMA An NFS read reply or write request is a large chunk of data with a variable length RPC & NFS header. RPC Header Data That large chunk of data could be moved more efficiently if we could move it instead with DDP. DDP Header Stag Address Length NFS over RDMA 6 of 17

  7. SIGCOMM 2003, NICELI Workshop XDR De-Chunking the Message • Encoded message for TCP transport XDR encoded RPC Message TCP Conn Chunk • Encoded message for RDMA transport Non Chunks Chunk list entry RDMA Send XDR Offset Chunk Address RDMA Read or Write NFS over RDMA 7 of 17

  8. SIGCOMM 2003, NICELI Workshop RDMA Transport Header RPC Message sans chunks Message XID Version Chunk List Type XDR Stream Offset Chunk Length Source STag Source Address Next Chunk NFS over RDMA 8 of 17

  9. SIGCOMM 2003, NICELI Workshop Read-Read Protocol Client Server RPC Call SEND Message + Chunk list Arg chunks READ RPC Reply SEND Message + Chunk list Result Chunks READ RPC Done SEND Free chunks NFS over RDMA 9 of 17

  10. SIGCOMM 2003, NICELI Workshop NFS/TCP Throughput Peak throughput 60 MB/sec @ 256 KB reads & 4 reads-ahead NFS over RDMA 10 of 17

  11. SIGCOMM 2003, NICELI Workshop NFS/RDMA Throughput Peak throughput 102 MB/sec @ 256 KB reads & 8 reads-ahead NFS over RDMA 11 of 17

  12. SIGCOMM 2003, NICELI Workshop CPU Utilization (with no async read-ahead) NFS over RDMA 12 of 17

  13. SIGCOMM 2003, NICELI Workshop Further Work • NFS/RDMA protocol Internet Drafts submitted to IETF • Extends basic “read-read” protocol to use RDMA write with ULP hooks: “read-write” • Includes receive buffer request/grant credit control • Support for alignment padding in RDMA SENDs • Receive buffer size negotiation protocol • Support in NFS version 4.1 NFS over RDMA 13 of 17

  14. SIGCOMM 2003, NICELI Workshop Extended RDMA Transport Header Old Header Message XID Version Chunk List Type Receive Buffer Extended Header Credit Control Message XID Version Credits Type Long replies Direct write Read List Write List Reply from server Alignment Threshold Read List Write List Reply Padding Control NFS over RDMA 14 of 17

  15. SIGCOMM 2003, NICELI Workshop Read-Write Protocol Client Server RPC Call SEND Message + Write list Arg chunks READ Result Chunks WRITE RPC Reply SEND Message + Write list NFS over RDMA 15 of 17

  16. SIGCOMM 2003, NICELI Workshop Project Status • Solaris prototype – kVIPL with Emulex GN9000/VI, 1Gb link – Like a normal NFS mount – Demonstrated good performance • Infiniband – Implementing extended “read-write” protocol – Mellanox Tavor, 10 Gb (4x) link – Evaluating performance NFS over RDMA 16 of 17

  17. SIGCOMM 2003, NICELI Workshop NFS over RDMA 17 of 17

Recommend


More recommend