shawn hall hybrid rdma rdma sr mix for data sr otherwise
play

Shawn Hall Hybrid RDMA RDMA/SR mix for data, SR otherwise Client - PowerPoint PPT Presentation

Shawn Hall Hybrid RDMA RDMA/SR mix for data, SR otherwise Client side events Completion of sending request messages polling Completion of incoming reply and control messages interrupts Server side events Since its dedicated -


  1. Shawn Hall

  2. Hybrid RDMA

  3. RDMA/SR mix for data, SR otherwise Client side events Completion of sending request messages – polling Completion of incoming reply and control messages – interrupts Server side events Since it’s dedicated - polling

  4. For small messages, memory (de)registration cost > zero-copy benefit Data-piggybacked SR w/ pre-registered buffers Client caches location of preallocated/ preregistered Fast RDMA buffers on I/O server RDMA Write with Immediate data Large transfers split into smaller ones Client/server communication and disk I/O pipelined

  5. Internal Buffer Credit-Based Flow Control Preallocated/prepinned buffers per connection Server RDMA Buffer Management Most I/O server memory allocated as RDMA buffers Buffers are grouped by size into “zones” Try to fit into contiguous buffer, otherwise split transfer Client RDMA Buffer Management Dynamic (de)registration required for clients Pin-down cache delays deregistration, caches info Pin-down not useful for I/O intensive applications

  6. Fast Memory Registration and Deregistration Uses pin-down cache and batched deregistration

  7. % of pin-down cache hits

  8. Chunk List – Multidimensional lists that store the locations of multiple buffers RPC Long Call – Long RPCs are broken into chunks First message contains chunk list of other messages NFS Write Client Server

  9. NFS Readdir and Readlink – similar to NFS Read NFS Read Read-Read Design Read-Write Design Client Server Client Server

  10. Server buffers exposed to client RDMA Server resources not freed until client sends RDMA_DONE Synchronous RDMA read causes latency Number of concurrent RDMA reads is limited

  11. RPC long replies and NFS READ can come directly from server Client cannot initiate RDMA and try to access other buffers, so more secure Mellanox HCA can issue many RDMA write operations in parallel No waiting for RDMA_DONE Fewer server interrupts

  12. Fast Memory Registration – registration steps that involve communication with the HCA are done at initialization rather than dynamically Buffer Registration Cache No information about server buffers exposed Physical Memory Registration – avoids virtual to physical address translation Translation also does not need to be sent to HCA

  13. RDMA_DONE elimination RDMA Write parallelism

  14. No local scatter/gather, so more RDMA reads. Simultaneous reads are capped though, so decreased parallelism.

  15. Server memory saturates.

  16. Applies only to MPI Not portable. Portable and transparent to implementation. MPI stacks and applications

  17. FUSE – software that allows to create a user level virtual file system. Berkeley Lab Checkpoint/Restart (BLCR) – writes a process image to a file for later restart. MPI Checkpointing Mechanisms – offered by MVAPICH2, MPICH2, OpenMPI. MPI library flushes communication channel BLCR library used to dump memory snapshot BLCR library used to restart job if needed

  18. VFS Cache Needs Work Efficient Sequential Writes

  19. File Open – caught by FUSE, CRFS inserts/increments value in hash table, passes call to underlying file system File Close – buffer pool flushed into work queue, blocked until operations complete File Sync – complete all writes on file, pass fsync() to underlying file system Other File Operations – passed to file system

  20. File Write Data copied from file into chunk in buffer pool until chunk is full Chunk enqueued into work queue This triggers an I/O thread to wake up and write chunk Number of I/O threads limited to prevent contention

Recommend


More recommend