Shawn Hall
Hybrid RDMA
RDMA/SR mix for data, SR otherwise Client side events Completion of sending request messages – polling Completion of incoming reply and control messages – interrupts Server side events Since it’s dedicated - polling
For small messages, memory (de)registration cost > zero-copy benefit Data-piggybacked SR w/ pre-registered buffers Client caches location of preallocated/ preregistered Fast RDMA buffers on I/O server RDMA Write with Immediate data Large transfers split into smaller ones Client/server communication and disk I/O pipelined
Internal Buffer Credit-Based Flow Control Preallocated/prepinned buffers per connection Server RDMA Buffer Management Most I/O server memory allocated as RDMA buffers Buffers are grouped by size into “zones” Try to fit into contiguous buffer, otherwise split transfer Client RDMA Buffer Management Dynamic (de)registration required for clients Pin-down cache delays deregistration, caches info Pin-down not useful for I/O intensive applications
Fast Memory Registration and Deregistration Uses pin-down cache and batched deregistration
% of pin-down cache hits
Chunk List – Multidimensional lists that store the locations of multiple buffers RPC Long Call – Long RPCs are broken into chunks First message contains chunk list of other messages NFS Write Client Server
NFS Readdir and Readlink – similar to NFS Read NFS Read Read-Read Design Read-Write Design Client Server Client Server
Server buffers exposed to client RDMA Server resources not freed until client sends RDMA_DONE Synchronous RDMA read causes latency Number of concurrent RDMA reads is limited
RPC long replies and NFS READ can come directly from server Client cannot initiate RDMA and try to access other buffers, so more secure Mellanox HCA can issue many RDMA write operations in parallel No waiting for RDMA_DONE Fewer server interrupts
Fast Memory Registration – registration steps that involve communication with the HCA are done at initialization rather than dynamically Buffer Registration Cache No information about server buffers exposed Physical Memory Registration – avoids virtual to physical address translation Translation also does not need to be sent to HCA
RDMA_DONE elimination RDMA Write parallelism
No local scatter/gather, so more RDMA reads. Simultaneous reads are capped though, so decreased parallelism.
Server memory saturates.
Applies only to MPI Not portable. Portable and transparent to implementation. MPI stacks and applications
FUSE – software that allows to create a user level virtual file system. Berkeley Lab Checkpoint/Restart (BLCR) – writes a process image to a file for later restart. MPI Checkpointing Mechanisms – offered by MVAPICH2, MPICH2, OpenMPI. MPI library flushes communication channel BLCR library used to dump memory snapshot BLCR library used to restart job if needed
VFS Cache Needs Work Efficient Sequential Writes
File Open – caught by FUSE, CRFS inserts/increments value in hash table, passes call to underlying file system File Close – buffer pool flushed into work queue, blocked until operations complete File Sync – complete all writes on file, pass fsync() to underlying file system Other File Operations – passed to file system
File Write Data copied from file into chunk in buffer pool until chunk is full Chunk enqueued into work queue This triggers an I/O thread to wake up and write chunk Number of I/O threads limited to prevent contention
Recommend
More recommend