Performance of HPC Middleware over Infiniband WAN Designing Efficient FTP Mechanisms for High Performance Data –Transfer over Infiniband High Performance Data Transfer in Grid Environment Using GridFTP over Infiniband Presented by: Ashish Kumar Singh
Performance of HPC Middleware over Infiniband WAN S. Narravula, H. Subramoni, P. Lai, R. Noronha and D.K. Panda
Motivation • Multi-Cluster needs of organizations • Advent of long haul Infiniband (IB WAN) – Infiniband range extenders like Intel Connects and Obsidian Longbows • IB applications and libraries like, MPI, NFS over RDMA, etc. developed for Intra-cluster environments
Contributions • Analyzes the general communication performance of HPC middleware • Proposes basic design optimizations for enhancing communication performance over WAN • Demonstrates the potential benefits obtained by enhancing internal protocols of middleware
IB Range Extension • Obsidian Longbows provide range extension for Infiniband fabrics over 10 Gigabits/s WAN
Verbs-level Performance (UD) • UD does not involve any acknowledgements from the remote side • UD is scalable with higher delays • Higher level protocols need to take care of reliability and flow control mechanisms
Verbs-level Performance (RC) • RC guarantees in-order delivery by ACKs and NACKs, which limits the number of messages that can be in flight to a maximum supported window size • Fewer large messages can fill the pipeline and so large messages are less effected
IPoIB Performance (UD) • TCP needs larger window sizes to achieve good bandwidth • More streams – more UD packets with independent flow control, so more outstanding packets that can be pushed out from source at any given time frame
IPoIB Performance (RC) • Advantage of RC transport mode over IPoIB is that RC can handle larger packet sizes. Larger packet sizes can achieve better bandwidth and per byte TCP processing decreases
MPI-level Performance (Delay) • Trends similar to basic verbs-level evaluation
MPI-level Performance (Tuning) • Protocol choice changes for medium sized messages in high delay scenario • Rendezvous protocol involves an additional message exchange
MPI-level Performance (Streams) a) 100 us delay b) 1 ms delay c) 10 ms delay • For small messages, messaging rate increases proportionally with number of communicating streams • For higher delay networks, additional parallel streams are better for overall network bandwidth utilization
MPI-level Performance (Collective) a) 10 us delay b) 100 us delay c) 1000 us delay • Simple optimized broadcast that performs the bcast operation hierarchially over the two connected clusters, minimizing the traffic on the WAN • For small messages, as the WAN link is able to handle all the traffic, the congestion is very minor
Conclusions • Applications usually absorb smaller network delay fairly well • Many protocols get severely impacted in high delay scenarios • Protocols can be optimized for high delay scenarios to improve the performance • With long-haul IB WAN technology cluster-of- clusters architecture for HPC systems is feasible
Designing Efficient FTP Mechanisms for High Performance Data – Transfer over Infiniband P. Lai, H. Subramoni, S. Narravula, A. Mamidala and D.K. Panda
Motivation • FTP - most popular method to transfer bulk data • Typically used in applications like data staging, content replication and remote site backup • Advent of long haul Infiniband (IB WAN) made cluster-of-cluster architecture possible • IPoIB and SDP lose significant native performance
Possible Approaches • Existing sockets based FTP through intermediate drivers (#1, #2 and #3). IPoIB and SDP are the popular schemes for this choice. • #4, new FTP mechanism using the Native IB features.
Performance of Communication Protocols • Native IB verbs achieve much higher bandwidth as compared to other protocols. • Performance for FTP, e.g., GridFTP, using IPoIB and SDP is even more worse.
Contributions • Design an Advanced Data Transfer Service (ADTS) that leverages zero-copy capabilities • Leverage ADTS to design a high performance zero-copy FTP library • Provide a robust and inter-operable mechanism to support zero-copy capable clients and the traditional TCP/UDP clients • Performance study
FTP-ADTS Architecture • Clients may be capable of performing zero-copy data transfer or only support the TCP/UDP based communication. • Once the transport protocol is negotiated , Data Connection Management component initiates a connection.
Design of Zero-Copy Channel • Memory Semantics using RDMA vs. Channel semantics using Send-Recv • Drawbacks of Memory Semantics: – Pre-allocation, registration and communication of target RDMA buffers – Explicit flow control – Notification of completion – Latency benefits for small messages is marred by high network delay
Design of Zero-Copy Channel • Advantages of Send-Recv Semantics: – Identical zero-copy benefits – Simpler flow control, with use of SRQ – Sender is not throttled down due to lack of buffers on remote node – Both RC and UD transports available
Design Enhancements • Buffer/File Management component keeps a small set of pre-allocated and registered buffers • Memory Registration Cache and Persistent Sessions • Pipelined Data Transfers • Prefork Server to handle bursts of requests
Performance • Site Replication over IB WAN using FTP. • FTP-ADTS speeds up data transfer by up to 65%. • Much lesser CPU utilization.
Conclusions • Existing TCP or UDP or SCTP based FTP implementations are not suitable for WAN capable interconnects like IB WAN • FTP-ADTS efficiently transfers data by leveraging zero-copy operations of modern interconnects
High Performance Data Transfer in Grid Environment Using GridFTP over Infiniband H. Subramoni, P. Lai, R. Kettimuthu and D.K. Panda
Overview • GridFTP is a high-performance, secure, reliable extension of the standard FTP optimized for WAN • Globus XIO framework, used to design GridFTP, offers easy-to-use interface • The framework hides the complications of communication semantics of underlying devices (network or disk)
Contribution • Combining the ease of use of Globus XIO framework and the high performance achieved through IB • Enhancing the disk I/O performance of the existing ADTS library – By decoupling the network processing from disk I/O operations • Evaluation of the design – micro-benchmark level – applications like Community Climate System Model and ultra scale visualization
Design Issues • Most HPC applications require movement of huge amount of data – Needs slower hard disks and RAIDs for storage – With low bandwidth provided by TCP/UDP based FTP, this was not an issue – Will be an issue for Globus ADTS XIO • Solution – decoupling of network from disk I/O
Design Changes in ADTS • Introduction of : • multiple threads (read, write and network thread) • set of buffers to stage the data • Read thread prefetches a set of locations from the disk and keeps it ready for the network thread to send over the physical link • How to avoid frequent context switches? • Low and High Water Marks, High water mark is set to max size of circular buf • Read only available buffers less than low-water mark
Application Level Improvements
Recommend
More recommend