Performance of HPC Middleware over Infiniband WAN Designing - PowerPoint PPT Presentation

Performance of HPC Middleware over Infiniband WAN Designing Efficient FTP Mechanisms for High Performance Data –Transfer over Infiniband High Performance Data Transfer in Grid Environment Using GridFTP over Infiniband Presented by: Ashish Kumar Singh

Performance of HPC Middleware over Infiniband WAN S. Narravula, H. Subramoni, P. Lai, R. Noronha and D.K. Panda

Motivation • Multi-Cluster needs of organizations • Advent of long haul Infiniband (IB WAN) – Infiniband range extenders like Intel Connects and Obsidian Longbows • IB applications and libraries like, MPI, NFS over RDMA, etc. developed for Intra-cluster environments

Contributions • Analyzes the general communication performance of HPC middleware • Proposes basic design optimizations for enhancing communication performance over WAN • Demonstrates the potential benefits obtained by enhancing internal protocols of middleware

IB Range Extension • Obsidian Longbows provide range extension for Infiniband fabrics over 10 Gigabits/s WAN

Verbs-level Performance (UD) • UD does not involve any acknowledgements from the remote side • UD is scalable with higher delays • Higher level protocols need to take care of reliability and flow control mechanisms

Verbs-level Performance (RC) • RC guarantees in-order delivery by ACKs and NACKs, which limits the number of messages that can be in flight to a maximum supported window size • Fewer large messages can fill the pipeline and so large messages are less effected

IPoIB Performance (UD) • TCP needs larger window sizes to achieve good bandwidth • More streams – more UD packets with independent flow control, so more outstanding packets that can be pushed out from source at any given time frame

IPoIB Performance (RC) • Advantage of RC transport mode over IPoIB is that RC can handle larger packet sizes. Larger packet sizes can achieve better bandwidth and per byte TCP processing decreases

MPI-level Performance (Delay) • Trends similar to basic verbs-level evaluation

MPI-level Performance (Tuning) • Protocol choice changes for medium sized messages in high delay scenario • Rendezvous protocol involves an additional message exchange

MPI-level Performance (Streams) a) 100 us delay b) 1 ms delay c) 10 ms delay • For small messages, messaging rate increases proportionally with number of communicating streams • For higher delay networks, additional parallel streams are better for overall network bandwidth utilization

MPI-level Performance (Collective) a) 10 us delay b) 100 us delay c) 1000 us delay • Simple optimized broadcast that performs the bcast operation hierarchially over the two connected clusters, minimizing the traffic on the WAN • For small messages, as the WAN link is able to handle all the traffic, the congestion is very minor

Conclusions • Applications usually absorb smaller network delay fairly well • Many protocols get severely impacted in high delay scenarios • Protocols can be optimized for high delay scenarios to improve the performance • With long-haul IB WAN technology cluster-of- clusters architecture for HPC systems is feasible

Designing Efficient FTP Mechanisms for High Performance Data – Transfer over Infiniband P. Lai, H. Subramoni, S. Narravula, A. Mamidala and D.K. Panda

Motivation • FTP - most popular method to transfer bulk data • Typically used in applications like data staging, content replication and remote site backup • Advent of long haul Infiniband (IB WAN) made cluster-of-cluster architecture possible • IPoIB and SDP lose significant native performance

Possible Approaches • Existing sockets based FTP through intermediate drivers (#1, #2 and #3). IPoIB and SDP are the popular schemes for this choice. • #4, new FTP mechanism using the Native IB features.

Performance of Communication Protocols • Native IB verbs achieve much higher bandwidth as compared to other protocols. • Performance for FTP, e.g., GridFTP, using IPoIB and SDP is even more worse.

Contributions • Design an Advanced Data Transfer Service (ADTS) that leverages zero-copy capabilities • Leverage ADTS to design a high performance zero-copy FTP library • Provide a robust and inter-operable mechanism to support zero-copy capable clients and the traditional TCP/UDP clients • Performance study

FTP-ADTS Architecture • Clients may be capable of performing zero-copy data transfer or only support the TCP/UDP based communication. • Once the transport protocol is negotiated , Data Connection Management component initiates a connection.

Design of Zero-Copy Channel • Memory Semantics using RDMA vs. Channel semantics using Send-Recv • Drawbacks of Memory Semantics: – Pre-allocation, registration and communication of target RDMA buffers – Explicit flow control – Notification of completion – Latency benefits for small messages is marred by high network delay

Design of Zero-Copy Channel • Advantages of Send-Recv Semantics: – Identical zero-copy benefits – Simpler flow control, with use of SRQ – Sender is not throttled down due to lack of buffers on remote node – Both RC and UD transports available

Design Enhancements • Buffer/File Management component keeps a small set of pre-allocated and registered buffers • Memory Registration Cache and Persistent Sessions • Pipelined Data Transfers • Prefork Server to handle bursts of requests

Performance • Site Replication over IB WAN using FTP. • FTP-ADTS speeds up data transfer by up to 65%. • Much lesser CPU utilization.

Conclusions • Existing TCP or UDP or SCTP based FTP implementations are not suitable for WAN capable interconnects like IB WAN • FTP-ADTS efficiently transfers data by leveraging zero-copy operations of modern interconnects

High Performance Data Transfer in Grid Environment Using GridFTP over Infiniband H. Subramoni, P. Lai, R. Kettimuthu and D.K. Panda

Overview • GridFTP is a high-performance, secure, reliable extension of the standard FTP optimized for WAN • Globus XIO framework, used to design GridFTP, offers easy-to-use interface • The framework hides the complications of communication semantics of underlying devices (network or disk)

Contribution • Combining the ease of use of Globus XIO framework and the high performance achieved through IB • Enhancing the disk I/O performance of the existing ADTS library – By decoupling the network processing from disk I/O operations • Evaluation of the design – micro-benchmark level – applications like Community Climate System Model and ultra scale visualization

Design Issues • Most HPC applications require movement of huge amount of data – Needs slower hard disks and RAIDs for storage – With low bandwidth provided by TCP/UDP based FTP, this was not an issue – Will be an issue for Globus ADTS XIO • Solution – decoupling of network from disk I/O

Design Changes in ADTS • Introduction of : • multiple threads (read, write and network thread) • set of buffers to stage the data • Read thread prefetches a set of locations from the disk and keeps it ready for the network thread to send over the physical link • How to avoid frequent context switches? • Low and High Water Marks, High water mark is set to max size of circular buf • Read only available buffers less than low-water mark

Application Level Improvements

Performance of HPC Middleware over Infiniband WAN Designing - PowerPoint PPT Presentation

Performance of HPC Middleware over Infiniband WAN Designing Efficient FTP Mechanisms for High Performance Data Transfer over Infiniband High Performance Data Transfer in Grid Environment Using GridFTP over Infiniband Presented by: Ashish

Causes and solutions Wan Abdul Manan Wan Muda, for Wan Abdul Manan Wan Muda, Jomo KS and Tan

Middleware Chapter 2: Contents - Chapter 2 Understanding middleware Middleware as a

HPC @ SAO S.G. Korzennik - SAO HPC Analyst hpc@cfa February 2013 SGK ( hpc@cfa ) HPC @ SAO

IO Virtualization with InfiniBand [InfiniBand as a Hypervisor Accelerator] Michael Kagan Vice

InfiniBand Network Block Device Overview IBNBD: InfiniBand Network Block device Transfer

Uni.lu HPC School 2020 PS6: HPC Containers: Singularity Uni.lu High Performance Computing (HPC)

Entity Resolution: Glue for Middleware Hector Garcia-Molina Stanford University Middleware

The HPC Skill Tree A Brief Overview Kai Himstedt On Behalf of the HPC-CF Board BoF:

UL HPC School 2017[bis] PS1: Getting Started on the UL HPC platform UL High Performance

UL HPC School 2017 PS1: Getting Started on the UL HPC platform UL High Performance Computing

Infiniband for Open MPI Andrew Friedley, Torsten Hoefler Matthew L. Leininger, Andrew Lumsdaine

Middleware Queries Queries Middleware Middleware Queries Prof. Paolo Ciaccia Prof. Paolo

From Middleware Implementor to Middleware User (There and Back Again) Steve Vinoski Member of

Java Middleware Patrick Eugster, Till Bay, Tomas Hruz Java Middleware What is middleware

UL HPC School 2017 PS5: Advanced Scheduling with SLURM and OAR on UL HPC clusters UL High

Whats new in HPC? Gregory Bauer To keep up-to-date on HPC HPC Guru -

Installation of 0.8.0 cd /ibrix/app/gridapp/pacman-3.21 source setup.sh cd - mkdir

So, What Actually is a Cloud? Dan Stanzione Deputy Director, TACC UT-Austin Originally from

Cloud Computing RICS tutorial Dan C. Marinescu Computer Science Division EECS Department, UCF

Effective communications: How IT can talk to researchers about their research Photo by Miguel

Python Best Practices in HPC Roland Haas (NCSA) Email: rhaas@illinois.edu Why use Python in HPC?

EUCALYPTUS: An Open Source Infrastructure for Elastic Computing Research Rich Wolski Chris

BNL dCache Status and Plan dCache Workshop: January 18-19, 2007 dCache Workshop: January 18-19,

Challenges for Grids Challenges for Grids Markus Schulz CERN IT GD LCG/EGEE Disclaimer

Sambuz

Useful Links

Newsletter

Mail Us