Cray I/O Software Enhancements Tom Edwards tedwards@cray.com C O M - PowerPoint PPT Presentation

Cray I/O Software Enhancements Tom Edwards tedwards@cray.com C O M P U T E | S T O R E | A N A L Y Z E 1 9/3/2014

Overview ● The Cray Linux Environment and parallel libraries provide full support for common I/O standards. ● Serial POSIX I/O ● Parallel MPI I/O ● 3 rd part-libraries built on top of MPI I/O ● HDF5, NetCDF4 ● Cray versions provide many enhancements over generic implementations that integrate directly with Cray XC30 and Cray Sonexion hardware. ● Cray MPI-IO collective buffering, aggregation and data sieving. ● Automatic buffering and direct I/O for Posix transfers via IOBUF. ● This talk explains how to get the best from the enhanced capabilities of the Cray software stack. C O M P U T E | S T O R E | A N A L Y Z E 2 9/3/2014

Cray MPI-IO Layer Data Aggregation and Sieving C O M P U T E | S T O R E | A N A L Y Z E 3 9/3/2014

MPI I/O ● The MPI-2.0 standard provides a standardised interface for reading and writing data to disk in parallel. Commonly referred to as MPI I/O ● Full integration with other parts of the MPI standard allows users to use derived type to complete complex tasks with relative ease. ● Can automatically handle portability like byte-ordering and native and standardised data formats. ● Available as part of the cray-mpich library on XC30, commonly referred to as Cray MPI-IO. ● Fully optimised and integrated with underlying Lustre file-system. C O M P U T E | S T O R E | A N A L Y Z E 4 9/3/2014

Step 1: MPI-IO Hints The MPI I/O interface provides a mechanism for providing additional information about how to the MPI-IO layer should access files. These are controlled via MPI-IO HINTS, either via calls in the MPI API or passed via an environment variable. All hints can be set on a file-by-file basis. On the Cray XC30 the first most useful are: ● striping_factor – Number of lustre stripes ● striping_unit – Size of lustre stripes in bytes These set the file’s Lustre properties when it is created by an MPI-IO API call. * Note these require MPICH_MPIIO_CB_ALIGN to be set to its default value of 2. C O M P U T E | S T O R E | A N A L Y Z E 5 9/3/2014

Example settings Lustre hints in C Hints can be added to MPI calls via an Info unit when the file is opened using the MPI I/O API. Below is an example in C #include <mpi.h> #include <stdio.h> int factor = 4; // The number of stripes int unit = 4; // The stripe size in megabytes sprintf(factor_string , “%d”, factor); // Multiple unit into bytes from megabytes sprintf(unit_string , “%d”, unit * 1024 * 1024); MPI_Info_set (info, “ striping_factor ”, factor_string); MPI_Info_set (info, “ striping_unit ”, unit_string); MPI_File_open(MPI_COMM_WORLD, filename, MPI_MODE_CREATE | MPI_MODE_RDWR, info, &fh); C O M P U T E | S T O R E | A N A L Y Z E 6 9/3/2014

Setting hints via environment variables Alternatively, hints can be passed externally via an environment variable, MPICH_MPIIO_HINTS . Hints can be applied to all files, specific files, or pattern files, e.g. # Set all MPI-IO files to 4 x 4m stripes MPICH_MPIIO_HINTS=“*: striping_factor =4:striping_unit=4194304” # Set all .dat files to 8 x 1m stripes MPICH_MPIIO_HINTS=“*. dat:striping_factor =8:striping_unit=1048576” # Set default to 4 x 4m and all *.dat files to 8 x 1 MPICH_MPIIO_HINTS =“*: striping_factor=4:striping_unit=4194304, \ =*.dat:striping_factor =8:striping_unit=1048576” C O M P U T E | S T O R E | A N A L Y Z E 7 9/3/2014

Displaying hints The MPI- IO library can print out the “hint” values that are being using by each file when it is opened. This is controlled by setting the environment variable: export MPICH_MPIIO_HINT_DISPLAY=1 The reported is generated by the PE with rank 0 in the relevant communicator and is printed to stderr . PE 0: MPICH/MPIIO environment settings: PE 0: MPICH_MPIIO_HINTS_DISPLAY = 1 PE 0: MPICH_MPIIO_HINTS = NULL PE 0: MPICH_MPIIO_ABORT_ON_RW_ERROR = disable PE 0: MPICH_MPIIO_CB_ALIGN = 2 PE 0: MPIIO hints for file1: … direct_io = false aggregator_placement_stride = -1 … C O M P U T E | S T O R E | A N A L Y Z E 8 9/3/2014

Collective vs independent calls ● Opening a file via MPI I/O is a collective operation that must be performed by all members of a supplied communicator. ● However, many individual file operations have two versions: ● A collective version which must be performed by all members of the supplied communicator ● An independent version which can be performed ad-hoc by any processor at any time. This is akin to standard POSIX I/O, however includes MPI data handling syntactic sugar. ● It is only during collective calls that the MPI-IO library can perform required optimisations. Independent I/O is usually no more (or less) efficient than POSIX equivalents. C O M P U T E | S T O R E | A N A L Y Z E 9 9/3/2014

Collective Buffering & Data Sieving C O M P U T E | S T O R E | A N A L Y Z E 10 9/3/2014

Writing a simple data structure to disk Header data 0 Consider a simple 1D parallel decomposition. 1 OST Rank 0 0 Data MPI I/O allows parallel 2 data structures Rank 1 distributed across Data 0 ranks to be stored in a OST Rank 2 1 single with a simple 1 Data offset mapping. Rank 3 2 OST However exactly Data 2 matching this 0 distribution to Lustre’s stripe alignment is 1 difficult to achieve. Lustre Stripe Boundaries C O M P U T E | S T O R E | A N A L Y Z E 11 9/3/2014

Recap: Optimising Lustre Performance Lustre’s performance comes from Parallelism, with many writers/readers to/from many Object Storage Targets (OSTs). MPI I/O offers good parallelism, with each rank able potentially writing it’s own data into a file However, for large jobs #writers >> #OSTs, and each Potential rank may write to more than 1 Lock Contention OST. This can cause Lustre Points Single File lock contention and that slows access C O M P U T E | S T O R E | A N A L Y Z E 12

Collective Buffering and Lustre Stripe Alignment To limit the number of writers the MPI-IO library will assign and automatically redistribute data to a subset of “collective buffering” or “aggregator” nodes during a collective file operation. By default, the number of “collective buffering” nodes Collective will be the same as the lustre Buffering Nodes striping factor to get maximum benefit of Lustre Stripe Alignment. Single File Each collective buffer node will attempt to only write data to a single Lustre OST. C O M P U T E | S T O R E | A N A L Y Z E 13

Automatic Lustre stripe alignment 0 0 Rank *0 0 1 OST Rank 0 0 0 Data 2 Rank 1 1 Data 0 OST Rank *1 1 Rank 2 1 1 Data 1 Rank 3 2 OST Data 2 0 2 Rank *2 1 2 Collective nodes C O M P U T E | S T O R E | A N A L Y Z E 14 9/3/2014

Writing structured data to disk However, switching to an even slightly more complex 0 decomposition, like a OST 2D Cartesian, results in 0 1 ranks having to perform non- 2 Rank 0 Rank 1 contiguous file Data Data OST operations. 0 1 Rank 2 Rank 3 Data Data 1 OST 2 2 0 C O M P U T E | S T O R E | A N A L Y Z E 15 9/3/2014

Data Sieving ● “Read/Write Gaps” occur when the data is not accessed contiguously from the file. ● This limits the total bandwidth rate as each access requires separate calls and may cause additional seek time on HDD storage. ● Overall performance can be improved by minimising the number of read/write gaps. ● The Cray MPI-IO library will attempt to use data sieving to automatically combine multiple smaller operations into fewer larger operations. C O M P U T E | S T O R E | A N A L Y Z E 16 9/3/2014

Strided file access Focusing on a rank we can see that it will potentially end up 0 writing strided data to OST each OST. 0 1 2 This is likely to incur Rank 0 Rank 1 Data Data penalties due to extent OST 0 locking on each of the 1 Rank 2 Rank 3 OSTs. Data Data 1 It also prevents optimal OST 2 performance of HDD 2 block devices writing 0 contiguous blocks of data C O M P U T E | S T O R E | A N A L Y Z E 17 9/3/2014

Writing structured data to disk MPI-IO transposes MPI-IO data to optimal translates Storing Lustre layout to 1D to OSTs 0 Rank *0 0 Data held in OST local 2D 0 0 Decomposition 1 0 Rank 0 Rank 1 2 Data Data OST 1 Rank *1 0 1 Rank 2 Rank 3 Data Data 1 1 OST 2 2 Rank *2 2 0 2 C O M P U T E | S T O R E | A N A L Y Z E 18 9/3/2014

Data Sieving Data Sieving combines MPI-IO smaller operations into translates Storing larger contiguous ones to 1D to OSTs 0 0 Rank *0 Data held in OST local 2D 0 1 0 Decomposition 0 2 Rank 0 Rank 1 Data Data OST 1 Rank *1 0 1 Rank 2 Rank 3 Data Data 1 1 2 OST 2 Rank *2 2 0 2 C O M P U T E | S T O R E | A N A L Y Z E 19 9/3/2014

Cray I/O Software Enhancements Tom Edwards tedwards@cray.com C O M - PowerPoint PPT Presentation

Cray I/O Software Enhancements Tom Edwards tedwards@cray.com C O M P U T E | S T O R E | A N A L Y Z E 1 9/3/2014 Overview The Cray Linux Environment and parallel libraries provide full support for common I/O standards.

Application Performance Tuning on Cray XT Systems Luiz DeRose John Levesque PE Director CSCE

Cray Lustre Model Roadmap Cory Spitz and Derek Robb Cray Inc. 5/24/2011 Introduction and Agenda

The Cray 1 Time line 1969 -- CDC Introduces 7600, designed by cray. 1972 -- Design of the

FFT libraries on Cray XT: CRay Adaptive FFT (CRAFFT) Jonathan Bentz Cray Inc. Outline

Introducing the Cray XMT Petr Konecny November 29 th 2007 Agenda Shared memory programming

Howard Pritchard and Igor Gorodetsky Cray, Inc. Cray User Group Conference 2011 1 Cray User

I/O Performance on Cray XC30 Zhengji Zhao 1) , Doug Petesch 2) , David Knaak 2) , and Tina Declerck

Dave Strenski, Cray Inc. Cray User Group, Atlanta 5-5-09 Storaasli - MRSC - 29 M 07 3 FPGA

Detecting Application Load Imbalance on Cray Systems Heidi Poxon Technical Lead, Performance

Environment (CLE) Performance Jeff Larkin Jeff Kuehn Cray Inc. ORNL <larkin@cray.com>

Application Characteristics and Performance on a Cray XE6 Performance on a Cray XE6 Courtenay T.

GTC Overflow PARQUET Cray Inc. Confidential Slide 2 Cray has a long tradition of

COMPILING FOR THE ARCHER HARDWARE Slides contributed by Cray and EPCC Modules The Cray

T3E Resiliency Enhancements Dean Elling Software Engineer SGI 41st Cray User Group Conference

ALPS Tutorial Ascent Michael Karo mek@cray.com Topics A look back at Base Camp

Diagnostic Capabilities of the Red Storm Compliance Test Suite Mike Davis Cray Inc.

MAC-layer Approach for Cluster-Based Aggregation in Sensor Networks Petar Popovski, Frank H.P.

Data Visualizations of HYIP Dataset Quantifying the World April 23, 2012 Jie Han Financial

Some Techniques and Best Practices for Sourcing and Properly Citing Climate Science Research 19

Address Subcommittee Meeting May 10, 2017 1:00 2:30 PM Eastern U.S. Dept. of Transportation

Automated GOLE Pilot Project update Jerry Sobieski Oct 2012 Chicago, US Automated GOLE Fabric

Distributed Optimization for Smart Grids Jose Rivera , Christoph Goebel, and Hans-Arno Jacobsen

tfAirlineNDC API Get Your NDC API Live for Free - Within 3 Weeks! Presented by: Moshe Rafiah tf

P2G-VPP business models for renewable aggregation Asian applications Felix Jedamzik, Next

Cray I/O Software Enhancements Tom Edwards tedwards@cray.com C O M - PowerPoint PPT Presentation

Cray I/O Software Enhancements Tom Edwards tedwards@cray.com C O M P U T E | S T O R E | A N A L Y Z E 1 9/3/2014 Overview The Cray Linux Environment and parallel libraries provide full support for common I/O standards.

Application Performance Tuning on Cray XT Systems Luiz DeRose John Levesque PE Director CSCE

Cray Lustre Model Roadmap Cory Spitz and Derek Robb Cray Inc. 5/24/2011 Introduction and Agenda

The Cray 1 Time line 1969 -- CDC Introduces 7600, designed by cray. 1972 -- Design of the

FFT libraries on Cray XT: CRay Adaptive FFT (CRAFFT) Jonathan Bentz Cray Inc. Outline

Introducing the Cray XMT Petr Konecny November 29 th 2007 Agenda Shared memory programming

Howard Pritchard and Igor Gorodetsky Cray, Inc. Cray User Group Conference 2011 1 Cray User

I/O Performance on Cray XC30 Zhengji Zhao 1) , Doug Petesch 2) , David Knaak 2) , and Tina Declerck

Dave Strenski, Cray Inc. Cray User Group, Atlanta 5-5-09 Storaasli - MRSC - 29 M 07 3 FPGA

Detecting Application Load Imbalance on Cray Systems Heidi Poxon Technical Lead, Performance

Environment (CLE) Performance Jeff Larkin Jeff Kuehn Cray Inc. ORNL &lt;larkin@cray.com&gt;

Application Characteristics and Performance on a Cray XE6 Performance on a Cray XE6 Courtenay T.

GTC Overflow PARQUET Cray Inc. Confidential Slide 2 Cray has a long tradition of

COMPILING FOR THE ARCHER HARDWARE Slides contributed by Cray and EPCC Modules The Cray

T3E Resiliency Enhancements Dean Elling Software Engineer SGI 41st Cray User Group Conference

ALPS Tutorial Ascent Michael Karo mek@cray.com Topics A look back at Base Camp

Diagnostic Capabilities of the Red Storm Compliance Test Suite Mike Davis Cray Inc.

MAC-layer Approach for Cluster-Based Aggregation in Sensor Networks Petar Popovski, Frank H.P.

Data Visualizations of HYIP Dataset Quantifying the World April 23, 2012 Jie Han Financial

Some Techniques and Best Practices for Sourcing and Properly Citing Climate Science Research 19

Address Subcommittee Meeting May 10, 2017 1:00 2:30 PM Eastern U.S. Dept. of Transportation

Automated GOLE Pilot Project update Jerry Sobieski Oct 2012 Chicago, US Automated GOLE Fabric

Distributed Optimization for Smart Grids Jose Rivera , Christoph Goebel, and Hans-Arno Jacobsen

tfAirlineNDC API Get Your NDC API Live for Free - Within 3 Weeks! Presented by: Moshe Rafiah tf

P2G-VPP business models for renewable aggregation Asian applications Felix Jedamzik, Next

Environment (CLE) Performance Jeff Larkin Jeff Kuehn Cray Inc. ORNL <larkin@cray.com>