Extending Scalability of Collective IO Through Nessie and Staging Parallel Data Storage Workshop November 13, 2011 Jay Lofstead (SNL) Ron Oldfield (SNL) Todd Kordenbrock (HP) Charles Reiss (UCB) Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. �
Motivation • Collective & Two-Phase IO proven beneficial – Relatively modest data volumes, 1-D, 2-D • But … . • Trade-off of inter-node communication for data reorganization to save IO not always beneficial – Large datasets – 3-D domain decompositions particularly bad Jay Lofstead www.lofstead.org PDSW 2011 2 gflofst@sandia.gov
Motivation • Problem: technique is central to some IO APIs – netCDF4, HDF5 • Problem: Changing IO techniques/file format may not be an option for some applications – CESM climate model is committed to netCDF file format • Problem: Continued scaling of problem sizes and resolution making things worse Jay Lofstead www.lofstead.org PDSW 2011 3 gflofst@sandia.gov
Solution • Use efficient transport layer and data staging transparently in the IO stack – Re-implement native IO API (link-time compatible) – Ensure format on disk is identical • Requirements – Efficient, portable transport layer – Staging area functionality to reduce time to completion Jay Lofstead www.lofstead.org PDSW 2011 4 gflofst@sandia.gov
Solution Architecture PnetCDF library Implementation PnetCDF API Native Science Native Storage Application PnetCDF Staging PnetCDF library library Implementation PnetCDF API PnetCDF API Redirector Native Science NSSI Staging Nessie Storage Application Functionality Jay Lofstead www.lofstead.org PDSW 2011 5 gflofst@sandia.gov
Nessie Transport Layer • Network Scalable Services Interface (Nessie) – Originally developed for the Lightweight File Systems project – RPC-like asynchronous API layer supporting RDMA – Physical layer support • InfiniBand • Portals • Cray Gemini – Server-directed for bulk data • Writes: pull from client • Reads: push to client Jay Lofstead www.lofstead.org PDSW 2011 6 gflofst@sandia.gov
Staging Functionality • Collect data packets prior to writing to storage – Cache data chunks to afford other optimizations • Perform data rearrangement – Perform partial data rearrangement like two-phase IO • Use different techniques for writing to storage – Direct, Caching, Aggregation Jay Lofstead www.lofstead.org PDSW 2011 7 gflofst@sandia.gov
Nessie Performance (Portals) NSSI Scaling Performance on Red Storm SeaStar Network 2048 100 1 client ! ! ! ! ! ! ! 4 clients ! 16 clients ! 1536 75 64 clients ! Throughput (MB/s) Percentage of Peak ! 1024 50 ! 512 ! 25 ! ! ! ! ! ! ! ! ! ! ! ! 0 0 32 1024 32768 1048576 33554432 Bytes/Transfer Performance of xfer_write_rdma on Red Storm Jay Lofstead www.lofstead.org PDSW 2011 8 gflofst@sandia.gov
Nessie Performance (InfiniBand) NSSI Scaling Performance on Thunderbird InfiniBand Network 1 client ● 1024 4 clients 100 16 clients 64 clients Throughput (MB/s) Percentage of Peak 768 75 ● ● ● ● ● ● ● 512 50 ● ● 256 25 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 0 32 1024 32768 1048576 33554432 Bytes/Transfer Performance of xfer_write_rdma on Thunderbird Jay Lofstead www.lofstead.org PDSW 2011 9 gflofst@sandia.gov
NetCDF Staging Operation 1. Initiate Request Compute Area Staging Area 2. Start data retrieval 1 3. Move data 2 4. Put completion/ 5 result 3 5. Process in staging 4 area Jay Lofstead www.lofstead.org PDSW 2011 10 gflofst@sandia.gov
NetCDF Staging Functionality • NetCDF4 and PnetCDF API supported • Direct – synchronous with client calls • Cache Independent – asynch with client calls • Aggregate Independent – asynch with client calls, but aggregate data prior to writing (on node only) request response Jay Lofstead www.lofstead.org PDSW 2011 11 gflofst@sandia.gov
NetCDF Staging Functionality • Untested functionality for this paper – Collective IO versions of cache and aggregate – like independent versions, but a maximal number of collective IO calls made for writing – Other data manipulation • Different implementation using Nessie for data analysis hosting Jay Lofstead www.lofstead.org PDSW 2011 12 gflofst@sandia.gov
NetCDF Staging Performance (!!!" '!!!" !"#$%&'$()*+',% &!!!" %!!!" $!!!" ,-./0"1,02345" 678092"/7-":2-;7,;" #!!!" -;;80;-20<7,6=0,60,2" 9-9>7,;<7,60=0,60,2" !" #!$&" $!&)" %!*$" &!+(" '#$!" (#&&" *#()" )#+$" -.)($''%/)0*1% Testing on JaguarPF using S3D IO kernel Jay Lofstead www.lofstead.org PDSW 2011 13 gflofst@sandia.gov
Current Status • Nessie and NetCDF staging now part of Trilinos – Trios capability area • Port to BlueGene underway • Integration of accelerators for staging processing Jay Lofstead www.lofstead.org PDSW 2011 14 gflofst@sandia.gov
Future Work • Finish tests on RedSky to isolate Lustre issues • Test collective IO routines • Examine impact on reading performance • For Nessie, other applications – ‘In flight’ data analysis routines – Transactions for resilience in data staging (see our poster!) – Hybrid, high level IO routine complications • Exodus + NetCDF Jay Lofstead www.lofstead.org PDSW 2011 15 gflofst@sandia.gov
Recommend
More recommend