Mitglied der Helmholtz-Gemeinschaft Portable Parallel I/O Handling large datasets in heterogeneous parallel environments May 21, 2014 Michael Stephan
Mitglied der Helmholtz-Gemeinschaft Portable Parallel I/O Part I: HDF5 May 21, 2014 Michael Stephan
Learning Objectives At the end of this lesson, you will be able to Get an idea about the HDF5 funktionality. Create a short example for HDF5 file I/O. Discuss the advantages and disadvantages of HDF5 file I/O. UsersGuide (352 pages) ReferenceGuide (802 pages) May 21, 2014 Michael Stephan Slide 3
Outline Introduction Introduction to HDF5 Programming Model and APIs General Programming Paradigm Parallele HDF5 May 21, 2014 Michael Stephan Slide 4
Outline Introduction Motivation Terms and Definitions Introduction to HDF5 Programming Model and APIs General Programming Paradigm Parallele HDF5 May 21, 2014 Michael Stephan Slide 5
What is HDF5? I Unique technology suite that makes possible the management of extremely large and complex data collections May 21, 2014 Michael Stephan Slide 6
What is HDF5? II The HDF5 technology suite includes: A versatile data model that can represent very complex data objects and a wide variety of metadata. A completely portable file format with no limit on the number or size of data objects in the collection. A software library that runs on a range of computational platforms, from laptops to massively parallel systems, and implements a high-level API with C, C++, Fortran 90, and Java interfaces. A rich set of integrated performance features that allow for access time and storage space optimizations. Tools and applications for managing, manipulating, viewing, and analyzing the data in the collection. The HDF5 data model, file format, API, library, and tools are open and distributed without charge. May 21, 2014 Michael Stephan Slide 7
What is HDF5? III Unlimited size, extensibility, and portability HDF5 does not limit the size of files or the size or number of objects in a file. The HDF5 format and library are extensible and designed to evolve gracefully to satisfy new demands. HDF5 functionality and data is portable across virtually all computing platforms and is distributed with C, C++, Java, and Fortran90 programming interfaces. May 21, 2014 Michael Stephan Slide 8
What is HDF5? IV General data model HDF5 has a simple but versatile data model. The HDF5 data model supports complex data relationships and dependencies through its grouping and linking mechanisms. HDF5 accommodates many common types of metadata and arbitrary user-defined metadata. May 21, 2014 Michael Stephan Slide 9
What is HDF5? V Unlimited variety of datatypes HDF5 supports a rich set of pre-defined datatypes as well as the creation of an unlimited variety of complex user-defined datatypes. Datatype definitions can be shared among objects in an HDF file, providing a powerful and efficient mechanism for describing data. Datatype definitions include information such as byte order (endian), size, and floating point representation, to fully describe how the data is stored, insuring portability to other platforms. May 21, 2014 Michael Stephan Slide 10
What is HDF5? VI Flexible, efficient I/O HDF5, through its virtual file layer, offers extremely flexible storage and data transfer capabilities. Standard (Posix), Parallel, and Network I/O file drivers are provided with HDF5. Application developers can write additional file drivers to implement customized data storage or transport capabilities. The parallel I/O driver for HDF5 reduces access times on parallel systems by reading/writing multiple data streams simultaneously. May 21, 2014 Michael Stephan Slide 11
What is HDF5? VII Flexible data storage HDF5 employs various compression, extensibility, and chunking strategies to improve access, management, and storage efficiency. HDF5 provides for external storage of raw data, allowing raw data to be shared among HDF5 files and/or applications, and often saving disk space. May 21, 2014 Michael Stephan Slide 12
What is HDF5? VIII Data transformation and complex subsetting HDF5 enables datatype and spatial transformation during I/O operations. HDF5 data I/O functions can operate on selected subsets of the data, reducing transferred data volume and improving access speed. May 21, 2014 Michael Stephan Slide 13
Who uses HDF5? Applications that deal with big or complex data Over 200 different types of apps 2+ million product users world-wide Academia, government agencies, industry May 21, 2014 Michael Stephan Slide 14
Outline Introduction Motivation Terms and Definitions Introduction to HDF5 Programming Model and APIs General Programming Paradigm Parallele HDF5 May 21, 2014 Michael Stephan Slide 15
An HDF5 “file” is a container... ...into which you can put your data objects Structures to organize objects May 21, 2014 Michael Stephan Slide 16
HDF5 model Groups – provide structure among objects Datasets – where the primary data goes Data arrays Rich set of datatype options Flexible, efficient storage and I/O Attributes, for metadata Other objects Links (point to data in a file or in another HDF5 file) Datatypes (can be stored for complex structures and reused by multiple datatsets) May 21, 2014 Michael Stephan Slide 17
HDF5 Dataset May 21, 2014 Michael Stephan Slide 18
HDF5 Dataspace Two roles Dataspace contains spatial info about a dataset stored in a file Rank and dimensions Permanent part of dataset definition Dataspace describes application’s data buffer and data elements participating in I/O May 21, 2014 Michael Stephan Slide 19
HDF5 Datatype I Datatype – how to interpret a data element Permanent part of the dataset definition Two classes: atomic and compound Can be stored in a file as an HDF5 object (HDF5 committed datatype) Can be shared among different datasets May 21, 2014 Michael Stephan Slide 20
HDF5 Datatype II HDF5 atomic types normal integer and float user-definable (e.g., 13-bit integer) variable length types (e.g., strings) references to objects/dataset regions enumeration - names mapped to integers array May 21, 2014 Michael Stephan Slide 21
HDF5 Datatype III HDF5 compound types Comparable to C structs (“records”) Members can be atomic or compound types May 21, 2014 Michael Stephan Slide 22
HDF5 dataset: array of records May 21, 2014 Michael Stephan Slide 23
Special storage options for dataset May 21, 2014 Michael Stephan Slide 24
HDF5 Attribute Attribute – data of the form “name = value”, attached to an object by application Operations similar to dataset operations, but Not extendible No compression or partial I/O Can be overwritten, deleted, added during the “life“ of a dataset May 21, 2014 Michael Stephan Slide 25
HDF5 Group A mechanism for organizing collections of related objects Every file starts with a root group Similar to UNIX directories / (root) /X /Y /X/temp Can have attributes May 21, 2014 Michael Stephan Slide 26
Partial I/O Move just part of a dataset May 21, 2014 Michael Stephan Slide 27
Partial I/O Move just part of a dataset May 21, 2014 Michael Stephan Slide 28
Layers – parallel example May 21, 2014 Michael Stephan Slide 29
Virtual I/O layer May 21, 2014 Michael Stephan Slide 30
Virtual I/O layer A public API for writing I/O drivers Allows HDF5 to interface to disk, the network, memory, or a user-defined device May 21, 2014 Michael Stephan Slide 31
Portability and Robustness Runs almost anywhere Linux and UNIX workstations Windows, Mac OS X Big ASC machines, Crays, VMS systems TeraGrid and other clusters Source and binaries available from http://www.hdfgroup.org/HDF5/release/index.html May 21, 2014 Michael Stephan Slide 32
Other Software The HDF Group HDFView Java tools Command-line utilities Web browser plug-in Regression and performance testing software Parallel h5diff 3rd Party (IDL, MATLAB, Mathematica, PyTables, HDF Explorer, LabView) Communities (EOS, ASC, CGNS) Integration with other software (iRODS, OPeNDAP) May 21, 2014 Michael Stephan Slide 33
HDF5 software stack May 21, 2014 Michael Stephan Slide 34
Structure of HDF5 Library May 21, 2014 Michael Stephan Slide 35
Goals of HDF5 Library Provide flexible API to support a wide range of operations on data. Support high performance access in serial and parallel computing environments. Be compatible with common data models and programming languages. Because of these goals, the HDF5 API is rich and large May 21, 2014 Michael Stephan Slide 36
Operations Supported by the API Create groups, datasets, attributes, linkages Create complex data types Assign storage and I/O properties to objects Perform complex subsetting during read/write Use variety of I/O-”devices“ (parallel, remote, etc.) Transform data during I/O Query about file and structure and properties Query about object structure, content, properties May 21, 2014 Michael Stephan Slide 37
Recommend
More recommend