Efficient Scientific Data Efficient Scientific Data Management on Supercomputers Management on Supercomputers Suren Byna Scientific Data Management Group, LBNL
Scientific Data - Where is it coming from? ▪ Simulations ▪ Experiments ▪ Observations 2
Life of scientific data Generation In situ analysis Processing Storage Analysis Preservation Sharing Refinement 3
Supercomputing systems 4
Typical supercomputer architecture Blade&&=&2&x&Burst&Buffer&Node&(2x&SSD)& Compute&Nodes& I/O&Node&(2x&InfiniBand&HCA)& BB& SSD& CN& CN& CN& CN& CN& SSD& Lustre&OSSs/OSTs& Storage&Fabric&(InfiniBand)& ION& IB& CN& CN& CN& CN& CN& IB& BB& SSD& CN& CN& CN& CN& CN& SSD& BB& SSD& CN& CN& CN& CN& CN& SSD& ION& IB& CN& CN& CN& CN& CN& IB& Storage&Servers& BB& SSD& CN& CN& CN& CN& CN& SSD& Aries&HighHSpeed&Network& InfiniBand&Fabric& Cori system 5
Scientific Data Management in supercomputers ▪ Data representation – Metadata, data structures, data models ▪ Data storage – Storing and retrieving data and metadata to file systems fast ▪ Data access – Improving performance of data access that scientists desire ▪ Facilitating analysis – Strategies for supporting finding the meaning in the data ▪ Data transfers – Transfer data within a supercomputing system and between different systems 6
Scientific Data Management in supercomputers ▪ Data representation – Metadata, data structures, data models ▪ Data storage – Storing and retrieving data and metadata to file systems fast ▪ Data access – Improving performance of data access that scientists desire ▪ Facilitating analysis – Strategies for supporting finding the meaning in the data ▪ Data transfers – Transfer data within a supercomputing system and between different systems 7
Focus of this presentation ▪ Storing and retrieving data – Parallel I/O – Software stack – Modes of parallel I/O – Tuning parallel I/O performance ▪ Autonomous data management system – Proactive Data Containers (PDC) system – Metadata management service – Data management service 8
Trends – Storage system transformation Current Upcoming Conventional Eg. Cori @ NERSC Eg. Aurora @ ALCF Memory Memory Memory Node-local storage IO Gap IO Gap Shared burst buffer Shared burst buffer Parallel file system Parallel file system Parallel file system Campaign storage (Lustre, GPFS) (Lustre, GPFS) Archival Storage Archival Storage Archival storage (HPSS tape) (HPSS tape) (HPSS tape) • IO performance gap in HPC storage is a significant bottleneck because of slow disk-based storage • SSD and new memory technologies are trying to fill the gap, but increase the depth of storage hierarchy 9
Contemporary Parallel I/O ST Stack Applications High Level I/O Libraries I/O Middleware I/O Forwarding Parallel File System I/O Hardware 10
Parallel I/O software stack § I/O Libraries – HDF5 (The HDF Group) [LBL, ANL] Applications – ADIOS (ORNL) – PnetCDF (Northwestern, ANL) High Level I/O Library (HDF5, NetCDF, ADIOS) – NetCDF-4 (UCAR) I/O Middleware (MPI-IO) • Middleware – POSIX-IO, MPI-IO (ANL) I/O Forwarding • I/O Forwarding Parallel File System (Lustre, GPFS,..) • File systems: Lustre (Intel), GPFS I/O Hardware (IBM), DataWarp (Cray), … § I/O Hardware (disk-based, SSD-based, … ) 11
Parallel I/O – Application view ▪ Types of parallel I/O • 1 writer/reader, 1 file … … … … … P 0 P 1 P n • N writers/readers, N files P 0 P 0 P 1 P 1 P n-1 P n-1 P n P n P 0 P 0 P 1 P 1 P n-1 P n-1 P n-1 P n P n (File-per-process) • N writers/readers, 1 file file.0 file.0 file.m file.0 file.1 file.n file.n-1 • M writers/readers, 1 file File.1 File.1 M Writers/Readers, 1 File M Writers/Readers, M Files 1 Writer/Reader, 1 File n Writers/Readers, n Files n Writers/Readers, 1 File – Aggregators – Two-phase I/O • M aggregators, M files (file- per-aggregator) – Variations of this mode 12
Parallel I/O – System view Logical view ▪ Parallel file systems – Lustre and Spectrum Scale (GPFS) File ▪ Typical building blocks of parallel file systems Communication – Storage hardware – HDD or SSD network RAID – Storage servers (in Lustre, Object Storage Servers [OSS], and object Physical view on a parallel file system storage targets [OST] – Metadata servers – Client-side processes and interfaces ▪ Management – Stripe files for parallelism OST 0 OST 1 OST 2 OST 3 – Tolerate failures File 13
How to achieve peak parallel I/O performance? Application ▪ Parallel I/O software stack HDF5 provides options for performance (Alignment, Chunking, etc.) MPI I/O optimization (Enabling collective buffering, Sieving buffer size, collective buffer size, collective buffer nodes, etc.) ▪ Challenge: Complex inter- Parallel File System (Number of I/O nodes, stripe size, enabling prefetching buffer, etc.) dependencies among SW and Storage Hardware Storage Hardware HW layers 14
Tuning parameter space The$whole$space$visualized$ Stripe_Count& 4& 8& 16& 32& 64& 128& 5242 Stripe_Size&(MB)& 1& 2& 4& 8& 16& 32& 64& 128& 4& 1& 1& 1& 64& 88& cb_nodes& 1& 2& 4& 8& 16& 32& …$ 23040$ cb_buffer_size&(MB)& 1& 2& 4& 8& 16& 32& 64& 128& 1& 1048 128& 128& 32& 128& 576& MB& alignment& 524288& 1048576& 1& 64& 128& 256& 512& siv_buf_size&(KB)& MB& 15 July 30, CScADS Workshop 2012
Tuning for writing trillion particle datasets ▪ Simulation of magnetic reconnection (a space weather phenomenon) with VPIC code – 120,000 cores – 8 arrays (HDF5 datasets) – 32 TB to 42 TB files at 10 time steps ▪ Extracted I/O kernel ▪ M Aggregators to 1 shared file ▪ Trial-and-error selection of Lustre file system parameters while scaling the problem size ▪ Reached peak performance in many instances in a real simulation More details: SC12 and CUG 2013 papers 16
Tuning combinations are abundant • Searching through all combinations manually is impractical • Users, typically domain scientists, should not be burdened with tuning • Performance auto-tuning has been explored heavily for optimizing matrix operations • Auto-tuning for parallel I/O is challenging due to shared I/O subsystem and slow I/O • Need a strategy for reduce the search space with some knowledge 17
Our solution: I/O Auto-tuning • Auto-tuning framework to search the parameter space with a reduced number of combinations • HDF5 I/O library sets the optimization parameters • H5Tuner: Dynamic interception of HDF5 calls • H5Evolve: – Genetic algorithm based selection – Model-based selection 18
Dynamic Model-driven Auto-tuning Overview of Dynamic Model-driven I/O tuning I/O Kernel • Auto-tuning using empirical Model Generation Refitting performance models of I/O Training Training Phase • Steps Set – Training phase to develop an (Controled by user) Refit the model Develop an I/O model I/O Model – Pruning phase to select the Pruning top-K configurations All Possible I/O Model Values – Exploration phase to select Top k the best configuration Configurations – Refitting step to refine Exploration performance model Performance Results HPC System Select the Best Performing Configuration Storage System 19
Empirical Performance Model • Non-linear regression model n b ∑ m( x ; β ) = β k φ k ( x ) k = 1 • Linear combinations of n b non-linear, low polynomial basis functions ( ϕ k ), and hyper-parameters β (selected with standard regression approach) for a parameter configuration of x • For example: 1 1 c f f cf m ( x ) = β 1 + β 2 s + β 3 a + β 4 s + β 5 c + β 6 s + β 7 a , • f: file size; a: number of aggregators; c: stripe count; s: stripe size with a fit to the data yielding β = [10.59, 68.99, 59.83, − 1.23, 2.26, 0.18, 0.01] i 20
Performance Improvement: 4K cores Edison Hopper Stampede Default VPIC-IO on Hopper Default VORPAL-IO on Hopper 40 Default GCRM-IO on Hopper 30 20 10 I/O Bandwidth (GB/s) 1 0.4 0.3 0.1 VPIC-IO VORPAL-IO GCRM-IO 21
Performance Improvement: 8K cores Edison Hopper Default VPIC-IO on Hopper 30 Default VORPAL-IO on Hopper 20 10 I/O Bandwidth (GB/s) 94x 1 0.3 0.2 0.1 VPIC-IO VORPAL-IO GCRM-IO 22
Autonomous data management 23
Storage Systems and I/O: Current status Hardware Usage Software High-level lib Memory Applications (HDF5, etc.) Node-local storage IO middleware … Data (in memory) (POSIX, MPI-IO) Shared burst buffer Parallel file system IO forwarding Tune middleware IO software Tune file systems Campaign storage Parallel file systems Archival storage … Files in file system (HPSS tape) • Challenges – Multi-level hierarchy complicates data movement, especially if user has to be involved – POSIX-IO semantics hinder scalability and performance of file systems and IO software 24
Recommend
More recommend