Scalable Parallel I/O Alternatives for Massively Parallel - PowerPoint PPT Presentation

Scalable Parallel I/O Alternatives for Massively Parallel Partitioned Solver Systems Jing Fu, Ning Liu, Onkar Sahni, Ken Jansen, Mark Shephard, Chris Carothers Computer Science Department Scientific Computation Research Center (SCOREC) Rensselaer Polytechnic Institute chrisc@cs.rpi.edu Acknowledgments: Partners: Simmetrix, Acusim, Kitware, IBM NSF PetaApps, DOE INCITE, ITR, CTS; DOE: SciDAC-ITAPS, NERI; AFOSR Industry:IBM, Northrup Grumman, Boeing, Lockheed Martin, Motorola Computer Resources: TeraGrid, ANL, NERSC, RPI-CCNI Scalable Parallel I/O Alternatives 1

Outline • Motivating application: CFD • Blue Gene Platforms • I/O Alternatives – POSIX – PMPIO – syncIO – “ reduce blocking ” rbIO • Blue Gene Results • Summary Scalable Parallel I/O Alternatives 2

PHASTA Flow Solver Parallel Paradigm l Time-accurate, stabilized FEM flow solver l Input partitioned on a per-processor basis P2 P1 l Unstructured mesh “ parts ” mapped to cores l Two types of work: l Equation formation l O(40) peer-to-peer non-blocking comms l Overlapping comms with comp l Scales well on many machines l Implicit, iterative equation solution P3 l Matrix assembled on processor ONLY l Each Krylov vector is: l q=Ap (matrix-vector product) l Same peer-to-peer comm of q PLUS l Orthogonalize against prior vectors l REQUIRES NORMS=>MPI_Allreduce l This sets up a cycle of global comms. separated by modest amount of work Scalable Parallel I/O Alternatives 3

Parallel Implicit Flow Solver – Incompressible Abdominal Aorta Aneurysm (AAA) Cores IBM BG/L (avg. elems./core) RPI-CCNI t (secs.) scale factor 512 (204800) 2119.7 1 (base) 1024 (102400) 1052.4 1.01 2048 (51200) 529.1 1.00 4096 (25600) 267.0 0.99 8192 (12800) 130.5 1.02 16384 (6400) 64.5 1.03 32768 (3200) 35.6 0.93 32K parts shows modest degradation due to 15% node imbalance Scalable Parallel I/O Alternatives 4

AAA Adapted to 10 9 Elements: Scaling on Blue Gene /P #of cores Rgn imb Vtx imb Time (s) Scaling 32k 1.72% 8.11% 112.43 0.987 128k 5.49% 17.85% 31.35 0.885 New: @ 294,912 cores è 82% scaling But getting I/O done is a challenge… Scalable Parallel I/O Alternatives 5

Blue Gene /L Layout CCNI “ fen ” • 32K cores/ 16 racks • 12 TB / 8 TB usable RAM • ~1 PB of disk over GPFS • Custom OS kernel Scalable Parallel I/O Alternatives 6

Blue Gene /P Layout ALCF/ANL “ Intrepid ” •163K cores/ 40 racks • ~80TB RAM • ~8 PB of disk over GPFS • Custom OS kernel Scalable Parallel I/O Alternatives 7

Blue Gene/ P (vs. BG/L) Scalable Parallel I/O Alternatives 8

Blue Gene I/O Archiectures • Blue Gene/L @ CCNI – 1 2-core I/O node per 32 compute nodes – 32K system has 512, 1 Gbit/sec network interfaces – I/O nodes connected 48 GPFS file servers – Servers 0, 2, 4, and 6 are metadata servers – Server 0 does RAS and other duties – 800 TB of storage from 26 IBM DS4200 storage arrays – Split into 240 LUNs, each server has 10 LUNs (7 @ 1MB and 3 @ 128KB) – Peak bandwidth is ~8GB/sec read and 4 GB/sec write • Blue Gene/P @ ALCF – Similar I/O node to compute node ratio – 128 dual core fileservers over Myrinet w/ 4MB GPFS block size – Metadata can be done by any server – 16x DDN 9900 è 7.5 PB (raw) storage w/ peak bandwidth of 60 GB/sec. Scalable Parallel I/O Alternatives 9

Non-Parallel I/O: A Bad Approach… • Sequential I/O: – All processes send data to rank 0, and 0 writes it to the file Data 0 Data 1 Data 2 Data N-1 0 1 2 N-1 Block 0 Block 1 Block 2 Block N-1 Lacks scaling and results in excessive memory use on rank 0 Must think parallel from the start, but that implies data/file partitioning… Scalable Parallel I/O Alternatives 10

1 POSIX File Per Processor (1PFPP) • Pros: – parallelism, high performance at small core counts • Cons: – lots of small files to manage – LOTS OF METADATA – stress parallel filesystem – difficult to read back data from different number of processes – @ 300K cores yields 600K files • @ JSC è kernel panic!! – PHASTA currently uses this approach… Scalable Parallel I/O Alternatives 11

New Partitioned Solver Parallel I/O Format • Assumes data accessed in a coordinated manner • File: master header + series of data blocks • Each data block has header and data • Ex: 4 parts w/ 2 fields per part •Allows for different processor config: •(1 core @ 4 parts), •(2 core @ 2 parts) •(4 cores @ 1 part) •Allows for 1 to many files to control metadata overheads Scalable Parallel I/O Alternatives 12

MPI_File alternatives: PMPIO • PMPIO à “ poor man ’ s parallel I/O ” from “ silo ” mesh and field library • Divides app into groups of writers • w/i a group only 1 writer at a time to a file • Passing of a “ token ” ensures synchronization w/i a group • Support for HDF5 file format • Uses MPI_File_read/write_at routines Scalable Parallel I/O Alternatives 13

MPI_File alternatives: syncIO • Flexible design allows variable number files and procs/writers per file • Within a file, can be configured to write on “ block size boundries ” which are typically 1 to 4MB. • Implemented using collective I/O routines : e.g., MPI_File_write_at_all_begin Scalable Parallel I/O Alternatives 14

MPI_File alternatives: rbIO • Rb è “ reduced blocking ” • Targets “ checkpointing ” • Divides application into workers and writers with 1 writer MPI task per group of workers. • Workers send I/O to writers over MPI_Isend and are free to continue – – e.g., hides the latency of blocking parallel I/O • Writers then perform blocking MPI_File_write_at operation using MPI_SELF communicator Scalable Parallel I/O Alternatives 15

BG/L: 1PFPP w/ 7.7 GB data Scalable Parallel I/O Alternatives 16

BG/L: PMPIO w/ 7.7 GB data HDF5 RAW MPI-IO Peak: 600MB/sec Peak: 900 MB/sec Scalable Parallel I/O Alternatives 17

BG/L: syncIO w/ 7.7 GB data Write Read Performance Performance Peak: 1.3 GB/sec Peak: 6.6 GB/sec Scalable Parallel I/O Alternatives 18

BG/P: syncIO w/ ~60 GB data Scalable Parallel I/O Alternatives 19

BG/L: rbIO actual BW w/ 7.7 GB data Scalable Parallel I/O Alternatives 20

BG/L: rbIO perceived BW w/ 7.7 GB data ~22 TB/sec ~11 TB/sec Scalable Parallel I/O Alternatives 21

BG/P: rbIO actual BW w/ ~60 GB data Actual write performance of rbIO with 32, 768 procs on Intrepid ~17.9 GB/sec 20.00 18.00 Bandwidth(GB/s ) 16.00 14.00 12.00 10.00 8.00 6.00 4.00 2.00 0.00 1024 files 512 files 256 files 128 files 64 files 32 files 16 files 1 file 1024 writers 17.91 14.71 13.45 12.85 13.02 12.79 12.79 3.16 512 writers 17.49 12.05 10.53 11.39 10.53 10.53 3.20 Scalable Parallel I/O Alternatives 22

BG/P: rbIO perceived BW w/ ~60 GB data Perceived write performance of rbIO with 32, 768 procs on Intrepid 30.00 ~21 TB/sec 25.00 Bandwidth(TB/s ) 20.00 15.00 10.00 5.00 0.00 1024 files 512 files 256 files 128 files 64 files 32 files 16 files 1 file 1024 writers 20.74 19.69 19.69 19.69 19.69 19.69 19.70 19.69 512 writers 20.88 19.79 19.78 19.78 19.79 19.79 19.78 Scalable Parallel I/O Alternatives 23

Related Work • A. Nisar, W. Liao, and A. Choudhary, “ Scaling Parallel I/O Performance through I/O Delegate and Caching System, ” in Proceedings of the 2008 ACM/IEEE conference on Supercomputing , 2008. – Performance “ rbIO ” inside MPI via threads and using upto 10% compute cores as I/O workers • Benchmark studies (hightlight just a few…) – H Yu et al [18] – BG/L: 2 GB/sec @ 1K – Saini et al [19] – 512 NEC SX-8 cores – I/O was not scalable when all processors access a shared file. – Larkin et al [17] – large performance drop at 2K core count for CrayXT3/XT4 – Lang et al [30] – large I/O study across many benchmarks on Intrepid/BG-P. Found 60 GB/s read and 45 GB/s write. In practice, Intrepid has a peak I/O rate of around 35 GB/sec Scalable Parallel I/O Alternatives 24

Scalable Parallel I/O Alternatives for Massively Parallel - PowerPoint PPT Presentation

Scalable Parallel I/O Alternatives for Massively Parallel Partitioned Solver Systems Jing Fu, Ning Liu, Onkar Sahni, Ken Jansen, Mark Shephard, Chris Carothers Computer Science Department Scientific Computation Research Center (SCOREC)

A Massively Parallel Dense Symmetric A Massively Parallel Dense Symmetric A Massively Parallel

Breaking the Linear-Memory Barrier in Massively Parallel Computing MIS on Trees with Strongly

Loosely Dependent Parallel Processes Complementary Paradigms Massively Parallel Task

Massively Parallel A* Search on a GPU Yichao Zhou Jianyang Zeng Institute for Interdisciplinary

Massively Parallel Communication and Query Evaluation Paul Beame U. of Washington Based on

MPMPLAPACK: A Massively Parallel Multi-Precision Linear Algebra Package Jason Martin

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

Scalable Massively Parallel I/O to Task-Local Files | Wolfgang Frings, Jlich Supercomputing

Fast Scalable Parallel Comparison Sort Fast, Scalable Parallel Comparison Sort On Hybrid Multicore

PASCAL A Parallel Algorithmic SCALable Framework A Parallel Algorithmic SCALable Framework for

Scalable String Matching on the Scalable String Matching on the Scalable String Matching on the

Hows the Parallel Computing Revolution Going? Towards Parallel, Scalable VM Services Kathryn

A Massively Scalable Architecture for Learning Representations from Heterogeneous Graphs NVIDIA

Massively Scalable Indoor Positioning: The Skyhook Solution Christopher Steger Skyhook

ATLAS ATLAS A Scalable Emulator for A Scalable Emulator for Transactional Parallel Systems

Zarr - scalable storage of tensor Zarr - scalable storage of tensor data for parallel and

Global Surgery: Assuring an Adequate Surgical Workforce Florida Chapter, American College of

Multivariate Data Analysis in Omics Research Diverging Alternative Splicing Fingerprints

Human Factors Research Some OSU examples 1 Human Factors Research to Inform the Human-Machine

Meta-analysis of self-control study: Methods and associated application of METAN Dr. Robert

HPG Annual Report Dr Kathryn Cobain Public Health Consultant On Behalf of the Director of

Getting SMART with FHIR Grahame Grieve, Mark Braunstein, Michael Lawley, Brett Esler, Reuben

Constraints Driven Information Extraction in the Medical Domain Dan Roth Department of Computer

Disclosures A Reasonable Option in Off label use will be presented Selected Patients