fast forward i o storage
play

Fast Forward I/O & Storage Eric Barton Lead Architect High - PowerPoint PPT Presentation

Fast Forward I/O & Storage Eric Barton Lead Architect High Performance Data Division 1 Department of Energy - Fast Forward Challenge FastForward RFP provided US Government funding for exascale research and development Sponsored by 7


  1. Fast Forward I/O & Storage Eric Barton Lead Architect High Performance Data Division 1

  2. Department of Energy - Fast Forward Challenge FastForward RFP provided US Government funding for exascale research and development Sponsored by 7 leading US national labs Aims to solve the currently intractable problems of Exascale to meet the 2020 goal of an exascale machine RFP elements were CPU, Memory and Filesystem Whamcloud won the Filesystem component • HDF Group – HDF5 modifications and extensions • EMC – Burst Buffer manager and I/O Dispatcher • Cray - Test Contract renegotiated on Intel acquisition of Whamcloud • Intel - Arbitrary Connected Graph Computation • DDN - Versioning OSD Fast Forward I/O and Storage High Performance Data Division

  3. Exascale I/O technology drivers 2012 2020 Nodes 10-100K 100K-1M Threads/node ~10 ~1000 Total concurrency 100K-1M 100M-1B Object create 100K/s 100M/s Memory 1-4PB 30-60PB FS Size 10-100PB 600-3000PB MTTI 1-5 Days 6 Hours Memory Dump < 2000s < 300s Peak I/O BW 1-2TB/s 100-200TB/s Sustained I/O BW 10-200GB/s 20TB/s Fast Forward I/O and Storage High Performance Data Division

  4. Exascale I/O technology drivers (Meta)data explosion • Many billions of entities – Mesh elements / graph nodes • Complex relationships • UQ ensemble runs – Data provenance + quality OODB • Read/Write -> Instantiate/Persist • Fast / ad- hoc search: “Where’s the 100 year wave?” – Multiple indexes – Analysis shipping Fast Forward I/O and Storage High Performance Data Division

  5. Exascale I/O requirements Constant failures expected at exascale Filesystem must guarantee data and metadata consistency • Metadata at one level of abstraction is data to the level below Filesystem must guarantee data integrity • Required end-to-end Filesystem must always be available • Balanced recovery strategies – Transactional models – Fast cleanup up failure – Scrubbing – Repair / resource recovery that may take days-weeks Fast Forward I/O and Storage High Performance Data Division

  6. Exascale I/O Architecture Exascale Machine Shared Storage Site Storage I/O Network Exascale Nodes Network Disk Burst buffer Storage Metadata Compute NVRAM Servers NVRAM Nodes Fast Forward I/O and Storage High Performance Data Division

  7. Project Goals Storage as a tool of the Scientist Manage the explosive growth and complexity of application data and metadata at Exascale • Support complex / flexible analysis to enable scientists to engage with their datasets Overcome today’s filesystem scaling limits • Provide the storage performance and capacity Exascale science will require Provide unprecedented fault tolerance • Design ground-up to handle failure as the norm rather than the exception • Guarantee data and application metadata consistency • Guarantee data and application metadata integrity Fast Forward I/O and Storage High Performance Data Division

  8. I/O stack Features & requirements Application Query Tools • Non-blocking APIs Userspace Application I/O – Asynchronous programming models • Transactional == consistent thru failure I/O Dispatcher – End-to-end application data & metadata integrity • Low latency / OS bypass DAOS Kernel – Fragmented / Irregular data Storage Layered Stack • Application I/O – Multiple top-level APIs to support general purpose or application-specific I/O models • I/O Dispatcher – Match conflicting application and storage object models – Manage NVRAM burst buffer / cache • DAOS – Scalable, transactional global shared object storage Fast Forward I/O and Storage High Performance Data Division

  9. Fast Forward I/O Architecture Storage Compute HPC Fabric I/O Nodes SAN Fabric Servers Nodes MPI / Portals Burst Buffer OFED Application I/O Forwarding Server Lustre Server MPI-IO I/O Dispatcher HDF5 VOL POSIX Lustre Client NVRAM (DAOS+POSIX) I/O Forwarding Client Fast Forward I/O and Storage High Performance Data Division

  10. Transactions I/O Epochs Consistency and Integrity • Guarantee required on any and all failures – Foundational component of system resilience Updates • Required at all levels of the I/O stack – Metadata at one level is data to the level below No blocking protocols Time • Non-blocking on each OSD • Non-blocking across OSDs I/O Epochs demark globally consistent snapshots • Guarantee all updates in one epoch are atomic • Recovery == roll back to last globally persistent epoch – Roll forward using client replay logs for transparent fault handling • Cull old epochs when next epoch persistent on all OSDs Fast Forward I/O and Storage High Performance Data Division

  11. I/O stack Applications and tools Application Query Tools • Query, search and analysis Userspace Application I/O – Index maintenance • Data browsers, visualizers, editors I/O Dispatcher • Analysis shipping DAOS – Move I/O intensive operations to data Kernel Storage Application I/O • Non-blocking APIs • Function shipping CN/ION • End-to-end application data/metadata integrity • Domain-specific API styles – HDFS, Posix , … – OODB, HDF5, … – Complex data models Fast Forward I/O and Storage High Performance Data Division

  12. HDF5 Application I/O DAOS-native Storage Format Application Query Tools • Built-for-HPC storage containers Userspace Application I/O • Leverage I/O Dispatcher/DAOS capabilities I/O Dispatcher • End-to-end metadata+data integrity New Application Capabilities DAOS Kernel • Asynchronous I/O Storage – Create/modify/delete objects – Read/write dataset elements • Transactions – Group many API operations into single transaction Data Model Extensions • Pluggable Indexing + Query Language • Pointer datatypes Fast Forward I/O and Storage High Performance Data Division

  13. I/O Dispatcher I/O rate/latency/bandwidth matching Application Query Tools Burst buffer / prefetch cache Userspace • Application I/O Absorb peak application load • I/O Dispatcher Sustain global storage performance • Layout optimization DAOS Kernel • Application object aggregation / sharding Storage • Upper layers provide expected usage Higher-level resilience models • Exploit redundancy across storage objects Scheduler integration • Pre-staging / Post flushing Fast Forward I/O and Storage High Performance Data Division

  14. DAOS Containers Distributed Application Object Storage Application Query Tools • Sharded transactional object storage Userspace Application I/O • Virtualizes underlying object storage • Private object namespace / schema I/O Dispatcher Share-nothing create/destroy, read/write DAOS Kernel • 10s of billions of objects Storage • Distributed over thousands of servers • Accessed by millions of application threads ACID transactions • Defined state on any/all combinations of failures • No scanning on recovery Fast Forward I/O and Storage High Performance Data Division

  15. DAOS Container Shar ard Container FID Shar ard Co Container ainer Inode ode Shar ard UID, D, per erms ms etc tc Parent t FID Shar ard FIDs Ds Shar ard Metada adata a (spa pace etc tc) Obj IDX Ob Object ect Parent t FID Obj metada adata (size, e, etc tc) Data Fast Forward I/O and Storage High Performance Data Division

  16. Versioning OSD DAOS container shards Application Query Tools Userspace • Space accounting Application I/O • Quota I/O Dispatcher • Shard objects DAOS Transactions Kernel Storage • Container shard versioned by epoch – Implicit commit – Epoch becomes durable when globally persistent – Explicit abort – Rollback to specific container version • Out-of-epoch-order updates • Version metadata aggregation Fast Forward I/O and Storage High Performance Data Division

  17. Versioning with CoW New epoch directed to a clone Cloned extents freed when no longer referenced Requires epochs to be written in order Fast Forward I/O and Storage High Performance Data Division

  18. Versioning with an intent log Out-of-order epoch writes logged Log “flattened” into CoW clone on epoch close Keeps storage system eager Fast Forward I/O and Storage High Performance Data Division

  19. Server Collectives Collective client eviction • Enables non-local/derived attribute caching (e.g. SOM) Collective client health monitoring • Avoids “ping” storms Global epoch persistence • Enables distributed transactions (SNS) Spanning Tree • Scalable O(log n) latency – Collectives and notifications • Discovery & Establishment – Gossip protocols – Accrual failure detection Fast Forward I/O and Storage High Performance Data Division

  20. Exascale filesystem /projects Integrated I/O Stack /Legacy /HPC /BigData • Epoch transaction model • Non-blocking scalable object I/O Posix striped file HDF5 a b c a b c a b c a • High level application object I/O model Simulation data • I/O forwarding OODB metadata OODB metadata OODB metadata OODB metadata OODB metadata I/O Dispatcher data data data data data data • Burst Buffer management data data data data data data data data data • Impedance match application I/O performance to storage system capabilities MapReduce data DAOS Blocksequence data • Conventional namespace for administration, security & accounting data data • DAOS container files for transactional, scalable object I/O data data data data data data Fast Forward I/O and Storage High Performance Data Division

  21. Thank You

Recommend


More recommend