DataMods Programmable File System Services Noah Watkins*, Carlos - PowerPoint PPT Presentation

DataMods Programmable File System Services Noah Watkins*, Carlos Maltzahn, Scott Brandt UC Santa Cruz, *Inktank Adam Manzanares California State University, Chico 1

Talk Agenda 1. Middleware and modern IO stacks 2. Services in middleware and parallel file systems 3. Avoid duplicating work with DataMods 4. Case study: Checkpoint/restart 2

Why DataMods? • Applications struggle to scale on POSIX I/O • Parallel FS rarely provide other interfaces – POSIX I/O designed to prevent lock-in • Open-source PFS are now available – Ability to avoid lock-in • Can we generalize PFS services to provide new behavior to new users? 3

Application Middleware • Complex data models and interfaces • Difficult to work directly with simple byte stream • Middleware maps the complex onto the simple 4

Middleware Complexity Bloat • Hadoop and “Big Data” data models – Ordered key/value pairs stored in file – Dictionary for random key-oriented access – Common table abstractions 5

Middleware Complexity Bloat • Scientific data – Multi-dimensional arrays – Imaging – Genomics 6

Middleware Complexity Bloat • IO Middleware – Low-level data models and I/O optimization – Transformative I/O avoids POSIX limitations 7

Middleware Scalability Challenges • Scalable storage system • Exposes one data model • Must find ‘magic’ alignment 8

Data Model Modules • Plugin new “ file ” interfaces and behavior • Native support; atop existing scalable services New behavior Generalized storage services Pluggable customization ( new programmer role ) 9

What does middleware do? Metadata Data Management Placement Intelligent Asynchronous Access Services 10

Middleware: Metadata Management File • Byte stream layout Header • Data type information • Data model attributes • Example: Mesh Data Model – How is the mesh represented? – What does it represent? 11

Middleware: Data Placement • Serialization Header • Placement index • Physical alignment Data – Including the metadata • Example: Mesh Data Model Met a – Vertex lists Data – Mesh elements Met a – Metadata Data 12

Middleware: Intelligent Access • Data model specific interfaces Header • Rich access methods – Views, subsetting, filtering Data • Write-time optimizations • Locality and data movement Met a Data HDF5 Library Met read(array-slice) Array-based a Application Data 13

Middleware: Asynchronous Services • Workflows Header – Regridding • Compression HDF5 Library Workflow Data • Indexing Driver • Layout optimization Met a • Performed online Data Met a Data 14

Middleware Challenges • Inflexible byte stream abstraction • Scalability rules are simple – But middleware is complex • Applying ‘magic number’ – Unnatural and difficult to propogate • Loss of detail at lower-levels – Difficult for in-transit / co-located compute 15

Storage System Services • Scalable meta data – Clustered service – Scalability invariants • Distributed object store – Local compute resources – Define new behavior • File operations – POSIX • Fault-tolerance – Scrubbing and replication 16

DataMods Abstraction File Manifold (Metadata and Data Placement) Typed and Active Asynchronous Storage Services 17

DataMods Architecture • Generalized file system services • Exposed through programming model 18

File Manifold • Metadata management and data placement – Flexible, custom layouts • Extensible interfaces • Object namespace managed by manifold • Placement rules evaluated by system 19

Typed and Active Storage • Active storage adoption has been slow – Code injection is scary – Security and QoS • Reading, writing, and checksums are not free • Exposing scalable services is tractable – Well-defined data models supports optimization – Programming model support data model creation – Indexing and filtering 20

Asynchronous Services • Re-use of active / typed storage components • Temporal relationship to file manifold – Incremental processing – After file is closed – Object update trigger • Scheduling – Exploit idle time – Integrate with larger ecosystem – Preempted or aborted 21

Case Study: PLFS Checkpoint/Restart • Long-running simulations need fault-tolerance – Checkpoint simulation state • Simulations run on expensive machines – Very expensive machines. Really, very expensive. • Decrease cost (time) of checkpoint/restart • Translation: increase bulk I/O bandwidth 22

Overview of PLFS • Middleware layer – Transforms I/O pattern • IO Pattern: N-1 – Most common • IO Pattern: N-N – File system friendly • Convert N-1 into N-N • Applications see the same logical file 23

Simplified PLFS I/O Behavior Client 1 Client 2 Client 3 Parallel Log-structured File System Index Index Index Log-structured Log-structured Log-structured 24

PLFS Scalability Challenges • Index maintenance and volume • Optimization above file system – Compression and reorganization Compute Application PLFS File System Optimization Process Time 25

Moving Overhead to Storage System • Checkpoints are not read immediately (if at all) – Index maintenance and optimization in storage Compute Application PLFS File System Return to compute sooner Time Optimization Process 26

DataMods Module for PLFS • File Manifold – Logical file view – Per-process log-structured files – Index • Hierarchical Solution – Top-level manifold routes to logs – Inner manifold implements log-structured file – Automatic namespace management (metadata) 27

PLFS Outer File Manifold Logical top-half file is not materialized 28

PLFS Outer File Manifold Logical top-half file is not materialized Routes to per- process log file 29

PLFS Inner File Manifold Logical top-half file is not materialized Routes to per- process log file Append striping within object namespace 30

PLFS Inner File Manifold Logical top-half file is not materialized Routes to per- process log file Append striping within object namespace Index-enabled objects record logical-to-phy 31

PLFS Inner File Manifold Logical top-half file is not materialized Routes to per- process log file Interface to index maintenance Append striping routines within object namespace Index-enabled objects record logical-to-phy 32

Active and Typed Objects • Append-only object • Automatic indexing • Managed layout • Built on existing services • Logical view at lowest level • Index maintenance interface

Offline Index Optimization • Extreme index fragmentation (per-object) • Exploit opportunities for optimization – Storage system idle time – Re-use of analysis I/O – Piggy-backed on scrubbing / healing • Index Compression – Merging contiguous entries – Pattern discovery and replacement – Consolidation 34

Offline Index Optimization • Three stage pipeline – Incremental compression and consolidation • Incremental compression 1. Merging physically contiguous entries (in PLFS) Not subject to buffer size limits • Applied technique to 92 PLFS indexes • published by LANL 35

Merging Reduces PLFS Index Size 10000000 Raw Trace (Baseline) Large, Strided Merge Compress 1000000 100000 Number of Index Entries 10000 Contiguous Writes 1000 100 10 1 1 11 21 31 41 51 61 71 81 91 PLFS Map File 36

Index Compression: Pattern • Compactly represent extents using patterns • Example pattern template – offset + stride * i, low < i < high • Fit data to this pattern to reduce index size • Linear algorithm; parallel across logs 37

Pattern Compression Improves Over Merging 10000000 Raw Trace (Baseline) Strided pattern identified Merge Compress 1000000 Pattern Compress 100000 Number of Index Entries 10000 1000 100 10 1 1 11 21 31 41 51 61 71 81 91 PLFS Map File 38

Index Consolidation • Combines all logs together (in PLFS) • Increases index read efficiency Index Consolidation Index Pack … 39

Wrapping Up • Implementing new data model plugins – Hadoop and Visualization – Refining API with more use cases – Constructing specification language • Thank you to supporters – DOE funding (DE-SC0005428), Gary Grider John Bent, James Nunez • Questions? --- jayhawk@cs.ucsc.edu • Poster session 40

Extra Slides 41