S N he e ur a title Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
S Los Alamos National Laboratory LA-UR-17-24107 MarFS and DeltaFS @ LANL you e User Level FS Challenges and Opportunities logo and delete wo e Brad Settlemyer, LANL HPC May 16, 2017 is Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
Los Alamos National Laboratory Acknowledgements • G. Grider, D. Bonnie, C. Hoffman, J. Inman, W. Vining, A. Torrez, H.B. Chen (LANL MarFS Team) • Q. Zheng, G. Amvrosiadis, C. Cranor, S. Kadekodi, G. Gibson (CMU-LANL IRHPIT, CMU Mochi) • P. Carns, R. Ross, S. Snyder, R. Latham, M. Dorier (Argonne Mochi) • J. Soumagne, J. Lee (HDF Group Mochi) • G. Shipman (LANL Mochi) • F. Guo, B. Albright (LANL VPIC) • J. Bent, K. Lamb (Seagate) 5/16/17 | 3
Los Alamos National Laboratory Overview • 2 Special Purpose User-Level File Systems • No free lunch! • Tradeoff one thing to get something else • Campaign Storage and MarFS • MarFS constrained to its use case • Addresses data center problem • VPIC and DeltaFS • VPIC is open source, scalable, beautiful, modern C++ cosmology sim • Scientist has a needle-haystack problem • Lessons Learned 5/16/17 | 4
Los Alamos National Laboratory MarFS 5/16/17 | 5
Los Alamos National Laboratory Why MarFS? • Science campaigns often stretch beyond 6 months • ScratchFS purge more like 12 weeks • Parallel Tape $$$ • High capacity PFS $$$ • Need new (to HPC) set of tradeoffs • Capacity growth over time (~500PB in 2020) • Support Petabyte files (Silverton uses N:1 I/O at huge scale) • Billions of “tiny” files • Scalable append-only workload (desire 1 GB/s/PB) • Requirements blend PFS and Object Storage capabilities • LANL wants compromise of both! 5/16/17 | 6
Los Alamos National Laboratory MarFS Access Model • Simplify, simplify, simplify • Data plane of object stores present some issues • Object stores tend to have an ideal object size • Pack small files together • Split large files • Only write access is via PFTool • User’s want to use traditional tools • Read-only mount via FUSE • We could make this read-write (e.g. O_APPEND only), but object store security is not the same as POSIX security 5/16/17 | 7
Los Alamos National Laboratory Uni-Object and Multi Object File Scality, S3, erasure, etc. /MarFS top level namespace aggregation GPFS MarFS Metadata File System(s). Obj repo 1 /GPFS-MarFS-md1 /GPFS-MarFS-md2 Obj001 Lazy Tree Info Dir1.1 Dir2.1 Lazy Tree Info trashdir Obj repo 2 trashdir MultiFile-1 UniFile-1 Obj001 Additional meta: All md is just normal except mtime Xattr-repo=2 Obj002 and size which are set by pftool/fuse chunksize= 256M , on close. ojjtype-=Multi Config file/db (means it’s a multi-part file and the obj/offset Additional meta: list is in the GPFS mdfile Xattr-objid Obj repo1 access methods File: list of obj name space/objname/offset/ repo=1 info length (obj name space=2, Obj001 offs/ id=Obj001 Obj repo2 access methods length, Obj002 … objoffs=0 info Xattr-restart chunksize=256M 5/16/17 | 8
Los Alamos National Laboratory Pftool • A highly parallel copy/rsync/compare/list tool • Walks tree in parallel, copy/rsync/compare in parallel. • Parallel Readdir’s, stat’s, and copy/rsinc/compare – Dynamic load balancing – Repackaging: breaks up big files, coalesces small files – To/From NFS/POSIX/parallel FS/MarFS D Dirs Queue o Stat Readdir n Load e Reporter Stat Balancer Stat Queue Q Scheduler u Copy/Rsync/ e u Compare Cp/R/C Queue e 5/16/17 | 9
Los Alamos National Laboratory Moving Data into MarFS efficiently FTA User: Submit batch job FE Scratch1 pfcp –r /scratch1/fs1 /marfs/fs1 (78 PB) FTA1 FTA2 Store1 (~3PB) FTA Cluster FTA3 A collection of pftool worker FTA4 nodes capable Store2 of performing (38PB) data movement FTA5 in parallel FTA6 Store3 (38PB) 5/16/17 | 10
Los Alamos National Laboratory Current MarFS Deployment at LANL • Successfully deployed during Trinity Open Science (6 PB of Scality storage (1 RING), 1GPFS cluster) • Small file packing features weren’t efficient (yet) • Cleanup tools needed better feature sets • It worked! • PFTool cluster deployed for enclaves • Transfers instrumented with Darshan • Deployed 32PB of MarFS campaign storage for Trinity Phase 2 • Plan to grow that by 50 – 100PB every year during Trinity deployment • SMR drive performance is challenging 5/16/17 | 11
Los Alamos National Laboratory Los Alamos National Laboratory Future MarFS Deployment at LANL Parity over multiple ZFS pools Meta-data servers File • 2020 – 350PB Transfer Agent Parity of 10+2 D D D D D D D D D D P P Storage Storage Storage Storage Storage Storage Storage Storage Storage Storage Storage Storage Storage Node Node Node Node Node Node Node Node Node Node Node Node Node Zpool 1 Zpool 1 Zpool 1 Zpool 1 Zpool 1 Zpool 1 Zpool 1 Zpool 1 Zpool 1 Zpool 1 Zpool 1 Zpool 1 Zpool 1 Zpool 2 Zpool 2 Zpool 2 Zpool 2 Zpool 2 Zpool 2 Zpool 2 Zpool 2 Zpool 2 Zpool 2 Zpool 2 Zpool 2 Zpool 2 Zpool 3 Zpool 3 Zpool 3 Zpool 3 Zpool 3 Zpool 3 Zpool 3 Zpool 3 Zpool 3 Zpool 3 Zpool 3 Zpool 3 Zpool 3 Zpool 4 Zpool 4 Zpool 4 Zpool 4 Zpool 4 Zpool 4 Zpool 4 Zpool 4 Zpool 4 Zpool 4 Zpool 4 Zpool 4 Zpool 4 Each Zpool Storage nodes Multiple JBODs Data and Parity are round-robined Storage Nodes is a 17+3 in separate racks per Storage Node to storage nodes NFS export to FTAs 5/16/17 | 12 4/19/17 | 12
Los Alamos National Laboratory Los Alamos National Laboratory MarFS at LANL MarFS Growth over time Trinity Open Science Trinity Production (current) • Successfully deployed during Trinity Open Science (6 PB of Scality storage (1 RING), 1GPFS cluster) • Some of the small file packing features aren’t efficient yet Future MarFS deployment (Crossroads 2020) • Some of the cleanup tools need better feature sets • But it actually works • All PFTool transfers were logged with Darshan during Open Science Key • Unfortunately, I haven’t had the opportunity to parse that data yet 4 PB • Curently deploying 32PB of MarFS campaign storage for Trinity Phase 2 • Plan is to grow that by 50 – 100PB every year during Trinity deployment 5/16/17 | 13 4/19/17 | 16
Los Alamos National Laboratory DeltaFS 5/16/17 | 14
Los Alamos National Laboratory Why DeltaFS? • DeltaFS provides extremely scalable metadata plane • Don’t want to provision MDS to suit most demanding application • Purchase MDS to address all objects in the data plane • DeltaFS allows applications to be as metadata-intensive as desired • But not the DeltaFS vision talk • Instead I will talk about what exists • VPIC is a good, but imperfect match for DeltaFS • State of the art is a single h5part log for VPIC that has written greater than 1Trillion particles (S. Byna) • Not much opportunity to improve write performance 5/16/17 | 15
Los Alamos National Laboratory Vector Particle in Cell (VPIC) Setup • 128 Million particles/node • Part/node should be higher • Need to store a sample in the range of 0.1 – 10% • Scientist interested in trajectory of the particles with highest energy at end of simulation • At the end of the simulation, easy to dump the highest energy particle ids • Hard to get those trajectories 5/16/17 | 16
Los Alamos National Laboratory DeltaFS for VPIC • Scientists wants trajectory of N highest energy particles • Trajectory is ~40 bytes/particle, once per timestep dump/cycle • Current State of Art (HDF5+H5Part+Byna) 1. At each timestep dump/cycle write particles in log order 2. At end of simulation identify N highest energy particles 3. Sort each timestep dump by particle id in parallel 4. Extract N high energy trajectories (open, seek, read, open, seek, read) • DeltaFS Experiments • Original Goal: Create a file+object per particle • New Goal: 2500 Trinity Nodes -> 1- 4T particles -> 40-160TB per timestep • Speedup particle extraction, minimize slowdown during write (tanstafl) 5/16/17 | 17
Los Alamos National Laboratory Wait What!?! • Always confident we could create 1T files • Key-Value Metadata organization • TableFS -> IndexFS -> BatchFS -> DeltaFS • Planning papers assumed a scalable object store as data plane • RADOS has some cool features that may have worked • No RADOS on Trinity without root/vendor assist! • Cannot create a data plane faster than Lustre, DataWarp • Trillions of files open at one time • 40 Byte appends! (the dribble write problem) • Bitten off far more than originally intended • DeltaFS scope increased to optimize VPIC data plane 5/16/17 | 18
Los Alamos National Laboratory FS Representation vs. Storage Representation 5/16/17 | 19
Los Alamos National Laboratory DeltaFS is a composed User-Level FS 5/16/17 | 20
Los Alamos National Laboratory Write Indexing Overhead 5/16/17 | 21
Los Alamos National Laboratory Trajectory Extraction Time 5/16/17 | 22
Los Alamos National Laboratory DeltaFS UDF (only VPIC example works so far) 5/16/17 | 23
Los Alamos National Laboratory Summary 5/16/17 | 24
Recommend
More recommend