Structuring PLFS for Extensibility Chuck Cranor, Milo Polte, Garth Gibson PARALLEL DATA LABORATORY Carnegie Mellon University �
What is PLFS? • Parallel Log Structured File System – Interposed filesystem b/w apps & backing storage – Los Alamos National Labs, CMU, EMC, … – Target: HPC checkpoint files • PLFS transparently transforms a highly concurrent write access pattern to a pattern more efficient for distributed filesystems – First paper: Bent et al, Supercomputer 2009 – http://github.com/plfs, http://institute.lanl.gov/plfs/ � 2
Checkpoint Write Patterns • The two main checkpoint write patterns: – N-1: all N processes write to one shared file • Concurrent I/O to a single file is often unscalable • Small, unaligned, clustered traffic is problematic – N-N: each process writes to its own file • Overhead of inserting many files in a single dir • Easier for DFS (after files created) • Archival and management more difficult • Initial PLFS focus: improve N-1 case � 3
PLFS Transforms Workloads • PLFS improves N-1 performance by transforming it into an N-N workload • FUSE/MPI: transparent solution, no application changes required � 4
PLFS Converts N-1 to N-N 279 131 152 132 281 148 host1 host3 host2 PLFS Virtual Layer /foo /foo/ hostdir.1/ hostdir.2/ hostdir.3/ data.132 data.131 data.281 data.279 data.148 data.152 indx.132 indx.131 indx.281 indx.279 indx.148 indx.152 � Physical Underlying Parallel File System 5
PLFS N-1 Bandwidth Speedups 100X SPEED UP 10X � 6
The Price of Success • Original PLFS was limited to 1 workload: – N-1 checkpoint on mounted posix filesystem – All data stored in PLFS container logs • Ported first to MIO-IO/ROMIO – Feasibly deploy on leadership class machines • Success with LANL apps: actual adoption? – Requires maintainability & roadmap evolution – Develop a team: LANL, EMC, CMU, … • Revisit code with maintainability in mind � 7
PLFS Extensibility Architecture HPC Application libplfs PLFS high-level API Logical FS interface flat file small file container Index API byte-range pattern distributed I/O Store interface posix pvfs iofsl hdfs MDHIM libhdfs/jvm hdfs.jar w/LevelDB � 8
Case Study: HPC in the Cloud • Emergence of Hadoop: converged storage • HDFS: Hadoop Distributed Filesystem – Key attributes: • Single sequential writer (not POSIX, no pwrite) • Not VFS mounted, access through Java API • Local storage on nodes (converged) • Data replicated ~3 times (local+remote1+remote2) • HPC in the Cloud: N-1 checkpoint on HDFS? – Observation: PLFS log I/O fits HDFS semantics � 9
PLFS Backend Limitations • PLFS hardwired to POSIX API: – Needs a kernel mounted filesystem – Uses integer file descriptors – Memory maps index files to read them • HDFS does not fit these assumptions • Solution: I/O Store – Insert a layer of indirection above PLFS backend – Model after POSIX API to minimize code changes � 10
PLFS I/O Store Architecture PLFS FUSE PLFS MPI I/O libplfs PLFS container I/O store posix HDFS i/o i/o posix libc lib{hdfs,jvm} API mounted fs hdfs.jar Java code � 11
PLFS/HDFS Benchmark • Testbed: PRObE (www.nmc-probe.org) • Each node has dual 1.6GHz AMD cores, 16GB RAM, 1TB drive, gigabit ethernet • Ubuntu Linux, HDFS 0.21.0, PLFS, OpenMPI • Benchmark: LANL FS Test Suite (fs_test) • Simulates N-1 checkpoint, strided • Filesystems tested: – PVFS OrangeFS 2.8.4 w/64MB stripe size – PLFS/HDFS w/1 replica (local disk) – PLFS/HDFS w/3 replicas (local disk + remote1 + remote 2) • Blocksizes: 47001, 48K, 1M • Checkpoint size: 32GB written by 64 nodes � 12
Benchmark Operation nodes 0 1 2 3 continue pattern for remaining strides write phase block read phase stride 2 3 0 1 nodes (shifted for read) We unmount and cache flush data filesystem between read/write � 13
PLFS Implementation Architecture • FUSE filesystem and a Middleware lib (MPI) PLFS PLFS PLFS PLFS PLFS FUSE FUSE FUSE MPI app MPI app app app daemon proc1 proc2 proc1 proc2 PLFS/ PLFS/ PLFS lib app i/o MPI libs MPI libs use r kernel FUSE VFS/POSIX API upcall MPI sync calls FUSE backing Distributed Local fs store i/o interconnect module fs to disk to network to other nodes � 14
PLFS/HDFS Write Bandwidth PVFS-write 2000 PLFS/HDFS1-write PLFS/HDFS3-write write bandwidth (Mbytes/s) 1500 1000 500 0 47001 48K 1M access unit size (bytes) � 15
PLFS/HDFS Write Bandwidth • PLFS/HDFS performs well (note HDFS1 is local disk) PVFS-write 2000 PLFS/HDFS1-write PLFS/HDFS3-write write bandwidth (Mbytes/s) 1500 1000 500 0 47001 48K 1M access unit size (bytes) � 16
PLFS/HDFS Write Bandwidth • PLFS/HDFS performs well (note HDFS3 is 3 copies) PVFS-write 2000 PLFS/HDFS1-write PLFS/HDFS3-write write bandwidth (Mbytes/s) 1500 1000 500 0 47001 48K 1M access unit size (bytes) � 17
PLFS/HDFS Read Bandwidth • HDFS with small access size benefits from PLFS log grouping PVFS-read PLFS/HDFS1-read PLFS/HDFS3-read 1000 read bandwidth (Mbytes/s) 500 0 47001 48K 1M access unit size (bytes) � 18
PLFS/HDFS Read Bandwidth • HDFS3 with large access size suffers imbalance PVFS-read PLFS/HDFS1-read PLFS/HDFS3-read 1000 read bandwidth (Mbytes/s) 500 0 47001 48K 1M access unit size (bytes) � 19
HDFS 1 vs 3: I/O Scheduling • Network counters show HDFS3 read imbalance PLFS/HDFS1 PLFS/HDFS3 1000 Total size of data served (MB) 500 0 10 20 30 40 50 60 Node number � 20
I/O Store Status • Rewrote initial I/O Store prototype – Production-level code – Multiple concurrent instances of I/O Stores • Re-plumbed entire backend I/O path • Prototyped POSIX, HDFS, PVFS stores – IOFSL done by EMC • Regression tested at LANL • I/O Store now part of PLFS released code – https://github.com/PLFS � 21
Conclusions • PLFS extensions for workload transformation: – Logical FS interface • Not just container logs; packing small files, burst buffer – I/O Store layer • Non-POSIX backends (HDFS, IOFSL, PVFS) • Compression, write buffering, IO forwarding – Container index extensions • PLFS is open source, available on github – http://github.com/plfs – Developer email: plfs-devel@lists.sourceforge.net � 22
Recommend
More recommend