XtreemFS: high- performance network file system clients and servers in userspace Minor Gordon, NEC Deutschland GmbH mgordon@hpce.nec.com
HIGH PERFORMANCE Why userspace? COMPUTING 05/04/09 File systems traditionally implemented in the kernel for ❚ performance, control Some advantages doing things in userspace: ❚ ❚ High-level languages: Python, Ruby, et al. for prototyping, then C++ ( → tool support, reduced code footprint, etc.) ❚ Protection: kernel-userspace bridges (Dokan, FUSE) are fairly stable, file system can crash without requiring a reboot ❚ Porting: one common kernel->userspace upcall interface (FUSE) on Linux, OS X, Solaris Acceptable performance for network file systems ❚ ❚ Often bound to disk anyway Page 2 Minor Gordon
HIGH PERFORMANCE Overview COMPUTING 05/04/09 Implementing file systems in ❚ userspace Handling concurrency ❚ XtreemFS: an object-based ❚ distributed file system Page 3 Minor Gordon
Implementing file systems in HIGH PERFORMANCE COMPUTING userspace 05/04/09 ~ VFS functions ❚ static int FUSE kernel module translates ❚ mkdir( operations to messages, writes them const char* path, on an FD mode_t mode ); FUSE userspace library reads the ❚ messages, calls the appropriate function, returns the result as a static int message DOKAN_CALLBACK ❚ Callbacks must be thread-safe, CreateDirectory( completely synchronously. LPCWSTR FileName, PDOKAN_FILE_INFO Dokan (Win32) calls can be ❚ translated, sans sharing modes. ); Page 4 Minor Gordon
HIGH PERFORMANCE Abstract away COMPUTING 05/04/09 bool Volume::mkdir( Yield C++ library ❚ const YIELD::Path& path, for minimalist platform primitives, mode_t mode concurrency (next ) section), IPC { Auto-generate mrc_proxy.mkdir( ❚ client-server Path( this->name, interfaces from IDL; path ), mode ); make synchronous return true; proxy calls that do } message passing under the hood. Page 5 Minor Gordon
HIGH PERFORMANCE Handling concurrency COMPUTING 05/04/09 Possible approaches: ❚ 1) Let the (multiple) FUSE threads execute all of the logic of the system Advantages: simple at the outset Disadvantages: have to lock around shared data structures, error prone and code becomes a mess 2) Have some sort of event loop Advantages: obviates need for locks Disadvantages: code becomes even uglier, even faster; hard to parallelize Page 6 Minor Gordon
HIGH PERFORMANCE Stages COMPUTING 05/04/09 Decompose file system logic into stages that pass ❚ messages via queues. A stage is a unit of concurrency : two stages can ❚ FUSE always run concurrently on two different physical processors. Volume ❚ Single-threaded stages: shared data structures encapsulated by a single serializing stage – no locking Req Req ❚ Most stages should be thread-safe (otherwise MRC Proxy Amdahl's law comes into play). ❚ A stage-aware scheduler can exploit the nature of stages as well as their communications pattern (the stage graph , similar to a process interaction graph). Page 7 Minor Gordon
HIGH PERFORMANCE XtreemFS COMPUTING 05/04/09 EU research project ❚ Wide-area file system with RAID, replication ❚ Aim for POSIX semantics, allow per-volume ❚ relaxation Everything in userspace ❚ ❚ Test new ideas with minimal implementation cost Goal: usable file system that performs within ❚ an order of magnitude of kernel-based network file systems Page 8 Minor Gordon
HIGH PERFORMANCE XtreemFS: Features COMPUTING 05/04/09 Staged design ❚ Efficient key-value store for metadata ❚ ❚ Based on Log-Structured Merge Trees ❚ Simple implementation (~ 5k SLOC) ❚ Snapshots Striping ❚ WAN operation ❚ ❚ Distributed replicas held consistent ❚ Automatic failover ❚ Security with SSL, X.509 Page 9 Minor Gordon
HIGH PERFORMANCE XtreemFS: Stages COMPUTING 05/04/09 Client FUSE ❚ Servers ❚ XtreemFS Volume ❚ Directory (DIR) ❚ Metadata Req Req catalogue (MRC) DIR Proxy MRC Proxy ❚ Object store Req (OSD) File Cache Req OSD Proxy Page 10 Minor Gordon
HIGH PERFORMANCE XtreemFS: Stages cont'd COMPUTING 05/04/09 Advantages of staged design in XtreemFS: ❚ No locking around shared data structures like caches ❚ Other stages can be multithreaded to increase concurrency or offset blocking ❚ Gracefully degrade under [over]load with queue backpressure (original raison d'etre of stages in servers) ❚ Userspace scheduling ❚ Per-stage queue disciplines like SRPT ❚ Stage selection (CPU scheduling) ❚ Increase cache efficiency (Cohort scheduling, my research) Page 11 Minor Gordon
HIGH PERFORMANCE XtreemFS: local reads COMPUTING 05/04/09 500000 450000 400000 350000 NFS, O_DIRECT 300000 NFS ext4, O_DIRECT ext4 250000 XtreemFS XtreemFS clienrel 200000 150000 100000 50000 0 read reread reverse read stride read random read p read Page 12 Minor Gordon
HIGH PERFORMANCE XtreemFS: local writes COMPUTING 05/04/09 250000 200000 NFS, O_DIRECT 150000 NFS ext4, O_DIRECT ext4 XtreemFS XtreemFS clienrel 100000 50000 0 write rewrite random write pwrite Page 13 Minor Gordon
HIGH PERFORMANCE Conclusion COMPUTING 05/04/09 Project runs until June 2010 ❚ Next release: beginning of May ❚ ❚ Re-implemented client (Linux, Win, OS X) ❚ Client-side metadata, data caching ❚ New binary protocol (based on ONC-RPC) ❚ Full SSL/X.509 support ❚ Read-only WAN replication ❚ Plugin policy modules for access control http://www.xtreemfs.org/ Page 14 Minor Gordon
HIGH PERFORMANCE COMPUTING 05/04/09 Thank you for your attention. Questions? Page 15 Minor Gordon
Recommend
More recommend