Cloud Filesystem Jeff Darcy for BBLISA, October 2011
What is a Filesystem? • “The thing every OS and language knows” • Directories, files, file descriptors • Directories within directories • Operate on single record (POSIX: single byte) within a file • Built-in permissions model (e.g. UID, GID, ugo·rwx) • Defined concurrency behaviors (e.g. fsync) • Extras: symlinks, ACLs, xattrs
Are Filesystems Relevant? • Supported by every language and OS natively • Shared data with rich semantics • Graceful and efficient handling of multi-GB objects • Permission model missing in some alternatives • Polyglot storage, e.g. DB to index data in FS
Network Filesystems • Extend filesystem to multiple clients • Awesome idea so long as total required capacity/performance doesn't exceed a single server o ...otherwise you get server sprawl • Plenty of commercial vendors, community experience • Making NFS highly available brings extra headaches
Distributed Filesystems • Aggregate capacity/performance across servers • Built-in redundancy o ...but watch out: not all deal with HA transparently • Among the most notoriously difficult kinds of software to set up, tune and maintain o Anyone want to see my Lustre scars? • Performance profile can be surprising • Result: seen as specialized solution (esp. HPC)
Example: NFS4.1/pNFS • pNFS distributes data access across servers • Referrals etc. offload some metadata • Only a protocol, not an implementation o OSS clients, proprietary servers • Does not address metadata scaling at all • Conclusion: partial solution, good for compatibility, full solution might layer on top of something else
Example: Ceph • Two-layer architecture • Object layer (RADOS) is self-organizing o can be used alone for block storage via RBD • Metadata layer provides POSIX file semantics on top of RADOS objects • Full-kernel implementation • Great architecture, some day it will be a great implementation
Ceph Diagram Data Metadata Data Client Metadata Data Data Ceph RADOS Layer Layer
Example: GlusterFS • Single-layer architecture o sharding instead of layering o one type of server – data and metadata • Servers are dumb, smart behavior driven by clients • FUSE implementation • Native, NFSv3, UFO, Hadoop
GlusterFS Diagram Brick A Brick C Data Data Data Data Metadata Metadata Metadata Metadata Client Data Data Data Data Metadata Metadata Metadata Metadata Brick B Brick D
OK, What About HekaFS? • Don't blame me for the name o trademark issues are a distraction from real work • Existing DFSes solve many problems already o sharding, replication, striping • What they don't address is cloud-specific deployment o lack of trust (user/user and user/provider) o location transparency o operationalization
Why Start With GlusterFS? • Not going to write my own from scratch o been there, done that o leverage existing code, community, user base • Modular architecture allows adding functionality via an API o separate licensing, distribution, support • By far the best configuration/management • OK, so it's FUSE o not as bad as people think + add more servers
HekaFS Current Features • Directory isolation • ID isolation o “virtualize” between server ID space and tenants' • SSL o encryption useful on its own o authentication is needed by other features • At-rest encryption o Keys ONLY on clients o AES-256 through AES-1024, “ESSIV-like”
HekaFS Future Features • Enough of multi-tenancy, now for other stuff • Improved (local/sync) replication o lower latency, faster repair • Namespace (and small-file?) caching • Improved data integrity • Improved distribution o higher server counts, smoother reconfiguration • Erasure codes?
HekaFS Global Replication • Multi-site asynchronous • Arbitrary number of sites • Write from any site, even during partition o ordered, eventually consistent with conflict resolution • Caching is just a special case of replication o interest expressed (and withdrawn) not assumed • Some infrastructure being done early for local replication
Project Status • All open source o code hosted by Fedora, bugzilla by Red Hat o Red Hat also pays me (and others) to work on it • Close collaboration with Gluster o they do most of the work o they're open-source folks too o completely support their business model • “current” = Fedora 16 • “future” = Fedora 17+ and Red Hat product
Contact Info • Project • http://hekafs.org • jdarcy@redhat.com • Personal • http://pl.atyp.us • jeff@pl.atyp.us
Cloud Filesystem Jeff Darcy for BBLISA, October 2011
What is a Filesystem? • “The thing every OS and language knows” • Directories, files, file descriptors • Directories within directories • Operate on single record (POSIX: single byte) within a file • Built-in permissions model (e.g. UID, GID, ugo·rwx) • Defined concurrency behaviors (e.g. fsync) • Extras: symlinks, ACLs, xattrs
Are Filesystems Relevant? • Supported by every language and OS natively • Shared data with rich semantics • Graceful and efficient handling of multi-GB objects • Permission model missing in some alternatives • Polyglot storage, e.g. DB to index data in FS
Network Filesystems • Extend filesystem to multiple clients • Awesome idea so long as total required capacity/performance doesn't exceed a single server o ...otherwise you get server sprawl • Plenty of commercial vendors, community experience • Making NFS highly available brings extra headaches
Distributed Filesystems • Aggregate capacity/performance across servers • Built-in redundancy o ...but watch out: not all deal with HA transparently • Among the most notoriously difficult kinds of software to set up, tune and maintain o Anyone want to see my Lustre scars? • Performance profile can be surprising • Result: seen as specialized solution (esp. HPC)
Example: NFS4.1/pNFS • pNFS distributes data access across servers • Referrals etc. offload some metadata • Only a protocol, not an implementation o OSS clients, proprietary servers • Does not address metadata scaling at all • Conclusion: partial solution, good for compatibility, full solution might layer on top of something else
Example: Ceph • Two-layer architecture • Object layer (RADOS) is self-organizing o can be used alone for block storage via RBD • Metadata layer provides POSIX file semantics on top of RADOS objects • Full-kernel implementation • Great architecture, some day it will be a great implementation
C A e t a d a t a C l i e n t R D a e e y a L h p C O r e y a L S M t e D p h D i a g r a m D a t a a a t d a t e M a a t D a t a D a r
Example: GlusterFS • Single-layer architecture o sharding instead of layering o one type of server – data and metadata • Servers are dumb, smart behavior driven by clients • FUSE implementation • Native, NFSv3, UFO, Hadoop
G k t a B r i c d C D a t a M a a t a a M e t a d t t a D a t a M e e a a a k D D a t M i e t a d a t c r d a a t a D a t M B e t a d a t a t D l M n t D a t a e i t a d a t a e l r S u s t e r F C D i a g r a m B i B t M e t a d a a t B r i c k a a c a k A D a t M D e t a d a t a a
OK, What About HekaFS? • Don't blame me for the name o trademark issues are a distraction from real work • Existing DFSes solve many problems already o sharding, replication, striping • What they don't address is cloud-specific deployment o lack of trust (user/user and user/provider) o location transparency o operationalization
Why Start With GlusterFS? • Not going to write my own from scratch o been there, done that o leverage existing code, community, user base • Modular architecture allows adding functionality via an API o separate licensing, distribution, support • By far the best configuration/management • OK, so it's FUSE o not as bad as people think + add more servers
HekaFS Current Features • Directory isolation • ID isolation o “virtualize” between server ID space and tenants' • SSL o encryption useful on its own o authentication is needed by other features • At-rest encryption o Keys ONLY on clients o AES-256 through AES-1024, “ESSIV-like”
HekaFS Future Features • Enough of multi-tenancy, now for other stuff • Improved (local/sync) replication o lower latency, faster repair • Namespace (and small-file?) caching • Improved data integrity • Improved distribution o higher server counts, smoother reconfiguration • Erasure codes?
HekaFS Global Replication • Multi-site asynchronous • Arbitrary number of sites • Write from any site, even during partition o ordered, eventually consistent with conflict resolution • Caching is just a special case of replication o interest expressed (and withdrawn) not assumed • Some infrastructure being done early for local replication
Recommend
More recommend