1
CephFS fsck: Distributed Filesystem Checking
Hi, I’m Greg Greg Farnum CephFS Tech Lead, Red Hat gfarnum@redhat.com Been working as a core Ceph developer since June 2009 3
4
What is Ceph? An awesome, software-based, scalable, distributed storage system that is designed for failures • Object storage (our native API) • Block devices (Linux kernel, QEMU/KVM, others) • RESTful S3 & Swift API object store • POSIX Filesystem 5
APP APP HOST/VM CLIENT RADOSGW CEPH FS RBD LIBRADOS A bucket-based REST A reliable and fully- A POSIX-compliant A library allowing gateway, compatible with distributed block device, distributed file system, apps to directly S3 and Swift with a Linux kernel client with a Linux kernel client access RADOS, and a QEMU/KVM driver and support for FUSE with support for C, C++, Java, Python, Ruby, and PHP RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes Reliable Autonomic Distributed Object Store 6
What is CephFS? An awesome, software-based, scalable, distributed POSIX- compliant file system that is designed for failures 7
RADOS A user perspective 8
Objects in RADOS Data 01110010101 01010101010 00010101101 xattrs version: 1 omap foo -> bar baz -> qux 9
The librados API C, C++, Python, Java, shell. File-like API: • read/write (extent), truncate, remove; get/set/remove xattr or key • efficient copy-on-write clone • Snapshots — single object or pool-wide • atomic compound operations/transactions • read + getxattr, write + setxattr • compare xattr value, if match write + setxattr • “object classes” • load new code into cluster to implement new methods • calc sha1, grep/filter, generate thumbnail • encrypt, increment, rotate image • Implement your own access mechanisms — HDF5 on the node • watch/notify: use object as communication channel between clients (locking primitive) • pgls: list the objects within a placement group 10
The RADOS Cluster M M M CLIENT Object Storage Devices (OSDs) 11
Object Storage Object 1 Object 3 Object 1 Object 3 Devices (OSDs) Object 2 Object 4 Object 2 Object 4 Pool 1 Pool 2 M Monitor 12
10 10 01 01 10 10 01 11 01 10 hash(object name) % num pg CRUSH(pg, cluster state, rule set) 13
RADOS data guarantees • Any write acked as safe will be visible to all subsequent readers • Any write ever visible to a reader will be visible to all subsequent readers • Any write acked as safe will not be lost unless the whole containing PG is lost • A PG will not be lost unless all N copies are lost ( N is admin- configured, usually 3)… • …and in case of OSD failure the system will try to bring you back up to N copies (no user intervention required) 14
RADOS data guarantees • Data is regularly scrubbed to ensure copies are consistent with each other, and administrators are alerted if inconsistencies arise • …and while it’s not automated, it’s usually easy to identify the correct data with “majority voting” or similar. • btrfs maintains checksums for certainty and we think this is the future 15
CephFS System Design 16
CephFS Design Goals Infinitely scalable Avoid all Single Points Of Failure Self Managing 17
CLIENT 01 10 Metadata Server (MDS) M M M 18
Scaling Metadata So we have to use multiple MetaData Servers (MDSes) Two Issues: • Storage of the metadata • Ownership of the metadata 19
Scaling Metadata – Storage Some systems store metadata on the MDS system itself But that’s a Single Point Of Failure! • Hot standby? • External metadata storage √ 20
Scaling Metadata – Ownership Traditionally: assign hierarchies manually to each MDS • But if workloads change, your nodes can unbalance Newer: hash directories onto MDSes • But then clients have to jump around for every folder traversal 21
one tree two metadata servers 22
one tree two metadata servers 23
The Ceph Metadata Server Key insight: If metadata is stored in RADOS, ownership should be impermanent One MDS is authoritative over any given subtree, but... • That MDS doesn’t need to keep the whole tree in-memory • There’s no reason the authoritative MDS can’t be changed! 24
The Ceph MDS – Partitioning Cooperative Partitioning between servers: • Keep track of how hot metadata is • Migrate subtrees to keep heat distribution similar • Cheap because all metadata is in RADOS • Maintains locality 25
The Ceph MDS – Persistence All metadata is written to RADOS • And changes are only visible once in RADOS 26
The Ceph MDS – Clustering Benefits Dynamic adjustment to metadata workloads • Replicate hot data to distribute workload Dynamic cluster sizing: • Add nodes as you wish • Decommission old nodes at any time Recover quickly and easily from failures 27
28
29
30
31
DYNAMIC SUBTREE PARTITIONING 32
Does it work? 33
It scales! Click to edit Master text styles 34
It redistributes! Click to edit Master text styles 35
Cool Extras Besides POSIX-compliance and scaling 36
Snapshots $ mkdir foo/.snap/one # create snapshot $ ls foo/.snap one $ ls foo/bar/.snap _one_1099511627776 # parent's snap name is mangled $ rm foo/myfile $ ls -F foo bar/ $ ls foo/.snap/one myfile bar/ $ rmdir foo/.snap/one # remove snapshot 37
Recursive statistics $ ls -alSh | head total 0 drwxr-xr-x 1 root root 9.7T 2011-02-04 15:51 . drwxr-xr-x 1 root root 9.7T 2010-12-16 15:06 .. drwxr-xr-x 1 pomceph pg4194980 9.6T 2011-02-24 08:25 pomceph drwx--x--- 1 fuzyceph adm 1.5G 2011-01-18 10:46 fuzyceph drwxr-xr-x 1 dallasceph pg275 596M 2011-01-14 10:06 dallasceph $ getfattr -d -m ceph. pomceph # file: pomceph ceph.dir.entries="39" ceph.dir.files="37" ceph.dir.rbytes="10550153946827" ceph.dir.rctime="1298565125.590930000" ceph.dir.rentries="2454401" ceph.dir.rfiles="1585288" ceph.dir.rsubdirs="869113" ceph.dir.subdirs="2" 38
Different Storage strategies • Set a “virtual xattr” on a directory and all new files underneath it follow that layout. • Layouts can specify lots of detail about storage: • pool file data goes into • how large file objects and stripes are • how many objects are in a stripe set • So in one cluster you can use • one slow pool with big objects for Hadoop workloads • one fast pool with little objects for a scratch space • one slow pool with small objects for home directories • or whatever else makes sense... 39
CephFS Important Data structures 40
Directory objects •One (or more!) per directory •Deterministically named: <inode number>.<directory piece> •Embeds dentries and inodes for each child of the folder •Contains a potentially-stale versioned backtrace (path location) •Located in the metadata pool 41
File objects •One or more per file •Deterministically named <ino number>.<object number> •First object contains a potentially-stale versioned backtrace •Located in any of the data pools 42
MDS log (objects) The MDS fully journals all metadata operations. The log is chunked across objects. •Deterministically named <log inode number>.<log piece> •Log objects may or may not be replayable if previous entries are lost •each entry contains what it needs, but eg a file move can depend on a previous rename entry •Located in the metadata pool 43
MDSTable objects • Single objects • SessionMap (per-MDS) • stores the state of each client Session • particularly: preallocated inodes for each client • InoTable (per-MDS) • Tracks which inodes are available to allocate • (this is not a traditional inode mapping table or similar) • SnapTable (shared) • Tracks system snapshot IDs and their state (in use pending create/delete) • All located in the metadata pool 44
CephFS Metadata update flow 45
Client Sends Request Object Storage Devices (OSDs) Create dir CLIENT MDS log.1 log.2 log.3 dir.1 dir.2 dir.3 46
MDS Processes Request Object Storage “Early Reply” and journaling Devices (OSDs) Early Reply Journal Write CLIENT MDS log.1 log.2 log.3 dir.1 dir.2 dir.3 47
MDS Processes Request Object Storage Journaling and safe reply Devices (OSDs) CLIENT Journal ack Safe Reply MDS log.1 log.2 log.3 log.4 dir.1 dir.2 dir.3 dir.3 48
…time passes… Object Storage Devices (OSDs) MDS log.4 dir.1 dir.2 dir.3 log.5 log.6 log.7 log.8 49
MDS Flushes Log Object Storage Devices (OSDs) Directory Write MDS log.4 dir.1 dir.2 dir.3 log.5 log.6 log.7 log.8 50
MDS Flushes Log Object Storage Devices (OSDs) Write ack MDS log.4 dir.1 dir.2 dir.3 dir.4 log.5 log.6 log.7 log.8 51
MDS Flushes Log Object Storage Devices (OSDs) Log Delete MDS dir.1 dir.2 dir.3 dir.4 log.5 log.6 log.7 log.8 52
Traditional fsck 53
Recommend
More recommend