1 cephfs fsck distributed filesystem checking hi i m greg
play

1 CephFS fsck: Distributed Filesystem Checking Hi, Im Greg Greg - PowerPoint PPT Presentation

1 CephFS fsck: Distributed Filesystem Checking Hi, Im Greg Greg Farnum CephFS Tech Lead, Red Hat gfarnum@redhat.com Been working as a core Ceph developer since June 2009 3 4 What is Ceph? An awesome, software-based, scalable,


  1. 1

  2. CephFS fsck: Distributed Filesystem Checking

  3. Hi, I’m Greg Greg Farnum CephFS Tech Lead, Red Hat gfarnum@redhat.com Been working as a core Ceph developer since June 2009 3

  4. 4

  5. What is Ceph? An awesome, software-based, scalable, distributed storage system that is designed for failures • Object storage (our native API) • Block devices (Linux kernel, QEMU/KVM, others) • RESTful S3 & Swift API object store • POSIX Filesystem 5

  6. APP APP HOST/VM CLIENT RADOSGW CEPH FS RBD LIBRADOS A bucket-based REST A reliable and fully- A POSIX-compliant A library allowing gateway, compatible with distributed block device, distributed file system, apps to directly S3 and Swift with a Linux kernel client with a Linux kernel client access RADOS, and a QEMU/KVM driver and support for FUSE with support for C, C++, Java, Python, Ruby, and PHP RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes Reliable Autonomic Distributed Object Store 6

  7. What is CephFS? An awesome, software-based, scalable, distributed POSIX- compliant file system that is designed for failures 7

  8. RADOS A user perspective 8

  9. Objects in RADOS Data 01110010101 01010101010 00010101101 xattrs version: 1 omap foo -> bar baz -> qux 9

  10. The librados API C, C++, Python, Java, shell. File-like API: • read/write (extent), truncate, remove; get/set/remove xattr or key • efficient copy-on-write clone • Snapshots — single object or pool-wide • atomic compound operations/transactions • read + getxattr, write + setxattr • compare xattr value, if match write + setxattr • “object classes” • load new code into cluster to implement new methods • calc sha1, grep/filter, generate thumbnail • encrypt, increment, rotate image • Implement your own access mechanisms — HDF5 on the node • watch/notify: use object as communication channel between clients (locking primitive) • pgls: list the objects within a placement group 10

  11. The RADOS Cluster M M M CLIENT Object Storage Devices (OSDs) 11

  12. Object Storage Object 1 Object 3 Object 1 Object 3 Devices (OSDs) Object 2 Object 4 Object 2 Object 4 Pool 1 Pool 2 M Monitor 12

  13. 10 10 01 01 10 10 01 11 01 10 hash(object name) % num pg CRUSH(pg, cluster state, rule set) 13

  14. RADOS data guarantees • Any write acked as safe will be visible to all subsequent readers • Any write ever visible to a reader will be visible to all subsequent readers • Any write acked as safe will not be lost unless the whole containing PG is lost • A PG will not be lost unless all N copies are lost ( N is admin- configured, usually 3)… • …and in case of OSD failure the system will try to bring you back up to N copies (no user intervention required) 14

  15. RADOS data guarantees • Data is regularly scrubbed to ensure copies are consistent with each other, and administrators are alerted if inconsistencies arise • …and while it’s not automated, it’s usually easy to identify the correct data with “majority voting” or similar. • btrfs maintains checksums for certainty and we think this is the future 15

  16. CephFS System Design 16

  17. CephFS Design Goals Infinitely scalable Avoid all Single Points Of Failure Self Managing 17

  18. CLIENT 01 10 Metadata Server (MDS) M M M 18

  19. Scaling Metadata So we have to use multiple MetaData Servers (MDSes) Two Issues: • Storage of the metadata • Ownership of the metadata 19

  20. Scaling Metadata – Storage Some systems store metadata on the MDS system itself But that’s a Single Point Of Failure! • Hot standby? • External metadata storage √ 20

  21. Scaling Metadata – Ownership Traditionally: assign hierarchies manually to each MDS • But if workloads change, your nodes can unbalance Newer: hash directories onto MDSes • But then clients have to jump around for every folder traversal 21

  22. one tree two metadata servers 22

  23. one tree two metadata servers 23

  24. The Ceph Metadata Server Key insight: If metadata is stored in RADOS, ownership should be impermanent One MDS is authoritative over any given subtree, but... • That MDS doesn’t need to keep the whole tree in-memory • There’s no reason the authoritative MDS can’t be changed! 24

  25. The Ceph MDS – Partitioning Cooperative Partitioning between servers: • Keep track of how hot metadata is • Migrate subtrees to keep heat distribution similar • Cheap because all metadata is in RADOS • Maintains locality 25

  26. The Ceph MDS – Persistence All metadata is written to RADOS • And changes are only visible once in RADOS 26

  27. The Ceph MDS – Clustering Benefits Dynamic adjustment to metadata workloads • Replicate hot data to distribute workload Dynamic cluster sizing: • Add nodes as you wish • Decommission old nodes at any time Recover quickly and easily from failures 27

  28. 28

  29. 29

  30. 30

  31. 31

  32. DYNAMIC SUBTREE PARTITIONING 32

  33. Does it work? 33

  34. It scales! Click to edit Master text styles 34

  35. It redistributes! Click to edit Master text styles 35

  36. Cool Extras Besides POSIX-compliance and scaling 36

  37. Snapshots $ mkdir foo/.snap/one # create snapshot $ ls foo/.snap one $ ls foo/bar/.snap _one_1099511627776 # parent's snap name is mangled $ rm foo/myfile $ ls -F foo bar/ $ ls foo/.snap/one myfile bar/ $ rmdir foo/.snap/one # remove snapshot 37

  38. Recursive statistics $ ls -alSh | head total 0 drwxr-xr-x 1 root root 9.7T 2011-02-04 15:51 . drwxr-xr-x 1 root root 9.7T 2010-12-16 15:06 .. drwxr-xr-x 1 pomceph pg4194980 9.6T 2011-02-24 08:25 pomceph drwx--x--- 1 fuzyceph adm 1.5G 2011-01-18 10:46 fuzyceph drwxr-xr-x 1 dallasceph pg275 596M 2011-01-14 10:06 dallasceph $ getfattr -d -m ceph. pomceph # file: pomceph ceph.dir.entries="39" ceph.dir.files="37" ceph.dir.rbytes="10550153946827" ceph.dir.rctime="1298565125.590930000" ceph.dir.rentries="2454401" ceph.dir.rfiles="1585288" ceph.dir.rsubdirs="869113" ceph.dir.subdirs="2" 38

  39. Different Storage strategies • Set a “virtual xattr” on a directory and all new files underneath it follow that layout. • Layouts can specify lots of detail about storage: • pool file data goes into • how large file objects and stripes are • how many objects are in a stripe set • So in one cluster you can use • one slow pool with big objects for Hadoop workloads • one fast pool with little objects for a scratch space • one slow pool with small objects for home directories • or whatever else makes sense... 39

  40. CephFS Important Data structures 40

  41. Directory objects •One (or more!) per directory •Deterministically named: <inode number>.<directory piece> •Embeds dentries and inodes for each child of the folder •Contains a potentially-stale versioned backtrace (path location) •Located in the metadata pool 41

  42. File objects •One or more per file •Deterministically named <ino number>.<object number> •First object contains a potentially-stale versioned backtrace •Located in any of the data pools 42

  43. MDS log (objects) The MDS fully journals all metadata operations. The log is chunked across objects. •Deterministically named <log inode number>.<log piece> •Log objects may or may not be replayable if previous entries are lost •each entry contains what it needs, but eg a file move can depend on a previous rename entry •Located in the metadata pool 43

  44. MDSTable objects • Single objects • SessionMap (per-MDS) • stores the state of each client Session • particularly: preallocated inodes for each client • InoTable (per-MDS) • Tracks which inodes are available to allocate • (this is not a traditional inode mapping table or similar) • SnapTable (shared) • Tracks system snapshot IDs and their state (in use pending create/delete) • All located in the metadata pool 44

  45. CephFS Metadata update flow 45

  46. Client Sends Request Object Storage Devices (OSDs) Create dir CLIENT MDS log.1 log.2 log.3 dir.1 dir.2 dir.3 46

  47. MDS Processes Request Object Storage “Early Reply” and journaling Devices (OSDs) Early Reply Journal Write CLIENT MDS log.1 log.2 log.3 dir.1 dir.2 dir.3 47

  48. MDS Processes Request Object Storage Journaling and safe reply Devices (OSDs) CLIENT Journal ack Safe Reply MDS log.1 log.2 log.3 log.4 dir.1 dir.2 dir.3 dir.3 48

  49. …time passes… Object Storage Devices (OSDs) MDS log.4 dir.1 dir.2 dir.3 log.5 log.6 log.7 log.8 49

  50. MDS Flushes Log Object Storage Devices (OSDs) Directory Write MDS log.4 dir.1 dir.2 dir.3 log.5 log.6 log.7 log.8 50

  51. MDS Flushes Log Object Storage Devices (OSDs) Write ack MDS log.4 dir.1 dir.2 dir.3 dir.4 log.5 log.6 log.7 log.8 51

  52. MDS Flushes Log Object Storage Devices (OSDs) Log Delete MDS dir.1 dir.2 dir.3 dir.4 log.5 log.6 log.7 log.8 52

  53. Traditional fsck 53

Recommend


More recommend