the btrfs filesystem chris mason
play

The Btrfs Filesystem Chris Mason Btrfs Design Goals Broad - PowerPoint PPT Presentation

The Btrfs Filesystem Chris Mason Btrfs Design Goals Broad development community General purpose filesystem that scales to very large storage Extents for large files Small files packed in as metadata Flexible disk format that


  1. The Btrfs Filesystem Chris Mason

  2. Btrfs Design Goals • Broad development community • General purpose filesystem that scales to very large storage – Extents for large files – Small files packed in as metadata • Flexible disk format that can adapt to new features – Btree indexes based on extensible key/value lookups – Key ordering determines relative location in the btree • Data and metadata checksumming – Crc32c used for fast hardware enabled crcs

  3. Btrfs Design Goals • Data and metadata copy on write – Block contents preserved until replacement is safely on disk • Data and metadata reference counting with back references – Every block and filename link back to their owners • Fast, writable snapshots – COW enables O(1) snapshots of subvolumes – O(number of extents in the file) snapshots of single files • Efficient detection of recently modified files

  4. Btrfs Design Goals • Simple, online disk administration – Btrfs dev add /dev/xxx /mnt – Btrfs dev delete /dev/xxx /mnt – Btrfs filesystem resize XX /mnt • Can also resize a single device – Btrfs filesystem balance /mnt • Multiple device support – Flexible relocation of space – Easily find good copies when crcs fail • Efficient synchronous operations that do not stall the rest of the filesystem • These goals have been met!

  5. Snapshots and Subvolumes • Subvolume is the unit of snapshotting • Snapshots are very efficient, even when many are in place against the same source – Individual files may be cloned without a full snapshot – Cloning support now in cp --relink • Subvolumes and snapshots may be created anywhere • Subvolumes are roughly as expensive as directories – But, you may not rename or hardlink files between subvolumes • Snapshots can be written and snapshotted again

  6. Snapshot Rollback • The snapshot or subvolume used as the root of the filesystem can be specified – Btrfs subvol list to find subvolumes – btrfs subvolume setdefault to set a new default • Allows you to snapshot before upgrading and rollback if things don't work well

  7. Current Work In Progress • Fsck with repair – Initially fs rescue • Robust error handling • RAID5/6 – Reuse MD's parity calculation code – Single stripe size, adapt allocator and FS writeback to send down full stripes • SSD front end cache • Locking bottlenecks

  8. SSD Optimizations • Really just turning off rotational optimizations • Send IO to the device right away – No stalling or waiting to collect more IO • Don't avoid fragmentation • Send large writes whenever possible • Reuse blocks instead of spreading across the device – Unless you're on a cheap SSD • Send discards down in large batches – Collected in bulk and sent down right after transaction commit

  9. Why Discard/Trim

  10. SSD Front End Cache • Stage writes to a set of fast SSD devices • Remapping layer to remember which blocks are up to date on the SSD • Push frequently read extents into the SSD as well • Hot data will stay on the SSD without hitting spinning disks • Work in progress, slightly different from IBM's experiments over the summer

  11. Thin Provisioning • Btrfs storage chunks are well suited to thin provisioning • Btrfs can return large chunks of storage back to the array • Btrfs can quickly expand the FS • Discard support in Btrfs sends information about unused blocks down to the storage at run time • Fitrim ioctl support is important for thin provisioning

  12. Atomic Writes for Applications • COW writes to Btrfs can be atomic up to large sizes • Some hardware support fast atomic writes of larger Ios as well • Work in progress to wire up Btrfs atomic write support and use optimizations from the hardware • We may also support linked atomic writes between two or more files

  13. Database Write Performance • Poor random write performance in COW mode • Large files tend to fragment badly, leading to huge amounts of metadata and seeking • New data from random writes can be collected in bulk after transaction commit and copied back to the original location • Work in progress

  14. Finding Recent Modifications • Btrfs subvol find-new

  15. Btrfs Scrubbing • Scrubbing finds and repairs bad data • Read all the allocated extents • Verify checksums • Replace bad copies with correct mirror • Work in progress, initial implementation working

  16. Conclusions • Many things working and stable • Focused on stability and performance • http://btrfs.wiki.kernel.org/ • chris.mason@oracle.com

Recommend


More recommend