The Btrfs Filesystem Chris Mason
Btrfs Design Goals • Broad development community • General purpose filesystem that scales to very large storage – Extents for large files – Small files packed in as metadata • Flexible disk format that can adapt to new features – Btree indexes based on extensible key/value lookups – Key ordering determines relative location in the btree • Data and metadata checksumming – Crc32c used for fast hardware enabled crcs
Btrfs Design Goals • Data and metadata copy on write – Block contents preserved until replacement is safely on disk • Data and metadata reference counting with back references – Every block and filename link back to their owners • Fast, writable snapshots – COW enables O(1) snapshots of subvolumes – O(number of extents in the file) snapshots of single files • Efficient detection of recently modified files
Btrfs Design Goals • Simple, online disk administration – Btrfs dev add /dev/xxx /mnt – Btrfs dev delete /dev/xxx /mnt – Btrfs filesystem resize XX /mnt • Can also resize a single device – Btrfs filesystem balance /mnt • Multiple device support – Flexible relocation of space – Easily find good copies when crcs fail • Efficient synchronous operations that do not stall the rest of the filesystem • These goals have been met!
Snapshots and Subvolumes • Subvolume is the unit of snapshotting • Snapshots are very efficient, even when many are in place against the same source – Individual files may be cloned without a full snapshot – Cloning support now in cp --relink • Subvolumes and snapshots may be created anywhere • Subvolumes are roughly as expensive as directories – But, you may not rename or hardlink files between subvolumes • Snapshots can be written and snapshotted again
Snapshot Rollback • The snapshot or subvolume used as the root of the filesystem can be specified – Btrfs subvol list to find subvolumes – btrfs subvolume setdefault to set a new default • Allows you to snapshot before upgrading and rollback if things don't work well
Current Work In Progress • Fsck with repair – Initially fs rescue • Robust error handling • RAID5/6 – Reuse MD's parity calculation code – Single stripe size, adapt allocator and FS writeback to send down full stripes • SSD front end cache • Locking bottlenecks
SSD Optimizations • Really just turning off rotational optimizations • Send IO to the device right away – No stalling or waiting to collect more IO • Don't avoid fragmentation • Send large writes whenever possible • Reuse blocks instead of spreading across the device – Unless you're on a cheap SSD • Send discards down in large batches – Collected in bulk and sent down right after transaction commit
Why Discard/Trim
SSD Front End Cache • Stage writes to a set of fast SSD devices • Remapping layer to remember which blocks are up to date on the SSD • Push frequently read extents into the SSD as well • Hot data will stay on the SSD without hitting spinning disks • Work in progress, slightly different from IBM's experiments over the summer
Thin Provisioning • Btrfs storage chunks are well suited to thin provisioning • Btrfs can return large chunks of storage back to the array • Btrfs can quickly expand the FS • Discard support in Btrfs sends information about unused blocks down to the storage at run time • Fitrim ioctl support is important for thin provisioning
Atomic Writes for Applications • COW writes to Btrfs can be atomic up to large sizes • Some hardware support fast atomic writes of larger Ios as well • Work in progress to wire up Btrfs atomic write support and use optimizations from the hardware • We may also support linked atomic writes between two or more files
Database Write Performance • Poor random write performance in COW mode • Large files tend to fragment badly, leading to huge amounts of metadata and seeking • New data from random writes can be collected in bulk after transaction commit and copied back to the original location • Work in progress
Finding Recent Modifications • Btrfs subvol find-new
Btrfs Scrubbing • Scrubbing finds and repairs bad data • Read all the allocated extents • Verify checksums • Replace bad copies with correct mirror • Work in progress, initial implementation working
Conclusions • Many things working and stable • Focused on stability and performance • http://btrfs.wiki.kernel.org/ • chris.mason@oracle.com
Recommend
More recommend