scaling the btrfs free space cache
play

Scaling the Btrfs Free Space Cache Omar Sandoval Vault 2016 - PowerPoint PPT Presentation

Scaling the Btrfs Free Space Cache Omar Sandoval Vault 2016 Outline Background Design Background Identifying the Problem Solution The Free Space Tree Results Questions Background What is Btrfs? General-purpose filesystem Advanced


  1. Scaling the Btrfs Free Space Cache Omar Sandoval Vault 2016

  2. Outline Background Design Background Identifying the Problem Solution The Free Space Tree Results Questions

  3. Background

  4. What is Btrfs? • General-purpose filesystem • Advanced features • First-class snapshots • Checksumming of all data and metadata • Transparent compression • Integrated multi-device support (RAID) • Incremental backups with send/receive • Actively developed

  5. Contributors $ git log --since=”January 1, 2015” --pretty=format:”%an” --no-merges -- fs/btrfs Filipe Manana Dan Carpenter Borislav Petkov Mike Snitzer David Sterba Forrest Liu Andreas Gruenbacher Mel Gorman Zhao Lei Yang Dongsheng Alex Lyakas Matthew Wilcox Qu Wenruo Tsutomu Itoh Zach Brown Konstantin Khebnikov Anand Jain Michal Hocko Yauhen Kharuzhy Jan Kara Josef Bacik Kirill A. Shutemov Yaowei Bai Guenter Roeck Omar Sandoval Jiri Kosina Vladimir Davydov Greg Kroah-Hartman Chris Mason Daniel Dressler Tom Van Braeckel Geert Uytterhoeven Zhaolei Wang Shilong Sudip Mukherjee Gabríel Arthúr Pétursson Liu Bo Satoru Takeuchi Shilong Wang Dmitry Monakhov Chandan Rajendra Naohiro Aota Shaohua Li Deepa Dinamani Dongsheng Yang Luis de Bethencourt Shan Hai Davide Italiano Alexandru Moise Kinglong Mee Sebastian Andrzej Siewior Dave Jones Mark Fasheh Kent Overstreet Sasha Levin Darrick J. Wong Eric Sandeen Justin Maggard Sam Tygier Colin Ian King Jeff Mahoney Jens Axboe Robin Ruede Chengyu Song Byongho Lee Holger Hoffstätte Rasmus Villemoes chandan r Christoph Hellwig Gui Hecheng Quentin Casasnovas Ashish Samant Al Viro Fabian Frederick Pranith Kumar Arnd Bergmann chandan David Howells Oleg Nesterov Adam Buchbinder Geliang Tang Christian Engelmayer NeilBrown

  6. Btrfs at Facebook • GlusterFS • Testing on other tiers • Web servers • Build machines • More...

  7. Design Background

  8. Copy-on-Write B-trees • B+-trees with no leaf chaining • Lock coupling • Relaxed balancing contraints, proactive balancing • Lazy reference-counting • Update in memory, write out in periodic commits • Due to O. Rodeh

  9. Copy-on-Write B-tree Insertion 1, 4, 10 1, 2 5, 6, 7 10, 11

  10. Copy-on-Write B-tree Insertion 1, 4, 10 1, 4, 10 1, 2 5, 6, 7 10, 11 10, 11, 19

  11. Copy-on-Write B-tree Deletion 1, 4, 10 1, 2 5, 6, 7 10, 11

  12. Copy-on-Write B-tree Deletion 1, 4, 10 1, 4, 10 1, 2 5, 6, 7 5, 7 10, 11

  13. Btrfs B-trees generation, etc. • Generic data structure • Three main structures: • Block header : header of every block in the B-tree, contains checksum, flags, filesystem ID, • Key : comprises 64-bit object ID, 8-bit type, and 64-bit offset • Item : comprises key, 32-bit offset, and 32-bit size • Nodes contain only keys and block pointers to children • Leaves contain array of items and data

  14. B-tree Nodes struct btrfs_disk_key objectid=256 type=INODE_ITEM offset=0

  15. B-tree Nodes struct btrfs_key_ptr struct btrfs_disk_key blockptr=46976991232 generation=51107 256 INODE_ITEM 0

  16. B-tree Nodes struct btrfs_node struct btrfs_key_ptr struct btrfs_key_ptr struct btrfs_header struct btrfs_disk_key struct btrfs_disk_key ... 46976991232 51107 46976319488 51107 256 INODE_ITEM 0 261 DIR_ITEM 358895273

  17. B-tree Leaves struct btrfs_item struct btrfs_disk_key offset=16123 size=160 256 INODE_ITEM 0

  18. B-tree Leaves struct btrfs_leaf struct btrfs_item struct btrfs_item struct btrfs_header struct btrfs_disk_key struct btrfs_disk_key ... struct btrfs_inode_ref struct btrfs_inode_item 16123 160 16111 12 256 INODE_ITEM 0 256 INODE_REF 256

  19. B-tree Structure struct btrfs_node struct btrfs_header struct btrfs_key_ptr struct btrfs_key_ptr ... ... ... struct btrfs_node struct btrfs_header struct btrfs_key_ptr struct btrfs_key_ptr ... struct btrfs_leaf ... struct btrfs_header struct btrfs_item struct btrfs_item ... data data

  20. • Free space tree: subject of this talk! B-trees Used in Btrfs • Root tree • FS trees • Extent tree • Checksum tree • Chunk tree, device tree • Relocation tree, log tree, quota tree, UUID tree

  21. B-trees Used in Btrfs • Root tree • FS trees • Extent tree • Checksum tree • Chunk tree, device tree • Relocation tree, log tree, quota tree, UUID tree • Free space tree: subject of this talk!

  22. Identifying the Problem

  23. Symptoms • Large (tens of terabytes), busy filesystems on GlusterFS • Stalls during commits, sometimes seconds long • Narrowed down the problem to write-out of the free space cache

  24. Terminology for data, 256 MB for metadata comprises one or more chunks particular purpose Figure: Basic terminology Chunks • Chunk : physical unit of allocation, 1 GB Disk 1 Disk 2 • Block group : logical unit of allocation, Extent • Extent : range of filesystem used for a Block group

  25. Extent Tree Figure: Data block group BLOCK_GROUP_ITEM • Tracks allocated space with references • Three types of items: • Block group items EXTENT_ITEM EXTENT_ITEM • Extent items • Metadata items

  26. Figure: Metadata block group Extent Tree BLOCK_GROUP_ITEM • Tracks allocated space with references • Three types of items: • Block group items • Extent items METADATA_ITEM METADATA_ITEM METADATA_ITEM • Metadata items

  27. Free Space Cache • Track free space in memory after loading a block group • Red-black tree • Hybrid of extents + bitmaps • Loading directly from the extent tree is slow • Instead, dump the in-memory cache to disk • Special free space inodes, one per block group • Compact, very fast to load • Has to be to written out in entirety

  28. Delayed Refs for every operation (red-black tree) Figure: Delayed refs tree 32 16 128 • Don’t want to update extent tree on disk 8 24 • Instead, track updates in memory • Run at commit time as a batch add 2 drop 1 add 1

  29. Diagnosis • Large, busy filesystem can dirty many block groups • Have to write out the free space cache for each one • Run during the critical section of the commit • Blocks other operations

  30. Solution

  31. False Start • Start writing out free space cache outside of critical section • Redo block groups that got dirtied again during that window • Still possible for a lot to get dirtied again • Back to square one

  32. The Free Space Tree

  33. Key Ideas • Use another B-tree instead • Inverse of the extent tree

  34. On-Disk Format Figure: Free space tree extents • Per-block group free space info • Extent count EXTENT_ITEM EXTENT_ITEM • Flags • Free space extents • Start • Length FREE_SPACE_EXTENT FREE_SPACE_EXTENT • Free space bitmaps • Start • Length • Bitmap

  35. On-Disk Format Figure: Free space tree bitmaps • Per-block group free space info • Extent count EXTENT_ITEM EXTENT_ITEM • Flags • Free space extents • Start • Length FREE_SPACE_BITMAP FREE_SPACE_BITMAP • Free space bitmaps • Start • Length • Bitmap

  36. Maintaining the Free Space Tree • Piggy-back updates on delayed-refs • Get batching for free • Convert between extents or bitmaps • Thresholds for whichever is more space-efficient • Buffer to avoid thrashing between formats

  37. Results

  38. Measuring Commit Overhead space) • Use fallocate and unlink to dirty a lot of block groups • Sparse filesystem image on loop device to allow for exaggerated sizes (50TB of dirtied • Measure time spent committing, esp. in critical section

  39. Commit Overhead Free space commit overhead 45000 40000 35000 Time spent in critical section per transaction (ms) 30000 25000 20000 15000 10000 5000 0 Free space cache Free space tree No cache

  40. Commit Overhead Free space commit overhead 0.8 0.7 0.6 Time spent in critical section per transaction (ms) 0.5 0.4 0.3 0.2 0.1 0 Free space tree No cache

  41. Measuring Load Overhead • Create a fragmented filesystem with fs_mark • Unmount, remount • Run fs_mark again • Measure time spent loading space cache

  42. Load Overhead Free space loading overhead 250 200 Total time spent caching block groups (ms) 150 100 50 0 Free space cache Free space tree No cache

  43. Try It Out (Since Linux v4.5) mount -o space_cache=v2

  44. Questions

Recommend


More recommend