advanced file systems advanced file systems zfs zfs
play

Advanced File Systems, Advanced File Systems, ZFS ZFS - PowerPoint PPT Presentation

Advanced File Systems, Advanced File Systems, ZFS ZFS http://d3s.mff.cuni.cz/aosy http://d3s.mff.cuni.cz Jan enolt Jan.Senolt@Oracle.COM Traditjonal UNIX File System, ext2 Traditjonal UNIX File System, ext2 ... BB Block Grp 0 Block Grp


  1. Advanced File Systems, Advanced File Systems, ZFS ZFS http://d3s.mff.cuni.cz/aosy http://d3s.mff.cuni.cz Jan Šenolt Jan.Senolt@Oracle.COM

  2. Traditjonal UNIX File System, ext2 Traditjonal UNIX File System, ext2 ... BB Block Grp 0 Block Grp 1 Block Grp N Super Block e2di_mode Block Bitmap e2di_uid ... Inode Bitmap e2di_blocks[0] Array of Inodes e2di_blocks[1] e2di_blocks[2] Data Blocks ... e2di_blocks[14] Jan Šenolt , Advanced Operatjng Systems, April 11 th 2019 Advanced FS, ZFS 2

  3. Crash consistency problem Crash consistency problem Appending a new block to the fjle involves at least 3 IOs to difgerent data structures at difgerent locatjons: Block bitmap – mark block as allocated Inode – update e2di_blocks[], e2di_size, ... Block – write the actual payload Cannot be performed atomically – what will happen if we fail to make some of these changes persistent? Jan Šenolt , Advanced Operatjng Systems, April 11 th 2019 Advanced FS, ZFS 3

  4. fsck fsck Lazy approach: try to detect the inconsistency and fjx it Does not scale well Can take very long tjme for large fjle system Checks metadata only, unable to detect some types of inconsistencies For example: updated the bitmap and the inode but crashed before writjng the data block content … but we stjll need fsck to detect other issues Jan Šenolt , Advanced Operatjng Systems, April 11 th 2019 Advanced FS, ZFS 4

  5. Journaling Journaling Write all changes to the journal fjrst, make sure that all writes completed and then made the actual in-place updates Can be a fjle within fs or a dedicated disk TB1 Inode BlkBmp DBlk TE1 TB2 Inode Journal replay – traverse the log, fjnd all complete records and apply them Physical journaling Writes actual new content of blocks Requires more space but is simple to replay Logical journaling Descriptjon of what needs to be done Must be idempotent Jan Šenolt , Advanced Operatjng Systems, April 11 th 2019 Advanced FS, ZFS 5

  6. Journaling (2) Journaling (2) Journal aggregatjon Do multjple updates in memory, log them together in one transactjon Effjcient when updatjng the same data multjple tjmes (Ordered) metadata-only journal Log only metadata → smaller write overhead Write data block fjrst, then create transactjon for metadata Metadata block reuse issue Jan Šenolt , Advanced Operatjng Systems, April 11 th 2019 Advanced FS, ZFS 6

  7. Log structured FS Log structured FS Copy-on-Write Fast crash recovery Long sequentjal I/O instead of many small I/Os Betuer I/O bandwidth utjlizatjon Aggressive caching Most I/Os are actually writes No block/inode bitmaps But disk has a fjnite size Needs garbage collector Jan Šenolt , Advanced Operatjng Systems, April 11 th 2019 Advanced FS, ZFS 7

  8. Log structured FS (2) Log structured FS (2) Inode Map CR Seg1 Seg2 SegN CR inode# to block #blks Seg. Summary mapping Inode no Data stored with other Generatjon data but pointed Ofgset Seg. Summary from Checkpoint Next SS Regions Data UID (unused) <inode# : gen> Jan Šenolt , Advanced Operatjng Systems, April 11 th 2019 Advanced FS, ZFS 8

  9. Log structured FS (3) Log structured FS (3) Segment cleaner (garbage collector) Creates empty segments by compactjng fragmented ones: 1)Read whole segment(s) into memory 2)Determine live data and copy them to another segment(s) 3)Mark original segment as empty Live data = stjll pointed by its inode Increment inode version number when fjle deleted Jan Šenolt , Advanced Operatjng Systems, April 11 th 2019 Advanced FS, ZFS 9

  10. Sofu Updates Sofu Updates Enforce rules for data updates: Never point to an initjalized structure Never reuse block which is stjll referenced Never remove existjng reference untjl the new one exits Keep block in memory, maintain their dependencies and write them asynchronously Cyclic dependencies Create a fjle in a directory Remove a difgerent fjle in the same dir (both fjles’ inodes are in the same block) Jan Šenolt , Advanced Operatjng Systems, April 11 th 2019 Advanced FS, ZFS 10

  11. Sofu Updates (2) – pro and con Sofu Updates (2) – pro and con Can mount the FS immediately afuer crash The worst case scenario is a resource leak Run fsck later or on background Need snapshot Hard to implement properly Delayed unref breaks POSIX fsync(2) and umount(2) slowness Jan Šenolt , Advanced Operatjng Systems, April 11 th 2019 Advanced FS, ZFS 11

  12. ZFS vs traditjonal fjle systems ZFS vs traditjonal fjle systems New administratjve model 2 commands: zpool(1M) and zfs(1M) Pooled storage Eliminates the notjon of volume and slices (partjtjons) dynamic inode allocatjon Data protectjon Transactjonal object system always consistent on disk, no fsck(1M) Detects and corrects data corruptjon Integrated RAID stripes, mirror, RAID-Z Advanced features snapshots, writable snapshots, transparent compression & encryptjon, replicatjon, integrated NFS & CIFS sharing, deduplicatjon, ... Jan Šenolt , Advanced Operatjng Systems, April 11 th 2019 Advanced FS, ZFS 12

  13. ZFS in Solaris ZFS in Solaris apps & libzfs libs libzpool syscalls kernel VFS VFS/vnode: Ioctl(2) on zfs_mount() /dev/zfs devfs zfs_putpage() zfs_inactjve() ZPL ... zvol DSL ldi_strategy(9F) DMU + ARC SPA + ZIO DDI Jan Šenolt , Advanced Operatjng Systems, April 11 th 2019 Advanced FS, ZFS 13

  14. Pooled Storage Layer, SPA Pooled Storage Layer, SPA ZFS pool Collectjon of blocks allocated within a vdev hierarchy top-level vdevs physical x logical vdevs leaf vdevs special vdevs: l2arc, log, meta Blocks addressed via “block pointers” - blkptr_t ZIO Pipelined parallel IO subsystem Performs aggregatjon, compression, checksumming, ... Calculates and verifjes checksums self-healing Jan Šenolt , Advanced Operatjng Systems, April 11 th 2019 Advanced FS, ZFS 14

  15. Pooled Storage Layer, SPA (2) Pooled Storage Layer, SPA (2) # zpool status mypool pool: mypool root vdev state: ONLINE scan: none requested config: top-level NAME STATE READ WRITE CKSUM C mirror-0 vdevs mypool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 /A ONLINE 0 0 0 /B ONLINE 0 0 0 /C ONLINE 0 0 0 A B errors: No known data errors Jan Šenolt , Advanced Operatjng Systems, April 11 th 2019 Advanced FS, ZFS 15

  16. Pooled Storage Layer, blkptr_t Pooled Storage Layer, blkptr_t 64 56 48 40 32 24 16 8 DVA – disk virtual address 0 VDEV 1 ASIZE n c p y | L 4 T VDEV – top-level vdev number 1 G| OFFSET 1 ASIZE – allocated size 2 VDEV 2 ASIZE n c p y | L 4 T LSIZE 3 G| OFFSET 2 logical size – without compression, 4 VDEV 3 ASIZE n c p y | L 4 T RAID-Z or gang overhead 5 G| OFFSET 3 PSIZE 6 B L TYPE CKSUM COMP PSIZE LSIZE D E | L V compressed size 7 PADDING 8 PADDING LVL – block level 9 PHYSICAL BIRTH TXG 0 – data block A BIRTH TXG >0 – indirect block B FILL COUNT BDE C CHECKSUM[0] D CHECKSUM[1] litule vs big-endian E CHECKSUM[2] dedup F CHECKSUM[3] Encryptjon FILL - number of blkptrs in block Jan Šenolt , Advanced Operatjng Systems, April 11 th 2019 Advanced FS, ZFS 16

  17. Data Management Unit, DMU Data Management Unit, DMU dbuf (dmu_buf_t) in-core block, stored in ARC size 512B – 1MB object (dnode_t, dnode_phys_t) array of dbufs types: DMU_OT_PLAIN_FILE_CONTENTS, DMU_OT_DNODE, … dn_dbufs – list of dbufs dn_dirty_records – list of modifjed dbufs objset (objset_t, objset_phys_t) set of objects os_dirty_dnodes – list of modifjed dnodes Jan Šenolt , Advanced Operatjng Systems, April 11 th 2019 Advanced FS, ZFS 17

  18. Adaptjve Replacement Cache, ARC Adaptjve Replacement Cache, ARC Megiddo, Modha: ARC: A Self-Tuning, Low Overhead Replacement Cache [1] c p MRU MFU MFU ghost MRU ghost c c p – increase if found in MRU-Ghost, decrease if found in MFU-Ghost p – increase to fjll unused memory, arc_shrink() Evict list –list of unreferenced dbufs can be moved to L2ARC: l2arc_feed_thread() Hash table hash(SPA, DVA, TXG) arc_hash_fjnd(), arc_hash_insert() Jan Šenolt , Advanced Operatjng Systems, April 11 th 2019 Advanced FS, ZFS 18

  19. Dataset and Snapshot Layer, DSL Dataset and Snapshot Layer, DSL dsl_dir_t, dsl_dataset_t Adds names to objsets, creates parent – child relatjon implements snapshots and clones Maintains propertjes DSL dead list set of blkptrs which were referenced in the previous snapshot, but not in this dataset when a block is no longer referenced: free it if was born afuer most recent snapshot otherwise put it on datasets dead list DSL scan traverse the pool, corrupted data triggers self-healing scrub – scan all txgs vs resilver – scan only txg when vdev was missing ZFS stream serialized dataset(s) Jan Šenolt , Advanced Operatjng Systems, April 11 th 2019 Advanced FS, ZFS 19

  20. ZFS Posix Layer, ZPL & ZFS Volume ZFS Posix Layer, ZPL & ZFS Volume ZPL creates a POSIX-like fjle system within dataset znode_t, zfsvfs_t System Atuributes portjon of znode with variable layouts to accommodate type specifjc atuributes ZVOL block devices in /dev/zvol SCSI targets (via COMSTAR) direct access to DMU & ARC, RDMA Jan Šenolt , Advanced Operatjng Systems, April 11 th 2019 Advanced FS, ZFS 20

Recommend


More recommend