fjlesystems 3
play

fjlesystems 3 1 last time hard drives seek time rotational - PowerPoint PPT Presentation

fjlesystems 3 1 last time hard drives seek time rotational latency block remapping by disk controller solid state disks wear leveling move data to new block when written work around limitations of large erasure blocks (that cant be


  1. xv6 fjlesystem performance issues inode, block map stored far away from fjle data long seek times for reading fjles unintelligent choice of fjle/directory data blocks xv6 fjnds fjrst free block/inode result: fjles/directory entries scattered about blocks are pretty small — needs lots of space for metadata could change size? but waste space for small fjles large fjles have giant lists of blocks linear searches of directory entries to resolve paths 13

  2. empirical fjle sizes Roselli et al, “A Comparison of Filesystem Workloads”, in FAST 2000 14

  3. typical fjle sizes most fjles are small sometimes 50+% less than 1kbyte often 80-95% less than 10kbyte doens’t mean large fjles are unimportant still take up most of the space biggest performance problems 15

  4. fragments FFS: a fjle’s last block can be a fragment — only part of a block each block split into approx. 4 fragments each fragment has its own index extra fjeld in inode indicates that last block is fragment allows one block to store data for several small fjles 16

  5. non-FFS changes now some techniques beyond FFS some of these supported by current fjlesystems, like Microsoft’s NTFS Linux’s ext4 (successor to ext2) 17

  6. xv6 fjlesystem performance issues inode, block map stored far away from fjle data long seek times for reading fjles unintelligent choice of fjle/directory data blocks xv6 fjnds fjrst free block/inode result: fjles/directory entries scattered about blocks are pretty small — needs lots of space for metadata could change size? but waste space for small fjles large fjles have giant lists of blocks linear searches of directory entries to resolve paths 18

  7. extents large fjle? lists of many thousands of blocks is awkward …and requires multiple reads from disk to get replaces or supplements block list Linux’s ext4 and Windows’s NTFS both use this 19 solution: store extents: (start disk block, size)

  8. allocating extents challenge: fjnding contiguous sets of free blocks FFS’s strategy “fjrst in block group” doesn’t work well fjrst several blocks likely to be ‘holes’ from deleted fjles NTFS: scan block map for “best fjt” look for big enough chunk of free blocks choose smallest among all the candidates don’t fjnd any? okay: use more than one extent 20

  9. seeking with extents with block pointers: can compute index with extents: need to scan ist? 21 challenge: fjnding byte X of the fjle

  10. exericse fjlesystem has: root directory with 2 subdirectories each subdirectory contains 3 512B fjles, 2 4MB fjles (1MB = 1024KB; 1KB = 1024B) 32B directory entries 4B block pointers 4KB blocks inode: 12 direct pointers, 1 indirect pointer, 1 double-indirect, 1 triple-indirect (a) how many inodes used? (b) how many blocks (outside of inodes) with 1KB fragments? [minimum w/partial blocks] (c) how many blocks (outside of inodes) with block pointers replaced by 8B extents (no fragments)? [compute minimum] 22

  11. fjlesystem reliability a crash happens — what’s the state of my fjlesystem? 26

  12. hard disk atomicity interrupt a hard drive write? write whole disk sector or corrupt it hard drive stores checksum for each sector write interrupted? — checksum mismatch hard drive returns read error 27

  13. reliability issues is the fjlesystem in a consistent state? do we know what blocks are free? do we know what fjles exist? is the data for fjles actually what was written? also important topics, but won’t spend much time on these: what data will I lose if storage fails? mirroring, erasure coding (e.g. RAID) — using multiple storage devices idea: if one storage device fails, other(s) still have data what data will I lose if I make a mistake? fjlesystem can store multiple versions “snapshots” of what was previously there 28

  14. several bad options (1) suppose we’re moving a fjle from one directory to another on xv6 steps: A: write new directory entry B: overwrite (remove) old directory entry if we do A before B and crash happens after A: can have extra pointer of fjle problem: if old directory entry removed later, will get confused and free the fjle! if we do B before A and crash happens after B: the fjle disappeared entirely! 29

  15. several bad options (1) suppose we’re moving a fjle from one directory to another on xv6 steps: A: write new directory entry B: overwrite (remove) old directory entry if we do A before B and crash happens after A: can have extra pointer of fjle problem: if old directory entry removed later, will get confused and free the fjle! if we do B before A and crash happens after B: the fjle disappeared entirely! 29

  16. several bad options (1) suppose we’re moving a fjle from one directory to another on xv6 steps: A: write new directory entry B: overwrite (remove) old directory entry if we do A before B and crash happens after A: can have extra pointer of fjle problem: if old directory entry removed later, will get confused and free the fjle! if we do B before A and crash happens after B: the fjle disappeared entirely! 29

  17. several bad options (2) suppose we’re creating a new fjle A: mark blocks as used in free block map B: write inode for fjle C: write directory entry for fjle if we do A before B+C and crash happens after A: have blocks we can’t use (not free), but which are unused if we do B before A+C and crash happens after B: have inode we can’t use (not free), but which is not really used if we do C before A+B and crash happens after C: have directory entry that points to junk — will behave weirdly 30

  18. several bad options (2) suppose we’re creating a new fjle A: mark blocks as used in free block map B: write inode for fjle C: write directory entry for fjle if we do A before B+C and crash happens after A: have blocks we can’t use (not free), but which are unused if we do B before A+C and crash happens after B: have inode we can’t use (not free), but which is not really used if we do C before A+B and crash happens after C: have directory entry that points to junk — will behave weirdly 30

  19. several bad options (2) suppose we’re creating a new fjle A: mark blocks as used in free block map B: write inode for fjle C: write directory entry for fjle if we do A before B+C and crash happens after A: have blocks we can’t use (not free), but which are unused if we do B before A+C and crash happens after B: have inode we can’t use (not free), but which is not really used if we do C before A+B and crash happens after C: have directory entry that points to junk — will behave weirdly 30

  20. several bad options (2) suppose we’re creating a new fjle A: mark blocks as used in free block map B: write inode for fjle C: write directory entry for fjle if we do A before B+C and crash happens after A: have blocks we can’t use (not free), but which are unused if we do B before A+C and crash happens after B: have inode we can’t use (not free), but which is not really used if we do C before A+B and crash happens after C: have directory entry that points to junk — will behave weirdly 30

  21. beyond ordering recall: updating a sector is atomic happens entirely or doesn’t can we make fjlesystem updates work this way? yes — ‘just’ make updating one sector do the update 31

  22. beyond ordering recall: updating a sector is atomic happens entirely or doesn’t can we make fjlesystem updates work this way? yes — ‘just’ make updating one sector do the update 31

  23. concept: transaction transaction: bunch of updates that happen all at once implementation trick: one update means transaction “commits” update done — whole transaction happened update not done — whole transaction did not happen 32

  24. free map pt 2 = C updates will defjnitely happen! and redo them (just in case) redo logging: fjle creation block G I N … data blk 74 = (fjle) … super inode array log B data write log entries with intended operations write commit message to log fjlesystem needs to ensure that committed mechanism: check this log for commit messages later, …and start more transactions later, start applying results to actual disk when everything is written, can overwrite log E T B (fjle) E G I N …(new.txt, 53)… data blk 17 = (dir) … data blk 34 = … I addr[0]=34 inode #53 = … 1 0 1 … O M M 33

  25. updates will defjnitely happen! and redo them (just in case) redo logging: fjle creation block E G I N … data blk 74 = (fjle) … super inode array log T data write log entries with intended operations write commit message to log fjlesystem needs to ensure that committed mechanism: check this log for commit messages later, …and start more transactions later, start applying results to actual disk when everything is written, can overwrite log B I B … E G I N …(new.txt, 53)… data blk 17 = (dir) … data blk 34 = (fjle) addr[0]=34 M inode #53 = … 1 0 1 … free map pt 2 = C O M 33

  26. redo logging: fjle creation block E G I N … data blk 74 = (fjle) … super log B inode array data write log entries with intended operations write commit message to log fjlesystem needs to ensure that committed mechanism: check this log for commit messages later, …and start more transactions later, start applying results to actual disk when everything is written, can overwrite log B T I M E G I N …(new.txt, 53)… data blk 17 = (dir) … data blk 34 = (fjle) … addr[0]=34 inode #53 = … 1 0 1 … O M 33 free map pt 2 = C updates will defjnitely happen! and redo them (just in case)

  27. updates will defjnitely happen! and redo them (just in case) redo logging: fjle creation block G I N … data blk 74 = (fjle) … super inode array log B data write log entries with intended operations write commit message to log fjlesystem needs to ensure that committed mechanism: check this log for commit messages later, …and start more transactions later, start applying results to actual disk when everything is written, can overwrite log E T B … E G I N …(new.txt, 53)… data blk 17 = (dir) … data blk 34 = (fjle) addr[0]=34 I inode #53 = … 1 0 1 … O M M 33 free map pt 2 = C

  28. updates will defjnitely happen! and redo them (just in case) redo logging: fjle creation block G I N … data blk 74 = (fjle) … super inode array log B data write log entries with intended operations write commit message to log fjlesystem needs to ensure that committed mechanism: check this log for commit messages later, …and start more transactions later, start applying results to actual disk when everything is written, can overwrite log E T B … E G I N …(new.txt, 53)… data blk 17 = (dir) … data blk 34 = (fjle) addr[0]=34 I inode #53 = … 1 0 1 … O M M 33 free map pt 2 = C

  29. updates will defjnitely happen! and redo them (just in case) redo logging: fjle creation block G I N … data blk 74 = (fjle) … super inode array log B data write log entries with intended operations write commit message to log fjlesystem needs to ensure that committed mechanism: check this log for commit messages later, …and start more transactions later, start applying results to actual disk when everything is written, can overwrite log E T B … E G I N …(new.txt, 53)… data blk 17 = (dir) … data blk 34 = (fjle) addr[0]=34 I inode #53 = … 1 0 1 … O M M 33 free map pt 2 = C

  30. updates will defjnitely happen! and redo them (just in case) redo logging: fjle creation block G I N … data blk 74 = (fjle) … super inode array log B data write log entries with intended operations write commit message to log fjlesystem needs to ensure that committed mechanism: check this log for commit messages later, …and start more transactions later, start applying results to actual disk when everything is written, can overwrite log E T B … E G I N …(new.txt, 53)… data blk 17 = (dir) … data blk 34 = (fjle) addr[0]=34 I inode #53 = … 1 0 1 … O M M 33 free map pt 2 = C

  31. updates will defjnitely happen! and redo them (just in case) redo logging: fjle creation block G I N … data blk 74 = (fjle) … super inode array log B data write log entries with intended operations write commit message to log fjlesystem needs to ensure that committed mechanism: check this log for commit messages later, …and start more transactions later, start applying results to actual disk when everything is written, can overwrite log E T B … E G I N …(new.txt, 53)… data blk 17 = (dir) … data blk 34 = (fjle) addr[0]=34 I inode #53 = … 1 0 1 … O M M 33 free map pt 2 = C

  32. promise: will perform logged updates redo logging: fjle creation no partial operation to real data recovery reclaim space in log inode twice already done? — okay, setting “commit” redo any operation with “commit” ignore any operation with no read log and… (after system reboots/recovers) fjle created crash after commit ? fjle not created write to log transaction steps: crash before commit ? “garbage collection” reclaim space in log update directory inode update fjle inode update directory entry update fjle data blocks in any order: write to log “commit transaction” normal operation update directory inode (size, time) direcotry entry, inode to write data blocks to create 34

  33. promise: will perform logged updates redo logging: fjle creation no partial operation to real data recovery reclaim space in log inode twice already done? — okay, setting “commit” redo any operation with “commit” ignore any operation with no read log and… (after system reboots/recovers) fjle created crash after commit ? fjle not created write to log transaction steps: crash before commit ? “garbage collection” reclaim space in log update directory inode update fjle inode update directory entry update fjle data blocks in any order: write to log “commit transaction” normal operation update directory inode (size, time) direcotry entry, inode to write data blocks to create 34

  34. redo logging: fjle creation write to log transaction steps: recovery reclaim space in log inode twice already done? — okay, setting “commit” redo any operation with “commit” ignore any operation with no read log and… (after system reboots/recovers) fjle created crash after commit ? no partial operation to real data fjle not created crash before commit ? “garbage collection” reclaim space in log update directory inode update fjle inode update directory entry update fjle data blocks in any order: write to log “commit transaction” normal operation update directory inode (size, time) direcotry entry, inode to write data blocks to create 34 promise: will perform logged updates

  35. promise: will perform logged updates redo logging: fjle creation no partial operation to real data recovery reclaim space in log inode twice already done? — okay, setting “commit” redo any operation with “commit” ignore any operation with no read log and… (after system reboots/recovers) fjle created crash after commit ? fjle not created write to log transaction steps: crash before commit ? “garbage collection” reclaim space in log update directory inode update fjle inode update directory entry update fjle data blocks in any order: write to log “commit transaction” normal operation update directory inode (size, time) direcotry entry, inode to write data blocks to create 34

  36. promise: will perform logged updates redo logging: fjle creation no partial operation to real data recovery reclaim space in log inode twice already done? — okay, setting “commit” redo any operation with “commit” ignore any operation with no read log and… (after system reboots/recovers) fjle created crash after commit ? fjle not created write to log transaction steps: crash before commit ? “garbage collection” reclaim space in log update directory inode update fjle inode update directory entry update fjle data blocks in any order: write to log “commit transaction” normal operation update directory inode (size, time) direcotry entry, inode to write data blocks to create 34

  37. idempotency logged operations should be okay to do twice = idempotent bad example: increment inode link count as long as last committed inode value in log is right… bad example: allocate new inode with particular contents good example: overwrite data block with new value bad example: append data to last used block of fjle 35 good example: set inode link count to 4 good example: overwrite inode number X with new value

  38. redo logging summary write intended operation to the log before ever touching ‘real’ data in format that’s safe to do twice write marker to commit to the log if exists, the operation will be done eventually actually update the real data 36

  39. redo logging and fjlesystems fjlesystems that do redo logging are called journalling fjlesystems 37

  40. backup slides 38

  41. Linux ext2 inode __le16 i_gid; similar pointers like xv6 FS — but more indirection whole bunch of times owner and group and permissions (read/write/execute for owner/group/others) type (regular, directory, device) }; ... ... __le32 i_flags; __le32 i_blocks; __le16 i_links_count; struct ext2_inode { 39 __le32 i_size; __le32 i_mtime; __le16 i_mode; __le32 i_ctime; __le16 i_uid; __le32 i_dtime; __le32 i_atime; /* File mode */ /* Low 16 bits of Owner Uid */ /* Size in bytes */ /* Access time */ /* Creation time */ /* Modification time */ /* Deletion Time */ /* Low 16 bits of Group Id */ /* Links count */ /* Blocks count */ /* File flags */ __le32 i_block[EXT2_N_BLOCKS]; /* Pointers to blocks */

  42. Linux ext2 inode __le16 i_gid; similar pointers like xv6 FS — but more indirection whole bunch of times owner and group and permissions (read/write/execute for owner/group/others) type (regular, directory, device) }; ... ... __le32 i_flags; __le32 i_blocks; __le16 i_links_count; struct ext2_inode { 39 __le32 i_size; __le32 i_mtime; __le32 i_ctime; __le16 i_uid; __le32 i_dtime; __le32 i_atime; /* File mode */ __le16 i_mode; /* Low 16 bits of Owner Uid */ /* Size in bytes */ /* Access time */ /* Creation time */ /* Modification time */ /* Deletion Time */ /* Low 16 bits of Group Id */ /* Links count */ /* Blocks count */ /* File flags */ __le32 i_block[EXT2_N_BLOCKS]; /* Pointers to blocks */

  43. Linux ext2 inode __le16 i_gid; similar pointers like xv6 FS — but more indirection whole bunch of times owner and group and permissions (read/write/execute for owner/group/others) type (regular, directory, device) }; ... ... __le32 i_flags; __le32 i_blocks; __le16 i_links_count; struct ext2_inode { 39 __le32 i_size; __le32 i_mtime; __le16 i_mode; __le32 i_ctime; __le16 i_uid; __le32 i_dtime; __le32 i_atime; /* File mode */ /* Low 16 bits of Owner Uid */ /* Size in bytes */ /* Access time */ /* Creation time */ /* Modification time */ /* Deletion Time */ /* Low 16 bits of Group Id */ /* Links count */ /* Blocks count */ /* File flags */ __le32 i_block[EXT2_N_BLOCKS]; /* Pointers to blocks */

  44. Linux ext2 inode __le16 i_gid; similar pointers like xv6 FS — but more indirection whole bunch of times owner and group and permissions (read/write/execute for owner/group/others) type (regular, directory, device) }; ... ... __le32 i_flags; __le32 i_blocks; __le16 i_links_count; struct ext2_inode { 39 __le32 i_size; __le32 i_mtime; __le16 i_mode; __le32 i_ctime; __le16 i_uid; __le32 i_dtime; __le32 i_atime; /* File mode */ /* Low 16 bits of Owner Uid */ /* Size in bytes */ /* Access time */ /* Creation time */ /* Modification time */ /* Deletion Time */ /* Low 16 bits of Group Id */ /* Links count */ /* Blocks count */ /* File flags */ __le32 i_block[EXT2_N_BLOCKS]; /* Pointers to blocks */

  45. Linux ext2 inode __le16 i_gid; similar pointers like xv6 FS — but more indirection whole bunch of times owner and group and permissions (read/write/execute for owner/group/others) type (regular, directory, device) }; ... ... __le32 i_flags; __le32 i_blocks; __le16 i_links_count; struct ext2_inode { 39 __le32 i_size; __le32 i_mtime; __le16 i_mode; __le32 i_ctime; __le16 i_uid; __le32 i_dtime; __le32 i_atime; /* File mode */ /* Low 16 bits of Owner Uid */ /* Size in bytes */ /* Access time */ /* Creation time */ /* Modification time */ /* Deletion Time */ /* Low 16 bits of Group Id */ /* Links count */ /* Blocks count */ /* File flags */ __le32 i_block[EXT2_N_BLOCKS]; /* Pointers to blocks */

  46. double/triple indirect … triple-indirect pointer double-indirect pointer indirect pointer 12 direct pointers data blocks blocks of block pointers block pointers … … … … … i_block[14] i_block[0] i_block[13] i_block[12] i_block[11] i_block[10] i_block[9] i_block[8] i_block[7] i_block[6] i_block[5] i_block[4] i_block[3] i_block[2] i_block[1] 40

  47. double/triple indirect … triple-indirect pointer double-indirect pointer indirect pointer 12 direct pointers data blocks blocks of block pointers block pointers … … … … … i_block[14] i_block[0] i_block[13] i_block[12] i_block[11] i_block[10] i_block[9] i_block[8] i_block[7] i_block[6] i_block[5] i_block[4] i_block[3] i_block[2] i_block[1] 40

  48. double/triple indirect … triple-indirect pointer double-indirect pointer indirect pointer 12 direct pointers data blocks blocks of block pointers block pointers … … … … … i_block[14] i_block[0] i_block[13] i_block[12] i_block[11] i_block[10] i_block[9] i_block[8] i_block[7] i_block[6] i_block[5] i_block[4] i_block[3] i_block[2] i_block[1] 40

  49. double/triple indirect … triple-indirect pointer double-indirect pointer indirect pointer 12 direct pointers data blocks blocks of block pointers block pointers … … … … … i_block[14] i_block[0] i_block[13] i_block[12] i_block[11] i_block[10] i_block[9] i_block[8] i_block[7] i_block[6] i_block[5] i_block[4] i_block[3] i_block[2] i_block[1] 40

  50. double/triple indirect … triple-indirect pointer double-indirect pointer indirect pointer 12 direct pointers data blocks blocks of block pointers block pointers … … … … … i_block[14] i_block[0] i_block[13] i_block[12] i_block[11] i_block[10] i_block[9] i_block[8] i_block[7] i_block[6] i_block[5] i_block[4] i_block[3] i_block[2] i_block[1] 40

  51. double/triple indirect … triple-indirect pointer double-indirect pointer indirect pointer 12 direct pointers data blocks blocks of block pointers block pointers … … … … … i_block[14] i_block[0] i_block[13] i_block[12] i_block[11] i_block[10] i_block[9] i_block[8] i_block[7] i_block[6] i_block[5] i_block[4] i_block[3] i_block[2] i_block[1] 40

  52. ext2 indirect blocks 12 direct block pointers 1 indirect block pointer pointer to block containing more direct block pointers 1 double indirect block pointer pointer to block containing more indirect block pointers 1 triple indirect block pointer pointer to block containing more double indirect block pointers exercise: if 1K blocks, 4 byte block pointers, how big can a fjle be? 41

  53. ext2 indirect blocks 12 direct block pointers 1 indirect block pointer pointer to block containing more direct block pointers 1 double indirect block pointer pointer to block containing more indirect block pointers 1 triple indirect block pointer pointer to block containing more double indirect block pointers exercise: if 1K blocks, 4 byte block pointers, how big can a fjle be? 41

  54. ext2 indirect blocks (2) 12 direct block pointers 1 indirect block pointer 1 double indirect block pointer 1 triple indirect block pointer (1) using indirect pointer or double-indirect pointer in inode? (2) what index of block pointer array pointed to by pointer in inode? 42 exercise: if 1K ( 2 10 byte) blocks, 4 byte block pointers, how does OS fjnd byte 2 15 of the fjle?

  55. exercise say xv6 fjlesystem with: 64-byte inodes (12 direct + 1 indirect pointer) 16-byte directory entries 512 byte blocks 2-byte block pointers how many blocks (not storing inodes) is used to store a directory of remember: blocks could include blocks storing data or block pointers or directory enties how many blocks is used to store a directory of 2000 3KB fjles? 43 200 30464B ( 29 · 1024 + 256 byte) fjles?

  56. recall: FAT: fjle creation (1) 20 34 35 cluster number the disk entry value index … … 20 18 0 (free) 19 -1 (end mark) 0 (free) 22 0 21 0 (free) 24 22 -1 (end) 23 35 25 48 26 0 (free) 27 … … fjle allocation table 33 32 31 14 1 2 3 4 5 6 7 8 9 10 11 12 13 15 30 23 29 28 27 26 25 24 22 16 21 20 19 18 17 44 0 (free) -1 (end) 24

  57. recall: FAT: fjle creation (2) -1 (end mark) -1 (end) 22 0 (free) 24 21 0 (free) 22 20 19 35 0 (free) 18 20 … … index 23 25 the disk … … unused entry unused entry unused entry “new.txt”, cluster 21, size …, created … “quux.txt”, cluster 104, size …, created … “foo.txt”, cluster 11, size …, created … 48 directory of new fjle fjle allocation table … … 27 0 (free) 26 entry value cluster number 0 9 15 14 13 12 11 10 8 17 7 6 5 4 3 2 1 16 18 35 27 34 33 32 31 30 29 28 26 25 24 23 22 21 20 19 45 0 (free) -1 (end) 24

  58. exercise: FAT fjle creation 34 25 26 27 28 29 30 31 32 33 35 23 cluster number the disk 2 4 6 6 clusters to write on loss of power: only some completed exercise: what happens if only 1, 2 complete? everything but 3? 0 24 22 10 1 2 3 4 5 6 7 8 21 9 11 12 13 14 15 16 17 18 19 20 46 1 FAT entries for directory + fjle 3 new directory cluster 5 new fjle clusters

  59. exercise: FAT fjle creation 34 25 26 27 28 29 30 31 32 33 35 23 cluster number the disk 2 4 6 6 clusters to write on loss of power: only some completed exercise: what happens if only 1, 2 complete? everything but 3? 0 24 22 10 1 2 3 4 5 6 7 8 21 9 11 12 13 14 15 16 17 18 19 20 46 1 FAT entries for directory + fjle 3 new directory cluster 5 new fjle clusters

  60. exercise: FAT ordering fjle’s directory entry (in new directory cluster) E. 3, 1, 4, 2 D. 3, 4, 2, 1 C. 1, 3, 4, 2 B. 4, 3, 1, 2 A. 1, 2, 3, 4 what ordering is best if a crash happens in the middle? 4. (creating a fjle that needs new cluster of direntries) fjle clusters 3. FAT entry for new fjle clusters 2. FAT entry for extra directory cluster 1. 47

  61. exercise: xv6 FS ordering 6. ignoring journalling for now — we’ll talk about it later E. 3, 4, 1, 2, 5, 6 D. 2, 6, 4, 1, 5, 3 C. 1, 2, 6, 5, 4, 3 B. 6, 5, 4, 3, 2, 1 A. 1, 2, 3, 4, 5, 6 what ordering is best if a crash happens in the middle? fjle data blocks new directory entry for fjle (in new directory block) (creating a fjle that neeeds new block of direntries) 5. new fjle inode 4. directory inode 3. free block map for new fjle block 2. free block map for new directory block 1. 48

  62. inode-based FS: careful ordering mark blocks as allocated before referring to them from directories write data blocks before writing pointers to them from inodes write inodes before directory entries pointing to it remove inode from directory before marking inode as free or decreasing link count, if there’s another hard link 49 idea: better to waste space than point to bad data

  63. recovery with careful ordering programs like fsck (fjlesystem check) , chkdsk (check disk) run manually or periodically or after abnormal shutdown 50 avoiding data loss → can ‘fjx’ inconsistencies

  64. inode-based FS: creating a fjle mark blocks/inodes used before writing block/inode pointers recovery ( fsck ) update/access times scan directories for missing unused = not in inode lists free unused data blocks unused = not in directory free unused inodes scan all inodes read all directory entries than point to bad data allocate data block better to waste space general rule: normal operation modifjcation time update direcotry inode fjlename+inode number update directory entry update fjle inode update free block map write data block 51

  65. inode-based FS: creating a fjle mark blocks/inodes used before writing block/inode pointers recovery ( fsck ) update/access times scan directories for missing unused = not in inode lists free unused data blocks unused = not in directory free unused inodes scan all inodes read all directory entries than point to bad data allocate data block better to waste space general rule: normal operation modifjcation time update direcotry inode fjlename+inode number update directory entry update fjle inode update free block map write data block 51

  66. inode-based FS: creating a fjle mark blocks/inodes used before writing block/inode pointers recovery ( fsck ) update/access times scan directories for missing unused = not in inode lists free unused data blocks unused = not in directory free unused inodes scan all inodes read all directory entries than point to bad data allocate data block better to waste space general rule: normal operation modifjcation time update direcotry inode fjlename+inode number update directory entry update fjle inode update free block map write data block 51

  67. inode-based FS: exercise: unlink what order to remove a hard link (= directory entry) for fjle? 1. overwrite directroy entry for fjle assume not the last hard link what does recovery operation do? 52 2. decrement link count in inode (but link count still > 1 so don’t remove)

  68. inode-based FS: exercise: unlink what order to remove a hard link (= directory entry) for fjle? 1. overwrite directroy entry for fjle assume not the last hard link what does recovery operation do? 52 2. decrement link count in inode (but link count still > 1 so don’t remove)

  69. inode-based FS: exercise: unlink last what order to remove a hard link (= directory entry) for fjle? 1. overwrite last directroy entry for fjle 2. mark inode as free (link count = 0 now) 3. mark inode’s data blocks as free what does recovery operation do? 53 assume is the last hard link

  70. inode-based FS: exercise: unlink last what order to remove a hard link (= directory entry) for fjle? 1. overwrite last directroy entry for fjle 2. mark inode as free (link count = 0 now) 3. mark inode’s data blocks as free what does recovery operation do? 53 assume is the last hard link

  71. fsck Unix typically has an fsck utility Windows equivalent: chkdsk checks for fjlesystem consistency is a data block marked as used that no inodes uses? is a data block referred to by two difgerent inodes? is a inode marked as used that no directory references? is the link count for each inode = number of directories referencing it? … assuming careful ordering, can fjx errors after a crash without loss maybe can fjx other errors, too 54

  72. fsck costs my desktop’s fjlesystem: 2.4M used inodes; 379.9M of 472.4M used blocks recall: check for data block marked as used that no inode uses: read blocks containing all of the 2.4M used inodes add each block pointer to a list of used blocks if they have indirect block pointers, read those blocks, too get list of all used blocks (via direct or indirect pointers) compare list of used blocks to actual free block bitmap pretty expensive and slow 55

  73. running fsck automatically common to have “clean” bit in superblock last thing written (to set) on shutdown fjrst thing written (to clear) on startup on boot: if clean bit clear, run fsck fjrst 56

  74. ordering and disk performance recall: seek times write many things in one pass of disk head write many things in cylinder in one rotation ordering constraints make this hard: free block map for fjle (start), then fjle blocks (middle), then… fjle inode (start), then directory (middle), … 57 would like to order writes based on locations on disk

  75. ordering and disk performance recall: seek times write many things in one pass of disk head write many things in cylinder in one rotation ordering constraints make this hard: free block map for fjle (start), then fjle blocks (middle), then… fjle inode (start), then directory (middle), … 57 would like to order writes based on locations on disk

  76. beyond mirroring mirroring seems to waste a lot of space 10 disks of data? how good can we do with 15 disks? best possible: lose 5 disks, still okay can’t do better or it wasn’t really 10 disks of data schemes that do this based on erasure codes erasure code: encode data in way that handles parts missing (being erased) 58 10 disks of data? mirroring → 20 disks

  77. erasure code example store 2 disks of data on 3 disks recompute original 2 disks of data from any 2 of the 3 disks extra disk of data: some formula based on the original disks common choice: bitwise XOR common set of schemes like this: RAID Redundant Array of Independent Disks 59

  78. snapshots fjlesystem snapshots idea: fjlesystem keeps old versions of fjles around accidental deletion? old version stil there eventually discard some old versions can access snapshot of fjles at prior time mechanism: copy-on-write changing fjle makes new copy of fjlesystem common parts shared between versions 60

  79. snapshots fjlesystem snapshots idea: fjlesystem keeps old versions of fjles around accidental deletion? old version stil there eventually discard some old versions can access snapshot of fjles at prior time changing fjle makes new copy of fjlesystem common parts shared between versions 60 mechanism: copy-on-write

  80. inode and copy-on-write + new inode of entire inode array don’t want to write new copy has big array of inodes challenge: FFS/xv6/ext2 design unchanged parts of fjle shared both old+new inode valid + new indirect blocks inode update: new data blocks new inode … fjle data … … indirect blocks 61

  81. inode and copy-on-write + new indirect blocks of entire inode array don’t want to write new copy has big array of inodes challenge: FFS/xv6/ext2 design unchanged parts of fjle shared both old+new inode valid + new inode update: new data blocks old new inode … fjle data … … indirect blocks inode 61

  82. inode and copy-on-write + new indirect blocks of entire inode array don’t want to write new copy has big array of inodes challenge: FFS/xv6/ext2 design unchanged parts of fjle shared both old+new inode valid + new inode update: new data blocks old new inode … fjle data … … indirect blocks inode 61

  83. inode and copy-on-write + new indirect blocks of entire inode array don’t want to write new copy has big array of inodes challenge: FFS/xv6/ext2 design unchanged parts of fjle shared both old+new inode valid + new inode update: new data blocks old new inode … fjle data … … indirect blocks inode 61

  84. extra indirection for inode array create new root inode array of root inodes multiple snapshots? shared between versions inode array unchanged parts of + pointers update one inode? root inode inode old … split into pieces arrays of inodes … indirect blocks 62

  85. extra indirection for inode array create new root inode array of root inodes multiple snapshots? shared between versions inode array unchanged parts of + pointers update one inode? root inode inode old … split into pieces arrays of inodes … indirect blocks 62

  86. extra indirection for inode array create new root inode array of root inodes multiple snapshots? shared between versions inode array unchanged parts of + pointers update one inode? root inode inode old … split into pieces arrays of inodes … indirect blocks 62

More recommend