Storage: The Big Issues 1. Disks are rotational media with mechanical arms. • High access cost � caching and prefetching Storage • Cost depends on previous access � careful block placement and scheduling. 2. Stored data is hard state . • Stored data persists after a restart. Jeff Chase • Data corruption and poor allocations also persist. Duke University • � Allocate for longevity, and write carefully. 3. Disks fail. • Plan for failure � redundancy and replication. • RAID: integrate redundancy with striping across multiple disks for higher throughput. Rotational Media RAID Track Sector Arm Cylinder Platter Head Access time = seek time + rotational delay + transfer time seek time = 5-15 milliseconds to move the disk arm and settle on a cylinder rotational delay = 8 milliseconds for full rotation at 7200 RPM: average delay = 4 ms transfer time = 1 millisecond for an 8KB block at 8 MB/s Bandwidth utilization is less than 50% for any noncontiguous access at a block grain. • Raid levels 3 through 5 • Backup and parity drives are shaded Disks and Drivers A Typical Unix File Tree Each volume is a set of directories and files; a host’s file tree is the set of • Disk hardware and driver software provide directories and files visible to processes on a given host. foundational support for block devices . • OS views the block devices as a collection of volumes . / File trees are built by grafting volumes from different volumes – A logical volume may be a partition of a single disk or a or from network servers. concatenation of multiple physical disks (e.g., RAID). bin etc tmp usr vmunix • volume == LUN In Unix, the graft operation is • Each volume is an array of fixed-size sectors . ls sh project users the privileged mount system call, and each volume is a filesystem . – Name sector/block by (volumeID, sector ID). packages – Read/write operations DMA data to/from physical memory. mount point • Device interrupts OS on I/O completion. (volume root) mount (coveredDir, volume) – ISR wakes process, updates internal records, etc. coveredDir: directory pathname tex emacs volume : device specifier or network volume volume root contents become visible at pathname coveredDir 1
Filesystems File Systems: The Big Issues • Files • Buffering disk data for access from the processor. – Sequentially numbered bytes or logical blocks . – Block I/O (DMA) needs aligned physical buffers. – Metadata stored in on-disk data object – Block update is a read-modify-write. • e.g, Unix “inode” • Creating/representing/destroying independent files. • Directories • Allocating disk blocks and scheduling disk operations – A special kind of file with a set of name mappings. to deliver the best performance for the I/O stream. • E.g., name to inode – What are the patterns in the request stream? – Pointer to parent in rooted hierarchy: .., / • Multiple levels of name translation. • System calls – Pathname � inode, logical � physical block – Unix: open, close, read, write, stat, seek, sync, link, • Reliability and the handling of updates. unlink, symlink, chdir, chroot, mount, chmod, chown . Representing a File On Disk Representing Large Files inode file attributes : may include owner, access control list, time direct of create/modify/access, etc. once upo logical Classical Unix block n a time block 0 map /nin a l Each file system block is a clump of indirect sectors (4KB, 8KB, 16KB). block block map Inode == 128 bytes, packed into blocks. and far logical far away Each inode has 68 bytes of attributes and block 1 ,/nlived t 15 block map entries. double indirect physical block pointers in the block block map are sector IDs or he wise logical physical block numbers and sage Suppose block size = 8KB block 2 wizard. 12 direct block map entries in the inode can map 96KB of data. inode One indirect block (referenced by the inode) can map 16MB of data. One double indirect block pointer in inode maps 2K indirect blocks. maximum file size is 96KB + 16MB + (2K*16MB) + ... Unix index blocks Unix index blocks • Intuition – Many files are small 15 19 21 25 • Length = 0, length = 1, length < 80, ... – Some files are huge (3 gigabytes) 16 20 22 26 • “Clever heuristic” in Unix FFS inode 17 – 12 (direct) block pointers: 12 * 8 KB = 96 KB 27 101 23 18 • Availability is “free” - you need inode to open() file anyway 28 102 24 – 3 indirect block pointers 100 • single, double, triple 103 29 500 104 30 501 1000 502 105 31 106 32 2
Unix index blocks Unix index blocks 15 15 19 16 16 20 Direct blocks 17 17 18 18 -1 Indirect pointer 100 Double-indirect -1 -1 Triple-indirect -1 -1 Unix index blocks Unix index blocks 15 19 15 19 21 21 25 16 20 22 16 20 22 26 17 17 27 101 23 101 23 18 18 28 24 24 102 102 100 100 103 29 500 500 104 30 501 -1 1000 502 105 31 106 32 Directories A Filesystem On Disk sector 0 sector 1 wind: 18 directory allocation 0 bitmap file inode wind: 18 snow: 62 directory 11100010 file 0 00101101 0 snow: 62 10111101 0 rain: 32 once upo rain: 32 10011010 n a time hail: 48 hail: 48 00110001 /n in a l 00010101 00101110 and far 00011001 far away sector 32 01000100 , lived th Entries or slots are found by a linear scan. This is just an example (Nachos) 3
Unix File Naming (Hard Links) Unix Symbolic (Soft) Links directory A directory B A soft link is a file containing a pathname of some A Unix file may have multiple names. 0 wind: 18 0 other file. rain: 32 Each directory entry naming the file is sleet: 48 hail: 48 called a hard link . symlink system call symlink (existing name, new name) Each inode contains a reference count allocate a new file (inode) with type symlink directory A directory B inode link showing how many hard links name it. initialize file contents with existing name count = 2 0 wind: 18 create directory entry for new file with new name 0 rain: 32 inode 48 hail: 48 sleet: 67 The target of the link may be link system call unlink system call (“remove”) removed at any time, leaving unlink(name) link (existing name, new name) inode link ../A/hail/0 a dangling reference. create a new name for an existing file destroy directory entry count = 1 increment inode link count decrement inode link count How should the kernel inode 48 inode 67 if count == 0 and file is not in active use handle recursive soft links? free blocks (recursively) and on-disk inode Illustrates: garbage collection by reference counting. Failures, Commits, Atomicity Unix Failure/Atomicity • What guarantees does the system offer about the • File writes are not guaranteed to commit until close. hard state if the system fails? – A process can force commit with a sync . – The system forces commit every (say) 30 seconds. – Durability – Failure could lose an arbitrary set of writes. • Did my writes commit , i.e., are they on the disk? • Reads/writes to a shared file interleave at the – Atomicity granularity of system calls. • Can an operation “partly commit”? • Metadata writes are atomic/synchronous. • Also, can it interleave with other operations? • Disk writes are carefully ordered. – Recoverability and Corruption – The disk can become corrupt in well-defined ways. • Is the metadata well-formed on recovery? – Restore with a scrub (“fsck”) on restart. – Alternatives: logging, shadowing • Want better reliability? Use a database. The Problem of Disk Layout Bandwidth utilization Define • The level of indirection in the file block maps allows b Block size flexibility in file layout. B Raw disk bandwidth (“spindle speed”) • “File system design is 99% block allocation.” [McVoy] s Average access (seek+rotation) delay per block I/O • Competing goals for block allocation: – allocation cost Then – bandwidth for high-volume transfers Transfer time per block = b/B – stamina/longevity I/O completion time per block = s + (b/B) – efficient directory operations Effective disk bandwidth for I/O request stream = b/(s + (b/B)) • Goal: reduce disk arm movement and seek overhead. Bandwidth wasted per I/O: sB • metric of merit: bandwidth utilization Effective bandwidth utilization (%): b/(sB + b) Track Sector How to get better performance? Arm - Larger b (larger blocks, clustering, extents, etc.) Cylinder - Smaller s (placement / ordering, sequential access, logging, etc.) Platter Head 4
Recommend
More recommend