the page cache
play

The Page Cache System Calls Kernel Todays Lecture RCU File - PDF document

3/16/16 CSE 506: Opera.ng Systems CSE 506: Opera.ng Systems Logical Diagram Binary Memory Threads Formats Allocators User The Page Cache System Calls Kernel Todays Lecture RCU File System Networking Sync (kernel level mem.


  1. 3/16/16 CSE 506: Opera.ng Systems CSE 506: Opera.ng Systems Logical Diagram Binary Memory Threads Formats Allocators User The Page Cache System Calls Kernel Today’s Lecture RCU File System Networking Sync (kernel level mem. management) Don Porter Memory Device CPU Management Drivers Scheduler Hardware Interrupts Disk Net Consistency 1 2 CSE 506: Opera.ng Systems CSE 506: Opera.ng Systems Recap of previous lectures Background • Page tables: translate virtual addresses to physical • Lab2: Track physical pages with an array of PageInfo addresses structs • VM Areas (Linux): track what should be mapped at in – Contains reference counts the virtual address space of a process – Free list layered over this array • Just like JOS, Linux represents physical memory with • Hoard/Linux slab: Efficient allocaVon of objects from an array of page structs a superblock/slab of pages – Obviously, not the exact same contents, but same idea • Pages can be allocated to processes, or to cache file data in memory 3 4 CSE 506: Opera.ng Systems CSE 506: Opera.ng Systems Today’s Problem The address space abstracVon • Given a VMA or a file’s inode, how do I figure out • Unifying abstracVon: which physical pages are storing its data? – Each file inode has an address space (0—file size) • Next lecture: We will go the other way, from a – So do block devices that cache data in RAM (0---dev size) physical page back to the VMA or file inode – The (anonymous) virtual memory of a process has an address space (0—4GB on x86) • In other words, all page mappings can be thought of as and (object, offset) tuple – Make sense? 5 6 1

  2. 3/16/16 CSE 506: Opera.ng Systems CSE 506: Opera.ng Systems Address Spaces for: Start Simple • VM Areas (VMAs) • “Anonymous” memory – no file backing it – E.g., the stack for a process • Files • Not shared between processes – Will discuss sharing and swapping later • How do we figure out virtual to physical mapping? – Just walk the page tables! • Linux doesn’t do anything outside of the page tables to track this mapping 7 8 CSE 506: Opera.ng Systems CSE 506: Opera.ng Systems File mappings Logical View • A VMA can also represent a memory mapped file ? ? • The kernel can also map file pages to service Foo.txt read() or write() system calls Process A inode ? • Goal: We only want to load a file into memory once! ? Process B Hello! Disk Process C 9 10 CSE 506: Opera.ng Systems CSE 506: Opera.ng Systems VMA to a file Logical View • Also easy: VMA includes a file pointer and an offset ? into file Hello! Foo.txt – A VMA may map only part of the file Process A inode – Offset must be at page granularity FDs are – Anonymous mapping: file pointer is null process- • File pointer is an open file descriptor in the process specific file descriptor table Process B – We will discuss file handles later File Descriptor Disk Table Process C 11 12 2

  3. 3/16/16 CSE 506: Opera.ng Systems CSE 506: Opera.ng Systems Tracking file pages The Radix Tree • What data structure to use for a file? • A space-opVmized trie – No page tables for files – Trie: Rather than store enVre key in each node, traversal of parent(s) builds a prefix, node just stores suffix • For example: What page stores the first 4k of file • Especially useful for strings “foo” – Prefix less important for file offsets, but does bound key storage space • What data structure to use? • More important: A tree with a branching factor k > 2 – Hint: Files can be small, or very, very large – Faster lookup for large files (esp. with tricks) • Note: Linux’s use of the Radix tree is constrained 13 14 CSE 506: Opera.ng Systems CSE 506: Opera.ng Systems From “Understanding the Linux Kernel” A bit more detail • Assume an upper bound on file size when building radix_tree_root rnode the radix tree height = 2 – Can rebuild later if we are wrong radix_tree_node count = 2 • Specifically: Max size is 256k, branching factor (k) = radix_tree_root 0 63 rnode ... 64 height = 1 slots[0] slots[2] • 256k / 4k pages = 64 pages radix_tree_node radix_tree_node radix_tree_node count = 2 count = 2 count = 1 – So we need a radix tree of height 1 to represent these 0 63 0 63 0 63 ... ... ... pages slots[0] slots[4] slots[0] slots[4] slots[3] page page page page page index = 0 index = 4 index = 0 index = 4 index = 131 (a) radix tree of height 1 (b) radix tree of height 2 15 16 CSE 506: Opera.ng Systems CSE 506: Opera.ng Systems Tree of height 1 Tree of height n • Similar story: • Root has 64 slots, can be null, or a pointer to a page – Ship off low 12 bits • Lookup address X: • At each child ship off 6 bits from middle (starVng at 6 – Ship off low 12 bits (offset within page) * (distance to the bosom – 1) bits) to find which of – Use next 6 bits as an index into these slots (2^6 = 64) the 64 potenVal children to go to – If pointer non-null, go to the child node (page) – Use fixed height to figure out where to stop, which bits to – If null, page doesn’t exist use for offset • ObservaVons: – “Key” at each node implicit based on posiVon in tree – Lookup Vme constant in height of tree • In a general-purpose radix tree, may have to check all k children, for higher lookup cost 17 18 3

  4. 3/16/16 CSE 506: Opera.ng Systems CSE 506: Opera.ng Systems Fixed heights Back to address spaces • If the file size grows beyond max height, must grow • Each address space for a file cached in memory the tree includes a radix tree – Radix tree is sparse: pages not in memory are missing • RelaVvely simple: Add another root, previous tree becomes first child • Radix tree also supports tags: such as dirty • Scaling in height: – A tree node is tagged if at least one child also has the tag • Example: I tag a file page dirty – 1: 2^( (6*1) +12) = 256 KB – 2: 2^( (6*2) + 12) = 16 MB – Must tag each parent in the radix tree as dirty – 3: 2^( (6*3) + 12) = 1 GB – When I am finished wriVng page back, I must check all siblings; if none dirty, clear the parent’s dirty tag – 4: 2^( (6*4) + 12) = 64 GB – 5: 2^( (6*5) + 12) = 4 TB 19 20 CSE 506: Opera.ng Systems CSE 506: Opera.ng Systems Logical View Recap Address Space • Anonymous page: Just use the page tables • File-backed mapping Foo.txt – VMA -> open file descriptor-> inode Process A inode – Inode -> address space (radix tree)-> page Radix Tree Hello! Process B Disk Process C 21 22 CSE 506: Opera.ng Systems CSE 506: Opera.ng Systems Problem 2: Dirty pages Sync system calls • Most OSes do not write file updates to disk • sync() – Flush all dirty buffers to disk immediately • fsync(fd) – Flush all dirty buffers associated with this – (Later lecture) OS tries to opVmize disk arm movement file to disk (including changes to the inode) • OS instead tracks “dirty” pages • fdatasync(fd) – Flush only dirty data pages for this – Ensures that write back isn’t delayed too long file to disk • Lest data be lost in a crash – Don’t bother with the inode • ApplicaVon can force immediate write back with sync system calls (and some open/mmap opVons) 23 24 4

  5. 3/16/16 CSE 506: Opera.ng Systems CSE 506: Opera.ng Systems How to implement sync? How to implement sync? • Goal: keep overheads of finding dirty blocks low • Background: Each file system has a super block – A naïve scan of all pages would work, but expensive – All super blocks in a list – Lots of clean pages • Each super block keeps a list of dirty inodes • Idea: keep track of dirty data to minimize overheads • Inodes and superblocks both marked dirty upon use – A bit of extra work on the write path, of course 25 26 CSE 506: Opera.ng Systems CSE 506: Opera.ng Systems FS OrganizaVon Simple traversal One Superblock for each s in superblock list: per FS if (s->dirty) writeback s SB SB SB / /floppy /d1 for i in inode list: Dirty list of if (i->dirty) writeback i inodes Dirty list if (i->radix_root->dirty) : inode // Recursively traverse tree wriVng // dirty pages and clearing dirty flag Inodes and radix nodes/pages marked dirty separately 27 28 CSE 506: Opera.ng Systems CSE 506: Opera.ng Systems Asynchronous flushing pdflush • Kernel thread(s): pdflush • When pdflush is scheduled, it figures out how many dirty pages are above the target raVo – A kernel thread is a task that only runs in the kernel’s address space • Writes back pages unVl it meets its goal or can’t – 2-8 threads, depending on how busy/idle threads are write more back • When pdflush runs, it is given a target number of – (Some pages may be locked, just skip those) pages to write back • Same traversal as sync() + a count of wrisen pages – Kernel maintains a total number of dirty pages – Usually quits earlier – Administrator configures a target dirty raVo (say 10%) 29 30 5

Recommend


More recommend