accounting pages shared pages make it diffjcult to count memory usage count shared fjle pages for the process that last ‘used’ them …as detected by page fault for page 20 Linux cgroups accounting: last touch
Linux cgroup limits Linux “control groups” of processes can set memory limits for group of proceses: low limit: don’t ‘steal’ pages when group uses less than this always take pages someone is using (unless no choice) high limit: never let group use more than this replace pages from this group before anything else … 21
Linux cgroups Linux mechanism: seperate processes into groups: cgroup website cgroup login can set memory and CPU and …shares for each group 22 webserver webapp … bash (shell) ls …
Linux cgroup memory limits memory usage low limit high limit max 0 GB memory capacity actively deallocate pages cgroup is using if other processes need memory, take from this group do not take from this group for other groups (even if pages not recently used) 23
recall: kernel bufgering (reads) …via bufger …via bufger data from disk bufger: recently read read block of data from disk from fjle read char from terminal program read char waiting for program bufger: keyboard input keypress happens, read disk keyboard operating system 24
recall: kernel bufgering (reads) …via bufger …via bufger data from disk bufger: recently read read block of data from disk from fjle read char from terminal program read char waiting for program bufger: keyboard input keypress happens, read disk keyboard operating system 24
recall: kernel bufgering (reads) …via bufger …via bufger data from disk bufger: recently read read block of data from disk from fjle read char from terminal program read char waiting for program bufger: keyboard input keypress happens, read disk keyboard operating system 24
recall: kernel bufgering (reads) …via bufger …via bufger data from disk bufger: recently read read block of data from disk from fjle read char from terminal program read char waiting for program bufger: keyboard input keypress happens, read disk keyboard operating system 24
recall: kernel bufgering (reads) …via bufger …via bufger data from disk bufger: recently read read block of data from disk from fjle read char from terminal program read char waiting for program bufger: keyboard input keypress happens, read disk keyboard operating system 24
recall: kernel bufgering (writes) to remote machine to be written on disk bufger: data waiting write block of data from disk (when ready) to fjle write char print char program waiting for network bufger: output send data (when ready) disk network operating system 25
recall: kernel bufgering (writes) to remote machine to be written on disk bufger: data waiting write block of data from disk (when ready) to fjle write char print char program waiting for network bufger: output send data (when ready) disk network operating system 25
recall: kernel bufgering (writes) to remote machine to be written on disk bufger: data waiting write block of data from disk (when ready) to fjle write char print char program waiting for network bufger: output send data (when ready) disk network operating system 25
recall: kernel bufgering (writes) to remote machine to be written on disk bufger: data waiting write block of data from disk (when ready) to fjle write char print char program waiting for network bufger: output send data (when ready) disk network operating system 25
recall: kernel bufgering (writes) to remote machine to be written on disk bufger: data waiting write block of data from disk (when ready) to fjle write char print char program waiting for network bufger: output send data (when ready) disk network operating system 25
recall: layering application standard library system calls kernel’s fjle interface device drivers hardware interfaces kernel’s bufgers read/write cout/printf — and their own bufgers 26
ways to talk to I/O devices user program read/write/mmap/etc. fjle interface regular fjles fjlesystems device fjles device drivers 27
devices as fjles talking to device? open/read/write/close typically similar interface within the kernel device driver implements the fjle interface 28
example device fjles from a Linux desktop /dev/snd/pcmC0D0p — audio playback confjgure, then write audio data /dev/sda , /dev/sdb — SATA-based SSD and hard drive usually access via fjlesystem, but can mmap/read/write directly /dev/input/event3 , /dev/input/event10 — mouse and keyboard can read list of keypress/mouse movement/etc. events /dev/dri/renderD128 — builtin graphics DRI = direct rendering infrastructure 29
devices: extra operations? read/write/mmap not enough audio output device — set format of audio? terminal — whether to echo back what user types? CD/DVD — open the disk tray? is a disk present? … POSIX: ioctl (general I/O control), tcget/setaddr (for terminal settings), … 30
Linux example: fjle operations (selected subset — table of pointers to functions) }; ... ... unsigned long mmap_supported_flags; ... ... ... struct file_operations { 31 ssize_t (*read) ( struct file *, char __user *, size_t, loff_t *); ssize_t (*write) ( struct file *, const char __user *,x size_t, loff_t *); long (*unlocked_ioctl) ( struct file *, unsigned int , unsigned long ); int (*mmap) ( struct file *, struct vm_area_struct *); int (*open) ( struct inode *, struct file *); int (*release) ( struct inode *, struct file *);
special case: block devices devices like disks often have a difgerent interface instead of bytes used by fjlesystems — store directories on devices fjlesystems are specialized to know disks aren’t byte-based want to work with page cache — bytes not convenient read/write page at a time implement read/write to use page cache, not direct common code to translate from working with bytes to blocks 32 unlike normal fjle interface, works in terms of ‘blocks’
Linux example: block device operations struct block_device_operations { ... }; read/write a page for a sector number (= block number) 33 int (*open) ( struct block_device *, fmode_t); void (*release) ( struct gendisk *, fmode_t); int (*rw_page)( struct block_device *, sector_t, struct page *, bool ); int (*ioctl) ( struct block_device *, fmode_t, unsigned , unsigned long );
device driver fmow get interrupt from device trap handler “bottom half” device hardware store and return request result send more to device (if needed) wake up thread (if needed) update bufgers put thread to sleep (if needed) thread making read/write/etc. “top half” send or queue I/O operation (e.g. previous keypresses to keyboard) check if satisfjed from bufgers page cache miss/eviction… read/write/… system call or get I/O request 34
device driver fmow get interrupt from device trap handler “bottom half” device hardware store and return request result send more to device (if needed) wake up thread (if needed) update bufgers put thread to sleep (if needed) thread making read/write/etc. “top half” send or queue I/O operation (e.g. previous keypresses to keyboard) check if satisfjed from bufgers page cache miss/eviction… read/write/… system call or get I/O request 34
device driver fmow get interrupt from device trap handler “bottom half” device hardware store and return request result send more to device (if needed) wake up thread (if needed) update bufgers put thread to sleep (if needed) thread making read/write/etc. “top half” send or queue I/O operation (e.g. previous keypresses to keyboard) check if satisfjed from bufgers page cache miss/eviction… read/write/… system call or get I/O request 34
xv6: device fjles struct devsw { }; extern struct devsw devsw[]; table of devices device fjle uses entry in devsw array fjlesystem stores name to index lookup similar scheme used on ‘real’ Unix/Linux fjles referencing major/minor device number table of device numbers in kernel 35 int (*read)( struct inode*, char *, int ); int (*write)( struct inode*, char *, int );
xv6: console devsw code run at boot: devsw[CONSOLE].write = consolewrite; devsw[CONSOLE].read = consoleread; CONSOLE is a constant consoleread/consolewrite: run when you read/write console 36
xv6: console devsw code run at boot: devsw[CONSOLE].write = consolewrite; devsw[CONSOLE].read = consoleread; CONSOLE is a constant consoleread/consolewrite: run when you read/write console 36
device driver fmow get interrupt from device trap handler “bottom half” device hardware store and return request result send more to device (if needed) wake up thread (if needed) update bufgers put thread to sleep (if needed) thread making read/write/etc. “top half” send or queue I/O operation (e.g. previous keypresses to keyboard) check if satisfjed from bufgers page cache miss/eviction… read/write/… system call or get I/O request 37
xv6: console top half (read) } } ... release(&cons.lock) } ... } sleep(&input.r, &cons.lock); ... int while (input.r == input.w){ while (n > 0){ acquire(&cons.lock); target = n; ... { 38 consoleread( struct inode *ip, char *dst, int n) if (myproc() − >killed){ return − 1;
device driver fmow get interrupt from device trap handler “bottom half” device hardware store and return request result send more to device (if needed) wake up thread (if needed) update bufgers put thread to sleep (if needed) thread making read/write/etc. “top half” send or queue I/O operation (e.g. previous keypresses to keyboard) check if satisfjed from bufgers page cache miss/eviction… read/write/… system call or get I/O request 39
xv6: console top half (read) int } ... release(&cons.lock) } break ; if (c == '\n') *dst++ = c; ... c = input.buf[input.r++ % INPUT_BUF]; ... while (n > 0){ acquire(&cons.lock); target = n; ... { 40 consoleread( struct inode *ip, char *dst, int n) −− n; return target − n;
xv6: console top half (read) int } ... release(&cons.lock) } break ; if (c == '\n') *dst++ = c; ... c = input.buf[input.r++ % INPUT_BUF]; ... while (n > 0){ acquire(&cons.lock); target = n; ... { 40 consoleread( struct inode *ip, char *dst, int n) −− n; return target − n;
xv6: console top half wait for bufger to fjll no special work to request data — keyboard input always sent copy from bufger check if done (newline or enough chars), if not repeat 41
device driver fmow get interrupt from device trap handler “bottom half” device hardware store and return request result send more to device (if needed) wake up thread (if needed) update bufgers put thread to sleep (if needed) thread making read/write/etc. “top half” send or queue I/O operation (e.g. previous keypresses to keyboard) check if satisfjed from bufgers page cache miss/eviction… read/write/… system call or get I/O request 42
xv6: console interrupt (one case) break ; lapcieoi: tell CPU “I’m done with this interrupt” kbdintr: atually read from keyboard device } ... } ... lapcieoi(); void kbdintr(); case T_IRQ0 + IRQ_KBD: ... ... 43 trap( struct trapframe *tf) { switch (tf − >trapno) {
xv6: console interrupt (one case) break ; lapcieoi: tell CPU “I’m done with this interrupt” kbdintr: atually read from keyboard device } ... } ... lapcieoi(); void kbdintr(); case T_IRQ0 + IRQ_KBD: ... ... 43 trap( struct trapframe *tf) { switch (tf − >trapno) {
device driver fmow get interrupt from device trap handler “bottom half” device hardware store and return request result send more to device (if needed) wake up thread (if needed) update bufgers put thread to sleep (if needed) thread making read/write/etc. “top half” send or queue I/O operation (e.g. previous keypresses to keyboard) check if satisfjed from bufgers page cache miss/eviction… read/write/… system call or get I/O request 44
xv6: console interrupt reading kbdintr fuction actually reads from device adds data to bufger (if room) wakes up sleeping thread (if any) 45
connecting devices 0x80004808 : which of several processors handles it, …, etc.) (deals with ordering, interrupt disabling, component of processor decides when to handle way to send “please interrupt” signal bufgers/queues will also have memory addresses actually just sends the value the external hardware e.g. maybe writing to write? “control register” control registers might not really be registers actually changes value in device controller looks like write to memory control registers have memory addresses …: 0x80004810 : 0x80004800 : processor external hardware? bufgers/queues control registers … write? read? status device controller other devices actual memory other processors… memory bus controller interrupt 46
connecting devices 0x80004808 : which of several processors handles it, …, etc.) (deals with ordering, interrupt disabling, component of processor decides when to handle way to send “please interrupt” signal bufgers/queues will also have memory addresses actually just sends the value the external hardware e.g. maybe writing to write? “control register” control registers might not really be registers actually changes value in device controller looks like write to memory control registers have memory addresses …: 0x80004810 : 0x80004800 : processor external hardware? bufgers/queues control registers … write? read? status device controller other devices actual memory other processors… memory bus controller interrupt 46
connecting devices 0x80004808 : which of several processors handles it, …, etc.) (deals with ordering, interrupt disabling, component of processor decides when to handle way to send “please interrupt” signal bufgers/queues will also have memory addresses actually just sends the value the external hardware e.g. maybe writing to write? “control register” control registers might not really be registers actually changes value in device controller looks like write to memory control registers have memory addresses …: 0x80004810 : 0x80004800 : processor external hardware? bufgers/queues control registers … write? read? status device controller other devices actual memory other processors… memory bus controller interrupt 46
connecting devices 0x80004808 : which of several processors handles it, …, etc.) (deals with ordering, interrupt disabling, component of processor decides when to handle way to send “please interrupt” signal bufgers/queues will also have memory addresses actually just sends the value the external hardware e.g. maybe writing to write? “control register” control registers might not really be registers actually changes value in device controller looks like write to memory control registers have memory addresses …: 0x80004810 : 0x80004800 : processor external hardware? bufgers/queues control registers … write? read? status device controller other devices actual memory other processors… memory bus controller interrupt 46
connecting devices 0x80004808 : which of several processors handles it, …, etc.) (deals with ordering, interrupt disabling, component of processor decides when to handle way to send “please interrupt” signal bufgers/queues will also have memory addresses actually just sends the value the external hardware e.g. maybe writing to write? “control register” control registers might not really be registers actually changes value in device controller looks like write to memory control registers have memory addresses …: 0x80004810 : 0x80004800 : processor external hardware? bufgers/queues control registers … write? read? status device controller other devices actual memory other processors… memory bus controller interrupt 46
bus adaptors device controller difgerent bus external hardware? bufgers/queues control registers … write? read? status other devices processor bus adaptor other bus adaptors or other devices actual memory other processors… memory bus controller interrupt 47
devices as magic memory (1) devices expose memory locations to read/write use read/write instructions to manipulate device example: keyboard controller read from magic memory location — get last keypress/release reading location clears bufger for next keypress/release get interrupt whenever new keypress/release you haven’t read 48
devices as magic memory (1) devices expose memory locations to read/write use read/write instructions to manipulate device example: keyboard controller read from magic memory location — get last keypress/release reading location clears bufger for next keypress/release get interrupt whenever new keypress/release you haven’t read 48
devices as magic memory (1) devices expose memory locations to read/write use read/write instructions to manipulate device example: keyboard controller read from magic memory location — get last keypress/release reading location clears bufger for next keypress/release get interrupt whenever new keypress/release you haven’t read 48
device as magic memory (2) example: display controller write to pixels to magic memory location — displayed on screen other memory locations control format/screen size example: network interface write to bufgers write “send now” signal to magic memory location — send data read from “status” location, bufgers to receive 49
solution: OS can mark memory uncachable what about caching? caching “last keypress/release”? I press ‘h’, OS reads ‘h’, does that get cached? …I press ‘e’, OS reads what? x86: bit in page table entry can say “no caching” 50
what about caching? caching “last keypress/release”? I press ‘h’, OS reads ‘h’, does that get cached? …I press ‘e’, OS reads what? solution: OS can mark memory uncachable x86: bit in page table entry can say “no caching” 50
what about caching? caching “last keypress/release”? I press ‘h’, OS reads ‘h’, does that get cached? …I press ‘e’, OS reads what? solution: OS can mark memory uncachable x86: bit in page table entry can say “no caching” 50
aside: I/O space x86 has a “I/O addresses” like memory addresses, but accessed with difgerent instruction in and out instructions historically: separate I/O bus more recent processors/devices would just use memory addresses no need for more instructions, buses other reasons to have devices and memory close (later) 51
xv6 keyboard access two control registers: KBSTATP: status register (I/O address 0x64 ) KBDATAP: data bufger (I/O address 0x60 ) st = inb(KBSTATP); // in instruction: read from I/O address if ((st & KBS_DIB) == 0) // bit KBS_DIB indicates data in buffer? data = inb(KBDATAP); 52 return − 1; // read from data --- *clears* buffer /* interpret data to learn what kind of keypress/release */
programmed I/O “programmed I/O”: write to or read from device bufgers directly OS runs loop to transfer data to or from device might still be triggered by interrupt know/what for “is device ready” 53
approximating LRU: SEQ know: not referenced ‘recently’ extra details needed: how big is the inactive list? this is current Linux algorithm for non-fjle pages or mark invalid + get fault scan reference bits detecting references? “new” pages start in active list evict page at bottom of inactive list active list move to active list not really inactive inactive page referenced? is really inactive page guess: oldest active page inactive list 54
approximating LRU: SEQ know: not referenced ‘recently’ extra details needed: how big is the inactive list? this is current Linux algorithm for non-fjle pages or mark invalid + get fault scan reference bits detecting references? “new” pages start in active list evict page at bottom of inactive list active list move to active list not really inactive inactive page referenced? is really inactive page guess: oldest active page inactive list 54
approximating LRU: SEQ know: not referenced ‘recently’ extra details needed: how big is the inactive list? this is current Linux algorithm for non-fjle pages or mark invalid + get fault scan reference bits detecting references? “new” pages start in active list evict page at bottom of inactive list active list move to active list not really inactive inactive page referenced? is really inactive page guess: oldest active page inactive list 54
approximating LRU: SEQ know: not referenced ‘recently’ extra details needed: how big is the inactive list? this is current Linux algorithm for non-fjle pages or mark invalid + get fault scan reference bits detecting references? “new” pages start in active list evict page at bottom of inactive list active list move to active list not really inactive inactive page referenced? is really inactive page guess: oldest active page inactive list 54
approximating LRU: SEQ know: not referenced ‘recently’ extra details needed: how big is the inactive list? this is current Linux algorithm for non-fjle pages or mark invalid + get fault scan reference bits detecting references? “new” pages start in active list evict page at bottom of inactive list active list move to active list not really inactive inactive page referenced? is really inactive page guess: oldest active page inactive list 54
approximating LRU: SEQ know: not referenced ‘recently’ extra details needed: how big is the inactive list? this is current Linux algorithm for non-fjle pages or mark invalid + get fault scan reference bits detecting references? “new” pages start in active list evict page at bottom of inactive list active list move to active list not really inactive inactive page referenced? is really inactive page guess: oldest active page inactive list 54
approximating LRU: SEQ know: not referenced ‘recently’ extra details needed: how big is the inactive list? this is current Linux algorithm for non-fjle pages or mark invalid + get fault scan reference bits detecting references? “new” pages start in active list evict page at bottom of inactive list active list move to active list not really inactive inactive page referenced? is really inactive page guess: oldest active page inactive list 54
loaded evicted swapping timeline hopefully copy on disk is already up-to-date? and restarted from point of fault process A’s page table updated OS will get interrupt when disk is done other processes can run while reading page real case: possibly many page tables this example: only process B mark evicted page invalid in each page table fjrst step of replacement: interrupt OS needs to choose page to replace … start read OS page fault program A program B pages … program A pages 55
swapping timeline OS needs to choose page to replace and restarted from point of fault process A’s page table updated OS will get interrupt when disk is done other processes can run while reading page real case: possibly many page tables this example: only process B mark evicted page invalid in each page table fjrst step of replacement: hopefully copy on disk is already up-to-date? interrupt … start read OS page fault program A program B pages … program A pages 55 loaded evicted
swapping timeline OS needs to choose page to replace and restarted from point of fault process A’s page table updated OS will get interrupt when disk is done other processes can run while reading page real case: possibly many page tables this example: only process B mark evicted page invalid in each page table fjrst step of replacement: hopefully copy on disk is already up-to-date? interrupt … start read OS page fault program A program B pages … program A pages 55 loaded evicted
swapping timeline OS needs to choose page to replace and restarted from point of fault process A’s page table updated OS will get interrupt when disk is done other processes can run while reading page real case: possibly many page tables this example: only process B mark evicted page invalid in each page table fjrst step of replacement: hopefully copy on disk is already up-to-date? interrupt … start read OS page fault program A program B pages … program A pages 55 loaded evicted
swapping timeline OS needs to choose page to replace and restarted from point of fault process A’s page table updated OS will get interrupt when disk is done other processes can run while reading page real case: possibly many page tables this example: only process B mark evicted page invalid in each page table fjrst step of replacement: hopefully copy on disk is already up-to-date? interrupt … start read OS page fault program A program B pages … program A pages 55 loaded evicted
POSIX: everything is a fjle the fjle: one interface for devices (terminals, printers, …) regular fjles on disk networking (sockets) local interprocess communication (pipes, sockets) basic operations: open(), read(), write(), close() 56
the fjle interface open before use setup, access control happens here byte-oriented real device isn’t? operating system needs to hide that explicit close 57
the fjle interface open before use setup, access control happens here byte-oriented real device isn’t? operating system needs to hide that explicit close 57
kernel bufgering (reads) …via bufger …via bufger data from disk bufger: recently read read block of data from disk from fjle read char from terminal program read char waiting for program bufger: keyboard input keypress happens, read disk keyboard operating system 58
kernel bufgering (reads) …via bufger …via bufger data from disk bufger: recently read read block of data from disk from fjle read char from terminal program read char waiting for program bufger: keyboard input keypress happens, read disk keyboard operating system 58
kernel bufgering (reads) …via bufger …via bufger data from disk bufger: recently read read block of data from disk from fjle read char from terminal program read char waiting for program bufger: keyboard input keypress happens, read disk keyboard operating system 58
Recommend
More recommend