Block Device Scheduling Don Porter CSE 506 Logical Diagram Binary - PowerPoint PPT Presentation

Block Device Scheduling Don Porter CSE 506

Logical Diagram Binary Memory Threads Formats Allocators User System Calls Kernel RCU File System Networking Sync Memory CPU Device Management Scheduler Drivers Today’s Lecture Hardware Interrupts Disk Net Consistency

Quick Recap ò CPU Scheduling ò Balance competing concerns with heuristics ò What were some goals? ò No perfect solution ò Today: Block device scheduling ò How different from the CPU? ò Focus primarily on a traditional hard drive ò Extend to new storage media

Block device goals ò Throughput ò Latency ò Safety – file system can be recovered after a crash ò Fairness – surprisingly, very little attention is given to storage access fairness ò Hard problem – solutions usually just prevent starvation ò Disk quotas for space fairness

Big Picture VFS Low-level FS (ext4, BTRFS, etc.) Page Cache Block Device IO Scheduler Driver Disk

OS Model of a Block Dev. ò Simple array of blocks ò Blocks are usually 512 or 4k bytes

Recall: Page Cache Page (blue) w/ 3 buffer heads (green) Page Cache ` Buffer Heads map disk blocks Block Dev

Caching ò Obviously, the number 1 trick in the OS designer’s toolbox is caching disk contents in RAM ò Remember the page cache? ò Latency – can be hidden by pre-reading data into RAM ò And keeping any free RAM full of disk contents ò Doesn’t help synchronous reads (that miss in RAM cache) or synchronous writes

Caching + throughput ò Assume that most reads and writes to disk are asynchronous ò Dirty data can be buffered and written at OS’s leisure ò Most reads hit in RAM cache – most disk reads are read- ahead optimizations ò Key problem: How to optimally order pending disk I/O requests? ò Hint: it isn’t first-come, first-served

Another view of the problem ò Between page cache and disk, you have a queue of pending requests ò Requests are a tuple of (block #, read/write, buffer addr) ò You can reorder these as you like to improve throughput ò What reordering heuristic to use? If any? ò Heuristic is called the IO Scheduler

A simple disk model ò Disks are slow. Why? ò Moving parts << circuits ò Programming interface: simple array of sectors (blocks) ò Physical layout: ò Concentric circular “tracks” of blocks on a platter ò E.g., sectors 0-9 on innermost track, 10-19 on next track, etc. ò Disk arm moves between tracks ò Platter rotates under disk head to align w/ requested sector

Disk Model Each block on a sector Disk 1 0 Head 2 7 3 6 4 5 Disk spins at a Disk Head constant speed. reads at Sectors rotate granularity of underneath head. entire sector

Disk Model Concentric tracks 8 Disk 21 9 20 10 1 0 Head 19 2 7 11 3 18 6 12 4 5 17 13 14 15 16 Disk head seeks to different tracks Gap between 7 and 8 accounts for seek time

Many Tracks Disk Head

Several (~4) Platters Platters spin Each platter has a head; together at same All heads seek together speed

Implications of multiple platters ò Blocks actually striped across platters ò Example: ò Sector 0 on platter 0 ò Sector 1 on platter 1 at same position ò Sector 2 on platter 2, Sec. 3 on Plat. 3 also at same position ò 4 heads can read all 4 sectors simultaneously

3 key latencies ò I/O delay: time it takes to read/write a sector ò Rotational delay: time the disk head waits for the platter to rotate desired sector under it ò Note: disk rotates continuously at constant speed ò Seek delay: time the disk arm takes to move to a different track

Observations ò Latency of a given operation is a function of current disk arm and platter position ò Each request changes these values ò Idea: build a model of the disk ò Maybe use delay values from measurement or manuals ò Use simple math to evaluate latency of each pending request ò Greedy algorithm: always select lowest latency

Example formula ò s = seek latency, in time/track ò r = rotational latency, in time/sector ò i = I/O latency, in seconds ò Time = ( Δ tracks * s) + ( Δ sectors * r) + I ò Note: Δ sectors must factor in position after seek is finished. Why?

Problem with greedy? ò “Far” requests will starve ò Disk head may just hover around the “middle” tracks

Elevator Algorithm ò Require disk arm to move in continuous “sweeps” in and out ò Reorder requests within a sweep ò Ex: If disk arm is moving “out,” reorder requests between the current track and the outside of disk in ascending order (by block number) ò A request for a sector the arm has already passed must be ordered after the outermost request, in descending order

Elevator Algo, pt. 2 ò This approach prevents starvation ò Sectors at “inside” or “outside” get service after a bounded time ò Reasonably good throughput ò Sort requests to minimize seek latency ò Can get hit with rotational latency pathologies (How?) ò Simple to code up! ò Programming model hides low-level details; difficult to do fine- grained optimizations in practice

Pluggable Schedulers ò Linux allows the disk scheduler to be replaced ò Just like the CPU scheduler ò Can choose a different heuristic that favors: ò Fairness ò Real-time constraints ò Performance

Complete Fairness Queue (CFQ) ò Idea: Add a second layer of queues (one per process) ò Round-robin promote them to the “real” queue ò Goal: Fairly distribute disk bandwidth among tasks ò Problems? ò Overall throughput likely reduced ò Ping-pong disk head around

Deadline Scheduler ò Associate expiration times with requests ò As requests get close to expiration, make sure they are deployed ò Constrains reordering to ensure some forward progress ò Good for real-time applications

Anticipatory Scheduler ò Idea: Try to anticipate locality of requests ò If process P tends to issue bursts of requests for close disk blocks, ò When you see a request from P, hold the request in the disk queue for a while ò See if more “nearby” requests come in ò Then schedule all the requests at once And coalesce adjacent requests ò

Optimizations at Cross-purposes ò The disk itself does some optimizations: ò Caching ò Write requests can sit in a volatile cache for longer than expected ò Reordering requests internally ò Can’t assume that requests are serviced in-order ò Dependent operations must wait until first finishes ò Bad sectors can be remapped to “spares” ò Problem: disk arm flailing on an old disk

A note on safety ò In Linux, and other OSes, the I/O scheduler can reorder requests arbitrarily ò It is the file system’s job to keep unsafe I/O requests out of the scheduling queues

Dangerous I/Os ò What can make an I/O request unsafe? ò File system bookkeeping has invariants on disk ò Example: Inodes point to file data blocks; data blocks are also marked as free in a bitmap ò Updates must uphold these invariants ò Ex: Write an update to the inode, then the bitmap ò What if the system crashes between writes? ò Block can end up in two files!!!

3 Simple Rules (Courtesy of Ganger and McKusick, “Soft Updates” paper) ò Never write a pointer to a structure until it has been initialized ò Ex: Don’t write a directory entry to disk until the inode has been written to disk ò Never reuse a resource before nullifying all pointers to it ò Ex: Before re-allocating a block to a file, write an update to the inode that references it ò Never reset the last pointer to a live resource before a new pointer has been set ò Ex: Renaming a file – write the new directory entry before the old one (better 2 links than none)

A note on safety ò It is the file system’s job to keep unsafe I/O requests out of the scheduling queues ò While these constraints are simple, enforcing them in the average file system is surprisingly difficult ò Journaling helps by creating a log of what you are in the middle of doing, which can be replayed ò (Simpler) Constraint: Journal updates must go to disk before FS updates

Disks aren’t everything ò Flash is increasing in popularity ò Different types with slight variations (NAND, NOR, etc) ò No moving parts – who cares about block ordering anymore? ò Can only write to a block of flash ~100k times ò Can read as much as you want

More in a Flash ò Flash reads are generally fast, writes are more expensive ò Prefetching has little benefit ò Queuing optimizations can take longer than a read ò New issue: wear leveling – need to evenly distribute writes ò Flash devices usually have a custom, log-structured FS ò Group random writes

Even newer hotness ò Byte-addressible, persistent RAMs (BPRAM) ò Phase-Change Memory (PCM), Memristors, etc. ò Splits the difference between RAM and flash: ò Byte-granularity writes (vs. blocks) ò Fast reads, slower, high-energy writes ò Doesn’t need energy to hold state (DRAM refresh) ò Wear an issue (bytes get stuck at last value) ò Still in the lab, but getting close

Important research topic ò Most work on optimizing storage accessed is tailored to hard drives ò These heuristics are not easily adapted to new media ò Future systems will have a mix of disks, flash, PRAM, DRAM ò Does it even make sense to treat them all the same?

Summary ò Performance characteristics of disks, flash, BPRAM ò Disk scheduling heuristics ò Safety constraints for file systems

Block Device Scheduling Don Porter CSE 506 Logical Diagram Binary - PowerPoint PPT Presentation

Block Device Scheduling Don Porter CSE 506 Logical Diagram Binary Memory Threads Formats Allocators User System Calls Kernel RCU File System Networking Sync Memory CPU Device Management Scheduler Drivers Todays Lecture

Nquire ask anything Anis Abboud, Chris Snyder, Mario Finelli Device 1 Device 2 Device 1

InfiniBand Network Block Device Overview IBNBD: InfiniBand Network Block device Transfer

Block Device Scheduling Don Porter CSE 506 Quick Recap CPU Scheduling Balance

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Aperiodic Task Scheduling Radek Pel anek Preemptive Scheduling Non-preemptive Scheduling

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

Module 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Chapter 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Uniprocessor Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms 2

Module 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Module 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Uniprocessor Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Three

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

Instruction Scheduling Last time Instruction scheduling using list scheduling Today

Block Scheduling at PGHS aka Bell on Bells Block Scheduling Used to Improve the Following:

A/B BLOCK SCHEDULING 2015-2016 BLOCK DISCUSSION Weve been discussing researching the

E ff ective Concurrency with Algebraic E ff ects Stephen Dolan 1 , Leo White 2 , KC

Downflowing dynamics of vertical prominence threads R. Oliver, R. Soler, T. Zaqarashvili, J.

Summary of etcd benchmarks Exploring etcd as a potential back-end for the O 2 Configuration module

Google App Engine - Python Douglas Starnes SciPy 2011 Austin, Texas Agenda Lightning

Massif - the love child of Matlab Simulink and Eclipse kos Horvth , Istvn Rth and Rodrigo

CACHE OPTIMIZATION Mahdi Nazm Bojnordi Assistant Professor School of Computing University of

Index Construction Introduction to Information Retrieval INF 141 Donald J. Patterson Content

Managing the New Block Layer Kevin Wolf <kwolf@redhat.com> Max Reitz

Block Device Scheduling Don Porter CSE 506 Logical Diagram Binary - PowerPoint PPT Presentation

Block Device Scheduling Don Porter CSE 506 Logical Diagram Binary Memory Threads Formats Allocators User System Calls Kernel RCU File System Networking Sync Memory CPU Device Management Scheduler Drivers Todays Lecture

Nquire ask anything Anis Abboud, Chris Snyder, Mario Finelli Device 1 Device 2 Device 1

InfiniBand Network Block Device Overview IBNBD: InfiniBand Network Block device Transfer

Block Device Scheduling Don Porter CSE 506 Quick Recap CPU Scheduling Balance

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Aperiodic Task Scheduling Radek Pel anek Preemptive Scheduling Non-preemptive Scheduling

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

Module 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Chapter 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Uniprocessor Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms 2

Module 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Module 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Uniprocessor Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Three

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

Instruction Scheduling Last time Instruction scheduling using list scheduling Today

Block Scheduling at PGHS aka Bell on Bells Block Scheduling Used to Improve the Following:

A/B BLOCK SCHEDULING 2015-2016 BLOCK DISCUSSION Weve been discussing researching the

E ff ective Concurrency with Algebraic E ff ects Stephen Dolan 1 , Leo White 2 , KC

Downflowing dynamics of vertical prominence threads R. Oliver, R. Soler, T. Zaqarashvili, J.

Summary of etcd benchmarks Exploring etcd as a potential back-end for the O 2 Configuration module

Google App Engine - Python Douglas Starnes SciPy 2011 Austin, Texas Agenda Lightning

Massif - the love child of Matlab Simulink and Eclipse kos Horvth , Istvn Rth and Rodrigo

CACHE OPTIMIZATION Mahdi Nazm Bojnordi Assistant Professor School of Computing University of

Index Construction Introduction to Information Retrieval INF 141 Donald J. Patterson Content

Managing the New Block Layer Kevin Wolf &lt;kwolf@redhat.com&gt; Max Reitz

Managing the New Block Layer Kevin Wolf <kwolf@redhat.com> Max Reitz