asynchronous i o stack a low latency kernel i o stack for
play

Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for - PowerPoint PPT Presentation

Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra-Low Latency SSDs Jinkyu Jeong Sungkyunkwan University (SKKU) Source: Gyusun Lee et al., Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra-Low Latency SSDs, USENIX


  1. Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra-Low Latency SSDs Jinkyu Jeong Sungkyunkwan University (SKKU) Source: Gyusun Lee et al., Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra-Low Latency SSDs, USENIX ATC 19 NVRAMOS 2019

  2. Storage Performance Trends • Emerging ultra-low latency SSDs deliver I/Os in a few µs 10  10  HDD 10  Latency (ns) Samsung Z-SSD Intel Optane SSD SSD 10  10  ULL-SSD 10  DRAM 10  SRAM 10  10  1985 1990 1995 2000 2003 2005 2010 2017 Year Source: R. E. Bryant and D. R. O'Hallaron, Computer Systems: A Programmer's Perspective, Second Edition, Pearson Education, Inc., 2015 NVRAMOS 2019 2

  3. Overhead of Kernel I/O Stack • Low-latency SSDs expose the overhead of kernel I/O stack 1 Normalized Latency 0.8 0.6 Device 0.4 Kernel User 0.2 0 SATA NVMe Z-SSD Optane SATA NVMe Z-SSD Optane SSD SSD SSD SSD SSD SSD Read Write (+fsync) NVRAMOS 2019 3

  4. Synchronous I/O vs. Asynchronous I/O A (computation) CPU Synchronous I/O Device B (I/O) Total latency A’ A” CPU A ” is independent to B Asynchronous I/O Device B Total latency Throughput Total latency Our Idea: apply asynchronous I/O concept to the I/O stack itself NVRAMOS 2019 4

  5. Read Path Overview Vanilla Read Path sys_read() Return to user I/O stack operations I/O stack operations CPU Device I/O Proposed Read Path sys_read() Return to user CPU Async. operations Latency reduction Device I/O NVRAMOS 2019 5

  6. Write Path Overview Vanilla Write Path sys_write() Buffered write CPU Return to user Device NVRAMOS 2019

  7. Write Path Overview Vanilla Fsync Path sys_fsync() Return to user I/O stack ops. I/O stack ops. I/O stack ops. I/O stack ops. CPU Device I/O I/O I/O Proposed Fsync Path sys_fsync() Return to user I/O stack ops. I/O stack ops. CPU Latency reduction Device I/O I/O I/O NVRAMOS 2019 7

  8. Agenda • Read path − Analysis of vanilla read path − Proposed read path • Light-weight block I/O layer • Write path − Analysis of vanilla write path − Proposed write path • Evaluation • Conclusion NVRAMOS 2019 8

  9. Analysis of Vanilla Read Path Copy-to-user 0.21 µs Page cache lookup 0.30 µs Context switch 0.95 µs Page allocation 0.19 µs Request completion 0.81 µs Page cache insertion 0.33 µs DMA unmapping 0.23 µs LBA retrieval 0.09 µs BIO submission 0.72 µs DMA mapping 0.29 µs NVMe I/O submission 0.37 µs Context switch 0.95 µs sys_read() Return to user CPU Page cache File system Block layer Device driver Interrupt handler I/O submit Interrupt Device I/O 7.26 µs Total latency 12.82μs NVRAMOS 2019 9

  10. Page Allocation / DMA Mapping Page allocation 0.19 µs DMA mapping 0.29 µs CPU I/O submit Interrupt Device I/O 7.26 µs NVRAMOS 2019 10

  11. Asynchronous Page Allocation / DMA Mapping • DMA-mapped page pool 64 pages … Core 0 … 4KB DMA-mapped pages … Core N Pagepool allocation Page allocation 0.19 µs DMA mapping 0.29 µs CPU I/O submit Interrupt Device I/O 7.26 µs NVRAMOS 2019 11

  12. Asynchronous Page Allocation / DMA Mapping • DMA-mapped page pool 64 pages … Core 0 … 4KB DMA-mapped pages … Core N Pagepool allocation Page refill Pagepool allocation 0.016 µs Page allocation 0.19 µs DMA mapping 0.29 µs CPU I/O submit Interrupt Device I/O 7.26 µs NVRAMOS 2019 12

  13. Page Cache Insertion Page cache tree Root Page cache lookup overhead Leaf node … Page cache tree extension overhead … Page Page Page Cache lookup? Cache lookup? Miss Hit Cache insertion success Wait for page read Prevention from duplicated I/O requests Make I/O request for the same file index Page cache insertion 0.33 µs CPU I/O submit Interrupt Device I/O 7.26 µs NVRAMOS 2019 13

  14. Lazy Page Cache Insertion Page cache tree Root Page cache lookup overhead Leaf node … Page cache tree extension overhead Page Page Cache lookup? Miss Cache lookup? Make I/O request Miss Make I/O request Lazy cache insertion? Fail Lazy cache insertion? Duplicated I/O requests Page free (extremely low frequency) Success Page cache insertion 0.35 µs CPU I/O submit Interrupt Device I/O 7.26 µs NVRAMOS 2019 14

  15. DMA Unmapping DMA unmapping 0.23 µs CPU I/O submit Interrupt Device I/O 7.26 µs NVRAMOS 2019 15

  16. Lazy DMA Unmapping • Implementation − Delays DMA unmapping to when a system is idle or waiting for another I/O requests − Extended version of the deferred protection scheme in Linux [ ASPLOS ’ 16 ] − Optionally disabled for safety Lazy DMA unmapping 0.35 µs CPU I/O submit Interrupt Device I/O 7.26 µs NVRAMOS 2019 16

  17. Remaining Overheads in the Proposed Read Path BIO submission 0.72 µs Request completion 0.81 µs NVMe I/O submission 0.37 µs CPU I/O submit Interrupt Device I/O 7.26 µs NVRAMOS 2019 17

  18. Agenda • Read path − Analysis of vanilla read path − Proposed read path • Light-weight block I/O layer • Write path − Analysis of vanilla write path − Proposed write path • Evaluation • Conclusion NVRAMOS 2019 18

  19. Linux Multi-queue Block I/O Layer submit_bio() • Structure conversion Multi-queue − Merge bio with pending request via I/O merging bio: LBA, length, Block Layer page(s), … − Assign new tag & request and convert from bio page • Multi-queue structure request: length, bio(s) − Software staging queue (SW queue) Per-core … SW Queues ü Support I/O scheduling and reordering … HW Queues − Hardware dispatch queue (HW queue) ü Deliver the I/O request to the device driver Device request bio • Multiple dynamic memory allocations Driver Tag iod NVMe CMD − Bio (block layer) prp_list sg_list − NVMe iod, scatter/gather list, NVMe PRP* list … NVMe Queue Pairs (device driver) Linux multi-queue block layer *PRP: physical region page NVRAMOS 2019 19

  20. Linux Multi-queue Block I/O Layer submit_bio() • Structure conversion Multi-queue − Inefficiency of I/O merging [Zhang, OSDI ’ 18 ] bio: LBA, length, Block Layer page(s), … ü Useful feature for low-performance storage device page • Multi-queue structure request: length, bio(s) • Multiple dynamic memory allocations Per-core … SW Queues … HW Queues Device request bio Driver Tag iod NVMe CMD prp_list sg_list … NVMe Queue Pairs Linux multi-queue block layer NVRAMOS 2019 20

  21. Linux Multi-queue Block I/O Layer submit_bio() • Structure conversion Multi-queue − Inefficiency of I/O merging [Zhang, OSDI ’ 18 ] bio: LBA, length, Block Layer page(s), … ü Useful feature for low-performance storage device page • Multi-queue structure request: length, bio(s) − Inefficiency of I/O scheduling for low-latency Per-core … SW Queues SSDs [Saxena, ATC ’ 10 ] [Xu, SYSTOR ’ 15 ] … HW Queues ü Default configuration is noop scheduler − Bypass multi-queue structure [Zhang, OSDI ’ 18 ] Device request bio − Device-side I/O scheduling [Peter, OSDI’14 ] Driver Tag iod [Joshi, HotStorage ’ 17 ] NVMe CMD prp_list sg_list • Multiple dynamic memory allocations … NVMe Queue Pairs Linux multi-queue block layer NVRAMOS 2019 21

  22. Light-weight Block I/O Layer submit_lbio() • Light-weight bio (lbio) structure Light-weight − Contains only essential arguments for to make lbio: LBA, length, Block Layer NVMe I/O request prp_list, page(s), dma_addr(s) page − Eliminates unnecessary structure conversions DMA-mapped and allocations Page Pool … Core 0 • Per-CPU lbio pool … … … Per-CPU … Core n − Supports lockless lbio object allocation Lbio Pool − Supports tagging function Device lbio • Single dynamic memory allocation Driver Tag prp_list NVMe CMD − NVMe PRP* list (device driver) … NVMe Queue Pairs Light-weight block layer *PRP: physical region page NVRAMOS 2019 22

  23. Read Path Comparison Proposed Read Path (before applying light-weight block I/O layer) sys_read() Return to user BIO submission 0.72 µs NVMe I/O submission 0.37 µs Request completion 0.81 µs CPU I/O submit Interrupt Device I/O 7.26 µs Proposed Read Path sys_read() Return to user LBIO submission 0.13 µs LBIO completion 0.65 µs CPU I/O submit Interrupt Latency reduction Device I/O 7.26 µs NVRAMOS 2019 23

  24. Read Path Comparison Vanilla Read Path sys_read() Return to user CPU Interrupt I/O submit Device I/O 7.26 µs Total latency 12.82μs Proposed Read Path Latency reduction sys_read() Return to user CPU I/O submit Interrupt Device I/O 7.26 µs Total latency 10.10μs NVRAMOS 2019 24

  25. Agenda • Read path − Analysis of vanilla read path − Proposed read path • Light-weight block I/O layer • Write path − Analysis of vanilla write path − Proposed write path • Evaluation • Conclusion NVRAMOS 2019 25

  26. Analysis of Vanilla Fsync Path (Ext4 Ordered Mode) Data writeback 5.68 µs sys_fsync() Return to user jbd2 call 0.80 µs CPU Journal block preparation 5.55 µs - Allocating buffer pages - Allocating journal area block Commit block preparation - Checksum computation… 2.15 µs jbd2 Flush & commit block Data block submit Journal block submit submit Data block I/O Journal block I/O Commit block I/O Device 12.57μs 12.73μs 10.72μs NVRAMOS 2019 26

Recommend


More recommend