optimizing memory mapped i o for fast storage devices
play

Optimizing Memory-mapped I/O for Fast Storage Devices Anastasios - PowerPoint PPT Presentation

Optimizing Memory-mapped I/O for Fast Storage Devices Anastasios Papagiannis 1,2 , Giorgos Xanthakis 1,2 , Giorgos Saloustros 1 , Manolis Marazakis 1 , and Angelos Bilas 1,2 Foundation for Research and Technology Hellas (FORTH) 1 &


  1. Optimizing Memory-mapped I/O for Fast Storage Devices Anastasios Papagiannis 1,2 , Giorgos Xanthakis 1,2 , Giorgos Saloustros 1 , Manolis Marazakis 1 , and Angelos Bilas 1,2 Foundation for Research and Technology – Hellas (FORTH) 1 & University of Crete 2 USENIX ATC 2020 1

  2. Fast storage devices • Fast storage devices à Flash, NVMe • Millions of IOPS • < 10 μs access latency • Small I/Os are not such a big issue as in rotational disks • Require many outstanding I/Os for peak throughput USENIX ATC 2020 2

  3. Read/write system calls User Space • Read/write system calls + DRAM cache • Reduce accesses to the device Cache • Kernel-space cache • Requires system calls also for hits • Used for raw (serialized) blocks Kernel Space • User-space cache • Lookups for hits + system calls only for misses Cache • Application specific (deserialized) data • User-space cache removes system calls for hits • Hit lookups in user space introduce significant overhead [SIGMOD’08] Device USENIX ATC 2020 3

  4. Memory-mapped I/O • In memory-mapped I/O (mmio) hits handled in hardware à MMU + TLB • Less overhead compared to cache lookup • In mmio a file mapped to virtual address space • Load/store processor instructions to access data • Kernel fetch/evict page on-demand • Additionally mmio removes • Serialization/deserialization • Memory copies between user and kernel USENIX ATC 2020 4

  5. Disadvantages of mmio • Misses require a page fault instead of a system call • 4KB page size à Small & random I/Os • With fast storage devices this is not a big issue • Linux mmio path fails to scale with #threads USENIX ATC 2020 5

  6. Mmio path scalability Device: null_blk 5 4.5 Dataset: 4TB Million page-faults/sec (IOPS) 4 DRAM cache: 192GB 3.5 3 2.5 2 1.5 1 0.5 0 1 2 4 8 16 32 Linux-Read Linux-Write USENIX ATC 2020 6

  7. Mmio path scalability Device: null_blk 5 4.5 Dataset: 4TB Million page-faults/sec (IOPS) 4 DRAM cache: 192GB 3.5 Queue depth ≈ 27 3 2M IOPS 2.5 1.3M IOPS 2 1.5 1 0.5 0 1 2 4 8 16 32 Linux-Read (4.14) Linux-Write (4.14) Linux-Read (5.4) Linux-Write (5.4) USENIX ATC 2020 7

  8. FastMap • A novel mmio path that achieves high scalability and I/O concurrency • In the Linux kernel • Avoids all centralized contention points • Reduces CPU processing in the common path • Uses dedicated data structures to minimize interference USENIX ATC 2020 8

  9. Mmio path scalability Device: null_blk 5 4.5 Dataset: 4TB Million page-faults/sec (IOPS) 4 DRAM cache: 192GB 3.5 3x in 3 reads 2.5 2 6x in 1.5 writes 1 0.5 0 1 2 4 8 16 32 Linux-Read (4.14) Linux-Write (4.14) Linux-Read (5.4) Linux-Write (5.4) FastMap-Read FastMap-Write USENIX ATC 2020 9

  10. Outline • Introduction • Motivation • FastMap design • Experimental analysis • Conclusions USENIX ATC 2020 10

  11. Outline • Introduction • Motivation • FastMap design • Experimental analysis • Conclusions USENIX ATC 2020 11

  12. FastMap design: 3 main techniques • Separates data structures that keep clean and dirty pages • Avoids all centralized contention points • Optimizes reverse mappings • Reduces CPU processing in the common path • Uses a scalable DRAM cache • Minimizes interference and reduce latency variability USENIX ATC 2020 12

  13. FastMap design: 3 main techniques • Separates data structures that keep clean and dirty pages • Avoids all centralized contention points • Optimizes reverse mappings • Reduces CPU processing in the common path • Uses a scalable DRAM cache • Minimizes interference and reduce latency variability USENIX ATC 2020 13

  14. Linux mmio design page_tree page tree_lock address_space page VMA 126x contented lock acquisitions 155x more page wait time • tree_lock acquired for 2 main reasons • Insert/remove elements from page_tree & lock-free (RCU) lookups • Modify tags for a specific entry à Used to mark a page dirty USENIX ATC 2020 14

  15. FastMap design . . . page_tree page_tree page_tree 0 1 N-1 VMA PFD . . . dirty_tree dirty_tree dirty_tree 0 1 N-1 • Keep dirty pages on a separate data structure • Marking a page dirty/clean does not serialize insert/remove ops • Choose data-structure based on page_offset % num_cpus • Radix trees to keep ALL cached pages à lock-free (RCU) lookups • Red-black trees to keep ONLY dirty pages à sorted by device offset USENIX ATC 2020 15

  16. FastMap design: 3 main techniques • Separates data structures that keep clean and dirty pages • Avoids all centralized contention points • Optimizes reverse mappings • Reduces CPU processing in the common path • Uses a scalable DRAM cache • Minimizes interference and reduce latency variability USENIX ATC 2020 16

  17. Reverse mappings • Find out which page table entries map a specific page • Page eviction à Due to memory pressure or explicit writeback • Destroy mappings à munmap • Linux uses object-based reverse mappings • Executables and libraries (e.g. libc) introduce large amount of sharing • Reduces DRAM consumption and housekeeping costs • Storage applications that use memory-mapped I/O • Require minimal sharing • Can be applied selectively to certain devices or files USENIX ATC 2020 17

  18. Linux object-based reverse mappings page vma PGD _mapcount address_space i_mmap vma PGD read/write page semaphore vma PGD _mapcount • _mapcount can still results in useless page table traversals • rw-semaphore acquired as read on all operations • Cross NUMA-node traffic • Spend many CPU cycles USENIX ATC 2020 18

  19. FastMap full reverse mappings • Full reverse mappings VMA, vaddr page VMA, vaddr • Reduce CPU overhead • Efficient munmap VMA, vaddr page • No ordering required è scalable updates • More DRAM required per-core • Limited by small degree of sharing in pages VMA USENIX ATC 2020 19

  20. FastMap design: 3 main techniques • Separates data structures that keep clean and dirty pages • Avoids all centralized contention points • Optimizes reverse mappings • Reduces CPU processing in the common path • Uses a scalable DRAM cache • Minimizes interference and reduce latency variability USENIX ATC 2020 20

  21. Batched TLB invalidations • Under memory pressure FastMap evicts a batch of clean pages • Cache related operations • Page table cleanup • TLB invalidation • TLB invalidation require an IPI (Inter-Processor Interrupt) • Limits scalability [EuroSys’13, USENIX ATC’17, EurorSys’20] • Single TLB invalidation for the whole batch • Convert batch to range including unnecessary invalidations USENIX ATC 2020 21

  22. Other optimizations in the paper • DRAM cache • Eviction/writeback operations • Implementation details USENIX ATC 2020 22

  23. Outline • Introduction • Motivation • FastMap design • Experimental analysis • Conclusions USENIX ATC 2020 23

  24. Testbed • 2x Intel Xeon CPU E5-2630 v3 CPUs (2.4GHz) • 32 hyper-threads • Different devices • Intel Optane SSD DC P4800X (375GB) in workloads • null_blk in microbenchmarks • 256 GB of DDR4 DRAM • CentOS v7.3 with Linux 4.14.72 USENIX ATC 2020 24

  25. Workloads • Microbenchmarks • Storage applications • Kreon [ACM SoCC’18] – persistent key-value store (YCSB) • MonetDB – column oriented DBMS (TPC-H) • Extend available DRAM over fast storage devices • Silo [SOSP’13] – key-value store with scalable transactions (TPC-C) • Ligra [PPoPP’13] – graph algorithms (BFS) USENIX ATC 2020 25

  26. FastMap Scalability 4x Intel Xeon CPU E5-4610 v3 CPUs (1.7 GHz) 80 hyper-threads 8 million page-faults/sec (IOPS) FastMap-Rd-SPF 7 FastMap-Wr-SPF FastMap-Rd 6 FastMap-Wr 11.8x mmap-Rd 5 37.4% mmap-Wr 32% 25.4% 4 7.6% 3 2 1 0 1 10 20 40 80 #threads USENIX ATC 2020 26

  27. FastMap execution time breakdown 600 500 #samples (x1000) 400 mark_dirty address-space 300 page-fault other 200 100 0 mmap-Read mmap-Write FastMap-Read FastMap-Write USENIX ATC 2020 27

  28. Kreon key-value store • Persistent key-value store based on LSM-tree • Designed to use memory-mapped I/O in the common path • YCSB with 80M records • 80GB dataset • 16GB DRAM USENIX ATC 2020 28

  29. Kreon – 100% inserts FastMap mmap 400 idle iowait 350 kworker 300 pgfault 3.2x pthread time (sec) 250 others ycsb 200 kreon 150 100 50 0 1 2 4 8 16 32 1 2 4 8 16 32 #cores USENIX ATC 2020 29

  30. Kreon – 100% lookups FastMap mmap idle 400 iowait 350 kworker pgfault 300 others time (sec) ycsb 250 kreon 200 150 1.5x 100 50 0 1 2 4 8 16 32 1 2 4 8 16 32 #cores USENIX ATC 2020 30

  31. Batched TLB invalidations • TLB batching results in 25.5% more TLB misses • Improvement due to fewer IPIs Silo key-value store • 24% higher throughput & • 23.8% lower average latency TPC-C • Less time in flush_tlb_mm_range() • 20.3% à 0.1% USENIX ATC 2020 31

  32. Conclusions • FastMap, an optimized mmio path in Linux • Scalable with number of threads & low CPU overhead • FastMap has significant benefits for data-intensive applications • Fast storage devices • Multi-core servers • Up to 11.8x more IOPS with 80 cores and null_blk • Up to 5.2x more IOPS with 32 cores and Intel Optane SSD USENIX ATC 2020 32

Recommend


More recommend