Optimizing Memory-mapped I/O for Fast Storage Devices Anastasios Papagiannis 1,2 , Giorgos Xanthakis 1,2 , Giorgos Saloustros 1 , Manolis Marazakis 1 , and Angelos Bilas 1,2 Foundation for Research and Technology – Hellas (FORTH) 1 & University of Crete 2 USENIX ATC 2020 1
Fast storage devices • Fast storage devices à Flash, NVMe • Millions of IOPS • < 10 μs access latency • Small I/Os are not such a big issue as in rotational disks • Require many outstanding I/Os for peak throughput USENIX ATC 2020 2
Read/write system calls User Space • Read/write system calls + DRAM cache • Reduce accesses to the device Cache • Kernel-space cache • Requires system calls also for hits • Used for raw (serialized) blocks Kernel Space • User-space cache • Lookups for hits + system calls only for misses Cache • Application specific (deserialized) data • User-space cache removes system calls for hits • Hit lookups in user space introduce significant overhead [SIGMOD’08] Device USENIX ATC 2020 3
Memory-mapped I/O • In memory-mapped I/O (mmio) hits handled in hardware à MMU + TLB • Less overhead compared to cache lookup • In mmio a file mapped to virtual address space • Load/store processor instructions to access data • Kernel fetch/evict page on-demand • Additionally mmio removes • Serialization/deserialization • Memory copies between user and kernel USENIX ATC 2020 4
Disadvantages of mmio • Misses require a page fault instead of a system call • 4KB page size à Small & random I/Os • With fast storage devices this is not a big issue • Linux mmio path fails to scale with #threads USENIX ATC 2020 5
Mmio path scalability Device: null_blk 5 4.5 Dataset: 4TB Million page-faults/sec (IOPS) 4 DRAM cache: 192GB 3.5 3 2.5 2 1.5 1 0.5 0 1 2 4 8 16 32 Linux-Read Linux-Write USENIX ATC 2020 6
Mmio path scalability Device: null_blk 5 4.5 Dataset: 4TB Million page-faults/sec (IOPS) 4 DRAM cache: 192GB 3.5 Queue depth ≈ 27 3 2M IOPS 2.5 1.3M IOPS 2 1.5 1 0.5 0 1 2 4 8 16 32 Linux-Read (4.14) Linux-Write (4.14) Linux-Read (5.4) Linux-Write (5.4) USENIX ATC 2020 7
FastMap • A novel mmio path that achieves high scalability and I/O concurrency • In the Linux kernel • Avoids all centralized contention points • Reduces CPU processing in the common path • Uses dedicated data structures to minimize interference USENIX ATC 2020 8
Mmio path scalability Device: null_blk 5 4.5 Dataset: 4TB Million page-faults/sec (IOPS) 4 DRAM cache: 192GB 3.5 3x in 3 reads 2.5 2 6x in 1.5 writes 1 0.5 0 1 2 4 8 16 32 Linux-Read (4.14) Linux-Write (4.14) Linux-Read (5.4) Linux-Write (5.4) FastMap-Read FastMap-Write USENIX ATC 2020 9
Outline • Introduction • Motivation • FastMap design • Experimental analysis • Conclusions USENIX ATC 2020 10
Outline • Introduction • Motivation • FastMap design • Experimental analysis • Conclusions USENIX ATC 2020 11
FastMap design: 3 main techniques • Separates data structures that keep clean and dirty pages • Avoids all centralized contention points • Optimizes reverse mappings • Reduces CPU processing in the common path • Uses a scalable DRAM cache • Minimizes interference and reduce latency variability USENIX ATC 2020 12
FastMap design: 3 main techniques • Separates data structures that keep clean and dirty pages • Avoids all centralized contention points • Optimizes reverse mappings • Reduces CPU processing in the common path • Uses a scalable DRAM cache • Minimizes interference and reduce latency variability USENIX ATC 2020 13
Linux mmio design page_tree page tree_lock address_space page VMA 126x contented lock acquisitions 155x more page wait time • tree_lock acquired for 2 main reasons • Insert/remove elements from page_tree & lock-free (RCU) lookups • Modify tags for a specific entry à Used to mark a page dirty USENIX ATC 2020 14
FastMap design . . . page_tree page_tree page_tree 0 1 N-1 VMA PFD . . . dirty_tree dirty_tree dirty_tree 0 1 N-1 • Keep dirty pages on a separate data structure • Marking a page dirty/clean does not serialize insert/remove ops • Choose data-structure based on page_offset % num_cpus • Radix trees to keep ALL cached pages à lock-free (RCU) lookups • Red-black trees to keep ONLY dirty pages à sorted by device offset USENIX ATC 2020 15
FastMap design: 3 main techniques • Separates data structures that keep clean and dirty pages • Avoids all centralized contention points • Optimizes reverse mappings • Reduces CPU processing in the common path • Uses a scalable DRAM cache • Minimizes interference and reduce latency variability USENIX ATC 2020 16
Reverse mappings • Find out which page table entries map a specific page • Page eviction à Due to memory pressure or explicit writeback • Destroy mappings à munmap • Linux uses object-based reverse mappings • Executables and libraries (e.g. libc) introduce large amount of sharing • Reduces DRAM consumption and housekeeping costs • Storage applications that use memory-mapped I/O • Require minimal sharing • Can be applied selectively to certain devices or files USENIX ATC 2020 17
Linux object-based reverse mappings page vma PGD _mapcount address_space i_mmap vma PGD read/write page semaphore vma PGD _mapcount • _mapcount can still results in useless page table traversals • rw-semaphore acquired as read on all operations • Cross NUMA-node traffic • Spend many CPU cycles USENIX ATC 2020 18
FastMap full reverse mappings • Full reverse mappings VMA, vaddr page VMA, vaddr • Reduce CPU overhead • Efficient munmap VMA, vaddr page • No ordering required è scalable updates • More DRAM required per-core • Limited by small degree of sharing in pages VMA USENIX ATC 2020 19
FastMap design: 3 main techniques • Separates data structures that keep clean and dirty pages • Avoids all centralized contention points • Optimizes reverse mappings • Reduces CPU processing in the common path • Uses a scalable DRAM cache • Minimizes interference and reduce latency variability USENIX ATC 2020 20
Batched TLB invalidations • Under memory pressure FastMap evicts a batch of clean pages • Cache related operations • Page table cleanup • TLB invalidation • TLB invalidation require an IPI (Inter-Processor Interrupt) • Limits scalability [EuroSys’13, USENIX ATC’17, EurorSys’20] • Single TLB invalidation for the whole batch • Convert batch to range including unnecessary invalidations USENIX ATC 2020 21
Other optimizations in the paper • DRAM cache • Eviction/writeback operations • Implementation details USENIX ATC 2020 22
Outline • Introduction • Motivation • FastMap design • Experimental analysis • Conclusions USENIX ATC 2020 23
Testbed • 2x Intel Xeon CPU E5-2630 v3 CPUs (2.4GHz) • 32 hyper-threads • Different devices • Intel Optane SSD DC P4800X (375GB) in workloads • null_blk in microbenchmarks • 256 GB of DDR4 DRAM • CentOS v7.3 with Linux 4.14.72 USENIX ATC 2020 24
Workloads • Microbenchmarks • Storage applications • Kreon [ACM SoCC’18] – persistent key-value store (YCSB) • MonetDB – column oriented DBMS (TPC-H) • Extend available DRAM over fast storage devices • Silo [SOSP’13] – key-value store with scalable transactions (TPC-C) • Ligra [PPoPP’13] – graph algorithms (BFS) USENIX ATC 2020 25
FastMap Scalability 4x Intel Xeon CPU E5-4610 v3 CPUs (1.7 GHz) 80 hyper-threads 8 million page-faults/sec (IOPS) FastMap-Rd-SPF 7 FastMap-Wr-SPF FastMap-Rd 6 FastMap-Wr 11.8x mmap-Rd 5 37.4% mmap-Wr 32% 25.4% 4 7.6% 3 2 1 0 1 10 20 40 80 #threads USENIX ATC 2020 26
FastMap execution time breakdown 600 500 #samples (x1000) 400 mark_dirty address-space 300 page-fault other 200 100 0 mmap-Read mmap-Write FastMap-Read FastMap-Write USENIX ATC 2020 27
Kreon key-value store • Persistent key-value store based on LSM-tree • Designed to use memory-mapped I/O in the common path • YCSB with 80M records • 80GB dataset • 16GB DRAM USENIX ATC 2020 28
Kreon – 100% inserts FastMap mmap 400 idle iowait 350 kworker 300 pgfault 3.2x pthread time (sec) 250 others ycsb 200 kreon 150 100 50 0 1 2 4 8 16 32 1 2 4 8 16 32 #cores USENIX ATC 2020 29
Kreon – 100% lookups FastMap mmap idle 400 iowait 350 kworker pgfault 300 others time (sec) ycsb 250 kreon 200 150 1.5x 100 50 0 1 2 4 8 16 32 1 2 4 8 16 32 #cores USENIX ATC 2020 30
Batched TLB invalidations • TLB batching results in 25.5% more TLB misses • Improvement due to fewer IPIs Silo key-value store • 24% higher throughput & • 23.8% lower average latency TPC-C • Less time in flush_tlb_mm_range() • 20.3% à 0.1% USENIX ATC 2020 31
Conclusions • FastMap, an optimized mmio path in Linux • Scalable with number of threads & low CPU overhead • FastMap has significant benefits for data-intensive applications • Fast storage devices • Multi-core servers • Up to 11.8x more IOPS with 80 cores and null_blk • Up to 5.2x more IOPS with 32 cores and Intel Optane SSD USENIX ATC 2020 32
Recommend
More recommend