Effectively Prefetching Remote Memory with Leap Hasan Al Maruf and Mosharaf Chowdhury 1
Memory-Intensive Applications 2
Perform Great! 40 38.61 35 30 TPS (Thousands) 25 20 15 10 6.61 5 1.01 0 100% 75% 50% In-Memory Working Set TPC-C on VoltDB 3
Perform Great Until Memory Runs Out 40 38.61 35 30 TPS (Thousands) 25 20 15 10 6.61 5 1.01 0 100% 75% 50% In-Memory Working Set TPC-C on VoltDB 4
Perform Great Until Memory Runs Out 500 40 38.61 424.47 35 400 30 Completion Time (s) TPS (Thousands) 25 300 20 200 15 124.96 116.19 10 6.61 100 5 1.01 0 0 100% 75% 50% 100% 75% 50% In-Memory Working Set In-Memory Working Set TPC-C on VoltDB PageRank on PowerGraph 5
50% Less Memory Causes Slowdown of … 500 40 38.61 424.47 35 400 30 Completion Time (s) TPS (Thousands) 25 300 20 200 15 124.96 116.19 10 6.61 100 5 1.01 0 0 100% 75% 50% 100% 75% 50% In-Memory Working Set In-Memory Working Set TPC-C on VoltDB PageRank on PowerGraph 6
Between a Rock and a Hard Place Underallocation Leads to severe performance loss VS. Overallocation Leads to underutilization 30-40% in Google, Alibaba, and Facebook 7
Memory Disaggregation Disaggregated Memory … Machine 1 Machine 2 Machine 3 Machine N Used Memory Free Memory Remote Memory 8
Remote Memory Access 4KB page access latency User-space Applications local vs. remote 100 ns vs. 4 µs Memory Disaggregation Frameworks Infiniswap Remote Regions LegoOS (NSDI’17) (ATC’18) (OSDI’18) Remote Remote file Disaggregated memory paging abstraction OS Remote Memory 9
Remote Memory Access 4KB page access latency User-space Applications local vs. remote 100 ns vs. 4 µs Memory Disaggregation Frameworks Infiniswap Remote Regions LegoOS (NSDI’17) (ATC’18) (OSDI’18) Latency requirement for Remote Remote file Disaggregated preferable performance [1] memory paging abstraction OS 3 µs Remote Memory Existing frameworks can’t achieve! [1] P . X. Gao et al. “Network requirements for resource disaggregation” OSDI’16. 10
Remote Memory Access 4KB page access latency User-space Applications local vs. remote 100 ns vs. 4 µs Memory Disaggregation Frameworks Infiniswap Remote Regions LegoOS (NSDI’17) (ATC’18) (OSDI’18) data path Latency requirement for overhead Remote Remote file Disaggregated preferable performance [1] memory paging abstraction OS 3 µs variation in network latency Remote Memory Existing frameworks can’t achieve! [1] P . X. Gao et al. “Network requirements for resource disaggregation” OSDI’16. 11
Life of a Page User … Process 1 Process 2 Process N Cache Space 0.27 us Hit Page Fault Kernel Memory Management MMU Space Unit (MMU) Page Cache 2.1 us Device Mapping Layer 10.04 us bio Generic Block Layer I/O Scheduler Request Cache Queue Request queue processing: Miss Insertion, Merging, Dispatch 21.88 us Sorting, Staging and Dispatch Queue Block Device Driver RDMA: 4.3 us Remote Memory 12
Where Does the Time Go? Page Request Fast Path In Page 0. 0.12 12 µs Cache? Yes 0.15 0. 15 µs Yes Read Update Page Table Request? & End I/O 13
Where Does the Time Go? Page Request Fast Path No In Page Slow Path 0.12 0. 12 µs Cache? Allocate Cache 2. 2.1 1 µs Yes for Page 0.15 0. 15 µs No Yes Read Update Page Table Prepare for I/O 10.04 10. 04 µs Request? & End I/O Queue and Batch 21. 21.88 88 µs Requests Execute I/O RDMA RDMA: 4. 4.3 3 µs 14
Design Goal 1. Increase cache hit • faster path serves more page faults 2. Reduce the latency of the slow path • remove unnecessary block-layer operations for RDMA 15
Leap Online remote memory prefetcher Identifies memory access patterns to prefetch pages in a • fast, • cache-efficient, and • resilient manner without modifying any • applications, or • hardware 16
Life of a Page User … Process 1 Process 2 Process N Cache Space 0.27 us Hit Page Fault Kernel Memory Management MMU Space Unit (MMU) Page Cache 2.1 us Device Mapping Layer 10.04 us Generic Block Layer bio I/O Scheduler Request Cache Queue Request queue processing: Miss Insertion, Merging, Dispatch 21.88 us Sorting, Staging and Dispatch Queue Block Device Driver RDMA: 4.3 us Remote Memory 17
Life of a Page w/ Leap User … Process 1 Process 2 Process N Cache Space 0.27 us Hit Page Fault Kernel Memory Management MMU Space Unit (MMU) Page Cache 2.1 us Cache Miss RDMA: 4.3 us Remote Memory 18
Life of a Page w/ Leap User … Process 1 Process 2 Process N Cache Space 0.27 us Hit Page Fault Kernel Memory Management MMU Space Unit (MMU) Page Cache 2.1 us Le Leap Prefetcher Prefetch Process Specific Cache Trend Candidate Page Access Tracker Miss 0.34 us Detection Generation Eager Cache Eviction RDMA: 4.3 us Remote Memory 19
Prefetching in Linux Reads ahead pages sequentially too aggressive on seq : cache pollution Based only on the last page access too conservative off seq : brings nothing Does not distinguish between processes Cannot detect thread-level access irregularities 20
Prefetching Techniques Low Low Low Unmodified HW/SW Temporal Spatial Approach Computational Memory Cache Application Independence Locality Locality Complexity Overhead Pollution Next N-Line Yes Yes Yes Yes No Yes No Stride Yes Yes Yes Yes No Yes No Instruction No No No No Yes Yes No Prefetch Linux Yes Yes Yes Yes Yes Yes No Read-Ahead Leap Yes Yes Yes Yes Yes Yes Yes 21
Prefetching Techniques Low Low Low Unmodified HW/SW Temporal Spatial Approach Computational Memory Cache Application Independence Locality Locality Complexity Overhead Pollution Next N-Line Yes Yes Yes Yes No Yes No Stride Yes Yes Yes Yes No Yes No Instruction No No No No Yes Yes No Prefetch Linux Yes Yes Yes Yes Yes Yes No Read-Ahead Leap Yes Yes Yes Yes Yes Yes Yes 22
Prefetching Techniques Low Low Low Unmodified HW/SW Temporal Spatial Approach Computational Memory Cache Application Independence Locality Locality Complexity Overhead Pollution Next N-Line Yes Yes Yes Yes No Yes No Stride Yes Yes Yes Yes No Yes No Instruction No No No No Yes Yes No Prefetch Linux Yes Yes Yes Yes Yes Yes No Read-Ahead Leap Yes Yes Yes Yes Yes Yes Yes 23
Prefetching Techniques Low Low Low Unmodified HW/SW Temporal Spatial Approach Computational Memory Cache Application Independence Locality Locality Complexity Overhead Pollution Next N-Line Yes Yes Yes Yes No Yes No Stride Yes Yes Yes Yes No Yes No Instruction No No No No Yes Yes No Prefetch Linux Yes Yes Yes Yes Yes Yes No Read-Ahead Leap Yes Yes Yes Yes Yes Yes Yes 24
Prefetching Techniques Low Low Low Unmodified HW/SW Temporal Spatial Approach Computational Memory Cache Application Independence Locality Locality Complexity Overhead Pollution Next N-Line Yes Yes Yes Yes No Yes No Stride Yes Yes Yes Yes No Yes No Instruction No No No No Yes Yes No Prefetch Linux Yes Yes Yes Yes Yes Yes No Read-Ahead Leap Yes Yes Yes Yes Yes Yes Yes 25
Leap Prefetcher Linear-time and constant memory space Get Prefetch Window Size Two main components: § Trend detection Window § Prefetch window size detection Size = 0? No Yes Trend Read only the Found? requested page No Yes Prefetch with Prefetch with Previous Trend Current Trend 26
Trend Detection Flexible to short term irregularity Start with a smaller window of Access History Identifies the majority element in Run Boyer-Moore on the window access history Majority found? No Yes Regular trends can be found within recent accesses Doubles the Return Majority ∆ maj window size No Max. Yes window No trend found size? 27
Trend Detection Example trend of -3 tre 3 disappears, no major new trend tre trend of -3 0x48 0x45 0x42 0x3F 0x3C 0x02 0x04 0x06 0x48 0x45 0x42 0x3F +72 -3 -3 -3 -3 -58 +2 +2 +72 -3 -3 -3 t 0 t 2 t 4 t 6 t 1 t 3 t 5 t 7 t 0 t 2 t 1 t 3 (b) at time t7 (a) at time t3 trend of +2 2 detected among irregularities trend of +2 2 detected 0x08 0x45 0x42 0x3F 0x08 0x0A 0x0C 0x10 0x39 0x12 0x14 0x16 0x3C 0x02 0x04 0x06 +2 +2 +2 +4 -41 -39 +2 +2 +2 -3 -3 -3 -3 -58 +2 +2 t 8 t 10 t 9 t 11 t 12 t 14 t 15 t 13 t 8 t 2 t 3 t 4 t 6 t 1 t 5 t 7 (d) at time t15 (c) at time t8 28
Prefetch Window Size Detection Cache hit indicates prefetch utilization High cache hit: increase prefetch window aggressively trend availability: increase prefetch window gradually No cache hit no trend: decrease prefetch window gradually Gradual slow down helps during sudden changes 29
Evaluation Deploy and evaluate over 56 Gbps InfiniBand network Disaggregated VMM: Infiniswap Memory Disaggregation Frameworks Disaggregated VFS: Remote Regions 30
Lowers Remote Page Access Latency by… Stride Access Sequential Access 1 1 0.8 0.8 0.6 0.6 CDF CDF 0.4 Infiniswap 0.4 Infiniswap+Leap 0.2 0.2 0 0 0.01 1 100 10000 0.01 1 100 10000 Latency (us) Latency (us) 31
Recommend
More recommend