HawkEye: Efficient Fine-grained OS Support for Huge Pages Ashish Panwar 1 , Sorav Bansal 2 , K. Gopinath 1 Indian Institute of Science (IISc), Bangalore 1 Indian Institute of Technology, Delhi 2 Architectural Support for Programming Languages and Operating Systems (ASPLOS) - 2019. 1
Virtual address space 2
Virtual address space Physical address space 3
Virtual address space Physical address space 4
Virtual address space Physical address space 5
Too much TLB pressure! Virtual address space Physical address space 6
Virtual address space Physical address space 7
Virtual address space Huge pages Fewer misses Physical address space 8
OS Challenges ❑ Complex trade-offs • Memory bloat vs. performance • Page fault latency vs. the number of page faults ❑ Challenges due to (external) fragmentation • How to leverage limited memory contiguity • Fairness in huge page allocation 11
Memory bloat vs. performance 13
Internal fragmentation aggressive allocation Virtual memory huge page mapping Physical memory 14
Internal fragmentation aggressive allocation conservative allocation Virtual memory huge page mapping Physical memory 15
Internal fragmentation aggressive allocation conservative allocation unused pages Virtual memory huge page mapping Physical memory 16
Internal fragmentation aggressive allocation conservative allocation unused pages Virtual memory huge page mapping Physical memory bloat 17
Internal fragmentation aggressive allocation conservative allocation unused pages Virtual memory huge page mapping Physical memory Lower TLB reach (impacts performance) bloat 18
Bloat vs. performance Conservative Aggressive Lower perf Higher perf Higher bloat Lower bloat
Latency vs. # page faults 20
▪ Find a page 4-KB pre 21
▪ Find a page, zero-fill 4-KB pre zero-fill post 22
▪ Find a page, zero-fill, map 4-KB pre zero-fill post 23
▪ Find a page, zero-fill, map 25% 4-KB pre zero-fill post 24
▪ Find a page, zero-fill, map 25% 4-KB pre zero-fill post p p o 2-MB zero-fill r s e t 25
▪ Find a page, zero-fill, map 25% 4-KB pre zero-fill post p p o 2-MB zero-fill r s e t 26
▪ Find a page, zero-fill, map 25% 4-KB pre zero-fill post p p o 2-MB zero-fill r s e t 27
▪ Find a page, zero-fill, map 25% 4-KB pre zero-fill post p p o 2-MB zero-fill r s e t dominated by zero-filling (97%) 28
Latency vs. # page faults Aggressive Conservative High latency Low latency Fewer faults Higher faults 32
Current systems favor opposite ends of the design spectrum • FreeBSD is conservative (compromise on performance) • Linux is throughput-oriented (compromise on latency and bloat) conservative vs. aggressive FreeBSD Linux Memory bloat Low High Tradeoff-1: Performance Low High Allocation latency Low High Tradeoff-2: High Low # page faults 33
Ingens (OSDI’16) ▪ Asynchronous allocation • Huge pages allocated in the background ▪ Utilization-threshold based allocation • Tunable bloat vs. performance • Adaptive based on memory pressure ▪ Fairness driven by per-process fairness metric • Heuristic based on past behavior 34
Ingens (OSDI’16) low latency ▪ Asynchronous allocation too many page faults • Huge pages allocated in the background ▪ Utilization-threshold based allocation • Tunable bloat vs. performance • Adaptive based on memory pressure ▪ Fairness driven by per-process fairness metric • Heuristic based on past behavior 35
Ingens (OSDI’16) low latency ▪ Asynchronous allocation too many page faults • Huge pages allocated in the background ▪ Utilization-threshold based allocation • Tunable bloat vs. performance manual • Adaptive based on memory pressure ▪ Fairness driven by per-process fairness metric • Heuristic based on past behavior 36
Ingens (OSDI’16) low latency ▪ Asynchronous allocation too many page faults • Huge pages allocated in the background ▪ Utilization-threshold based allocation • Tunable bloat vs. performance manual • Adaptive based on memory pressure ▪ Fairness driven by per-process fairness metric • Heuristic based on past behavior weak correlation with page walk overhead 37
Current state-of-the-art FreeBSD Linux Ingens Memory bloat Low High Tunable Tradeoff-1: Performance Low High Tunable Allocation latency Low High Low Tradeoff-2: High Low High # page faults ▪ Hard to find the sweet-spot for utilization-threshold in Ingens • Application dependent, phase dependent 38
HawkEye 39
Key Optimizations ➢ Asynchronous page pre-zeroing [1] ➢ Content deduplication based bloat mitigation ➢ Fine-grained intra-process allocation ➢ Fairness driven by hardware performance counters [1] Optimizing the Idle Task and Other MMU Tricks, OSDI'99 40
Asynchronous page pre-zeroing ▪ Pages zero-filled in the background ▪ Potential issues: • Cache pollution – leverage non-temporal writes • DRAM bandwidth consumption – rate-limited o Limit CPU utilization (e.g., 5%) 41
Asynchronous page pre-zeroing Enables aggressive allocation with low latency ✓ 13.8x faster VM spin-up ✓ 1.26x higher throughput (Redis) 42
Mitigating bloat 43
Mitigating bloat Virtual memory huge page mapping Physical memory 44
Mitigating bloat unused Virtual memory huge page mapping Physical memory 45
Mitigating bloat unused Virtual memory huge page mapping Physical memory zero-filled 46
Mitigating bloat unused Virtual memory huge page mapping Physical memory zero-filled ▪ Observation: Unused base pages remain zero-filled ▪ Identify bloat by scanning memory ▪ Dedup zero-filled base pages to remove bloat 47
Mitigating bloat ▪ Ease of detecting non-zero pages 115.5 offset (bytes) distance (bytes) 120 90 67.5 55.4 60 27.4 30 9.11 6.63 3.9 2.8 1.2 1 0 48
Mitigating bloat ✓ Automated " bloat vs. performance " management success success out-of-memory out-of-memory 48 Redis 40 RSS (GB) P1: insert 32 P2 1 P3 P 24 P2: delete 16 Linux Ingens HawkEye 8 P3: insert 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 1 1 1 1 1 1 1 Time (seconds) 49
FreeBSD Linux Ingens HawkEye Memory bloat Low High Tunable Automated Tradeoff-1: Performance Low High Tunable Automated Allocation latency Low High Low Low Tradeoff-2: High Low High Low # page faults 50
Fine-grained (intra-process) allocation ▪ Maximizing performance with limited contiguity 51
Fine-grained (intra-process) allocation ▪ Maximizing performance with limited contiguity access-coverage hot regions XSBench access-coverage: # base pages accessed per second ❖ A good indicator of TLB-contention due to a region 52
Fine-grained (intra-process) allocation ▪ Track access-coverage ( access_map) ▪ Allocate in the sorted order (top to bottom) ✓ Yields higher profit per allocation access_map 53
Fine-grained (intra-process) allocation Page Walk Overhead (%) Linux Ingens HawkEye 50 MMU Overhead (%) access-coverage 40 30 20 10 0 1 101 201 301 401 501 Time (seconds) Workload: XSBench 54
Fine-grained (intra-process) allocation Execution time (ms) saved 1200 per huge page allocation Linux Ingens HawkEye ms saved per huge page 900 600 300 0 Graph500 XSBench NPB_CG.D 55
Fair (inter-process) allocation ▪ Prioritize allocation to the process with highest expected improvement ▪ How to estimate page walk overhead • Profile hardware performance counters • Low cost, accurate! 56
Fair (inter-process) allocation 70 Linux Ingens HawkEye 60 50 % speedup % speedup 40 30 20 10 0 cactusADM tigr Graph500 lbm_s SVM XSBench CG.D -10 Workloads running alongside a TLB-insensitive process 57
Summary ▪ OS support for huge pages involves complex tradeoffs ▪ Balancing fine-grained control with high performance ▪ Dealing with fragmentation for efficiency and fairness 58
Summary ▪ OS support for huge pages involves complex tradeoffs ▪ Balancing fine-grained control with high performance ▪ Dealing with fragmentation for efficiency and fairness HawkEye: Resolving fundamental conflicts for huge page optimizations https://github.com/apanwariisc/HawkEye 59
60
Recommend
More recommend