HawkEye: Efficient Fine-grained OS Support for Huge Pages Ashish - PowerPoint PPT Presentation

HawkEye: Efficient Fine-grained OS Support for Huge Pages Ashish Panwar 1 , Sorav Bansal 2 , K. Gopinath 1 Indian Institute of Science (IISc), Bangalore 1 Indian Institute of Technology, Delhi 2 Architectural Support for Programming Languages and Operating Systems (ASPLOS) - 2019. 1

Virtual address space 2

Virtual address space Physical address space 3

Too much TLB pressure! Virtual address space Physical address space 6

Virtual address space Huge pages Fewer misses Physical address space 8

OS Challenges ❑ Complex trade-offs • Memory bloat vs. performance • Page fault latency vs. the number of page faults ❑ Challenges due to (external) fragmentation • How to leverage limited memory contiguity • Fairness in huge page allocation 11

Memory bloat vs. performance 13

Internal fragmentation aggressive allocation Virtual memory huge page mapping Physical memory 14

Internal fragmentation aggressive allocation conservative allocation Virtual memory huge page mapping Physical memory 15

Internal fragmentation aggressive allocation conservative allocation unused pages Virtual memory huge page mapping Physical memory 16

Internal fragmentation aggressive allocation conservative allocation unused pages Virtual memory huge page mapping Physical memory bloat 17

Internal fragmentation aggressive allocation conservative allocation unused pages Virtual memory huge page mapping Physical memory Lower TLB reach (impacts performance) bloat 18

Bloat vs. performance Conservative Aggressive Lower perf Higher perf Higher bloat Lower bloat

Latency vs. # page faults 20

▪ Find a page 4-KB pre 21

▪ Find a page, zero-fill 4-KB pre zero-fill post 22

▪ Find a page, zero-fill, map 4-KB pre zero-fill post 23

▪ Find a page, zero-fill, map 25% 4-KB pre zero-fill post 24

▪ Find a page, zero-fill, map 25% 4-KB pre zero-fill post p p o 2-MB zero-fill r s e t 25

▪ Find a page, zero-fill, map 25% 4-KB pre zero-fill post p p o 2-MB zero-fill r s e t dominated by zero-filling (97%) 28

Latency vs. # page faults Aggressive Conservative High latency Low latency Fewer faults Higher faults 32

Current systems favor opposite ends of the design spectrum • FreeBSD is conservative (compromise on performance) • Linux is throughput-oriented (compromise on latency and bloat) conservative vs. aggressive FreeBSD Linux Memory bloat Low High Tradeoff-1: Performance Low High Allocation latency Low High Tradeoff-2: High Low # page faults 33

Ingens (OSDI’16) ▪ Asynchronous allocation • Huge pages allocated in the background ▪ Utilization-threshold based allocation • Tunable bloat vs. performance • Adaptive based on memory pressure ▪ Fairness driven by per-process fairness metric • Heuristic based on past behavior 34

Ingens (OSDI’16) low latency ▪ Asynchronous allocation too many page faults • Huge pages allocated in the background ▪ Utilization-threshold based allocation • Tunable bloat vs. performance • Adaptive based on memory pressure ▪ Fairness driven by per-process fairness metric • Heuristic based on past behavior 35

Ingens (OSDI’16) low latency ▪ Asynchronous allocation too many page faults • Huge pages allocated in the background ▪ Utilization-threshold based allocation • Tunable bloat vs. performance manual • Adaptive based on memory pressure ▪ Fairness driven by per-process fairness metric • Heuristic based on past behavior 36

Ingens (OSDI’16) low latency ▪ Asynchronous allocation too many page faults • Huge pages allocated in the background ▪ Utilization-threshold based allocation • Tunable bloat vs. performance manual • Adaptive based on memory pressure ▪ Fairness driven by per-process fairness metric • Heuristic based on past behavior weak correlation with page walk overhead 37

Current state-of-the-art FreeBSD Linux Ingens Memory bloat Low High Tunable Tradeoff-1: Performance Low High Tunable Allocation latency Low High Low Tradeoff-2: High Low High # page faults ▪ Hard to find the sweet-spot for utilization-threshold in Ingens • Application dependent, phase dependent 38

HawkEye 39

Key Optimizations ➢ Asynchronous page pre-zeroing [1] ➢ Content deduplication based bloat mitigation ➢ Fine-grained intra-process allocation ➢ Fairness driven by hardware performance counters [1] Optimizing the Idle Task and Other MMU Tricks, OSDI'99 40

Asynchronous page pre-zeroing ▪ Pages zero-filled in the background ▪ Potential issues: • Cache pollution – leverage non-temporal writes • DRAM bandwidth consumption – rate-limited o Limit CPU utilization (e.g., 5%) 41

Asynchronous page pre-zeroing Enables aggressive allocation with low latency ✓ 13.8x faster VM spin-up ✓ 1.26x higher throughput (Redis) 42

Mitigating bloat 43

Mitigating bloat Virtual memory huge page mapping Physical memory 44

Mitigating bloat unused Virtual memory huge page mapping Physical memory 45

Mitigating bloat unused Virtual memory huge page mapping Physical memory zero-filled 46

Mitigating bloat unused Virtual memory huge page mapping Physical memory zero-filled ▪ Observation: Unused base pages remain zero-filled ▪ Identify bloat by scanning memory ▪ Dedup zero-filled base pages to remove bloat 47

Mitigating bloat ▪ Ease of detecting non-zero pages 115.5 offset (bytes) distance (bytes) 120 90 67.5 55.4 60 27.4 30 9.11 6.63 3.9 2.8 1.2 1 0 48

Mitigating bloat ✓ Automated " bloat vs. performance " management success success out-of-memory out-of-memory 48 Redis 40 RSS (GB) P1: insert 32 P2 1 P3 P 24 P2: delete 16 Linux Ingens HawkEye 8 P3: insert 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 1 1 1 1 1 1 1 Time (seconds) 49

FreeBSD Linux Ingens HawkEye Memory bloat Low High Tunable Automated Tradeoff-1: Performance Low High Tunable Automated Allocation latency Low High Low Low Tradeoff-2: High Low High Low # page faults 50

Fine-grained (intra-process) allocation ▪ Maximizing performance with limited contiguity 51

Fine-grained (intra-process) allocation ▪ Maximizing performance with limited contiguity access-coverage hot regions XSBench access-coverage: # base pages accessed per second ❖ A good indicator of TLB-contention due to a region 52

Fine-grained (intra-process) allocation ▪ Track access-coverage ( access_map) ▪ Allocate in the sorted order (top to bottom) ✓ Yields higher profit per allocation access_map 53

Fine-grained (intra-process) allocation Page Walk Overhead (%) Linux Ingens HawkEye 50 MMU Overhead (%) access-coverage 40 30 20 10 0 1 101 201 301 401 501 Time (seconds) Workload: XSBench 54

Fine-grained (intra-process) allocation Execution time (ms) saved 1200 per huge page allocation Linux Ingens HawkEye ms saved per huge page 900 600 300 0 Graph500 XSBench NPB_CG.D 55

Fair (inter-process) allocation ▪ Prioritize allocation to the process with highest expected improvement ▪ How to estimate page walk overhead • Profile hardware performance counters • Low cost, accurate! 56

Fair (inter-process) allocation 70 Linux Ingens HawkEye 60 50 % speedup % speedup 40 30 20 10 0 cactusADM tigr Graph500 lbm_s SVM XSBench CG.D -10 Workloads running alongside a TLB-insensitive process 57

Summary ▪ OS support for huge pages involves complex tradeoffs ▪ Balancing fine-grained control with high performance ▪ Dealing with fragmentation for efficiency and fairness 58

Summary ▪ OS support for huge pages involves complex tradeoffs ▪ Balancing fine-grained control with high performance ▪ Dealing with fragmentation for efficiency and fairness HawkEye: Resolving fundamental conflicts for huge page optimizations https://github.com/apanwariisc/HawkEye 59

HawkEye: Efficient Fine-grained OS Support for Huge Pages Ashish - PowerPoint PPT Presentation

HawkEye: Efficient Fine-grained OS Support for Huge Pages Ashish Panwar 1 , Sorav Bansal 2 , K. Gopinath 1 Indian Institute of Science (IISc), Bangalore 1 Indian Institute of Technology, Delhi 2 Architectural Support for Programming Languages and

Fine Grained Access Control Fine-Grained Access Control Fine Grained Access Control

Fine-Grained Access Control Fine Grained Access Control Fine-grained access control examples:

Fine-Grained Geographic Communication (Geocast) Nexus Workshop Frank Drr 23.07.2003 1

Average-Case Fine-Grained Hardness Marshall Ball Alon Rosen Manuel Sabin Prashant Nalini

Fine-grained Visual Analysis: From Classification to Retrieval Yi-Zhe Song SketchX Lab, CVSSP,

Mechanized Verification of Fine-grained Concurrent Programs Ilya Sergey Aleks Nanevski

Junfeng Fan ESAT/COSIC ECC implementation methods Multi-core systems Coarse-Grained

Combining Data-Intense and Compute-Intense Methods for Fine-Grained Morphological Analyses Petra

Fine-Grained Power Modeling for Smartphones Using System Call Tracing Based on paper and

Fine-Grained Tracking of Grid Infections Ashish Gehani SRI Basim Baig, Salman Mahmood, Dawood

Addressing Inter-Class Similarity in Fine-Grained Visual Classification Abhimanyu Dubey

Fine-grained Image Recognition Lei Wang VILA group School of Computing and Information

On the Correctness Criteria of Fine-Grained Access Control in Relational Databases Qihua Wang,

Fine Grained Coordinated Parallelism in a Real World Application Mohammad Rezaei, PhD June 2012

Phase Transition in 3SAT Yi Zhou Phase Transition in 3SAT Phase Transition in 3SAT Fine Grained

Enhancing Fine- Grained Parallelism Loop vectorization, Loop distribution, Scalar expansion

Skill Based Workshop Severe Aggression CAPTAIN Summit December 5, 2018 Daniel B. Shabani,

Parallel QR Algorithm with Aggressive Early Deflation Meiyue Shao Department of Computing Science

Problem Behavior is Predictable and Preventable Timothy R. Vollmer, Ph.D. Department of

Empowering Software Debugging Through Architectural Support for Program

Aggressive Double Sampling for Reducing Multi-class Classification to Binary Classification

to Optimize Cellular Radio Usage Pavan Kumar, Ranjita Bhagwan, Saikat Guha, Vishnu Navda,

PMWG Readmissions Sub-group 06/25 / 2019 Agenda 1. Revisit Workplan/Vision of Sub-Group 2.

Applied Machine Learning CIML Chaps 4-5 (A Geometric Approach) A ship in port is safe, but