hawkeye efficient fine grained os support for huge pages
play

HawkEye: Efficient Fine-grained OS Support for Huge Pages Ashish - PowerPoint PPT Presentation

HawkEye: Efficient Fine-grained OS Support for Huge Pages Ashish Panwar 1 , Sorav Bansal 2 , K. Gopinath 1 Indian Institute of Science (IISc), Bangalore 1 Indian Institute of Technology, Delhi 2 Architectural Support for Programming Languages and


  1. HawkEye: Efficient Fine-grained OS Support for Huge Pages Ashish Panwar 1 , Sorav Bansal 2 , K. Gopinath 1 Indian Institute of Science (IISc), Bangalore 1 Indian Institute of Technology, Delhi 2 Architectural Support for Programming Languages and Operating Systems (ASPLOS) - 2019. 1

  2. Virtual address space 2

  3. Virtual address space Physical address space 3

  4. Virtual address space Physical address space 4

  5. Virtual address space Physical address space 5

  6. Too much TLB pressure! Virtual address space Physical address space 6

  7. Virtual address space Physical address space 7

  8. Virtual address space Huge pages Fewer misses Physical address space 8

  9. OS Challenges ❑ Complex trade-offs • Memory bloat vs. performance • Page fault latency vs. the number of page faults ❑ Challenges due to (external) fragmentation • How to leverage limited memory contiguity • Fairness in huge page allocation 11

  10. Memory bloat vs. performance 13

  11. Internal fragmentation aggressive allocation Virtual memory huge page mapping Physical memory 14

  12. Internal fragmentation aggressive allocation conservative allocation Virtual memory huge page mapping Physical memory 15

  13. Internal fragmentation aggressive allocation conservative allocation unused pages Virtual memory huge page mapping Physical memory 16

  14. Internal fragmentation aggressive allocation conservative allocation unused pages Virtual memory huge page mapping Physical memory bloat 17

  15. Internal fragmentation aggressive allocation conservative allocation unused pages Virtual memory huge page mapping Physical memory Lower TLB reach (impacts performance) bloat 18

  16. Bloat vs. performance Conservative Aggressive Lower perf Higher perf Higher bloat Lower bloat

  17. Latency vs. # page faults 20

  18. ▪ Find a page 4-KB pre 21

  19. ▪ Find a page, zero-fill 4-KB pre zero-fill post 22

  20. ▪ Find a page, zero-fill, map 4-KB pre zero-fill post 23

  21. ▪ Find a page, zero-fill, map 25% 4-KB pre zero-fill post 24

  22. ▪ Find a page, zero-fill, map 25% 4-KB pre zero-fill post p p o 2-MB zero-fill r s e t 25

  23. ▪ Find a page, zero-fill, map 25% 4-KB pre zero-fill post p p o 2-MB zero-fill r s e t 26

  24. ▪ Find a page, zero-fill, map 25% 4-KB pre zero-fill post p p o 2-MB zero-fill r s e t 27

  25. ▪ Find a page, zero-fill, map 25% 4-KB pre zero-fill post p p o 2-MB zero-fill r s e t dominated by zero-filling (97%) 28

  26. Latency vs. # page faults Aggressive Conservative High latency Low latency Fewer faults Higher faults 32

  27. Current systems favor opposite ends of the design spectrum • FreeBSD is conservative (compromise on performance) • Linux is throughput-oriented (compromise on latency and bloat) conservative vs. aggressive FreeBSD Linux Memory bloat Low High Tradeoff-1: Performance Low High Allocation latency Low High Tradeoff-2: High Low # page faults 33

  28. Ingens (OSDI’16) ▪ Asynchronous allocation • Huge pages allocated in the background ▪ Utilization-threshold based allocation • Tunable bloat vs. performance • Adaptive based on memory pressure ▪ Fairness driven by per-process fairness metric • Heuristic based on past behavior 34

  29. Ingens (OSDI’16) low latency ▪ Asynchronous allocation too many page faults • Huge pages allocated in the background ▪ Utilization-threshold based allocation • Tunable bloat vs. performance • Adaptive based on memory pressure ▪ Fairness driven by per-process fairness metric • Heuristic based on past behavior 35

  30. Ingens (OSDI’16) low latency ▪ Asynchronous allocation too many page faults • Huge pages allocated in the background ▪ Utilization-threshold based allocation • Tunable bloat vs. performance manual • Adaptive based on memory pressure ▪ Fairness driven by per-process fairness metric • Heuristic based on past behavior 36

  31. Ingens (OSDI’16) low latency ▪ Asynchronous allocation too many page faults • Huge pages allocated in the background ▪ Utilization-threshold based allocation • Tunable bloat vs. performance manual • Adaptive based on memory pressure ▪ Fairness driven by per-process fairness metric • Heuristic based on past behavior weak correlation with page walk overhead 37

  32. Current state-of-the-art FreeBSD Linux Ingens Memory bloat Low High Tunable Tradeoff-1: Performance Low High Tunable Allocation latency Low High Low Tradeoff-2: High Low High # page faults ▪ Hard to find the sweet-spot for utilization-threshold in Ingens • Application dependent, phase dependent 38

  33. HawkEye 39

  34. Key Optimizations ➢ Asynchronous page pre-zeroing [1] ➢ Content deduplication based bloat mitigation ➢ Fine-grained intra-process allocation ➢ Fairness driven by hardware performance counters [1] Optimizing the Idle Task and Other MMU Tricks, OSDI'99 40

  35. Asynchronous page pre-zeroing ▪ Pages zero-filled in the background ▪ Potential issues: • Cache pollution – leverage non-temporal writes • DRAM bandwidth consumption – rate-limited o Limit CPU utilization (e.g., 5%) 41

  36. Asynchronous page pre-zeroing Enables aggressive allocation with low latency ✓ 13.8x faster VM spin-up ✓ 1.26x higher throughput (Redis) 42

  37. Mitigating bloat 43

  38. Mitigating bloat Virtual memory huge page mapping Physical memory 44

  39. Mitigating bloat unused Virtual memory huge page mapping Physical memory 45

  40. Mitigating bloat unused Virtual memory huge page mapping Physical memory zero-filled 46

  41. Mitigating bloat unused Virtual memory huge page mapping Physical memory zero-filled ▪ Observation: Unused base pages remain zero-filled ▪ Identify bloat by scanning memory ▪ Dedup zero-filled base pages to remove bloat 47

  42. Mitigating bloat ▪ Ease of detecting non-zero pages 115.5 offset (bytes) distance (bytes) 120 90 67.5 55.4 60 27.4 30 9.11 6.63 3.9 2.8 1.2 1 0 48

  43. Mitigating bloat ✓ Automated " bloat vs. performance " management success success out-of-memory out-of-memory 48 Redis 40 RSS (GB) P1: insert 32 P2 1 P3 P 24 P2: delete 16 Linux Ingens HawkEye 8 P3: insert 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 1 1 1 1 1 1 1 Time (seconds) 49

  44. FreeBSD Linux Ingens HawkEye Memory bloat Low High Tunable Automated Tradeoff-1: Performance Low High Tunable Automated Allocation latency Low High Low Low Tradeoff-2: High Low High Low # page faults 50

  45. Fine-grained (intra-process) allocation ▪ Maximizing performance with limited contiguity 51

  46. Fine-grained (intra-process) allocation ▪ Maximizing performance with limited contiguity access-coverage hot regions XSBench access-coverage: # base pages accessed per second ❖ A good indicator of TLB-contention due to a region 52

  47. Fine-grained (intra-process) allocation ▪ Track access-coverage ( access_map) ▪ Allocate in the sorted order (top to bottom) ✓ Yields higher profit per allocation access_map 53

  48. Fine-grained (intra-process) allocation Page Walk Overhead (%) Linux Ingens HawkEye 50 MMU Overhead (%) access-coverage 40 30 20 10 0 1 101 201 301 401 501 Time (seconds) Workload: XSBench 54

  49. Fine-grained (intra-process) allocation Execution time (ms) saved 1200 per huge page allocation Linux Ingens HawkEye ms saved per huge page 900 600 300 0 Graph500 XSBench NPB_CG.D 55

  50. Fair (inter-process) allocation ▪ Prioritize allocation to the process with highest expected improvement ▪ How to estimate page walk overhead • Profile hardware performance counters • Low cost, accurate! 56

  51. Fair (inter-process) allocation 70 Linux Ingens HawkEye 60 50 % speedup % speedup 40 30 20 10 0 cactusADM tigr Graph500 lbm_s SVM XSBench CG.D -10 Workloads running alongside a TLB-insensitive process 57

  52. Summary ▪ OS support for huge pages involves complex tradeoffs ▪ Balancing fine-grained control with high performance ▪ Dealing with fragmentation for efficiency and fairness 58

  53. Summary ▪ OS support for huge pages involves complex tradeoffs ▪ Balancing fine-grained control with high performance ▪ Dealing with fragmentation for efficiency and fairness HawkEye: Resolving fundamental conflicts for huge page optimizations https://github.com/apanwariisc/HawkEye 59

  54. 60

Recommend


More recommend