Past Proposals: : Direct Segments • Segment based translation [1] • Three values represent contiguous translation of any size • Fully assoc. lookup for multiple segments (limits size of TLB) • Redundant Memory Mappings (RMM) [6] -> 32 Fully-associative TLB Base Limit Virtual Pages Direct Segment Offset Base Limit Offset Physical Pages Direct Segment [1] Basu et al. ISCA ’13 51 [6] Karakostas et al. ISCA ‘15
Past Proposals: : Direct Segments • Segment based translation [1] • Three values represent contiguous translation of any size • Fully assoc. lookup for multiple segments (limits size of TLB) • Redundant Memory Mappings (RMM) [6] -> 32 Fully-associative TLB Base Limit Virtual Pages Direct Segment Offset Base Limit Offset Physical Pages Direct Segment [1] Basu et al. ISCA ’13 52 [6] Karakostas et al. ISCA ‘15
Past Proposals: : Direct Segments • Segment based translation [1] • Three values represent contiguous translation of any size • Fully assoc. lookup for multiple segments (limits size of TLB) • Redundant Memory Mappings (RMM) [6] -> 32 Fully-associative TLB Base Limit Virtual Pages Direct Segment Offset Base Limit Offset Physical Pages Direct Segment [1] Basu et al. ISCA ’13 53 [6] Karakostas et al. ISCA ‘15
Past Proposals: : Direct Segments • Segment based translation [1] • Three values represent contiguous translation of any size • Fully assoc. lookup for multiple segments (limits size of TLB) • Redundant Memory Mappings (RMM) [6] -> 32 Fully-associative TLB Base Limit Efficient with small number of Virtual Pages Direct Segment big memory chunks Offset Base Limit Offset Physical Pages Direct Segment [1] Basu et al. ISCA ’13 54 [6] Karakostas et al. ISCA ‘15
Past P Proposals: Summary • Large pages • Affinity for large pages (2MB) • Cluster TLB • Affinity for clustering of mapping of up to 8 pages • Segment translations • Affinity for small number of large chunks (32 entry TLB) 55
Past P Proposals: Summary • Large pages • Affinity for large pages (2MB) • Cluster TLB • Affinity for clustering of mapping of up to 8 pages • Segment translations • Affinity for small number of large chunks (32 entry TLB) Prior proposals efficiently support specific memory mapping scenarios 56
Large C La Conti tiguity vs. Memor ory Non Non-Uniform rmity Node Node Node Node Node Node Node Node Regular Pages Large Page [2] Baptiste et al. ATC ’ 14 [3] Lee et al. ISCA ‘15 57 [4] Agarwal et al. ASPLOS ‘17
Large C La Conti tiguity vs. Memor ory Non Non-Uniform rmity • Conflicting goals of NUMA systems and large pages [2] • Memory traffic balance vs. efficient address translation Node Node Node Node Node Node Node Node Regular Pages Large Page [2] Baptiste et al. ATC ’ 14 [3] Lee et al. ISCA ‘15 58 [4] Agarwal et al. ASPLOS ‘17
Large C La Conti tiguity vs. Memor ory Non Non-Uniform rmity • Conflicting goals of NUMA systems and large pages [2] • Memory traffic balance vs. efficient address translation Node Node Node Node Node Node Node Node Regular Pages Large Page • Heterogeneous memory worsens non-uniformity [3][4] [2] Baptiste et al. ATC ’ 14 [3] Lee et al. ISCA ‘15 59 [4] Agarwal et al. ASPLOS ‘17
Large C La Conti tiguity vs. Memor ory Non Non-Uniform rmity • Conflicting goals of NUMA systems and large pages [2] • Memory traffic balance vs. efficient address translation Node Node Node Node Node Node Node Node Regular Pages Large Page • Heterogeneous memory worsens non-uniformity [3][4] [2] Baptiste et al. ATC ’ 14 [3] Lee et al. ISCA ‘15 60 [4] Agarwal et al. ASPLOS ‘17
Large C La Conti tiguity vs. Memor ory Non Non-Uniform rmity • Conflicting goals of NUMA systems and large pages [2] • Memory traffic balance vs. efficient address translation Node Node Node Node Node Node Node Node Regular Pages Large Page • Heterogeneous memory worsens non-uniformity [3][4] [2] Baptiste et al. ATC ’ 14 [3] Lee et al. ISCA ‘15 61 [4] Agarwal et al. ASPLOS ‘17
Large C La Conti tiguity vs. Memor ory Non Non-Uniform rmity • Conflicting goals of NUMA systems and large pages [2] • Memory traffic balance vs. efficient address translation Node Node Node Node Node Node Node Node Regular Pages Large Page • Heterogeneous memory worsens non-uniformity [3][4] [2] Baptiste et al. ATC ’ 14 [3] Lee et al. ISCA ‘15 62 [4] Agarwal et al. ASPLOS ‘17
Large C La Conti tiguity vs. Memor ory Non Non-Uniform rmity • Conflicting goals of NUMA systems and large pages [2] • Memory traffic balance vs. efficient address translation Node Node Node Node Node Node Node Node Regular Pages Large Page • Heterogeneous memory worsens non-uniformity [3][4] [2] Baptiste et al. ATC ’ 14 [3] Lee et al. ISCA ‘15 63 [4] Agarwal et al. ASPLOS ‘17
Large C La Conti tiguity vs. Memor ory Non Non-Uniform rmity • Conflicting goals of NUMA systems and large pages [2] • Memory traffic balance vs. efficient address translation Node Node Node Node Node Node Node Node Regular Pages Large Page • Heterogeneous memory worsens non-uniformity [3][4] [2] Baptiste et al. ATC ’ 14 [3] Lee et al. ISCA ‘15 64 [4] Agarwal et al. ASPLOS ‘17
Large C La Conti tiguity vs. Memor ory Non Non-Uniform rmity • Conflicting goals of NUMA systems and large pages [2] • Memory traffic balance vs. efficient address translation Node Node Node Node Node Node Node Node Regular Pages Large Page • Heterogeneous memory worsens non-uniformity [3][4] [2] Baptiste et al. ATC ’ 14 [3] Lee et al. ISCA ‘15 65 [4] Agarwal et al. ASPLOS ‘17
Large C La Conti tiguity vs. Memor ory Non Non-Uniform rmity • Conflicting goals of NUMA systems and large pages [2] • Memory traffic balance vs. efficient address translation Node Node Node Node Node Node Node Node Regular Pages Large Page • Heterogeneous memory worsens non-uniformity [3][4] [2] Baptiste et al. ATC ’ 14 [3] Lee et al. ISCA ‘15 66 [4] Agarwal et al. ASPLOS ‘17
Large C La Conti tiguity vs. Memor ory Non Non-Uniform rmity • Conflicting goals of NUMA systems and large pages [2] • Memory traffic balance vs. efficient address translation Node Node Node Node Node Node Node Node Regular Pages Large Page • Heterogeneous memory worsens non-uniformity [3][4] [2] Baptiste et al. ATC ’ 14 [3] Lee et al. ISCA ‘15 67 [4] Agarwal et al. ASPLOS ‘17
Large C La Conti tiguity vs. Memor ory Non Non-Uniform rmity • Conflicting goals of NUMA systems and large pages [2] • Memory traffic balance vs. efficient address translation Node Node Node Node Node Node Node Node Regular Pages Large Page • Heterogeneous memory worsens non-uniformity [3][4] [2] Baptiste et al. ATC ’ 14 [3] Lee et al. ISCA ‘15 68 [4] Agarwal et al. ASPLOS ‘17
Large C La Conti tiguity vs. Memor ory Non Non-Uniform rmity • Conflicting goals of NUMA systems and large pages [2] • Memory traffic balance vs. efficient address translation Node Node Node Node Node Node Node Node Regular Pages Large Page • Heterogeneous memory worsens non-uniformity [3][4] [2] Baptiste et al. ATC ’ 14 [3] Lee et al. ISCA ‘15 69 [4] Agarwal et al. ASPLOS ‘17
Large C La Conti tiguity vs. Memor ory Non Non-Uniform rmity • Conflicting goals of NUMA systems and large pages [2] • Memory traffic balance vs. efficient address translation Node Node Node Node Node Node Node Node Regular Pages Large Page • Heterogeneous memory worsens non-uniformity [3][4] [2] Baptiste et al. ATC ’ 14 [3] Lee et al. ISCA ‘15 70 [4] Agarwal et al. ASPLOS ‘17
Large C La Conti tiguity vs. Memor ory Non Non-Uniform rmity • Conflicting goals of NUMA systems and large pages [2] • Memory traffic balance vs. efficient address translation Node Node Node Node Node Node Node Node Regular Pages Large Page • Heterogeneous memory worsens non-uniformity [3][4] [2] Baptiste et al. ATC ’ 14 [3] Lee et al. ISCA ‘15 71 [4] Agarwal et al. ASPLOS ‘17
Large C La Conti tiguity vs. Memor ory Non Non-Uniform rmity • Conflicting goals of NUMA systems and large pages [2] • Memory traffic balance vs. efficient address translation Node Node Node Node Node Node Node Node Regular Pages Large Page • Heterogeneous memory worsens non-uniformity [3][4] [2] Baptiste et al. ATC ’ 14 [3] Lee et al. ISCA ‘15 72 [4] Agarwal et al. ASPLOS ‘17
Large C La Conti tiguity vs. Memor ory Non Non-Uniform rmity • Conflicting goals of NUMA systems and large pages [2] • Memory traffic balance vs. efficient address translation Node Node Node Node Node Node Node Node Regular Pages Large Page • Heterogeneous memory worsens non-uniformity [3][4] [2] Baptiste et al. ATC ’ 14 Cold Pages [3] Lee et al. ISCA ‘15 73 [4] Agarwal et al. ASPLOS ‘17
Large C La Conti tiguity vs. Memor ory Non Non-Uniform rmity • Conflicting goals of NUMA systems and large pages [2] • Memory traffic balance vs. efficient address translation Node Node Node Node Node Node Node Node Regular Pages Large Page • Heterogeneous memory worsens non-uniformity [3][4] [2] Baptiste et al. ATC ’ 14 Hot Pages Cold Pages [3] Lee et al. ISCA ‘15 74 [4] Agarwal et al. ASPLOS ‘17
Large C La Conti tiguity vs. Memor ory Non Non-Uniform rmity • Conflicting goals of NUMA systems and large pages [2] • Memory traffic balance vs. efficient address translation Node Node Node Node Node Node Node Node Regular Pages Large Page • Heterogeneous memory worsens non-uniformity [3][4] [2] Baptiste et al. ATC ’ 14 Hot Pages Cold Pages [3] Lee et al. ISCA ‘15 75 [4] Agarwal et al. ASPLOS ‘17
Large C La Conti tiguity vs. Memor ory Non Non-Uniform rmity • Conflicting goals of NUMA systems and large pages [2] • Memory traffic balance vs. efficient address translation Node Node Node Node Different systems have different memory mapping needs Node Node Node Node Regular Pages Large Page • Heterogeneous memory worsens non-uniformity [3][4] [2] Baptiste et al. ATC ’ 14 Hot Pages Cold Pages [3] Lee et al. ISCA ‘15 76 [4] Agarwal et al. ASPLOS ‘17
Need f for a r an Al All-Roun unde der Solution • Contiguity distribution varies among workloads • Also varies within the same workload [7] 77 [7] Kwon et al. OSDI ’16
Need f for a r an Al All-Roun unde der Solution • Contiguity distribution varies among workloads • Also varies within the same workload [7] CDF of process memory 78 [7] Kwon et al. OSDI ’16
Need f for a r an Al All-Roun unde der Solution • Contiguity distribution varies among workloads • Also varies within the same workload [7] CDF of process memory Well suited for Cluster 79 [7] Kwon et al. OSDI ’16
Need f for a r an Al All-Roun unde der Solution • Contiguity distribution varies among workloads • Also varies within the same workload [7] Well suited for Large pages CDF of process memory Well suited for Cluster 80 [7] Kwon et al. OSDI ’16
Need f for a r an Al All-Roun unde der Solution • Contiguity distribution varies among workloads • Also varies within the same workload [7] Well suited for Large pages CDF of process memory Well suited for Cluster Well suited for ?? 81 [7] Kwon et al. OSDI ’16
Need f for a r an Al All-Roun unde der Solution • Contiguity distribution varies among workloads • Also varies within the same workload [7] Well suited for Large pages CDF of process memory Well suited for Cluster Well suited for ?? 82 [7] Kwon et al. OSDI ’16
Need f for a r an Al All-Roun unde der Solution • Contiguity distribution varies among workloads • Also varies within the same workload [7] Well suited for Large pages CDF of process memory Well suited for Cluster Well suited for ?? Can we make a TLB scheme that works well for diverse scenarios? 83 [7] Kwon et al. OSDI ’16
Hyb ybrid T TLB LB C Coalesci cing Hardware TLB Operating System Page Table 84
Hyb ybrid T TLB LB C Coalesci cing Hardware We propose a TLB with TLB adjustable coverage • HW - SW Joint Effort • HW offers adjustable TLB coverage • Number of TLB entries fixed • Coverage of entry adjustable Operating System • OS decides best TLB coverage • Adjusts TLB coverage per process • OS identifies contiguous chunks Page Table • Marks onto process page table 85
Hyb ybrid T TLB LB C Coalesci cing Hardware We propose a TLB with TLB TLB adjustable coverage • HW - SW Joint Effort • HW offers adjustable TLB coverage • Number of TLB entries fixed • Coverage of entry adjustable Operating System • OS decides best TLB coverage • Adjusts TLB coverage per process • OS identifies contiguous chunks Page Table • Marks onto process page table 86
Hyb ybrid T TLB LB C Coalesci cing Hardware We propose a TLB with TLB TLB adjustable coverage • HW - SW Joint Effort • HW offers adjustable TLB coverage • Number of TLB entries fixed • Coverage of entry adjustable Operating System • OS decides best TLB coverage • Adjusts TLB coverage per process • OS identifies contiguous chunks Page Table • Marks onto process page table 87
Hyb ybrid T TLB LB C Coalesci cing Hardware We propose a TLB with TLB TLB adjustable coverage • HW - SW Joint Effort • HW offers adjustable TLB coverage • Number of TLB entries fixed • Coverage of entry adjustable Operating System • OS decides best TLB coverage • Adjusts TLB coverage per process • OS identifies contiguous chunks Page Table • Marks onto process page table 88
Hyb ybrid T TLB LB C Coalesci cing Hardware We propose a TLB with TLB TLB adjustable coverage • HW - SW Joint Effort • HW offers adjustable TLB coverage • Number of TLB entries fixed • Coverage of entry adjustable Operating System • OS decides best TLB coverage • Adjusts TLB coverage per process • OS identifies contiguous chunks Page Table • Marks onto process page table 89
Hyb ybrid T TLB LB C Coalesci cing Hardware We propose a TLB with TLB TLB adjustable coverage • HW - SW Joint Effort • HW offers adjustable TLB coverage • Number of TLB entries fixed • Coverage of entry adjustable Operating System • OS decides best TLB coverage • Adjusts TLB coverage per process • OS identifies contiguous chunks Page Table • Marks onto process page table 90
Hyb ybrid T TLB LB C Coalesci cing Hardware We propose a TLB with TLB TLB adjustable coverage • HW - SW Joint Effort • HW offers adjustable TLB coverage • Number of TLB entries fixed • Coverage of entry adjustable Operating System • OS decides best TLB coverage • Adjusts TLB coverage per process • OS identifies contiguous chunks Page Table • Marks onto process page table 91
Anch nchor • Anchors are special entries in the page table • Placed at every alignments of anchor distance • Anchor distance is a power of 2 (for encoding efficiency) • Anchor distance configurable by OS Anchor Distance = 8 Page Table 92
Anch nchor • Anchors are special entries in the page table • Placed at every alignments of anchor distance • Anchor distance is a power of 2 (for encoding efficiency) • Anchor distance configurable by OS Anchor Distance = 4 Anchor Distance = 8 Page Table 93
Anch nchor • Anchors are special entries in the page table • Placed at every alignments of anchor distance • Anchor distance is a power of 2 (for encoding efficiency) • Anchor distance configurable by OS Anchor Distance = 4 Anchor Distance = 8 Page Table 94
Anch nchor • Anchors are special entries in the page table • Placed at every alignments of anchor distance • Anchor distance is a power of 2 (for encoding efficiency) • Anchor distance configurable by OS Anchor Distance = 4 Anchor Distance = 8 Page Table 95
Anch nchor • Anchors are special entries in the page table • Placed at every alignments of anchor distance • Anchor distance is a power of 2 (for encoding efficiency) • Anchor distance configurable by OS Anchor Distance = 4 Anchor Distance = 8 Page Table 96
Anch nchor • Anchors are special entries in the page table • Placed at every alignments of anchor distance • Anchor distance is a power of 2 (for encoding efficiency) • Anchor distance configurable by OS Anchor Distance = 4 Anchor Distance = 8 Page Table 97
Anch nchor • Anchors are special entries in the page table • Placed at every alignments of anchor distance • Anchor distance is a power of 2 (for encoding efficiency) • Anchor distance configurable by OS Anchor Distance = 4 Anchor Distance = 8 Page Table 98
Anchor An r Page Table • Uses the Page Table • Anchor covers up to distance(4) contiguous pages • Each anchor represents contiguity that begins at anchor • OS marks contiguity onto the anchor page table Virtual Pages 2 3 4 1 4 Anchor Mappings Regular Mappings Physical Pages 99
Anchor An r Page Table • Uses the Page Table • Anchor covers up to distance(4) contiguous pages • Each anchor represents contiguity that begins at anchor • OS marks contiguity onto the anchor page table Virtual Pages 2 3 4 1 4 Anchor Mappings Regular Mappings Physical Pages 100
Anchor An r Page Table • Uses the Page Table • Anchor covers up to distance(4) contiguous pages • Each anchor represents contiguity that begins at anchor • OS marks contiguity onto the anchor page table Virtual Pages 2 3 4 1 4 Anchor Mappings Regular Mappings Physical Pages 101
An Anchor r TLB • Integrated into the L2 TLB • L1 keeps regular entries • Caches both regular and anchor page table entries • Regular and anchor indexed differently Anchor TLB (4 sets) Virtual Pages 2 3 4 0 4 TLB Entries 0 | 2 0 | 3 0 | 4 1 | 4 3 | X Anchor Entry 3 | X Tag | Contiguity 3 | X Regular Entry 102
An Anchor r TLB Lookup • On L1 TLB Miss Anchor TLB looks up • Regular TLB first • Anchor TLB next Anchor TLB (4 sets) Virtual Pages 2 3 4 0 4 0 | 2 1 | 4 0 | 3 3 | X 0 | 4 3 | X 3 | X Anchor Entry Regular Entry 103
An Anchor r TLB Lookup • On L1 TLB Miss Anchor TLB looks up • Regular TLB first • Anchor TLB next Anchor TLB (4 sets) Virtual Pages 2 3 4 0 4 0 | 2 1 | 4 0 | 3 3 | X 0 | 4 3 | X 3 | X Anchor Entry Regular Entry 104
An Anchor r TLB Lookup • On L1 TLB Miss Anchor TLB looks up • Regular TLB first • Anchor TLB next Anchor TLB (4 sets) Virtual Pages 2 3 4 0 4 0 | 2 1 | 4 0 | 3 3 | X 0 | 4 3 | X 0 | 3 3 | X Anchor Entry Regular Entry 105
An Anchor r TLB Lookup • On L1 TLB Miss Anchor TLB looks up • Regular TLB first • Anchor TLB next Anchor TLB (4 sets) Virtual Pages 2 3 4 0 4 0 | 2 1 | 4 0 | 3 3 | X 0 | 4 3 | X 0 | 3 3 | X Anchor Entry Regular Entry 106
Anchor An r TLB Lookup • On L1 TLB Miss Anchor TLB looks up • Regular TLB first • Anchor TLB next Anchor TLB (4 sets) Virtual Pages 2 3 4 0 4 0 | 2 1 | 4 0 | 3 3 | X 0 | 4 3 | X 0 | 3 3 | X Offset (2) < Anchor Entry Contiguity (3) Regular Entry 107
Anchor An r TLB Lookup • On L1 TLB Miss Anchor TLB looks up • Regular TLB first • Anchor TLB next Anchor TLB (4 sets) Virtual Pages 2 3 4 0 4 0 | 2 1 | 4 0 | 3 3 | X 0 | 4 3 | X 0 | 3 3 | X Offset (2) < Anchor Entry Contiguity (3) Regular Entry HIT 108 return Anchor PFN + offset
An Anchor r TLB Lookup • On L1 TLB Miss Anchor TLB looks up • Regular TLB first • Anchor TLB next Anchor TLB (4 sets) Virtual Pages 2 3 4 0 4 0 | 2 1 | 4 0 | 3 3 | X 0 | 4 3 | X 0 | 3 3 | X Offset (2) < MISS Anchor Entry Contiguity (3) Start Page Walk Regular Entry HIT 109 return Anchor PFN + offset
An Anchor r TLB Lookup • On L1 TLB Miss Anchor TLB looks up • Regular TLB first • Anchor TLB next Anchor TLB (4 sets) Virtual Pages 2 3 4 0 4 0 | 2 1 | 4 0 | 3 3 | X 0 | 4 3 | X 0 | 3 3 | X Offset (2) < MISS Anchor Entry Contiguity (3) Start Page Walk Regular Entry HIT 110 return Anchor PFN + offset
Operati ting S System Responsibilities • OS periodically selects process anchor distance • Heuristic algorithm to minimize TLB entry count • OS adjusts anchor distance • Anchor distance based on selection algorithm • OS marks mapping contiguity • Memory mapping contiguity in anchor page table entry 111
Simulation on M Met ethodol ology • Trace based TLB simulator (Based on Intel Haswell) TLB Configuration Common L1 4KB: 64 entry, 4 way 2MB: 32 entry, 4 way Baseline L2 / THP 4KB/2MB: 1024 entry, 8 way Cluster Regular (4KB/2MB): 768 entry, 6 way Cluster-8: 320 entry, 5 way RMM (Multiple segments) Baseline L2 TLB + RMM: 32 entry, fully-assoc. Anchor (Selected/Static 4KB/2MB/anchor: 1024 entry, 8 way Ideal) 112
Memory Mapping Scenarios • Two class of memory mapping scenarios • Two real system memory mappings • Four synthetic memory mappings Name Trace information demand Default Linux memory mapping eager ‘Eager’ allocation low 1– 16 pages (4KB – 64KB) medium 1 – 512 pages (4KB – 2MB) high 512 – 64K pages (2MB – 256MB) max Maximum contiguity 113
Evaluati tion – TLB LB M Misses of demand mappi ping ng 100 90 Relative TLB Misses (%) 80 70 60 50 40 30 20 10 0 THP Cluster RMM Anchor Selected Anchor Ideal 114
Evaluati tion – TLB LB M Misses of demand mappi ping ng 100 90 Relative TLB Misses (%) 80 70 60 50 40 30 20 10 0 THP Cluster RMM Anchor Selected Anchor Ideal 115
Evaluati tion – TLB LB M Misses of demand mappi ping ng 100 90 Relative TLB Misses (%) 80 70 60 50 40 30 20 10 0 THP Cluster RMM Anchor Selected Anchor Ideal 116
Evaluati tion – TLB LB M Misses of demand mappi ping ng 100 90 Relative TLB Misses (%) 80 70 60 50 40 30 20 10 0 Anchor TLB adjusted to satisfy small contiguities THP Cluster RMM Anchor Selected Anchor Ideal 117
Evaluati tion – TLB LB M Misses of medium mappi ping 100 90 Relative TLB Misses (%) 80 70 60 50 40 30 20 10 0 THP Cluster RMM Anchor Selected Anchor Ideal 118
Evaluati tion – TLB LB M Misses of medium mappi ping 100 90 Relative TLB Misses (%) 80 70 60 50 40 30 20 10 0 Anchor adjusted coverage to provide best TLB reduction THP Cluster RMM Anchor Selected Anchor Ideal 119
Evaluati tion – TLB M Misses of all mappi ping ng 100 90 80 Relative TLB Misses (%) 70 60 50 40 30 20 10 0 demand eager low cont. med cont. high cont. max cont. Baseline THP Cluster RMM Anchor Selected Anchor Ideal 120
Recommend
More recommend