Criticality Aware Tiered Cache Hierarchy (CATCH) Anant Nori*, Jayesh Gaur*, Siddharth Rai # , Sreenivas Subramoney*, Hong Wang* * Microarchitecture Research Lab, Intel # Indian Institute of Technology Kanpur, India
Popular Three Level Cache Hierarchy 2
L1 (Pvt.) Data Code 32 Popular Three Level 5 cyc KB Cache Hierarchy L2 (Pvt) 1MB 15 cyc Cache capacity ↔ Access latency • 1.375MB/core LLC 5.5MB (4 core) • Target low average latency Exclusive (Shared) 40 cyc Skylake-like Server 2
L1 (Pvt.) Data Code 32 Popular Three Level 5 cyc KB Cache Hierarchy L2 (Pvt) 1MB 15 cyc Cache capacity ↔ Access latency • 1.375MB/core LLC 5.5MB (4 core) • Target low average latency Exclusive (Shared) 40 cyc Skylake-like Server • Large distributed LLC, high latency • Lower L2 latency important 2
L1 (Pvt.) Data Code 32 Popular Three Level 5 cyc KB Cache Hierarchy L2 (Pvt) 1MB 15 cyc Cache capacity ↔ Access latency • 1.375MB/core LLC 5.5MB (4 core) • Target low average latency Exclusive (Shared) 40 cyc Skylake-like Server • Large distributed LLC, high latency 0% • Lower L2 latency important -3% -2.7% -3.3% -4.0% -4.2% perf. impact -6% -5.1% -6.5% -6.8% -6.9% -7.0% -9% -7.8% NoL2 + 6.5MB LLC -9.5% -12% NoL2 + 9.5MB LLC -14.0% -15% client FSPEC HPC ISPEC server GeoMean 2
Data Code Data Code L1 (Pvt.) 32 32 Popular Three Level 5 cyc KB KB 256 Cache Hierarchy 1MB L2 (Pvt) KB 15 cyc Cache capacity ↔ Access latency • 2MB/core 1.375MB/core LLC 8MB (4 core) 5.5MB (4 core) • Target low average latency Exclusive (Shared) Inclusive 40 cyc Skylake-like Server Broadwell-like Server • Large distributed LLC, high latency 0% • Lower L2 latency important -3% -2.7% -3.3% -4.0% -4.2% perf. impact • -6% Trend to larger L2 sizes -5.1% -6.5% -6.8% -6.9% -7.0% -9% -7.8% NoL2 + 6.5MB LLC -9.5% -12% NoL2 + 9.5MB LLC -14.0% -15% client FSPEC HPC ISPEC server GeoMean 2
Data Code Data Code L1 (Pvt.) 32 32 Popular Three Level 5 cyc KB KB 256 Cache Hierarchy 1MB L2 (Pvt) KB 15 cyc Cache capacity ↔ Access latency • 2MB/core 1.375MB/core LLC 8MB (4 core) 5.5MB (4 core) • Target low average latency Exclusive (Shared) Inclusive 40 cyc Skylake-like Server Broadwell-like Server • Large distributed LLC, high latency 0% • Lower L2 latency important -3% -2.7% -3.3% -4.0% -4.2% perf. impact • -6% Trend to larger L2 sizes -5.1% -6.5% -6.8% -6.9% -7.0% -9% -7.8% Is a large L2 the most NoL2 + 6.5MB LLC -9.5% -12% efficient design choice? NoL2 + 9.5MB LLC -14.0% -15% client FSPEC HPC ISPEC server GeoMean 2
Large L2 caches LLC L Inclusive 2 • Inclusive LLC → Exclusive LLC LLC L 2 LLC L2 Exclusive LLC L2 3
Large L2 caches LLC L Inclusive 2 • Inclusive LLC → Exclusive LLC LLC L 2 • Lower effective on-die cache per core LLC L2 Exclusive LLC L2 3
Large L2 caches LLC L Inclusive 2 • Inclusive LLC → Exclusive LLC LLC L 2 • Lower effective on-die cache per core LLC • Large LLC better for multiple threads L2 Exclusive with disparate cache footprints LLC L2 3
Large L2 caches LLC L Inclusive 2 • Inclusive LLC → Exclusive LLC LLC L 2 • Lower effective on-die cache per core LLC • Large LLC better for multiple threads L2 Exclusive with disparate cache footprints LLC L2 • Area for Snoop-filter/Coherence-directory 3
Large L2 caches LLC L Inclusive 2 • Inclusive LLC → Exclusive LLC LLC L 2 • Lower effective on-die cache per core LLC • Large LLC better for multiple threads L2 Exclusive with disparate cache footprints LLC L2 • Area for Snoop-filter/Coherence-directory Despite area and power overheads, average latency reduction (performance) drives large L2 3
Loads and Program Criticality Program execution expressed in a Data Dependency Graph (Fields et. al.) 5 Load 6 7 3 4 Load Load 2 Load 1 D D D D D D D L2 Hit LLC Hit L2 Hit LLC miss 2 30 10 E E E E E E E 2 10 200 2 10 2 200 2 30 10 C C C C C C C 4
Loads and Program Criticality Program execution expressed in a Data Dependency Graph (Fields et. al.) Execution time governed by “Critical Path” 5 Load 6 7 3 4 Load Load 2 Load 1 D D D D D D D L2 Hit LLC Hit L2 Hit LLC miss 2 30 10 E E E E E E E 2 10 200 2 10 2 200 2 30 10 C C C C C C C 4
Loads and Program Criticality Program execution expressed in a Data Dependency Graph (Fields et. al.) Execution time governed by “Critical Path” 5 Load 6 7 3 4 Load Load 2 Load 1 D D D D D D D L2 Hit LLC Hit L2 Hit LLC miss 2 30 10 E E E E E E E 2 10 200 2 10 2 200 2 30 10 C C C C C C C 4
Loads and Program Criticality Program execution expressed in a Data Dependency Graph (Fields et. al.) Execution time governed by “Critical Path” 5 Load 6 7 3 4 Load Load 2 Load 1 D D D D D D D L2 Hit LLC Hit L2 Hit LLC miss 2 30 10 E E E E E E E 2 10 200 2 10 2 200 2 30 10 C C C C C C C CRITICAL L2 HIT LOAD 4
Loads and Program Criticality Program execution expressed in a Data Dependency Graph (Fields et. al.) Execution time governed by “Critical Path” 5 Load 6 7 3 4 Load Load 2 Load 1 D D D D D D D L2 Hit LLC hit LLC Hit L2 Hit LLC miss 2 30 10 E E E E E E E 30 2 10 200 2 30 10 2 200 2 30 10 C C C C C C C CRITICAL L2 HIT LOAD ~9% performance loss if L2 HIT → LLC HIT 4
Loads and Program Criticality Program execution expressed in a Data Dependency Graph (Fields et. al.) Execution time governed by “Critical Path” 5 Load 6 7 3 4 Load Load 2 Load 1 D D D D D D D L2 Hit LLC Hit L2 Hit LLC miss 2 30 10 E E E E E E E 2 10 200 2 10 2 200 2 30 10 C C C C C C C CRITICAL L2 HIT LOAD NON-CRITICAL L2 HIT LOAD ~9% performance loss if L2 HIT → LLC HIT 4
Loads and Program Criticality Program execution expressed in a Data Dependency Graph (Fields et. al.) Execution time governed by “Critical Path” 5 Load 6 7 3 4 Load Load 2 Load 1 D D D D D D D L2 Hit LLC Hit LLC hit L2 Hit LLC miss 2 30 30 10 E E E E E E E 2 10 200 2 30 10 2 200 2 30 10 C C C C C C C CRITICAL L2 HIT LOAD NON-CRITICAL L2 HIT LOAD ~9% performance loss if No performance impact if L2 HIT → LLC HIT L2 HIT → LLC HIT 4
Loads and Program Criticality Program execution expressed in a Data Dependency Graph (Fields et. al.) Execution time governed by “Critical Path” 5 Load 6 7 3 4 Load Load 2 Load 1 D D D D D D D L2 Hit LLC Hit L2 Hit LLC miss 2 30 10 E E E E E E E 2 10 200 2 10 2 200 2 30 10 C C C C C C C CRITICAL L2 HIT LOAD NON-CRITICAL L2 HIT LOAD ~9% performance loss if No performance impact if Only critical load L2 hits matter to performance. L2 HIT → LLC HIT L2 HIT → LLC HIT 4
Cache Hierarchy and Program Criticality Oracle study • Track critical load PCs • Increase latencies of targeted load PCs 5
Cache Hierarchy and Program Criticality 0% 100% % loads converted to higer latency -0.8% 90% -2% 80% -4% 70% -4.9% -6% perf. impact 60% -8% -7.8% 50% 49.1% -10% 40% 39.6% -12% 30% -14% 20% Perf. Impact – All loads -16% 10% Perf. Impact – NonCritical loads % loads converted -16.1% -18% 0% ALL NonCritical ALL NonCritical L1 hits to L2 lat. L2 hits to LLC lat. 5
Cache Hierarchy and Program Criticality 0% 100% % loads converted to higer latency -0.8% 90% -2% 80% -4% 70% -4.9% -6% perf. impact 60% -8% -7.8% 50% 49.1% -10% 40% 39.6% -12% 30% -14% 20% Perf. Impact – All loads -16% 10% Perf. Impact – NonCritical loads % loads converted -16.1% -18% 0% ALL NonCritical ALL NonCritical L1 hits to L2 lat. L2 hits to LLC lat. L2 cache most amenable to criticality optimizations 5
Criticality Aware Tiered Cache Hierarchy (CATCH) A. Track critical load PCs 6
Criticality Aware Tiered Cache Hierarchy (CATCH) A. Track critical load PCs • Served from non-L1 on-die caches 6
Criticality Aware Tiered Cache Hierarchy (CATCH) A. Track critical load PCs • Served from non-L1 on-die caches B. Prefetch critical loads into L1 • Accelerate the critical path 6
Criticality Aware Tiered Cache Hierarchy (CATCH) A. Track critical load PCs Oracle Performance Potential (*prefetchers disabled) • Served from non-L1 on-die caches 10% 20% % L1 misses converted 17.0% 14.1% 15.5% 8% 16% perf. impact 6.6% 6.2% 6.1% 5.8% 6% 12% B. Prefetch critical loads into L1 5.5% 4% 8% • Accelerate the critical path 2% 4% 0% 0% 32 PC 128 2048 All PC NoL2 PC PC + 2048 PC PerfImpact %loads Converted 6
Recommend
More recommend