criticality aware tiered cache hierarchy catch
play

Criticality Aware Tiered Cache Hierarchy (CATCH) Anant Nori*, Jayesh - PowerPoint PPT Presentation

Criticality Aware Tiered Cache Hierarchy (CATCH) Anant Nori*, Jayesh Gaur*, Siddharth Rai # , Sreenivas Subramoney*, Hong Wang* * Microarchitecture Research Lab, Intel # Indian Institute of Technology Kanpur, India Popular Three Level Cache


  1. Criticality Aware Tiered Cache Hierarchy (CATCH) Anant Nori*, Jayesh Gaur*, Siddharth Rai # , Sreenivas Subramoney*, Hong Wang* * Microarchitecture Research Lab, Intel # Indian Institute of Technology Kanpur, India

  2. Popular Three Level Cache Hierarchy 2

  3. L1 (Pvt.) Data Code 32 Popular Three Level 5 cyc KB Cache Hierarchy L2 (Pvt) 1MB 15 cyc Cache capacity ↔ Access latency • 1.375MB/core LLC 5.5MB (4 core) • Target low average latency Exclusive (Shared) 40 cyc Skylake-like Server 2

  4. L1 (Pvt.) Data Code 32 Popular Three Level 5 cyc KB Cache Hierarchy L2 (Pvt) 1MB 15 cyc Cache capacity ↔ Access latency • 1.375MB/core LLC 5.5MB (4 core) • Target low average latency Exclusive (Shared) 40 cyc Skylake-like Server • Large distributed LLC, high latency • Lower L2 latency important 2

  5. L1 (Pvt.) Data Code 32 Popular Three Level 5 cyc KB Cache Hierarchy L2 (Pvt) 1MB 15 cyc Cache capacity ↔ Access latency • 1.375MB/core LLC 5.5MB (4 core) • Target low average latency Exclusive (Shared) 40 cyc Skylake-like Server • Large distributed LLC, high latency 0% • Lower L2 latency important -3% -2.7% -3.3% -4.0% -4.2% perf. impact -6% -5.1% -6.5% -6.8% -6.9% -7.0% -9% -7.8% NoL2 + 6.5MB LLC -9.5% -12% NoL2 + 9.5MB LLC -14.0% -15% client FSPEC HPC ISPEC server GeoMean 2

  6. Data Code Data Code L1 (Pvt.) 32 32 Popular Three Level 5 cyc KB KB 256 Cache Hierarchy 1MB L2 (Pvt) KB 15 cyc Cache capacity ↔ Access latency • 2MB/core 1.375MB/core LLC 8MB (4 core) 5.5MB (4 core) • Target low average latency Exclusive (Shared) Inclusive 40 cyc Skylake-like Server Broadwell-like Server • Large distributed LLC, high latency 0% • Lower L2 latency important -3% -2.7% -3.3% -4.0% -4.2% perf. impact • -6% Trend to larger L2 sizes -5.1% -6.5% -6.8% -6.9% -7.0% -9% -7.8% NoL2 + 6.5MB LLC -9.5% -12% NoL2 + 9.5MB LLC -14.0% -15% client FSPEC HPC ISPEC server GeoMean 2

  7. Data Code Data Code L1 (Pvt.) 32 32 Popular Three Level 5 cyc KB KB 256 Cache Hierarchy 1MB L2 (Pvt) KB 15 cyc Cache capacity ↔ Access latency • 2MB/core 1.375MB/core LLC 8MB (4 core) 5.5MB (4 core) • Target low average latency Exclusive (Shared) Inclusive 40 cyc Skylake-like Server Broadwell-like Server • Large distributed LLC, high latency 0% • Lower L2 latency important -3% -2.7% -3.3% -4.0% -4.2% perf. impact • -6% Trend to larger L2 sizes -5.1% -6.5% -6.8% -6.9% -7.0% -9% -7.8% Is a large L2 the most NoL2 + 6.5MB LLC -9.5% -12% efficient design choice? NoL2 + 9.5MB LLC -14.0% -15% client FSPEC HPC ISPEC server GeoMean 2

  8. Large L2 caches LLC L Inclusive 2 • Inclusive LLC → Exclusive LLC LLC L 2 LLC L2 Exclusive LLC L2 3

  9. Large L2 caches LLC L Inclusive 2 • Inclusive LLC → Exclusive LLC LLC L 2 • Lower effective on-die cache per core LLC L2 Exclusive LLC L2 3

  10. Large L2 caches LLC  L  Inclusive 2  • Inclusive LLC → Exclusive LLC LLC L  2  • Lower effective on-die cache per core LLC • Large LLC better for multiple threads  L2  Exclusive with disparate cache footprints  LLC  L2  3

  11. Large L2 caches LLC L Inclusive 2 • Inclusive LLC → Exclusive LLC LLC L 2 • Lower effective on-die cache per core LLC • Large LLC better for multiple threads L2 Exclusive with disparate cache footprints LLC L2 • Area for Snoop-filter/Coherence-directory 3

  12. Large L2 caches LLC L Inclusive 2 • Inclusive LLC → Exclusive LLC LLC L 2 • Lower effective on-die cache per core LLC • Large LLC better for multiple threads L2 Exclusive with disparate cache footprints LLC L2 • Area for Snoop-filter/Coherence-directory Despite area and power overheads, average latency reduction (performance) drives large L2 3

  13. Loads and Program Criticality Program execution expressed in a Data Dependency Graph (Fields et. al.) 5 Load 6 7 3 4 Load Load 2 Load 1 D D D D D D D L2 Hit LLC Hit L2 Hit LLC miss 2 30 10 E E E E E E E 2 10 200 2 10 2 200 2 30 10 C C C C C C C 4

  14. Loads and Program Criticality Program execution expressed in a Data Dependency Graph (Fields et. al.)  Execution time governed by “Critical Path” 5 Load 6 7 3 4 Load Load 2 Load 1 D D D D D D D L2 Hit LLC Hit L2 Hit LLC miss 2 30 10 E E E E E E E 2 10 200 2 10 2 200 2 30 10 C C C C C C C 4

  15. Loads and Program Criticality Program execution expressed in a Data Dependency Graph (Fields et. al.)  Execution time governed by “Critical Path” 5 Load 6 7 3 4 Load Load 2 Load 1 D D D D D D D L2 Hit LLC Hit L2 Hit LLC miss 2 30 10 E E E E E E E 2 10 200 2 10 2 200 2 30 10 C C C C C C C 4

  16. Loads and Program Criticality Program execution expressed in a Data Dependency Graph (Fields et. al.)  Execution time governed by “Critical Path” 5 Load 6 7 3 4 Load Load 2 Load 1 D D D D D D D L2 Hit LLC Hit L2 Hit LLC miss 2 30 10 E E E E E E E 2 10 200 2 10 2 200 2 30 10 C C C C C C C CRITICAL L2 HIT LOAD 4

  17. Loads and Program Criticality Program execution expressed in a Data Dependency Graph (Fields et. al.)  Execution time governed by “Critical Path” 5 Load 6 7 3 4 Load Load 2 Load 1 D D D D D D D L2 Hit LLC hit LLC Hit L2 Hit LLC miss 2 30 10 E E E E E E E 30 2 10 200 2 30 10 2 200 2 30 10 C C C C C C C CRITICAL L2 HIT LOAD ~9% performance loss if L2 HIT → LLC HIT 4

  18. Loads and Program Criticality Program execution expressed in a Data Dependency Graph (Fields et. al.)  Execution time governed by “Critical Path” 5 Load 6 7 3 4 Load Load 2 Load 1 D D D D D D D L2 Hit LLC Hit L2 Hit LLC miss 2 30 10 E E E E E E E 2 10 200 2 10 2 200 2 30 10 C C C C C C C CRITICAL L2 HIT LOAD NON-CRITICAL L2 HIT LOAD ~9% performance loss if L2 HIT → LLC HIT 4

  19. Loads and Program Criticality Program execution expressed in a Data Dependency Graph (Fields et. al.)  Execution time governed by “Critical Path” 5 Load 6 7 3 4 Load Load 2 Load 1 D D D D D D D L2 Hit LLC Hit LLC hit L2 Hit LLC miss 2 30 30 10 E E E E E E E 2 10 200 2 30 10 2 200 2 30 10 C C C C C C C CRITICAL L2 HIT LOAD NON-CRITICAL L2 HIT LOAD ~9% performance loss if No performance impact if L2 HIT → LLC HIT L2 HIT → LLC HIT 4

  20. Loads and Program Criticality Program execution expressed in a Data Dependency Graph (Fields et. al.)  Execution time governed by “Critical Path” 5 Load 6 7 3 4 Load Load 2 Load 1 D D D D D D D L2 Hit LLC Hit L2 Hit LLC miss 2 30 10 E E E E E E E 2 10 200 2 10 2 200 2 30 10 C C C C C C C CRITICAL L2 HIT LOAD NON-CRITICAL L2 HIT LOAD ~9% performance loss if No performance impact if Only critical load L2 hits matter to performance. L2 HIT → LLC HIT L2 HIT → LLC HIT 4

  21. Cache Hierarchy and Program Criticality Oracle study • Track critical load PCs • Increase latencies of targeted load PCs 5

  22. Cache Hierarchy and Program Criticality 0% 100% % loads converted to higer latency -0.8% 90% -2% 80% -4% 70% -4.9% -6% perf. impact 60% -8% -7.8% 50% 49.1% -10% 40% 39.6% -12% 30% -14% 20% Perf. Impact – All loads -16% 10% Perf. Impact – NonCritical loads % loads converted -16.1% -18% 0% ALL NonCritical ALL NonCritical L1 hits to L2 lat. L2 hits to LLC lat. 5

  23. Cache Hierarchy and Program Criticality 0% 100% % loads converted to higer latency -0.8% 90% -2% 80% -4% 70% -4.9% -6% perf. impact 60% -8% -7.8% 50% 49.1% -10% 40% 39.6% -12% 30% -14% 20% Perf. Impact – All loads -16% 10% Perf. Impact – NonCritical loads % loads converted -16.1% -18% 0% ALL NonCritical ALL NonCritical L1 hits to L2 lat. L2 hits to LLC lat. L2 cache most amenable to criticality optimizations 5

  24. Criticality Aware Tiered Cache Hierarchy (CATCH) A. Track critical load PCs 6

  25. Criticality Aware Tiered Cache Hierarchy (CATCH) A. Track critical load PCs • Served from non-L1 on-die caches 6

  26. Criticality Aware Tiered Cache Hierarchy (CATCH) A. Track critical load PCs • Served from non-L1 on-die caches B. Prefetch critical loads into L1 • Accelerate the critical path 6

  27. Criticality Aware Tiered Cache Hierarchy (CATCH) A. Track critical load PCs Oracle Performance Potential (*prefetchers disabled) • Served from non-L1 on-die caches 10% 20% % L1 misses converted 17.0% 14.1% 15.5% 8% 16% perf. impact 6.6% 6.2% 6.1% 5.8% 6% 12% B. Prefetch critical loads into L1 5.5% 4% 8% • Accelerate the critical path 2% 4% 0% 0% 32 PC 128 2048 All PC NoL2 PC PC + 2048 PC PerfImpact %loads Converted 6

Recommend


More recommend