A Fast Analytical Model of Fully Associative Caches spcl.inf.ethz.ch - PowerPoint PPT Presentation

spcl.inf.ethz.ch @spcl_eth T OBIAS G YSI , T OBIAS G ROSSER , L AURIN B RANDNER , AND T ORSTEN H OEFLER A Fast Analytical Model of Fully Associative Caches

spcl.inf.ethz.ch @spcl_eth The Cost of Data Movement Depends on Global State and Does Not Compose int N = 1000; for ( int i = 0; i < N; i++) { for ( int j = 0; j < i; j++) { for ( int k = 0; k < j; k++) { percentage of cache misses? A[i][j] -= A[i][k] * A[j][k]; L1 cache 1.6% } L2 cache 1.4% A[i][j] /= A[j][j]; } most expensive memory access? for ( int k = 0; k < i; k++) { A[j][k] A[i][i] -= A[i][k] * A[i][k]; } amount of compulsory and capacity misses? A[i][i] = sqrt(A[i][i]); # compulsory misses 31,752 } # capacity misses 10,630,620 Cholesky kernel from https://sourceforge.net/projects/polybench/ 2

spcl.inf.ethz.ch @spcl_eth HayStack Output for Cholesky Factorization parameters: relative number of cache misses (statement) • cache sizes (32k and 512k) • cacheline size (64B) 5 for (int i = 0; i < N; i++) { 6 for (int j = 0; j < i; j++) { 7 for (int k = 0; k < j; k++) { 8 A[i][j] -= A[i][k] * A[j][k]; ----------------------------------------------------------------------------- ref type comp[%] L1[%] L2[%] tot[%] reuse[ln] A[i][j] rd 0.00459 0.00000 0.00000 24.86910 8,10 A[i][k] rd 0.00000 0.00000 0.00000 24.86910 8,10 A[j][k] rd 0.00000 1.58635 1.38213 24.86910 8,10,13,15 A[i][j] wr 0.00000 0.00000 0.00000 24.86910 8 ----------------------------------------------------------------------------- absolute number of cache misses (program) compulsory: 31'752 capacity (L1): 10'630'620 capacity (L2): 9'258'460 total: 668'166'500 3

spcl.inf.ethz.ch @spcl_eth Comparison to Simulation 1 day 1 hour 1 minute 1 seconds 4

spcl.inf.ethz.ch @spcl_eth Symbolic Counting Avoids the Explicit Enumeration 1d illustration i j Barvinok algorithm enumeration #points = 3 #points = 1 #points = 2 #points = j-i+1 = 3 symbolic i j enumeration #points = 1 #points = 2 #points = 3 #points = 4 #points = 5 #points = 6 #points = 7 #points = 8 #points = 9 symbolic #points = j-i+1 = 9 Alexander I. Barvinok, A Polynomial Time Algorithm for Counting Integral Points in Polyhedra When the Dimension is Fixed . 1994. 5

spcl.inf.ethz.ch @spcl_eth The LRU Stack Distance Allows Us to Model Fully Associative Caches example memory accesses LRU stack int sum = 0; distance 3 distance 4 distance 2 distance 1 i=0 M(0) M(0) M(1) M(0) M(2) M(1) M(3) M(2) for ( int i=0; i<4; ++i) hit i=1 M(1) M(0) M(1) M(2) M(2) M(3) M(1) S0: M[i] = i; for ( int j=0; j<4; ++j) i=2 M(2) M(1) M(0) M(3) M(2) S1: sum += M[3-j]; miss i=3 M(3) M(3) M(0) j=0 M(3) j=1 M(2) deliberately generic model j=2 M(1) j=3 M(0) Richard L Mattson, Jan Gecsei, Donald R Slutz, and Irving L Traiger, Evaluation techniques for storage hierarchies. 1970. 6

spcl.inf.ethz.ch @spcl_eth Compute the LRU Stack Distance example 𝑞 𝑘 = 𝑘 + 1 int sum = 0; 4 for ( int i=0; i<4; ++i) S0: M[i] = i; stack distance 3 for ( int j=0; j<4; ++j) S1: sum += M[3-j]; 2 1 apply symbolic counting once j 0 1 2 3 4 Kristof Beyls and Erik H D’Hollander , Generating cache hints for improved program efficiency . 2005. 7

spcl.inf.ethz.ch @spcl_eth Count the Cache Misses Given the LRU Stack Distance example 𝑞 𝑘 = 𝑘 + 1 int sum = 0; misses 4 for ( int i=0; i<4; ++i) hits S0: M[i] = i; stack distance 3 for ( int j=0; j<4; ++j) S1: sum += M[3-j]; cache size C=2 2 1 apply symbolic counting twice j 0 1 2 3 4 many different pieces and 𝑘 ∶ 𝒒 𝒌 > 𝑫 ∧ 0 ≤ 𝑘 < 4 𝑘 ∶ 𝒒 𝒌 > 𝑫 ∧ 0 ≤ 𝑘 < 4 = 2 sometimes non-affine polynomials 8

spcl.inf.ethz.ch @spcl_eth Some Access Patterns Result in Non-Linearities example original int sum = 0; int sum = 0; for ( int i=0; i<4; ++i) for ( int t=0; t<4; ++t) { S0: M[i] = i; for ( int i=0; i<4; ++i) for ( int j=0; j<4; ++j) S0: M[i] = i; i j S1: sum += M[3-j]; for ( int m=0; m<t; ++m) 𝑞 𝑘 = 𝑘 + 1 for ( int n=0; n<t; ++n) N[m][n] = t; additional time loop for ( int j=0; j<4; ++j) t S1: sum += M[3-j]; } i j 𝑞 𝑘, 𝑢 = 𝒖 𝟑 + 𝑘 + 1 partial enumeration 9

spcl.inf.ethz.ch @spcl_eth Enumerate the Non-Affine Dimensions p 𝐮=𝟏 𝑘 = 𝟏 + 𝑘 + 1 0 1 2 𝑞 𝑘, 𝑢 = 𝒖 𝟑 + 𝑘 + 1 0 ≤ 𝑢 < 3 (0,0) (1,0) (2,0) 0 π t 𝑞 𝐮=𝟐 𝑘 = 𝟐 + 𝑘 + 1 (0,1) (1,1) (2,1) 1 0 1 2 2 (0,2) (1,2) (2,2) 0 ≤ (𝑘, 𝑢) < 3 p 𝐮=𝟑 𝑘 = 𝟓 + 𝑘 + 1 0 1 2 12.4x speedup due to partial enumeration 10

spcl.inf.ethz.ch @spcl_eth Modelling Cache Lines Introduces Floor Terms example original int sum = 0; for ( int i=0; i<4; ++i) S0: M[i] = i; i j for ( int j=0; j<4; ++j) S1: sum += M[3-j]; 𝑞 𝑘 = 𝑘 + 1 modelling cache lines equalization and rasterization i j 𝑞 𝑘 = 𝑘 𝟑 − 𝒌 − 𝟐 𝒌 + 1 2 𝟑 11

spcl.inf.ethz.ch @spcl_eth Split the Domain to Eliminate Floor Terms 𝑞 𝑘 𝒌%𝟑=𝟏 = 𝑘 2 𝟐 + 1 0 2 𝑞 𝑘 = 𝑘 𝟑 − 𝒌 − 𝟐 𝒌 + 1 2 𝟑 3 0 1 2 𝑞 𝑘 𝒌%𝟑>𝟏 = 𝑘 0 ≤ 𝑘 < 4 2 𝟏 + 1 1 3 1.9x speedup due to equalization 12

spcl.inf.ethz.ch @spcl_eth Accuracy of HayStack for the L1 Cache of Our Test System 13

spcl.inf.ethz.ch @spcl_eth Error of HayStack Compared to Simulation (Dinero IV) HayStack (fully associative) Dinero IV (fully associative) Dinero IV (8-way associative) 14

spcl.inf.ethz.ch @spcl_eth Performance of HayStack for the Large Problem Size of PolyBench 15

spcl.inf.ethz.ch @spcl_eth Performance of HayStack Compared to PolyCache and Dinero Dinero IV PolyCache - simulator - analytical cache model - setup to simulate full associativity - models set associativity - problem size dependent performance - one core per cache set 370x speedup 21x speedup Jan Elder and Mark D. Hill, Dinero IV Trace-Driven Wenlei Bao, Sriram Krishnamoorthy, Louis-Noel Pouchet, and P Sadayappan, Uniprocessor Cache Simulator . 2003. Analytical modeling of cache behavior for affine programs. 2017. 16

spcl.inf.ethz.ch @spcl_eth Conclusion generic model of fully associative caches accurate results compared to measurements fast enough to provide interactive feedback excellent performance compared to alternatives 17

A Fast Analytical Model of Fully Associative Caches spcl.inf.ethz.ch - PowerPoint PPT Presentation

spcl.inf.ethz.ch @spcl_eth T OBIAS G YSI , T OBIAS G ROSSER , L AURIN B RANDNER , AND T ORSTEN H OEFLER A Fast Analytical Model of Fully Associative Caches spcl.inf.ethz.ch @spcl_eth The Cost of Data Movement Depends on Global State and Does

Associative caches (3 rd Ed: p.496-504, 4 th Ed: 479-487) flexible block placement schemes

Lazy Associative Classification Decision Tree Classifier (Eager) Associative Classifier By

Associative arrays Associative arrays map a key to a value Keys and values can be different

In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU

Set-Associative Caches Improve cache hit ratio by allowing a memory location to be placed in

Multicore Workshop Caches Mark Bull David Henty EPCC, University of Edinburgh Overview

Trace Caches and optimizations therein CSE 240C - Rushi Chakrabarti - Winter 2009 Trace Caches

Cache design overview ANY cache can be viewed as k-way associative. What are the pros and cons of

CSE 351: Week 7 Tom Bergan, TA 1 Today Cache geometries Lab 4 2 Caches they make

A Fully Associative Software-Managed Cache Design Erik G. Hallnor and Steven K. Reindhardt

Associative Fine-Tuning of Biologically Inspired Active Neuro-Associative Knowledge Graphs Adrian

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

Associative dyadic boolean functions Goals Def: A Boolean function f : { 0 , 1 } 2 { 0 , 1 }

Example: Associative Arrays An environment can be expressed as an associative array, e.g.:

10. Left-associative grammar (LAG) 10.1 Rule types and derivation order 10.1.1 The notion

Highly-Associative Caches for Low-Power Processors

TESLA V100 GPU Xudong Shao Houxiang Ji Hao Gao The history of GPU architecture 2017 Volta

Addressing Shared Resource Contention in Multicore Processors via Scheduling ASPLOS 10

Advanced cache memory optimizations Computer Architecture J. Daniel Garca Snchez

Scope-based Method Cache Analysis Benedikt Huber 1 , Stefan Hepp 1 , Martin Schoeberl 2 1 Vienna

Data Processing on Modern Hardware Jens Teubner, TU Dortmund, DBIS Group

CS654 Advanced Computer Architecture Lec 8 Memory Hierarchy Review Peter Kemper Adapted

Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000

Leases and Cache Coherence Leases Lease - a time-limited right to do something - can be renewed