1 Blocking Example Reducing Conflict Misses by Blocking /* After - PDF document

Reducing Misses by Compiler Optimizing Memory Layout or Access Pattern McFarling [1989] reduced caches misses by 75% Lecture 15: Application-level Cache on 8KB direct mapped cache, 4 byte blocks in software Instructions Optimizations � Reorder procedures in memory so as to reduce conflict misses � Profiling to look at conflicts Data � Merging Arrays: improve spatial locality by single array of compound elements vs. 2 arrays � Loop Interchange: change nesting of loops to access data in order stored in memory � Loop Fusion: Combine 2 independent loops that have same looping and some variables overlap � Blocking: Improve temporal locality by accessing “blocks” of data repeatedly vs. going down whole columns or rows 1 2 Adapted from UCB CS252 S01 Adapted from UCB CS252 S01, Revised by Zhao Zhang in ISU CPRE 585 F04 Merging Arrays Example Loop Interchange Example /* Before: 2 sequential arrays */ /* Before */ int val[SIZE]; for (k = 0; k < 100; k = k+1) int key[SIZE]; for (j = 0; j < 100; j = j+1) for (i = 0; i < 5000; i = i+1) /* After: 1 array of stuctures */ x[i][j] = 2 * x[i][j]; struct merge { /* After */ int val; for (k = 0; k < 100; k = k+1) int key; for (i = 0; i < 5000; i = i+1) }; for (j = 0; j < 100; j = j+1) struct merge merged_array[SIZE]; x[i][j] = 2 * x[i][j]; Reducing potential conflicts between val & key; Sequential accesses instead of striding through memory Improve spatial locality every 100 words; improved spatial locality 3 4 Loop Fusion Example Blocking Example: Dense Matrix Multipication /* Before */ /* Before */ for (i = 0; i < N; i = i+1) for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) for (j = 0; j < N; j = j+1) a[i][j] = 1/b[i][j] * c[i][j]; {r = 0; for (i = 0; i < N; i = i+1) for (k = 0; k < N; k = k+1){ for (j = 0; j < N; j = j+1) r = r + y[i][k]*z[k][j];}; d[i][j] = a[i][j] + c[i][j]; x[i][j] = r; /* After */ }; for (i = 0; i < N; i = i+1) Two Inner Loops: for (j = 0; j < N; j = j+1) { a[i][j] = 1/b[i][j] * c[i][j]; � Read all NxN elements of z[] d[i][j] = a[i][j] + c[i][j];} � Read N elements of 1 row of y[] repeatedly � Write N elements of 1 row of x[] 2 misses per access to a & c vs. one miss per access; improve temporal locality Capacity Misses a function of N & Cache Size: � 2N 3 + N 2 => (assuming no conflict; otherwise …) Idea: compute on BxB submatrix that fits 5 6 1

Blocking Example Reducing Conflict Misses by Blocking /* After */ for (jj = 0; jj < N; jj = jj+B) for (kk = 0; kk < N; kk = kk+B) 0.1 for (i = 0; i < N; i = i+1) for (j = jj; j < min(jj+B-1,N); j = j+1) {r = 0; for (k = kk; k < min(kk+B-1,N); k = k+1) { Direct Mapped Cache 0.05 r = r + y[i][k]*z[k][j];}; x[i][j] = x[i][j] + r; }; Fully Associative Cache 0 B called Blocking Factor 0 50 100 150 Capacity Misses from 2N 3 + N 2 to N 3 /B+2N 2 Blocking Factor But may suffer from conflict misses Conflict misses in caches not FA vs. Blocking size • Choose the best blocking factor 7 8 Reducing Conflict Misses by OS Methods in Reducing L2 Cache Copying or Padding Conflicts Conflicts are caused by “bad” mapping; can mapping be changed? Copying: If some date cause severe cache Note L2 cache is usually physically indexed! conflict misses, copy them to another region [Gatlin et al. HPCA’99] OS Approaches: change the mapping between virtual memory and physical memory Padding: Insert padding space to change the � Dynamically detected memory pages that cause severe mapping of the data onto cache [zhang et al. conflict misses � Change the physical page of those pages so that they are SC’99] not mapped onto the same sets in cache � Needs hardware support (cache miss lookaside buffer) Automated approaches: let compiler [bershad et al, ISCA’94] chooses the best method after analysis 9 10 Summary of Compiler Optimizations to Reducing Misses by Software Prefetching Reduce Cache Misses (by hand) Data vpenta (nasa7) Data Prefetch gmty (nasa7) � Load data into register (HP PA-RISC loads) � Cache Prefetch: load into cache (MIPS IV, PowerPC, SPARC v. 9) tomcatv � Special prefetching instructions cannot cause faults; a form of btrix (nasa7) speculative execution Prefetching comes in two flavors: mxm (nasa7) � Binding prefetch: Requests load directly into register. spice � Must be correct address and register! cholesky � Non-Binding prefetch: Load into cache. (nasa7) compress � Can be incorrect. Frees HW/SW to guess! Issuing Prefetch Instructions takes time 1 1.5 2 2.5 3 � Is cost of prefetch issues < savings in reduced misses? � Higher superscalar reduces difficulty of issue bandwidth Performance Improvement merged loop loop fusion blocking arrays interchange 11 12 2

Cache Optimization Summary Cache Optimization Summary Technique MP MR HT Complexity Multilevel cache + 2 Technique MP MR HT Complexity Critical work first + 2 penalty Nonblocking caches + 3 miss Read first + 1 penalty Hardware prefetching + 2/3 miss Merging write buffer + 1 Software prefetching + + 3 Victim caches + + 2 Small and simple cache - + 0 Larger block - + 0 Avoiding address translation + 2 miss rate Larger cache + - 1 hit time Pipeline cache access + 1 Higher associativity + - 1 Trace cache + 3 Way prediction + 2 Pseudoassociative + 2 Compiler techniques + 0 13 14 3

1 Blocking Example Reducing Conflict Misses by Blocking /* After - PDF document

Reducing Misses by Compiler Optimizing Memory Layout or Access Pattern McFarling [1989] reduced caches misses by 75% Lecture 15: Application-level Cache on 8KB direct mapped cache, 4 byte blocks in software Instructions Optimizations

Controlling Loops in Parallel Mercury Code Paul Bone, Zoltan Somogyi and Peter Schachte National

How to Establish Loop-Free Multipath Routes in Named Data Networking? NDNcomm 2017 Klaus

The Eikonal Limit and Post-Minkowskian Scattering Talk by P.H. Damgaard at QCD Meets Gravity

Multifractal Volatility: Multifractal Volatility: Theory, Forecasting, and Pricing Theory,

MaD2: An Ultra-Performance Stream Cipher for Pervasive Data Encryption Jie Li and Jianliang Zheng

Virtual Memory Goals for Today Virtual memory Mechanism Mechanism How does it

Micro-structural analysis & radiation stability studies in undoped and cerium doped

Spontaneous symmetry breaking - long range order 5 Ginzburg-Landau theory (Ising model of

Infinite discrete symmetries near singularities and modular forms Axel Kleinschmidt (Albert

CSE 152: Computer Vision Hao Su Lecture 9: Convolutional Neural Network and Learning Recap:

Central Drift Chamber for the Belle experiment Kota Nakagiri (KEK) on behalf of Belle CDC

Strichartz inequalities on non compact manifolds Jean-Marc Bouclet Institut de Mathmatiques de

Higher partial waves in ! "# interaction . Albert Feijoo Aliau Nuclear Physics Institute,

Accessing localization properties of many - body systems with quantum Monte Carlo Fabien Alet

Maps A map is an equivalence class of labeled graphs embedded on a compact Riemann surface.

Adiabatic quantum optimization applied to the stereo matching problem William Cruz-Santos 1

On the cubic-quintic Schr odinger equation R emi Carles CNRS & Univ Rennes Based on a

On a low Mach number limit for Supernovae Donatella Donatelli joint work with E. Feireisl (

Hidden-charm Pentaquarks in Constituent Quark Models Exotic Hadrons Hadron is a

foldr, end of lexical scoping Review of foldr foldr (some6mes

Programming Abstractions Week 9-2: Exam 2 Review Stephen Checkoway Exam Format n mostly

Improving Link Discovery using context-aware link specifications LDOW 2016 PhD candidate Andrea

Any questions on lists? Loops Slices Len + indexing Slices: lst[start:stop] Goes from start

En v ironments , Reference Beha v ior , & Shared Fields OBJ E C T - OR IE N TE D P R OG R

1 Blocking Example Reducing Conflict Misses by Blocking /* After - PDF document

Reducing Misses by Compiler Optimizing Memory Layout or Access Pattern McFarling [1989] reduced caches misses by 75% Lecture 15: Application-level Cache on 8KB direct mapped cache, 4 byte blocks in software Instructions Optimizations

Controlling Loops in Parallel Mercury Code Paul Bone, Zoltan Somogyi and Peter Schachte National

How to Establish Loop-Free Multipath Routes in Named Data Networking? NDNcomm 2017 Klaus

The Eikonal Limit and Post-Minkowskian Scattering Talk by P.H. Damgaard at QCD Meets Gravity

Multifractal Volatility: Multifractal Volatility: Theory, Forecasting, and Pricing Theory,

MaD2: An Ultra-Performance Stream Cipher for Pervasive Data Encryption Jie Li and Jianliang Zheng

Virtual Memory Goals for Today Virtual memory Mechanism Mechanism How does it

Micro-structural analysis &amp; radiation stability studies in undoped and cerium doped

Spontaneous symmetry breaking - long range order 5 Ginzburg-Landau theory (Ising model of

Infinite discrete symmetries near singularities and modular forms Axel Kleinschmidt (Albert

CSE 152: Computer Vision Hao Su Lecture 9: Convolutional Neural Network and Learning Recap:

Central Drift Chamber for the Belle experiment Kota Nakagiri (KEK) on behalf of Belle CDC

Strichartz inequalities on non compact manifolds Jean-Marc Bouclet Institut de Mathmatiques de

Higher partial waves in ! &quot;# interaction . Albert Feijoo Aliau Nuclear Physics Institute,

Accessing localization properties of many - body systems with quantum Monte Carlo Fabien Alet

Maps A map is an equivalence class of labeled graphs embedded on a compact Riemann surface.

Adiabatic quantum optimization applied to the stereo matching problem William Cruz-Santos 1

On the cubic-quintic Schr odinger equation R emi Carles CNRS &amp; Univ Rennes Based on a

On a low Mach number limit for Supernovae Donatella Donatelli joint work with E. Feireisl (

Hidden-charm Pentaquarks in Constituent Quark Models Exotic Hadrons Hadron is a

foldr, end of lexical scoping Review of foldr foldr (some6mes

Programming Abstractions Week 9-2: Exam 2 Review Stephen Checkoway Exam Format n mostly

Improving Link Discovery using context-aware link specifications LDOW 2016 PhD candidate Andrea

Any questions on lists? Loops Slices Len + indexing Slices: lst[start:stop] Goes from start

En v ironments , Reference Beha v ior , &amp; Shared Fields OBJ E C T - OR IE N TE D P R OG R

Micro-structural analysis & radiation stability studies in undoped and cerium doped

Higher partial waves in ! "# interaction . Albert Feijoo Aliau Nuclear Physics Institute,

On the cubic-quintic Schr odinger equation R emi Carles CNRS & Univ Rennes Based on a

En v ironments , Reference Beha v ior , & Shared Fields OBJ E C T - OR IE N TE D P R OG R