How to Write Fast Numerical Code Spring 2011 Lecture 7 Instructor: - PowerPoint PPT Presentation

How to Write Fast Numerical Code Spring 2011 Lecture 7 Instructor: Markus Püschel TA: Georg Ofenbeck

Last Time: Locality Temporal and Spatial  memory memory

Last Time: Reuse FFT: O(log(n)) reuse MMM: O(n) reuse Discrete Fourier Transform (DFT) on 2 x Core 2 Duo 3 GHz (single precision) Matrix-Matrix Multiplication (MMM) on 2 x Core 2 Duo 3 GHz (double precision) Gflop/s Gflop/s 30 50 45 25 40 35 20 30 .. 15 25 20 10 15 10 5 5 0 0 16 32 64 128 256 512 1,024 2,048 4,096 8,192 16,384 32,768 65,536 131,072 262,144 0 1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000 9,000 input size matrix size

Today Caches 

Cache Definition: Computer memory with short access time used for the  storage of frequently or recently used instructions or data Main Cache CPU Memory Naturally supports temporal locality  Spatial locality is supported by transferring data in blocks   Core 2: one block = 64 B = 8 doubles

General Cache Mechanics Smaller, faster, more expensive Cache 8 4 9 10 14 3 memory caches a subset of the blocks Data is copied in block-sized 10 4 transfer units Larger, slower, cheaper memory Memory 0 1 2 3 viewed as partitioned into “blocks” 4 4 5 6 7 8 9 10 10 11 12 13 14 15

General Cache Concepts: Hit Data in block b is needed Request: 14 Block b is in cache: Cache 8 9 14 14 3 Hit! Memory 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

General Cache Concepts: Miss Data in block b is needed Request: 12 Block b is not in cache: Cache 8 12 9 14 3 Miss! Block b is fetched from Request: 12 12 memory Block b is stored in cache Memory 0 1 2 3 • Placement policy: 4 5 6 7 determines where b goes • Replacement policy: 8 9 10 11 determines which block 12 12 13 14 15 gets evicted (victim)

Types of Cache Misses (The 3 C’s) Compulsory (cold) miss   Occurs on first access to a block Capacity miss   Occurs when working set is larger than the cache Conflict miss   Conflict misses occur when the cache is large enough, but multiple data objects all map to the same slot

Cache Performance Metrics Miss Rate   Fraction of memory references not found in cache: misses / accesses = 1 – hit rate Hit Time   Time to deliver a block in the cache to the processor  Core 2: 3 clock cycles for L1 14 clock cycles for L2 Miss Penalty   Additional time required because of a miss  Core 2: about 100 cycles for L2 miss

General Cache Organization (S, E, B) E = 2 e lines per set E = associativity, E=1: direct mapped set line S = 2 s sets Cache size: S x E x B data bytes v tag 0 1 2 B-1 valid bit B = 2 b bytes per cache block (the data)

• Locate set • Check if any line in set Cache Read has matching tag E = 2 e lines per set • Yes + line valid: hit E = associativity, E=1: direct mapped • Locate data starting at offset Address of word: t bits s bits b bits S = 2 s sets tag set block index offset data begins at this offset v tag 0 1 2 B-1 valid bit B = 2 b bytes per cache block (the data)

Ignore the variables sum, i, j Example (S=8, E=1) assume: cold (empty) cache, a[0][0] goes here int sum_array_rows(double a[16][16]) { int i, j; double sum = 0; for (i = 0; i < 16; i++) for (j = 0; j < 16; j++) sum += a[i][j]; return sum; } int sum_array_cols(double a[16][16]) { int i, j; double sum = 0; for (j = 0; i < 16; i++) B = 32 byte = 4 doubles for (i = 0; j < 16; j++) sum += a[i][j]; return sum; blackboard }

Example (S=4, E=2) Ignore the variables sum, i, j assume: cold (empty) cache, int sum_array_rows(double a[16][16]) a[0][0] goes here { int i, j; double sum = 0; for (i = 0; i < 16; i++) for (j = 0; j < 16; j++) sum += a[i][j]; return sum; } B = 32 byte = 4 doubles int sum_array_cols(double a[16][16]) { int i, j; double sum = 0; for (j = 0; i < 16; i++) for (i = 0; j < 16; j++) sum += a[i][j]; return sum; blackboard }

What about writes? What to do on a write-hit?   Write-through: write immediately to memory  Write-back: defer write to memory until replacement of line ( needs a valid bit) What to do on a write-miss?   Write-allocate: load into cache, update line in cache  No-write-allocate: writes immediately to memory Core 2:   Write-back + Write-allocate

Small Example, Part 1 x[0] Cache: Array (accessed twice in example) E = 1 (direct mapped) x = x[0], …, x[7] S = 2 B = 16 (2 doubles) 0123456701234567 Access pattern: % Matlab style code for j = 0:1 Hit/Miss: MHMHMHMHMHMHMHMH for i = 0:7 access(x[i]) Result: 8 misses, 8 hits Spatial locality: yes Temporal locality: no

Small Example, Part 2 x[0] Cache: Array (accessed twice in example) E = 1 (direct mapped) x = x[0], …, x[7] S = 2 B = 16 (2 doubles) 0246135702461357 Access pattern: % Matlab style code for j = 0:1 Hit/Miss: MMMMMMMMMMMMMMMM for i = 0:2:7 access(x[i]) for i = 1:2:7 access(x[i]) Result: 16 misses Spatial locality: no Temporal locality: no

Small Example, Part 3 x[0] Cache: Array (accessed twice in example) E = 1 (direct mapped) x = x[0], …, x[7] S = 2 B = 16 (2 doubles) 0123012345674567 Access pattern: % Matlab style code for j = 0:1 Hit/Miss: MHMHHHHHMHMHHHHH for k = 0:1 for i = 0:3 access(x[i+4j]) Result: 4 misses, 8 hits (is optimal, why?) Spatial locality: yes Temporal locality: yes

The Killer: Two-Power Strided Access x = x[0], …, x[n -1], n >> cache size E-way associative (here: 2) x[0] S sets Stride 1: 0123… Stride 2: 0 2 4 6 … Stride 4: 0 4 8 12 … Spatial locality Some spatial locality No spatial locality Full cache used 1/2 cache used 1/4 cache used Same for larger stride Stride 8: 0 8 16 24 … Stride 4S: 0 4S 8S 16S … No spatial locality No spatial locality 1/8 cache used 1/(4S) cache used

The Killer: Where Does It Occur? Accessing two-power size 2D arrays (e.g., images) columnwise   2d Transforms  Stencil computations  Correlations Various transform algorithms   Fast Fourier transform  Wavelet transforms  Filter banks

Today Linear algebra software: history, LAPACK and BLAS  Blocking: key to performance  MMM  ATLAS: MMM program generator 

Linear Algebra Algorithms: Examples Solving systems of linear equations  Eigenvalue problems  Singular value decomposition  LU/Cholesky/QR/… decompositions  … and many others  Make up most of the numerical computation across disciplines  (sciences, computer science, engineering) Efficient software is extremely relevant 

The Path to LAPACK EISPACK and LINPACK   Libraries for linear algebra algorithms  Developed in the early 70s  Jack Dongarra, Jim Bunch, Cleve Moler , Pete Stewart, …  LINPACK still used as benchmark for the TOP500 (Wiki) list of most powerful supercomputers Problem:   Implementation “vector - based,” i.e., little locality in data access  Low performance on computers with deep memory hierarchy  Became apparent in the 80s Solution: LAPACK   Reimplement the algorithms “block - based,” i.e., with locality  Developed late 1980s, early 1990s  Jim Demmel, Jack Dongarra et al.

LAPACK and BLAS Basic Idea:  static LAPACK BLAS reimplemented for each platform Basic Linear Algebra Subroutines (BLAS, list)  Reuse: O(1)  BLAS 1: vector-vector operations (e.g., vector sum)  BLAS 2: matrix-vector operations (e.g., matrix-vector product) Reuse: O(1)  BLAS 3: matrix-matrix operations (e.g., MMM) Reuse: O(n) LAPACK implemented on top of BLAS   Using BLAS 3 as much as possible

Why is BLAS3 so important? Using BLAS3 = blocking = enabling reuse  Cache analysis for blocking MMM (blackboard)  Blocking (for the memory hierarchy) is the single most important  optimization for dense linear algebra algorithms Unfortunately: The introduction of multicore processors requires a  reimplementation of LAPACK just multithreading BLAS is not good enough

How to Write Fast Numerical Code Spring 2011 Lecture 7 Instructor: - PowerPoint PPT Presentation

How to Write Fast Numerical Code Spring 2011 Lecture 7 Instructor: Markus Pschel TA: Georg Ofenbeck Last Time: Locality Temporal and Spatial memory memory Last Time: Reuse FFT: O(log(n)) reuse MMM: O(n) reuse Discrete Fourier

Write Through No Write Allocate Cache Write Reference Check tag and index Yes Tag AND

Being a METS Startup Fast Failure; Fast Reward November 2016 Fast Failure; Fast Reward

How to Write Fast Numerical Code Spring 2011 Lecture 22 Instructor: Markus Pschel TA: Georg

How to Write Fast Numerical Code Spring 2011 Lecture 21 Instructor: Markus Pschel TA: Georg

Read Write Inc. Phonics Parents Meeting Who is Read Write Inc. Phonics for? Read Write Inc.

How to Write Fast Numerical Code Spring 2011 Lecture 13 Instructor: Markus Pschel TA: Georg

How to Write Fast Numerical Code Spring 2011 Lecture 5 Instructor: Markus Pschel TA: Georg

How to Write Fast Numerical Code Spring 2011 Lecture 15 Instructor: Markus Pschel TA: Georg

How to Write Fast Numerical Code Spring 2011 Lecture 8 Instructor: Markus Pschel TA: Georg

How to Write Fast Numerical Code Spring 2011 Lecture 12 Instructor: Markus Pschel TA: Georg

Code Generation Machine code generation cs4713 1 Machine code generation machine Intermediate

{Sequential Code} {Sequential Code} {Sequential Code} {Sequential Code} {Sequential Code}

Numerical Differentiation & Integration Composite Numerical Integration I Numerical Analysis

Numerical Differentiation & Integration Numerical Differentiation I Numerical Analysis (9th

Numerical Semigroup Algebra Joint with Kee, Mee-Kyoung International meeting on numerical

Obstacles in Numerical Calculations Erik Schnetter Paris, November 2006 Obstacles in Numerical

Memory Systems Memory speed has been slow compared to the speed of the processor A process

Evaluating find a path reachability queries P. Bouros 1 , T. Dalamagas 2 , S.Skiadopoulos 3

Data Management Systems Query Processing Execution models Understanding the data

CS 764: Topics in Database Management Systems Lecture 4: Query Optimization-1 Xiangyao Yu

MEMORY HIERARCHY DESIGN Mahdi Nazm Bojnordi Assistant Professor School of Computing University

Les Lesson 11 Memory Hierarchies In Intr troduc ductio tion From the earliest days of

WITH C++ Prof. Amr Goneid AUC Part 11a. The Vector Class Prof. amr Goneid, AUC 1 The Vector

CS 294-73 Software Engineering for Scientific Computing Lecture 7: STL

How to Write Fast Numerical Code Spring 2011 Lecture 7 Instructor: - PowerPoint PPT Presentation

How to Write Fast Numerical Code Spring 2011 Lecture 7 Instructor: Markus Pschel TA: Georg Ofenbeck Last Time: Locality Temporal and Spatial memory memory Last Time: Reuse FFT: O(log(n)) reuse MMM: O(n) reuse Discrete Fourier

Write Through No Write Allocate Cache Write Reference Check tag and index Yes Tag AND

Being a METS Startup Fast Failure; Fast Reward November 2016 Fast Failure; Fast Reward

How to Write Fast Numerical Code Spring 2011 Lecture 22 Instructor: Markus Pschel TA: Georg

How to Write Fast Numerical Code Spring 2011 Lecture 21 Instructor: Markus Pschel TA: Georg

Read Write Inc. Phonics Parents Meeting Who is Read Write Inc. Phonics for? Read Write Inc.

How to Write Fast Numerical Code Spring 2011 Lecture 13 Instructor: Markus Pschel TA: Georg

How to Write Fast Numerical Code Spring 2011 Lecture 5 Instructor: Markus Pschel TA: Georg

How to Write Fast Numerical Code Spring 2011 Lecture 15 Instructor: Markus Pschel TA: Georg

How to Write Fast Numerical Code Spring 2011 Lecture 8 Instructor: Markus Pschel TA: Georg

How to Write Fast Numerical Code Spring 2011 Lecture 12 Instructor: Markus Pschel TA: Georg

Code Generation Machine code generation cs4713 1 Machine code generation machine Intermediate

{Sequential Code} {Sequential Code} {Sequential Code} {Sequential Code} {Sequential Code}

Numerical Differentiation &amp; Integration Composite Numerical Integration I Numerical Analysis

Numerical Differentiation &amp; Integration Numerical Differentiation I Numerical Analysis (9th

Numerical Semigroup Algebra Joint with Kee, Mee-Kyoung International meeting on numerical

Obstacles in Numerical Calculations Erik Schnetter Paris, November 2006 Obstacles in Numerical

Memory Systems Memory speed has been slow compared to the speed of the processor A process

Evaluating find a path reachability queries P. Bouros 1 , T. Dalamagas 2 , S.Skiadopoulos 3

Data Management Systems Query Processing Execution models Understanding the data

CS 764: Topics in Database Management Systems Lecture 4: Query Optimization-1 Xiangyao Yu

MEMORY HIERARCHY DESIGN Mahdi Nazm Bojnordi Assistant Professor School of Computing University

Les Lesson 11 Memory Hierarchies In Intr troduc ductio tion From the earliest days of

WITH C++ Prof. Amr Goneid AUC Part 11a. The Vector Class Prof. amr Goneid, AUC 1 The Vector

CS 294-73 Software Engineering for Scientific Computing Lecture 7: STL

Numerical Differentiation & Integration Composite Numerical Integration I Numerical Analysis

Numerical Differentiation & Integration Numerical Differentiation I Numerical Analysis (9th