How to Write Fast Numerical Code Spring 2011 Lecture 8 Instructor: - PowerPoint PPT Presentation

How to Write Fast Numerical Code Spring 2011 Lecture 8 Instructor: Markus Püschel TA: Georg Ofenbeck

Reuse (Inherent Temporal Locality) Reuse of an algorithm:  Number of operations Minimal number of Size of input + size of output data Memory accesses Examples:  2 n 3  Matrix multiplication C = AB + C 3 n 2 = 2 3 n = O ( n )  Discrete Fourier transform ¼ 5 n log 2 ( n ) = 5 2 log 2 ( n ) = O (log( n )) 2 n 2 n = 1 n 2 = O (1)  Adding two vectors x = x+y

Last Time: Caches E = 2 e lines per set E = associativity, E=1: direct mapped Address of word: t bits s bits b bits S = 2 s sets tag set block index offset data begins at this offset v tag 0 1 2 B-1 valid bit B = 2 b bytes per cache block (the data)

Last Time: Blocking n Cache misses = (9/8)n 3 * n 3 /(4B) = *

Today Linear algebra software: LAPACK and BLAS  MMM  ATLAS: MMM program generator 

Linear Algebra Algorithms: Examples Solving systems of linear equations  Eigenvalue problems  Singular value decomposition  LU/Cholesky /QR/… decompositions  … and many others  Make up most of the numerical computation across disciplines  (sciences, computer science, engineering) Efficient software is extremely relevant 

LAPACK and BLAS Basic Idea:  static LAPACK BLAS reimplemented for each platform Basic Linear Algebra Subroutines (BLAS, list)  Reuse: O(1)  BLAS 1: vector-vector operations (e.g., vector sum)  BLAS 2: matrix-vector operations (e.g., matrix-vector product) Reuse: O(1)  BLAS 3: matrix-matrix operations (e.g., MMM) Reuse: O(n) LAPACK implemented on top of BLAS   Using BLAS 3 as much as possible

Why is BLAS3 so important? Using BLAS3 = blocking  Reuse O(1) → O(n)  Cache analysis for blocking MMM (blackboard)  Blocking (for the memory hierarchy) is the single most important  optimization for dense linear algebra algorithms Unfortunately: The introduction of multicore processors requires a  reimplementation of LAPACK just multithreading BLAS is not good enough

Matlab Invented in the late 70s by Cleve Moler  Commercialized (MathWorks) in 84  Motivation: Make LINPACK, EISPACK easy to use  Matlab uses LAPACK and other libraries but can only call it if you  operate with matrices and vectors and do not write your own loops  A*B (calls MMM routine)  A\b (calls linear system solver)

Today Linear algebra software: history, LAPACK and BLAS  MMM  ATLAS: MMM program generator 

MMM by Definition Usually computed as C = AB + C  Cost as computed before   n 3 multiplications + n 3 additions = 2n 3 floating point operations  = O(n 3 ) runtime Blocking   Increases locality (see previous example)  Does not decrease cost Can we do better? 

Strassen’s Algorithm Strassen, V. "Gaussian Elimination is Not Optimal," Numerische  Mathematik 13, 354-356, 1969 Until then, MMM was thought to be Θ (n 3 ) Recurrence T(n) = 7T(n/2) + O(n 2 ):  Multiplies two n x n matrices in O(n log2(7) ) ≈ O(n 2.808 )  Crossover point, in terms of cost: n=654, but …  Structure more complex → performance crossover much later  Numerical stability inferior Can we do better? 

MMM Complexity: What is known Coppersmith, D. and Winograd, S. "Matrix Multiplication via  Arithmetic Programming," J. Symb. Comput. 9, 251-280, 1990 MMM is O(n 2.376 )  MMM is obviously Ω (n 2 )  It could well be Θ (n 2 )  Compare this to matrix-vector multiplication:   Known to be Θ (n 2 ) (Winograd), i.e., boring

Today Linear algebra software: history, LAPACK and BLAS  MMM  ATLAS: MMM program generator 

MMM: Memory Hierarchy Optimization MMM (square real double) Core 2 Duo 3Ghz theoretical scalar peak ATLAS generated triple loop matrix size • Intel compiler icc – O2 • Huge performance difference for large sizes • Great case study to learn memory hierarchy optimization

ATLAS Successor of PhiPAC, BLAS program generator (web)  Idea: automatic porting  LAPACK static BLAS regenerated for each platform People can also contribute handwritten code  The generator uses empirical search over implementation  alternatives to find the fastest implementation no vectorization or parallelization: so not really used anymore We focus on BLAS3 MMM  Search only over cost 2n 3 algorithms  (cost equal to triple loop)

ATLAS Architecture Compile, MFLOPS Execute, Measure L1Size NB Detect ATLAS Search ATLAS MM MU,NU,KU MiniMMM Hardware NR Engine xFetch Code Generator Source MulAdd MulAdd Parameters (MMSearch) (MMCase) L * Latency Search parameters: • span search space • specify code • found by orthogonal line search Hardware parameters: • L1Size: size of L1 data cache • NR: number of registers • MulAdd: fused multiply-add available? • L * : latency of FP multiplication source: Pingali, Yotov, Cornell U.

How ATLAS Works Blackboard  References:   "Automated Empirical Optimization of Software and the ATLAS project" by R. Clint Whaley, Antoine Petitet and Jack Dongarra. Parallel Computing , 27(1-2):3-35, 2001  K. Yotov, X. Li, G. Ren, M. Garzaran, D. Padua, K. Pingali, P. Stodghill, Is Search Really Necessary to Generate High-Performance BLAS?, Proceedings of the IEEE, 93(2), pp. 358 – 386, 2005. Link. Our presentation is based on this paper

How to Write Fast Numerical Code Spring 2011 Lecture 8 Instructor: - PowerPoint PPT Presentation

How to Write Fast Numerical Code Spring 2011 Lecture 8 Instructor: Markus Pschel TA: Georg Ofenbeck Reuse (Inherent Temporal Locality) Reuse of an algorithm: Number of operations Minimal number of Size of input + size of output data

Write Through No Write Allocate Cache Write Reference Check tag and index Yes Tag AND

Being a METS Startup Fast Failure; Fast Reward November 2016 Fast Failure; Fast Reward

How to Write Fast Numerical Code Spring 2011 Lecture 22 Instructor: Markus Pschel TA: Georg

How to Write Fast Numerical Code Spring 2011 Lecture 21 Instructor: Markus Pschel TA: Georg

Read Write Inc. Phonics Parents Meeting Who is Read Write Inc. Phonics for? Read Write Inc.

How to Write Fast Numerical Code Spring 2011 Lecture 13 Instructor: Markus Pschel TA: Georg

How to Write Fast Numerical Code Spring 2011 Lecture 5 Instructor: Markus Pschel TA: Georg

How to Write Fast Numerical Code Spring 2011 Lecture 15 Instructor: Markus Pschel TA: Georg

How to Write Fast Numerical Code Spring 2011 Lecture 7 Instructor: Markus Pschel TA: Georg

How to Write Fast Numerical Code Spring 2011 Lecture 12 Instructor: Markus Pschel TA: Georg

Code Generation Machine code generation cs4713 1 Machine code generation machine Intermediate

{Sequential Code} {Sequential Code} {Sequential Code} {Sequential Code} {Sequential Code}

Numerical Differentiation & Integration Composite Numerical Integration I Numerical Analysis

Numerical Differentiation & Integration Numerical Differentiation I Numerical Analysis (9th

Numerical Semigroup Algebra Joint with Kee, Mee-Kyoung International meeting on numerical

Obstacles in Numerical Calculations Erik Schnetter Paris, November 2006 Obstacles in Numerical

Distributed Statistical Estimation of Matrix Products with Applications David Woodruff Qin Zhang

Multiprecision Multiplication on ARMv8 ZHE LIU 1 , KIMMO JRVINENDL 2 , WEIQIANG LIU 3 ,

Object Oriented Programming COP3330 / CGS5409 Exception Handling Bitwise Operators

MLC/TLC NAND support: (new ?) challenges for the MTD/NAND subsystem Free Electrons - Embedded

Powering a number (a bit easier than the recursive mystery question on the homework) Problem:

Lecture 8: Cryptography Trust No One. 1 / 20 Cryptography: Basic Set Up Alice Bob Eve Goal:

Side-Channel Plaintext-Recovery Attacks on Leakage-Resilient Encryption Thomas Unterluggauer,

Linked Data Indexing Methods: A Survey Martin Svoboda, Irena Mlnkov Charles University in

Sambuz

Useful Links

Newsletter

Mail Us

How to Write Fast Numerical Code Spring 2011 Lecture 8 Instructor: - PowerPoint PPT Presentation

How to Write Fast Numerical Code Spring 2011 Lecture 8 Instructor: Markus Pschel TA: Georg Ofenbeck Reuse (Inherent Temporal Locality) Reuse of an algorithm: Number of operations Minimal number of Size of input + size of output data

Write Through No Write Allocate Cache Write Reference Check tag and index Yes Tag AND

Being a METS Startup Fast Failure; Fast Reward November 2016 Fast Failure; Fast Reward

How to Write Fast Numerical Code Spring 2011 Lecture 22 Instructor: Markus Pschel TA: Georg

How to Write Fast Numerical Code Spring 2011 Lecture 21 Instructor: Markus Pschel TA: Georg

Read Write Inc. Phonics Parents Meeting Who is Read Write Inc. Phonics for? Read Write Inc.

How to Write Fast Numerical Code Spring 2011 Lecture 13 Instructor: Markus Pschel TA: Georg

How to Write Fast Numerical Code Spring 2011 Lecture 5 Instructor: Markus Pschel TA: Georg

How to Write Fast Numerical Code Spring 2011 Lecture 15 Instructor: Markus Pschel TA: Georg

How to Write Fast Numerical Code Spring 2011 Lecture 7 Instructor: Markus Pschel TA: Georg

How to Write Fast Numerical Code Spring 2011 Lecture 12 Instructor: Markus Pschel TA: Georg

Code Generation Machine code generation cs4713 1 Machine code generation machine Intermediate

{Sequential Code} {Sequential Code} {Sequential Code} {Sequential Code} {Sequential Code}

Numerical Differentiation &amp; Integration Composite Numerical Integration I Numerical Analysis

Numerical Differentiation &amp; Integration Numerical Differentiation I Numerical Analysis (9th

Numerical Semigroup Algebra Joint with Kee, Mee-Kyoung International meeting on numerical

Obstacles in Numerical Calculations Erik Schnetter Paris, November 2006 Obstacles in Numerical

Distributed Statistical Estimation of Matrix Products with Applications David Woodruff Qin Zhang

Multiprecision Multiplication on ARMv8 ZHE LIU 1 , KIMMO JRVINENDL 2 , WEIQIANG LIU 3 ,

Object Oriented Programming COP3330 / CGS5409 Exception Handling Bitwise Operators

MLC/TLC NAND support: (new ?) challenges for the MTD/NAND subsystem Free Electrons - Embedded

Powering a number (a bit easier than the recursive mystery question on the homework) Problem:

Lecture 8: Cryptography Trust No One. 1 / 20 Cryptography: Basic Set Up Alice Bob Eve Goal:

Side-Channel Plaintext-Recovery Attacks on Leakage-Resilient Encryption Thomas Unterluggauer,

Linked Data Indexing Methods: A Survey Martin Svoboda, Irena Mlnkov Charles University in

Sambuz

Useful Links

Newsletter

Mail Us

Numerical Differentiation & Integration Composite Numerical Integration I Numerical Analysis

Numerical Differentiation & Integration Numerical Differentiation I Numerical Analysis (9th