The M4RI & M4RIE libraries for linear algebra over F 2 and small - PowerPoint PPT Presentation

The M4RI & M4RIE libraries for linear algebra over F 2 and small extensions Martin R. Albrecht Nancy, March 30, 2011

Outline M4RI Introduction Multiplication Elimination M4RIE Introduction Travolta Tables Karatsuba Multiplication Results Final

The M4RI Library ◮ available under the GPL Version 2 or later (GPLv2+) ◮ provides basic arithmetic (addition, equality testing, stacking, augmenting, sub-matrices, randomisation, etc.) ◮ implements asymptotically fast multiplication [ABH10] ◮ implements asymptotically fast decomposition [AP10] ◮ implements some multi-core support ◮ Linux, Mac OS X (x86 and PPC), OpenSolaris (Sun Studio Express) and Windows (Cygwin) http://m4ri.sagemath.org

F 2 ◮ field with two elements. ◮ logical bitwise XOR is ⊕ ⊙ addition. 0 0 0 0 ◮ logical bitwise AND is 0 1 1 0 multiplication. 1 0 1 0 ◮ 64 (128) basic operations in 1 1 0 1 at most one CPU cycle ◮ . . . arithmetic rather cheap Memory access is the expensive operation, not arithmetic.

M4RM [ADKF70] I Consider C = A · B ( A is m × l and B is l × n ). A can be divided into l / k vertical “stripes” A 0 . . . A ( l − 1) / k of k columns each. B can be divided into l / k horizontal “stripes” B 0 . . . B ( l − 1) / k of k rows each. We have: ( l − 1) / k � C = A · B = A i · B i . 0

M4RM [ADKF70] II       1 1 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 0       A =  , B =  , A 0 =       1 1 1 1 0 1 1 0 1 1     0 1 1 1 0 1 0 1 0 1   0 1 � 1 � 0 � � 0 0 0 1 1 1 1 0   A 1 =  , B 0 = , B 1 =   0 1 1 0 0 1 0 1 1 1  1 1     0 1 0 1 1 1 0 1 0 0 0 0 0 0 0 0     A 0 · B 0 =  , A 1 · B 1 =     1 1 0 1 0 0 1 1    0 1 1 0 0 0 1 1

� n 3 / log n � M4RM: Algorithm O begin 1 C ← − create an m × n matrix with all entries 0; 2 k ← − ⌊ log n ⌋ ; 3 for 0 ≤ i < ( ℓ/ k ) do 4 // create table of 2 k − 1 linear combinations T ← MakeTable ( B , i × k , 0 , k ); 5 for 0 ≤ j < m do 6 // read index for table T id ← − ReadBits ( A , j , i × k , k ); 7 add row id from T to row j of C ; 8 return C ; 9 end 10 Algorithm 1 : M4RM

Strassen-Winograd [Str69] Multiplication ◮ fastest known pratical algorithm ◮ complexity: O � n log 2 7 � ◮ linear algebra constant: ω = log 2 7 ◮ M4RM can be used as base case for small dimensions → optimisation of this base case

Cache [Bhu99] Memory Regs L1 L2 Ram Swap 10 2 10 7 Speed (ns) 0.5 2 6 2 · 10 7 Cost (cycles) 1 4 14 200 Size 4 · 64-bit 64k 1-4M 1G 100G

Cache Friendly M4RM I begin 1 C ← − create an m × n matrix with all entries 0; 2 for 0 ≤ i < ( ℓ/ k ) do 3 // this is cheap in terms of memory access T ← MakeTable ( B , i × k , 0 , k ); 4 for 0 ≤ j < m do 5 // we load each row j to take care of k bits id ← − ReadBits ( A , j , i × k , k ); 6 add row id from T to row j of C ; 7 return C ; 8 end 9

Cache Friendly M4RM II begin 1 C ← − create an m × n matrix with all entries 0; 2 for 0 ≤ start < m / b s do 3 for 0 ≤ i < ( ℓ/ k ) do 4 // we regenerate T for each block T ← MakeTable ( B , i × k , 0 , k ); 5 for 0 ≤ s < b s do 6 j ← − start × b s + s ; 7 id ← − ReadBits ( A , j , i × k , k ); 8 add row id from T to row j of C ; 9 return C ; 10 end 11

Cache Friendly M4RM III Matrix Dimensions Plain Cache Friendly 10 , 000 × 10 , 000 4.141 2.866 16 , 384 × 16 , 384 16.434 12.214 20 , 000 × 20 , 000 29.520 20.497 32 , 000 × 32 , 000 86.153 82.446 Table: Strassen-Winograd with different base cases on 64-bit Linux, 2.33Ghz Core 2 Duo

t > 1 Gray Code Tables I ◮ actual arithmetic is quite cheap compared to memory reads and writes ◮ the cost of memory accesses greatly depends on where in memory data is located ◮ try to fill all of L1 with Gray code tables. ◮ Example: k = 10 and 1 Gray code table → 10 bits at a time. k = 9 and 2 Gray code tables, still the same memory for the tables but deal with 18 bits at once. ◮ The price is one extra row addition, which is cheap if the operands are all in cache.

t > 1 Gray Code Tables II begin 1 C ← − create an m × n matrix with all entries 0; 2 for 0 ≤ i < ( ℓ/ (2 k )) do 3 T 0 ← MakeTable ( B , i × 2 k , 0 , k ); 4 T 1 ← MakeTable ( B , i × 2 k + k , 0 , k ); 5 for 0 ≤ j < m do 6 id 0 ← − ReadBits ( A , j , i × 2 k , k ); 7 id 1 ← − ReadBits ( A , j , i × 2 k + k , k ); 8 add row id 0 from T 0 and row id 1 from T 1 to row j of C ; 9 return C ; 10 end 11

t > 1 Gray Code Tables III Matrix Dimensions t = 1 t = 2 t = 8 10 , 000 × 10 , 000 4.141 1.982 1.599 16 , 384 × 16 , 384 16.434 7.258 6.034 20 , 000 × 20 , 000 29.520 14.655 11.655 32 , 000 × 32 , 000 86.153 49.768 44.999 Table: Strassen-Winograd with different base cases on 64-bit Linux, 2.33Ghz Core 2 Duo

Results: Multiplication Multiplication 35 0.6 Magma Sage 30 0.4 Magma/Sage 25 0.2 execution time t log 2 relation 20 0.0 15 0.2 10 0.4 5 0.6 0 0.8 1000 5000 9000 13000 17000 21000 25000 29000 Figure: 2.66 Ghz Intel i7, 4GB RAM

Work-in-Progress: Small Matrices M4RI is efficient for large matrices, but not so for small matrices. But there is work under way by Carlo Wood to fix this. Emmanuel M4RI transpose 4.9064 µ s 5.3642 µ s copy 0.2019 µ s 0.2674 µ s add 0.2533 µ s 0.7503 µ s mul 0.2535 µ s 0.4472 µ s Table: 64 × 64 matrices ( matops.c ) Note One performance bottleneck is that our matrix structure is much more complicated than Emmanuel’s.

Results: Multiplication Revisited Multiplication 35 M4RI Magma 30 'optimal' 7 n M Execution time in seconds 25 7 n − 1 (7 M +15 A ) 20 15 10 5 0 0 5000 10000 15000 20000 25000 30000 35000 Figure: 2.66 Ghz Intel i7, 4GB RAM

PLE Decomposition I Definition (PLE) Let A be a m × n matrix over a field K . A PLE decomposition of A is a triple of matrices P , L and E such that P is a m × m permutation matrix, L is a unit lower triangular matrix, and E is am × n matrix in row-echelon form, and A = PLE . PLE decomposition can be in-place, that is L and E are stored in A and P is stored as an m -vector.

PLE Decomposition II From the PLE decomposition we can ◮ read the rank r , ◮ read the row rank profile (pivots), ◮ compute the null space, ◮ solve y = Ax for x and ◮ compute the (reduced) row echelon form.

Block Recursive PLE Decomposition O ( n ω ) Write � A NW � A NE � � A = A W A E = A SW A SE Main steps: 1. Call PLE on A W 2. Apply row permutation to A E 3. L NW ← the lower left triangular matrix in A NW 4. A NE ← L − 1 NW × A NE 5. A SE ← A SE + A SW × A NE 6. Call PLE on A SE 7. Apply row permutation to A SW 8. Compress L

Visualisation

Block Iterative PLE Decomposition I We need an efficient base case for PLE Decomposition ◮ block recursive PLE decomposition gives rise to a block iterative PLE decomposition ◮ choose blocks of size k = log n and use M4RM for the “update” multiplications n 3 / log n ◮ this gives a complexity O � � ◮ this is an alternative way of looking at the M4RI algorithm or its PLE decomposition equivalent (“MMPF”) ◮ M4RI is more cache friendly than straight block iterative PLE decomposition, so we adapt it PLE using M4RI idea

Visualisation

Results: Reduced Row Echelon Form Elimination 30 Magma Sage 25 Magma/Sage 2.0 20 execution time t log 2 relation 15 1.5 10 1.0 5 0 1000 5000 9000 13000 17000 21000 25000 29000 Figure: 2.66 Ghz Intel i7, 4GB RAM

Results: Row Echelon Form Using one core we can compute the echelon form of a 500 , 000 × 500 , 000 dense random matrix over F 2 in 9711 . 42 seconds = 2 . 7 hours . Using four cores decomposition we can compute the echelon form of a random dense 500 , 000 × 500 , 000 matrix in 3806 . 28 seconds = 1 . 05 hours.

Work-in-Progress: Sensitivity to Sparsity 14 M4RI M+P 0.15 12 PLS Magma 10 execution time t 8 6 4 2 0 0 10 20 30 40 50 nonzero elements per row on average

The M4RI & M4RIE libraries for linear algebra over F 2 and small - PowerPoint PPT Presentation

The M4RI & M4RIE libraries for linear algebra over F 2 and small extensions Martin R. Albrecht Nancy, March 30, 2011 Outline M4RI Introduction Multiplication Elimination M4RIE Introduction Travolta Tables Karatsuba Multiplication

Matrices Basic Linear Algebra Overview Lecture will cover why matrices and linear algebra

Linear Algebra Linear algebra has become as basic and as applicable as calculus, and

A Case for Malleable Thread-Level Linear Algebra Libraries: The LU Factorization with Partial

Linear Algebra Libraries for High- Performance Computing: Scientific Computing with Multicore and

PVS Linear Algebra Libraries for Verification of Control Software Algorithms in C/ACSL Heber

Chapter 1 What is Linear Algebra? Chapter 1 What is Linear Algebra? The study of linear

Linear Algebra libraries in Debian DebConf 10 New York 05/08/2010 Sylvestre Who I am

Linear algebra explained in four pages Excerpt from the N O BULLSHIT GUIDE TO LINEAR ALGEBRA by

15-388/688 - Practical Data Science: Matrices, vectors, and linear algebra J. Zico Kolter

Math you need to know 1 Linear algebra Linear algebra is mainly concerned with solving equations

Math you need to know 1 Linear algebra Linear algebra is mainly concerned with solving equations

MATRICES AND LINEAR ALGEBRA Linear Algebra Matrix manipulation is the original essence of

Chapter 4: Vectors, Matrices, and Linear Algebra Scott Owen & Greg Corrado Linear Algebra is

Lecture 14: Dense Linear Algebra David Bindel 18 Oct 2010 Where we are This week: dense

Linear Algebra 1: Computing canonical forms in exact linear algebra Clment P ERNET ,

Graphical Linear Algebra a specification language for linear algebra Pawel Sobocinski based on

20/11/2019 20/11/2019 2 Outline Introduction to Linear Algebra Algebra Solving

ManyCore ManyCore Computing: ManyCore ManyCore Computing: Computing: Computing: The Impact on

Essence of Linear Algebra Linear combinations and Span Some cool intuitions The box game:

Linear algebra and analysis recalls Lectures for PHD course on Numerical optimization Enrico

Armadillo C++ linear algebra library Umair Muslim 341318 Overview Introduction

Linear and Sublinear Linear Algebra Algorithms: Preconditioning Stochastic Gradient Algorithms

Basic Concepts in Linear Algebra Department of Mathematics Boise State University September 14,

A Quick Review of Linear Algebra (linear combination, linear independence, span, basis) +

The M4RI & M4RIE libraries for linear algebra over F 2 and small - PowerPoint PPT Presentation

The M4RI & M4RIE libraries for linear algebra over F 2 and small extensions Martin R. Albrecht Nancy, March 30, 2011 Outline M4RI Introduction Multiplication Elimination M4RIE Introduction Travolta Tables Karatsuba Multiplication

Matrices Basic Linear Algebra Overview Lecture will cover why matrices and linear algebra

Linear Algebra Linear algebra has become as basic and as applicable as calculus, and

A Case for Malleable Thread-Level Linear Algebra Libraries: The LU Factorization with Partial

Linear Algebra Libraries for High- Performance Computing: Scientific Computing with Multicore and

PVS Linear Algebra Libraries for Verification of Control Software Algorithms in C/ACSL Heber

Chapter 1 What is Linear Algebra? Chapter 1 What is Linear Algebra? The study of linear

Linear Algebra libraries in Debian DebConf 10 New York 05/08/2010 Sylvestre Who I am

Linear algebra explained in four pages Excerpt from the N O BULLSHIT GUIDE TO LINEAR ALGEBRA by

15-388/688 - Practical Data Science: Matrices, vectors, and linear algebra J. Zico Kolter

Math you need to know 1 Linear algebra Linear algebra is mainly concerned with solving equations

Math you need to know 1 Linear algebra Linear algebra is mainly concerned with solving equations

MATRICES AND LINEAR ALGEBRA Linear Algebra Matrix manipulation is the original essence of

Chapter 4: Vectors, Matrices, and Linear Algebra Scott Owen &amp; Greg Corrado Linear Algebra is

Lecture 14: Dense Linear Algebra David Bindel 18 Oct 2010 Where we are This week: dense

Linear Algebra 1: Computing canonical forms in exact linear algebra Clment P ERNET ,

Graphical Linear Algebra a specification language for linear algebra Pawel Sobocinski based on

20/11/2019 20/11/2019 2 Outline Introduction to Linear Algebra Algebra Solving

ManyCore ManyCore Computing: ManyCore ManyCore Computing: Computing: Computing: The Impact on

Essence of Linear Algebra Linear combinations and Span Some cool intuitions The box game:

Linear algebra and analysis recalls Lectures for PHD course on Numerical optimization Enrico

Armadillo C++ linear algebra library Umair Muslim 341318 Overview Introduction

Linear and Sublinear Linear Algebra Algorithms: Preconditioning Stochastic Gradient Algorithms

Basic Concepts in Linear Algebra Department of Mathematics Boise State University September 14,

A Quick Review of Linear Algebra (linear combination, linear independence, span, basis) +

Chapter 4: Vectors, Matrices, and Linear Algebra Scott Owen & Greg Corrado Linear Algebra is