the m4ri m4rie libraries for linear algebra over f 2 and
play

The M4RI & M4RIE libraries for linear algebra over F 2 and small - PowerPoint PPT Presentation

The M4RI & M4RIE libraries for linear algebra over F 2 and small extensions Martin R. Albrecht Nancy, March 30, 2011 Outline M4RI Introduction Multiplication Elimination M4RIE Introduction Travolta Tables Karatsuba Multiplication


  1. The M4RI & M4RIE libraries for linear algebra over F 2 and small extensions Martin R. Albrecht Nancy, March 30, 2011

  2. Outline M4RI Introduction Multiplication Elimination M4RIE Introduction Travolta Tables Karatsuba Multiplication Results Final

  3. Outline M4RI Introduction Multiplication Elimination M4RIE Introduction Travolta Tables Karatsuba Multiplication Results Final

  4. The M4RI Library ◮ available under the GPL Version 2 or later (GPLv2+) ◮ provides basic arithmetic (addition, equality testing, stacking, augmenting, sub-matrices, randomisation, etc.) ◮ implements asymptotically fast multiplication [ABH10] ◮ implements asymptotically fast decomposition [AP10] ◮ implements some multi-core support ◮ Linux, Mac OS X (x86 and PPC), OpenSolaris (Sun Studio Express) and Windows (Cygwin) http://m4ri.sagemath.org

  5. F 2 ◮ field with two elements. ◮ logical bitwise XOR is ⊕ ⊙ addition. 0 0 0 0 ◮ logical bitwise AND is 0 1 1 0 multiplication. 1 0 1 0 ◮ 64 (128) basic operations in 1 1 0 1 at most one CPU cycle ◮ . . . arithmetic rather cheap Memory access is the expensive operation, not arithmetic.

  6. Outline M4RI Introduction Multiplication Elimination M4RIE Introduction Travolta Tables Karatsuba Multiplication Results Final

  7. M4RM [ADKF70] I Consider C = A · B ( A is m × l and B is l × n ). A can be divided into l / k vertical “stripes” A 0 . . . A ( l − 1) / k of k columns each. B can be divided into l / k horizontal “stripes” B 0 . . . B ( l − 1) / k of k rows each. We have: ( l − 1) / k � C = A · B = A i · B i . 0

  8. M4RM [ADKF70] II       1 1 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 0       A =  , B =  , A 0 =       1 1 1 1 0 1 1 0 1 1     0 1 1 1 0 1 0 1 0 1   0 1 � 1 � 0 � � 0 0 0 1 1 1 1 0   A 1 =  , B 0 = , B 1 =   0 1 1 0 0 1 0 1 1 1  1 1     0 1 0 1 1 1 0 1 0 0 0 0 0 0 0 0     A 0 · B 0 =  , A 1 · B 1 =     1 1 0 1 0 0 1 1    0 1 1 0 0 0 1 1

  9. � n 3 / log n � M4RM: Algorithm O begin 1 C ← − create an m × n matrix with all entries 0; 2 k ← − ⌊ log n ⌋ ; 3 for 0 ≤ i < ( ℓ/ k ) do 4 // create table of 2 k − 1 linear combinations T ← MakeTable ( B , i × k , 0 , k ); 5 for 0 ≤ j < m do 6 // read index for table T id ← − ReadBits ( A , j , i × k , k ); 7 add row id from T to row j of C ; 8 return C ; 9 end 10 Algorithm 1 : M4RM

  10. Strassen-Winograd [Str69] Multiplication ◮ fastest known pratical algorithm ◮ complexity: O � n log 2 7 � ◮ linear algebra constant: ω = log 2 7 ◮ M4RM can be used as base case for small dimensions → optimisation of this base case

  11. Cache [Bhu99] Memory Regs L1 L2 Ram Swap 10 2 10 7 Speed (ns) 0.5 2 6 2 · 10 7 Cost (cycles) 1 4 14 200 Size 4 · 64-bit 64k 1-4M 1G 100G

  12. Cache Friendly M4RM I begin 1 C ← − create an m × n matrix with all entries 0; 2 for 0 ≤ i < ( ℓ/ k ) do 3 // this is cheap in terms of memory access T ← MakeTable ( B , i × k , 0 , k ); 4 for 0 ≤ j < m do 5 // we load each row j to take care of k bits id ← − ReadBits ( A , j , i × k , k ); 6 add row id from T to row j of C ; 7 return C ; 8 end 9

  13. Cache Friendly M4RM II begin 1 C ← − create an m × n matrix with all entries 0; 2 for 0 ≤ start < m / b s do 3 for 0 ≤ i < ( ℓ/ k ) do 4 // we regenerate T for each block T ← MakeTable ( B , i × k , 0 , k ); 5 for 0 ≤ s < b s do 6 j ← − start × b s + s ; 7 id ← − ReadBits ( A , j , i × k , k ); 8 add row id from T to row j of C ; 9 return C ; 10 end 11

  14. Cache Friendly M4RM III Matrix Dimensions Plain Cache Friendly 10 , 000 × 10 , 000 4.141 2.866 16 , 384 × 16 , 384 16.434 12.214 20 , 000 × 20 , 000 29.520 20.497 32 , 000 × 32 , 000 86.153 82.446 Table: Strassen-Winograd with different base cases on 64-bit Linux, 2.33Ghz Core 2 Duo

  15. t > 1 Gray Code Tables I ◮ actual arithmetic is quite cheap compared to memory reads and writes ◮ the cost of memory accesses greatly depends on where in memory data is located ◮ try to fill all of L1 with Gray code tables. ◮ Example: k = 10 and 1 Gray code table → 10 bits at a time. k = 9 and 2 Gray code tables, still the same memory for the tables but deal with 18 bits at once. ◮ The price is one extra row addition, which is cheap if the operands are all in cache.

  16. t > 1 Gray Code Tables II begin 1 C ← − create an m × n matrix with all entries 0; 2 for 0 ≤ i < ( ℓ/ (2 k )) do 3 T 0 ← MakeTable ( B , i × 2 k , 0 , k ); 4 T 1 ← MakeTable ( B , i × 2 k + k , 0 , k ); 5 for 0 ≤ j < m do 6 id 0 ← − ReadBits ( A , j , i × 2 k , k ); 7 id 1 ← − ReadBits ( A , j , i × 2 k + k , k ); 8 add row id 0 from T 0 and row id 1 from T 1 to row j of C ; 9 return C ; 10 end 11

  17. t > 1 Gray Code Tables III Matrix Dimensions t = 1 t = 2 t = 8 10 , 000 × 10 , 000 4.141 1.982 1.599 16 , 384 × 16 , 384 16.434 7.258 6.034 20 , 000 × 20 , 000 29.520 14.655 11.655 32 , 000 × 32 , 000 86.153 49.768 44.999 Table: Strassen-Winograd with different base cases on 64-bit Linux, 2.33Ghz Core 2 Duo

  18. Results: Multiplication Multiplication 35 0.6 Magma Sage 30 0.4 Magma/Sage 25 0.2 execution time t log 2 relation 20 0.0 15 0.2 10 0.4 5 0.6 0 0.8 1000 5000 9000 13000 17000 21000 25000 29000 Figure: 2.66 Ghz Intel i7, 4GB RAM

  19. Work-in-Progress: Small Matrices M4RI is efficient for large matrices, but not so for small matrices. But there is work under way by Carlo Wood to fix this. Emmanuel M4RI transpose 4.9064 µ s 5.3642 µ s copy 0.2019 µ s 0.2674 µ s add 0.2533 µ s 0.7503 µ s mul 0.2535 µ s 0.4472 µ s Table: 64 × 64 matrices ( matops.c ) Note One performance bottleneck is that our matrix structure is much more complicated than Emmanuel’s.

  20. Results: Multiplication Revisited Multiplication 35 M4RI Magma 30 'optimal' 7 n M Execution time in seconds 25 7 n − 1 (7 M +15 A ) 20 15 10 5 0 0 5000 10000 15000 20000 25000 30000 35000 Figure: 2.66 Ghz Intel i7, 4GB RAM

  21. Outline M4RI Introduction Multiplication Elimination M4RIE Introduction Travolta Tables Karatsuba Multiplication Results Final

  22. PLE Decomposition I Definition (PLE) Let A be a m × n matrix over a field K . A PLE decomposition of A is a triple of matrices P , L and E such that P is a m × m permutation matrix, L is a unit lower triangular matrix, and E is am × n matrix in row-echelon form, and A = PLE . PLE decomposition can be in-place, that is L and E are stored in A and P is stored as an m -vector.

  23. PLE Decomposition II From the PLE decomposition we can ◮ read the rank r , ◮ read the row rank profile (pivots), ◮ compute the null space, ◮ solve y = Ax for x and ◮ compute the (reduced) row echelon form.

  24. Block Recursive PLE Decomposition O ( n ω ) Write � A NW � A NE � � A = A W A E = A SW A SE Main steps: 1. Call PLE on A W 2. Apply row permutation to A E 3. L NW ← the lower left triangular matrix in A NW 4. A NE ← L − 1 NW × A NE 5. A SE ← A SE + A SW × A NE 6. Call PLE on A SE 7. Apply row permutation to A SW 8. Compress L

  25. Visualisation

  26. Visualisation

  27. Visualisation

  28. Visualisation

  29. Visualisation

  30. Visualisation

  31. Visualisation

  32. Visualisation

  33. Block Iterative PLE Decomposition I We need an efficient base case for PLE Decomposition ◮ block recursive PLE decomposition gives rise to a block iterative PLE decomposition ◮ choose blocks of size k = log n and use M4RM for the “update” multiplications n 3 / log n ◮ this gives a complexity O � � ◮ this is an alternative way of looking at the M4RI algorithm or its PLE decomposition equivalent (“MMPF”) ◮ M4RI is more cache friendly than straight block iterative PLE decomposition, so we adapt it PLE using M4RI idea

  34. Visualisation

  35. Visualisation

  36. Visualisation

  37. Visualisation

  38. Visualisation

  39. Visualisation

  40. Visualisation

  41. Visualisation

  42. Visualisation

  43. Visualisation

  44. Visualisation

  45. Visualisation

  46. Results: Reduced Row Echelon Form Elimination 30 Magma Sage 25 Magma/Sage 2.0 20 execution time t log 2 relation 15 1.5 10 1.0 5 0 1000 5000 9000 13000 17000 21000 25000 29000 Figure: 2.66 Ghz Intel i7, 4GB RAM

  47. Results: Row Echelon Form Using one core we can compute the echelon form of a 500 , 000 × 500 , 000 dense random matrix over F 2 in 9711 . 42 seconds = 2 . 7 hours . Using four cores decomposition we can compute the echelon form of a random dense 500 , 000 × 500 , 000 matrix in 3806 . 28 seconds = 1 . 05 hours.

  48. Work-in-Progress: Sensitivity to Sparsity 14 M4RI M+P 0.15 12 PLS Magma 10 execution time t 8 6 4 2 0 0 10 20 30 40 50 nonzero elements per row on average

Recommend


More recommend