I/O Lower Bounds and Algorithms for Matrix-Matrix Multiplication Tyler M. Smith July 5, 2017 1 / 69
Introduction ◮ Dense matrix-matrix multiplication (MMM) ◮ Goal: Reduce I/O cost for machines with hierarchical memory ◮ Novel contributions: ◮ I/O lower bounds with a tight constant 2 mnk √ S ◮ A family of algorithms for machines with any number of levels of memory hierarchy ◮ Outperform the state-of-the-art Goto’s Algorithm by 38% when there is low bandwidth to main memory 2 / 69
Problem definition ◮ Classical MMM ◮ C += AB ◮ C is m × n , A is m × k , and B is k × n ◮ Reduce I/O cost for MMM algorithms 3 / 69
Hierarchical memory 4 / 69
Blocked algorithms ◮ MMM is an operation with a lot of opportunities for reuse ◮ Each element of A is used n times ◮ Each element of B is used m times ◮ Each element of C is used k times ◮ With O ( n 2 ) elements, one can perform O ( n 3 ) flops ◮ If all matrices fit into fast memory, amortize O ( n 2 ) memops with O ( n 3 ) flops ◮ Work with blocks of matrices at a time, where the blocks can fit into fast memory 5 / 69
Building blocks of dense linear algebra ◮ MMM is the bottom of the food chain ◮ Level-3 BLAS ◮ LAPACK/FLAME ◮ ScaLAPACK/Elemental 6 / 69
Outline ◮ Introduction ◮ State-of-the-art MMM ◮ Goto’s Algorithm ◮ Lower bounds ◮ Algorithms ◮ Experiments 7 / 69
Goto’s Algorithm 5 th loop around micro-kernel n C n C += A C j C j B j 4 th loop around micro-kernel k C B p C j += A p k C ~ Pack B p → B p 3 rd loop around micro-kernel ~ C i A i m C m C B p += ~ Pack A i → A i 2 nd loop around micro-kernel ~ ~ n R n R B p C i A i += 1 st loop around micro-kernel m R m R += micro-kernel main memory 1 L3 cache L2 cache += 1 L1 cache registers 8 / 69
Goto’s Algorithm 5 th loop around micro-kernel n C n C += A C j B j main memory L3 cache L2 cache L1 cache registers 9 / 69
Goto’s Algorithm 5 th loop around micro-kernel n C n C += A C j B j 4 th loop around micro-kernel k C B p C j += A p k C ~ Pack B p → B p ~ B p main memory L3 cache L2 cache L1 cache registers 10 / 69
Goto’s Algorithm 5 th loop around micro-kernel n C n C += A C j B j 4 th loop around micro-kernel k C B p C j += A p k C ~ Pack B p → B p 3 rd loop around micro-kernel ~ C i A i m C m C B p += ~ Pack A i → A i ~ A i main memory L3 cache L2 cache L1 cache registers 11 / 69
Goto’s Algorithm 5 th loop around micro-kernel n C n C += A C j B j 4 th loop around micro-kernel k C B p C j += A p k C ~ Pack B p → B p 3 rd loop around micro-kernel ~ C i A i m C m C B p += ~ Pack A i → A i 2 nd loop around micro-kernel ~ ~ n R n R B p C i A i += main memory L3 cache L2 cache L1 cache registers 12 / 69
Goto’s Algorithm 5 th loop around micro-kernel n C n C += A C j B j 4 th loop around micro-kernel k C B p C j += A p k C ~ Pack B p → B p 3 rd loop around micro-kernel ~ C i A i m C m C B p += ~ Pack A i → A i 2 nd loop around micro-kernel ~ ~ n R n R B p C i A i += 1 st loop around micro-kernel m R m R += main memory L3 cache L2 cache L1 cache registers 13 / 69
Goto’s Algorithm 5 th loop around micro-kernel n C n C += A C j C j B j 4 th loop around micro-kernel k C B p C j += A p k C ~ Pack B p → B p 3 rd loop around micro-kernel ~ C i A i m C m C B p += ~ Pack A i → A i 2 nd loop around micro-kernel ~ ~ n R n R B p C i A i += 1 st loop around micro-kernel m R m R += micro-kernel main memory 1 L3 cache L2 cache += 1 L1 cache registers 14 / 69
I/O cost of Goto’s Algorithm ◮ Reuse dictates the I/O cost for Goto’s Algorithm ◮ Each time an element is read from main memory: ◮ An element of A is reused n c times ◮ An element of B is reused m times ◮ An element of C is reused k c times ◮ Overall I/O costs of: ◮ A : mnk n c ◮ B : mnk m mnk ◮ C : k c 15 / 69
Roofline model 4 core Intel i7-7700k Goto’s Algorithm 100 GFLOPS 10 Roofline 1 1 2 4 8 16 32 64 128 256 512 flops per byte 16 / 69
Roofline model Bandwidth to main memory: 51.2 GB/s Bandwidth to main memory: 6.4 GB/s Goto’s Algorithm Goto’s Algorithm 100 100 GFLOPS 10 10 Roofline 1 1 1 2 4 8 16 32 64 128 256 512 1 2 4 8 16 32 64 128 256 512 flops per byte flops per byte 17 / 69
Outline ◮ Introduction ◮ State-of-the-art MMM ◮ Lower bounds ◮ Algorithms ◮ Experiments 18 / 69
I/O lower bounds ◮ Theoretical minimum I/O cost for an operation ◮ We want to find the greatest I/O lower bound ◮ Model of computation ◮ 2 layers of memory: slow and fast ◮ Slow memory has unlimited capacity ◮ Fast memory has capacity S ◮ Data must be in fast memory before computing with it 19 / 69
Related work ◮ Hong and Kung (1981) � � ◮ I/O lower bound: Ω mnk √ S ◮ Irony, Toledo, and Tiskin (2004) ◮ I/O lower bound: mnk √ √ 2 2 S ◮ With a little calculus this can be improved to mnk √ S ◮ Tyler Smith, Robert van de Geijn, Bradley Lowery, and Julien Langou (2017) ◮ I/O lower bound 2 mnk √ S ◮ Under submission at ACM TOMS 20 / 69
Lower bound strategy ◮ Consider any algorithm for MMM ◮ Break the algorithm into phases ◮ Each phase has an I/O cost of exactly S 1 ◮ If there must be at least h phases, and each phase has an I/O cost of S , the overall I/O cost must be at least Sh . ◮ Determine minimum number of phases ◮ Let F be an upper bound on the multiplications during a phase ◮ There are mnk total multiplications during MMM ◮ There must be at least mnk phases F ◮ Determine F based on the number of elements available ◮ Each phase: 2 S elements available as inputs and 2 S elements available as outputs 1 except the last 21 / 69
Upper bound on elementary multiplications in a phase Irony, Toledo, and Tiskin (2004) ◮ Inequality from Loomis and Whitney (1949) ◮ Using N A , N B , and N C elements of A , B , and C ◮ Can perform at most √ N A N B N C multiplications ◮ At most 2 S elements available as inputs, and 2 S elements available as outputs ◮ N A ≤ 2 S , N B ≤ 2 S , and N C ≤ 2 S √ √ √ � � � 8 S 3 = ◮ At most � 2 2 S S multiplications in a phase 1 ◮ Gives an overall lower bound of mnk √ √ 2 2 S 22 / 69
Improving the lower bound ◮ Assume we perform FMAs instead of elementary multiplications ◮ In an FMA, elements of A , B , and C are all inputs ◮ We can reason about the input cost of C ◮ What if we generalize the I/O cost of each phase? ◮ Each phase can have S + M inputs and S + M outputs ◮ This adds a degree of freedom to our lower bound 23 / 69
Upper bound on FMAs during a phase ◮ There are at most S + M inputs ◮ N A + N B + N C ≤ S + M ◮ We again use the Loomis-Whitney inequality ◮ Maximize √ N A N B N C when N A + N B + N C = S + M ◮ Maximized when N A = N B = N C √ 3 3 Mmnk ◮ Then our lower bound is √ ( S + M ) S + M ◮ Finding the greatest lower bound ◮ Maximizing over M , this occurs when M = 2 S ◮ The greatest lower bound is 2 mnk √ S 24 / 69
Roofline model Bandwidth to main memory: 51.2 GB/s Bandwidth to main memory: 6.4 GB/s Goto’s Algorithm Goto’s Algorithm 100 100 Lower bound Lower bound GFLOPS 10 10 Roofline 1 1 1 2 4 8 16 32 64 128 256 512 1 2 4 8 16 32 64 128 256 512 flops per byte flops per byte 25 / 69
Outline ◮ Introduction ◮ State-of-the-art MMM ◮ Lower bounds ◮ Algorithms ◮ Single level of cache ◮ Multiple levels of cache ◮ Experiments 26 / 69
Resident C += C A B 27 / 69
Resident C Partition m dimension m c += 28 / 69
Resident C Partition n dimension n c m c += 29 / 69
Resident C Move m c × n c block of C into fast memory n c m c += 30 / 69
Resident C Stream panels of A and B from slow memory n c m c += 31 / 69
Resident C Partition k dimension n c 1 m c += 32 / 69
Resident C Move vectors into fast memory n c 1 m c += 33 / 69
I/O cost for Resident C n c 1 += m c ◮ I/O cost per block dot product: ◮ C i , j : m c n c reads and m c n c writes. ◮ A i : m c k reads. ◮ B j : kn c reads. ◮ Total I/O cost: ◮ C : mn reads and mn writes. ◮ A : mnk reads. n c ◮ B : mnk m c reads. 34 / 69
Choosing blocksizes for Resident C √ 1 S √ += S √ ◮ If m c ≈ n c ≈ S ◮ Total I/O cost: ◮ C : mn reads and mn writes. ◮ A : mnk S reads. √ ◮ B : mnk S reads. √ ◮ If m , n , k are large and we can ignore lower ordered terms ◮ I/O cost is 2 mnk √ S ◮ Same as lower bound 35 / 69
Three algorithms += Resident C += Resident B += Resident A Data in cache. Data in main memory. 36 / 69
Resident A, B, and C algorithms in Goto’s Algorithm 37 / 69
Algorithms for multiple levels of cache ◮ Suppose we have 2 levels of cache: L 2 and L 1 ◮ We have 3 algorithms ◮ Resident A, Resident B, and Resident C ◮ Each is associated with a shape of MMM ◮ Suppose we have one of those shapes at the L 2 level ◮ Then how do we also encounter one at the L 1 level? ◮ We can do it with two loops 38 / 69
Resident C at the L 2 cache += Resident block of L 2 cache. 39 / 69
L 1 outer loop Partition k dimension += Resident block of L 2 cache. 40 / 69
L 1 outer loop Partition k dimension += += Resident block of L 2 cache. 41 / 69
Recommend
More recommend