I/O Lower Bounds and Algorithms for Matrix-Matrix Multiplication - PowerPoint PPT Presentation

I/O Lower Bounds and Algorithms for Matrix-Matrix Multiplication Tyler M. Smith July 5, 2017 1 / 69

Introduction ◮ Dense matrix-matrix multiplication (MMM) ◮ Goal: Reduce I/O cost for machines with hierarchical memory ◮ Novel contributions: ◮ I/O lower bounds with a tight constant 2 mnk √ S ◮ A family of algorithms for machines with any number of levels of memory hierarchy ◮ Outperform the state-of-the-art Goto’s Algorithm by 38% when there is low bandwidth to main memory 2 / 69

Problem definition ◮ Classical MMM ◮ C += AB ◮ C is m × n , A is m × k , and B is k × n ◮ Reduce I/O cost for MMM algorithms 3 / 69

Hierarchical memory 4 / 69

Blocked algorithms ◮ MMM is an operation with a lot of opportunities for reuse ◮ Each element of A is used n times ◮ Each element of B is used m times ◮ Each element of C is used k times ◮ With O ( n 2 ) elements, one can perform O ( n 3 ) flops ◮ If all matrices fit into fast memory, amortize O ( n 2 ) memops with O ( n 3 ) flops ◮ Work with blocks of matrices at a time, where the blocks can fit into fast memory 5 / 69

Building blocks of dense linear algebra ◮ MMM is the bottom of the food chain ◮ Level-3 BLAS ◮ LAPACK/FLAME ◮ ScaLAPACK/Elemental 6 / 69

Outline ◮ Introduction ◮ State-of-the-art MMM ◮ Goto’s Algorithm ◮ Lower bounds ◮ Algorithms ◮ Experiments 7 / 69

Goto’s Algorithm 5 th loop around micro-kernel n C n C += A C j C j B j 4 th loop around micro-kernel k C B p C j += A p k C ~ Pack B p → B p 3 rd loop around micro-kernel ~ C i A i m C m C B p += ~ Pack A i → A i 2 nd loop around micro-kernel ~ ~ n R n R B p C i A i += 1 st loop around micro-kernel m R m R += micro-kernel main memory 1 L3 cache L2 cache += 1 L1 cache registers 8 / 69

Goto’s Algorithm 5 th loop around micro-kernel n C n C += A C j B j main memory L3 cache L2 cache L1 cache registers 9 / 69

Goto’s Algorithm 5 th loop around micro-kernel n C n C += A C j B j 4 th loop around micro-kernel k C B p C j += A p k C ~ Pack B p → B p ~ B p main memory L3 cache L2 cache L1 cache registers 10 / 69

Goto’s Algorithm 5 th loop around micro-kernel n C n C += A C j B j 4 th loop around micro-kernel k C B p C j += A p k C ~ Pack B p → B p 3 rd loop around micro-kernel ~ C i A i m C m C B p += ~ Pack A i → A i ~ A i main memory L3 cache L2 cache L1 cache registers 11 / 69

Goto’s Algorithm 5 th loop around micro-kernel n C n C += A C j B j 4 th loop around micro-kernel k C B p C j += A p k C ~ Pack B p → B p 3 rd loop around micro-kernel ~ C i A i m C m C B p += ~ Pack A i → A i 2 nd loop around micro-kernel ~ ~ n R n R B p C i A i += main memory L3 cache L2 cache L1 cache registers 12 / 69

Goto’s Algorithm 5 th loop around micro-kernel n C n C += A C j B j 4 th loop around micro-kernel k C B p C j += A p k C ~ Pack B p → B p 3 rd loop around micro-kernel ~ C i A i m C m C B p += ~ Pack A i → A i 2 nd loop around micro-kernel ~ ~ n R n R B p C i A i += 1 st loop around micro-kernel m R m R += main memory L3 cache L2 cache L1 cache registers 13 / 69

Goto’s Algorithm 5 th loop around micro-kernel n C n C += A C j C j B j 4 th loop around micro-kernel k C B p C j += A p k C ~ Pack B p → B p 3 rd loop around micro-kernel ~ C i A i m C m C B p += ~ Pack A i → A i 2 nd loop around micro-kernel ~ ~ n R n R B p C i A i += 1 st loop around micro-kernel m R m R += micro-kernel main memory 1 L3 cache L2 cache += 1 L1 cache registers 14 / 69

I/O cost of Goto’s Algorithm ◮ Reuse dictates the I/O cost for Goto’s Algorithm ◮ Each time an element is read from main memory: ◮ An element of A is reused n c times ◮ An element of B is reused m times ◮ An element of C is reused k c times ◮ Overall I/O costs of: ◮ A : mnk n c ◮ B : mnk m mnk ◮ C : k c 15 / 69

Roofline model 4 core Intel i7-7700k Goto’s Algorithm 100 GFLOPS 10 Roofline 1 1 2 4 8 16 32 64 128 256 512 flops per byte 16 / 69

Roofline model Bandwidth to main memory: 51.2 GB/s Bandwidth to main memory: 6.4 GB/s Goto’s Algorithm Goto’s Algorithm 100 100 GFLOPS 10 10 Roofline 1 1 1 2 4 8 16 32 64 128 256 512 1 2 4 8 16 32 64 128 256 512 flops per byte flops per byte 17 / 69

Outline ◮ Introduction ◮ State-of-the-art MMM ◮ Lower bounds ◮ Algorithms ◮ Experiments 18 / 69

I/O lower bounds ◮ Theoretical minimum I/O cost for an operation ◮ We want to find the greatest I/O lower bound ◮ Model of computation ◮ 2 layers of memory: slow and fast ◮ Slow memory has unlimited capacity ◮ Fast memory has capacity S ◮ Data must be in fast memory before computing with it 19 / 69

Related work ◮ Hong and Kung (1981) � � ◮ I/O lower bound: Ω mnk √ S ◮ Irony, Toledo, and Tiskin (2004) ◮ I/O lower bound: mnk √ √ 2 2 S ◮ With a little calculus this can be improved to mnk √ S ◮ Tyler Smith, Robert van de Geijn, Bradley Lowery, and Julien Langou (2017) ◮ I/O lower bound 2 mnk √ S ◮ Under submission at ACM TOMS 20 / 69

Lower bound strategy ◮ Consider any algorithm for MMM ◮ Break the algorithm into phases ◮ Each phase has an I/O cost of exactly S 1 ◮ If there must be at least h phases, and each phase has an I/O cost of S , the overall I/O cost must be at least Sh . ◮ Determine minimum number of phases ◮ Let F be an upper bound on the multiplications during a phase ◮ There are mnk total multiplications during MMM ◮ There must be at least mnk phases F ◮ Determine F based on the number of elements available ◮ Each phase: 2 S elements available as inputs and 2 S elements available as outputs 1 except the last 21 / 69

Upper bound on elementary multiplications in a phase Irony, Toledo, and Tiskin (2004) ◮ Inequality from Loomis and Whitney (1949) ◮ Using N A , N B , and N C elements of A , B , and C ◮ Can perform at most √ N A N B N C multiplications ◮ At most 2 S elements available as inputs, and 2 S elements available as outputs ◮ N A ≤ 2 S , N B ≤ 2 S , and N C ≤ 2 S √ √ √ � � � 8 S 3 = ◮ At most � 2 2 S S multiplications in a phase 1 ◮ Gives an overall lower bound of mnk √ √ 2 2 S 22 / 69

Improving the lower bound ◮ Assume we perform FMAs instead of elementary multiplications ◮ In an FMA, elements of A , B , and C are all inputs ◮ We can reason about the input cost of C ◮ What if we generalize the I/O cost of each phase? ◮ Each phase can have S + M inputs and S + M outputs ◮ This adds a degree of freedom to our lower bound 23 / 69

Upper bound on FMAs during a phase ◮ There are at most S + M inputs ◮ N A + N B + N C ≤ S + M ◮ We again use the Loomis-Whitney inequality ◮ Maximize √ N A N B N C when N A + N B + N C = S + M ◮ Maximized when N A = N B = N C √ 3 3 Mmnk ◮ Then our lower bound is √ ( S + M ) S + M ◮ Finding the greatest lower bound ◮ Maximizing over M , this occurs when M = 2 S ◮ The greatest lower bound is 2 mnk √ S 24 / 69

Roofline model Bandwidth to main memory: 51.2 GB/s Bandwidth to main memory: 6.4 GB/s Goto’s Algorithm Goto’s Algorithm 100 100 Lower bound Lower bound GFLOPS 10 10 Roofline 1 1 1 2 4 8 16 32 64 128 256 512 1 2 4 8 16 32 64 128 256 512 flops per byte flops per byte 25 / 69

Outline ◮ Introduction ◮ State-of-the-art MMM ◮ Lower bounds ◮ Algorithms ◮ Single level of cache ◮ Multiple levels of cache ◮ Experiments 26 / 69

Resident C += C A B 27 / 69

Resident C Partition m dimension m c += 28 / 69

Resident C Partition n dimension n c m c += 29 / 69

Resident C Move m c × n c block of C into fast memory n c m c += 30 / 69

Resident C Stream panels of A and B from slow memory n c m c += 31 / 69

Resident C Partition k dimension n c 1 m c += 32 / 69

Resident C Move vectors into fast memory n c 1 m c += 33 / 69

I/O cost for Resident C n c 1 += m c ◮ I/O cost per block dot product: ◮ C i , j : m c n c reads and m c n c writes. ◮ A i : m c k reads. ◮ B j : kn c reads. ◮ Total I/O cost: ◮ C : mn reads and mn writes. ◮ A : mnk reads. n c ◮ B : mnk m c reads. 34 / 69

Choosing blocksizes for Resident C √ 1 S √ += S √ ◮ If m c ≈ n c ≈ S ◮ Total I/O cost: ◮ C : mn reads and mn writes. ◮ A : mnk S reads. √ ◮ B : mnk S reads. √ ◮ If m , n , k are large and we can ignore lower ordered terms ◮ I/O cost is 2 mnk √ S ◮ Same as lower bound 35 / 69

Three algorithms += Resident C += Resident B += Resident A Data in cache. Data in main memory. 36 / 69

Resident A, B, and C algorithms in Goto’s Algorithm 37 / 69

Algorithms for multiple levels of cache ◮ Suppose we have 2 levels of cache: L 2 and L 1 ◮ We have 3 algorithms ◮ Resident A, Resident B, and Resident C ◮ Each is associated with a shape of MMM ◮ Suppose we have one of those shapes at the L 2 level ◮ Then how do we also encounter one at the L 1 level? ◮ We can do it with two loops 38 / 69

Resident C at the L 2 cache += Resident block of L 2 cache. 39 / 69

L 1 outer loop Partition k dimension += Resident block of L 2 cache. 40 / 69

L 1 outer loop Partition k dimension += += Resident block of L 2 cache. 41 / 69

I/O Lower Bounds and Algorithms for Matrix-Matrix Multiplication - PowerPoint PPT Presentation

I/O Lower Bounds and Algorithms for Matrix-Matrix Multiplication Tyler M. Smith July 5, 2017 1 / 69 Introduction Dense matrix-matrix multiplication (MMM) Goal: Reduce I/O cost for machines with hierarchical memory Novel

Circuit Lower-bounds Lecture 24 Weak circuits are indeed weak 1 Circuit Lower-bounds 2

Lower Bounds on Matrix Rigidity via a Quantum Argument Ronald de Wolf CWI Amsterdam Lower

Communication Lower Bounds for Matrix-Matrix Multiplication Dagstuhl Seminar #15281 July 6-9,

Lecture 3: Lower Bounds for Sorting, Linear Time Sorting Algorithms Instructor: Saravanan

Lecture 2. Upper and lower bounds for subgaussian matrices The -net method refined 1 Random

Kernel-Size Lower Bounds: The Evidence from Complexity Theory Andrew Drucker IAS Worker 2013,

Amit Chakrabarti Dartmouth College WAPMDS, IIT Kanpur, Dec 2009 Amit Chakrabarti 1 Multi-Pass

Kernel-Size Lower Bounds: The Evidence from Complexity Theory Andrew Drucker IAS Worker 2013,

Kernel-Size Lower Bounds: The Evidence from Complexity Theory Andrew Drucker IAS Worker 2013,

Monotone Circuit Depth Lower Bounds Prashant Vasudevan April 10, 2012 Prashant Vasudevan

On lower bounds for C 0 -semigroups Yuri Tomilov IM PAN, Warsaw Chemnitz, August, 2017 Yuri

Conspiracies between Learning Algorithms, Lower Bounds, and Pseudorandomness Igor Carboni

COMP 3170 - Analysis of Algorithms & Data Structures Shahin Kamali Lower Bounds CLRS 8.1

Learning Algorithms from Natural Lower Bounds CCC 2016 Marco Carmosino (UCSD) Russell

9. Sorting III Lower bounds for the comparison based sorting, radix- and bucket-sort 248 9.1

Sequence Covering Arrays Lower Bounds Upper Bounds Existence Results Charles J. Colbourn 1

Memory Hierarchy Optimizations and Performance Bounds for Sparse Richard Vuduc, Attila

CS452/652 Registers g Segmentation g Real-Time Global Descriptor Table g Programming

Operating System Labs Yuanbin Wu cs@ecnu Operating System Labs Review of Memory Management

Chapter 9: Memory Questions? What is main memory? CSCI [4|6]730 How does multiple

Data-Flow Based Detection of Loop Bounds Christoph Cullmann Florian Martin AbsInt GmbH

Lower Bounds for Geometric Diameter Problems Herv e Fournier University of Versailles

Undefined, Unspecified, Non-deterministic, and Implementation Defined Behavior in Verifiable Specs

Theory of Computer Science B4. Predicate Logic I Gabriele R oger University of Basel March

I/O Lower Bounds and Algorithms for Matrix-Matrix Multiplication - PowerPoint PPT Presentation

I/O Lower Bounds and Algorithms for Matrix-Matrix Multiplication Tyler M. Smith July 5, 2017 1 / 69 Introduction Dense matrix-matrix multiplication (MMM) Goal: Reduce I/O cost for machines with hierarchical memory Novel

Circuit Lower-bounds Lecture 24 Weak circuits are indeed weak 1 Circuit Lower-bounds 2

Lower Bounds on Matrix Rigidity via a Quantum Argument Ronald de Wolf CWI Amsterdam Lower

Communication Lower Bounds for Matrix-Matrix Multiplication Dagstuhl Seminar #15281 July 6-9,

Lecture 3: Lower Bounds for Sorting, Linear Time Sorting Algorithms Instructor: Saravanan

Lecture 2. Upper and lower bounds for subgaussian matrices The -net method refined 1 Random

Kernel-Size Lower Bounds: The Evidence from Complexity Theory Andrew Drucker IAS Worker 2013,

Amit Chakrabarti Dartmouth College WAPMDS, IIT Kanpur, Dec 2009 Amit Chakrabarti 1 Multi-Pass

Kernel-Size Lower Bounds: The Evidence from Complexity Theory Andrew Drucker IAS Worker 2013,

Kernel-Size Lower Bounds: The Evidence from Complexity Theory Andrew Drucker IAS Worker 2013,

Monotone Circuit Depth Lower Bounds Prashant Vasudevan April 10, 2012 Prashant Vasudevan

On lower bounds for C 0 -semigroups Yuri Tomilov IM PAN, Warsaw Chemnitz, August, 2017 Yuri

Conspiracies between Learning Algorithms, Lower Bounds, and Pseudorandomness Igor Carboni

COMP 3170 - Analysis of Algorithms &amp; Data Structures Shahin Kamali Lower Bounds CLRS 8.1

Learning Algorithms from Natural Lower Bounds CCC 2016 Marco Carmosino (UCSD) Russell

9. Sorting III Lower bounds for the comparison based sorting, radix- and bucket-sort 248 9.1

Sequence Covering Arrays Lower Bounds Upper Bounds Existence Results Charles J. Colbourn 1

Memory Hierarchy Optimizations and Performance Bounds for Sparse Richard Vuduc, Attila

CS452/652 Registers g Segmentation g Real-Time Global Descriptor Table g Programming

Operating System Labs Yuanbin Wu cs@ecnu Operating System Labs Review of Memory Management

Chapter 9: Memory Questions? What is main memory? CSCI [4|6]730 How does multiple

Data-Flow Based Detection of Loop Bounds Christoph Cullmann Florian Martin AbsInt GmbH

Lower Bounds for Geometric Diameter Problems Herv e Fournier University of Versailles

Undefined, Unspecified, Non-deterministic, and Implementation Defined Behavior in Verifiable Specs

Theory of Computer Science B4. Predicate Logic I Gabriele R oger University of Basel March

COMP 3170 - Analysis of Algorithms & Data Structures Shahin Kamali Lower Bounds CLRS 8.1