communication lower bounds for matrix matrix
play

Communication Lower Bounds for Matrix-Matrix Multiplication - PowerPoint PPT Presentation

Communication Lower Bounds for Matrix-Matrix Multiplication Dagstuhl Seminar #15281 July 6-9, 2015 Julien Langou M OTIVATIONS 2 Motivations Communication Lower Bound for Sequential Matrix-Matrix multiplication Application to Parallel


  1. Communication Lower Bounds for Matrix-Matrix Multiplication Dagstuhl Seminar #15281 July 6-9, 2015 Julien Langou

  2. M OTIVATIONS 2 Motivations Communication Lower Bound for Sequential Matrix-Matrix multiplication Application to Parallel Distributed

  3. M OTIVATIONS 3 Getting up to speed: The Future of Supercomputing , Eds. Susan L. Graham, Marc Snir, and Cynthia A. Patterson, National Research Council, 227 pages, 2004. Annual improvement Time per flop Bandwidth Latency Network 26% 15% 59% DRAM 23% 5%

  4. M OTIVATIONS 3 Getting up to speed: The Future of Supercomputing , Eds. Susan L. Graham, Marc Snir, and Cynthia A. Patterson, National Research Council, 227 pages, 2004. Annual improvement Time per flop Bandwidth Latency Network 26% 15% 59% DRAM 23% 5% 100 10000 Memory BW (Mword/sec) Mflops DRAM Chip BW (Mword/sec) DRAM Row Access Time Expon. (DRAM Row Access Time) 1000 Time (nsec) 100 10 1 Jan 88 Jan 90 Jan 92 Jan 94 Jan 96 Jan 98 Jan 00 Jan 02 10 Jan 88 Jan 90 Jan 92 Jan 94 Jan 96 Jan 98 Jan 00 Jan 02 FIGURE 5.3 Arithmetic performance (Mflops), memory bandwidth, and DRAM chip bandwidth per calendar year. FIGURE 5.4 Decrease in memory latency (in nanoseconds) per calendar year.

  5. M OTIVATIONS 4 http://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardware-characteristics-over-time/

  6. M OTIVATIONS 5 Data Movement Cost: Energy Trends FLOPs almost free; 10000 ¡ data movement cost is dominant 1000 ¡ Minimizing amount of data movement Picojoules increasingly critical 100 ¡ No Change 45mm ¡ 45 nm 10 ¡ 11 nm 11nm ¡(2018) ¡ 1 ¡ Source: Jim Demmel, John Shalf

  7. M OTIVATIONS 6

  8. M OTIVATIONS 7

  9. C OMMUNICATION L OWER B OUND FOR S EQUENTIAL M ATRIX -M ATRIX MULTIPLICATION 8 Motivations Communication Lower Bound for Sequential Matrix-Matrix multiplication Application to Parallel Distributed

  10. C OMMUNICATION L OWER B OUND FOR S EQUENTIAL M ATRIX -M ATRIX MULTIPLICATION 9 One core Intel Xeon Processor E5520 (nehalem) β − 1 = 580 · 10 6 words/sec γ − 1 = 10 . 12 · 10 9 flops/sec M = 10 6 words DGEMM on one core Intel Xeon Processor E5520 (nehalem) 10 9 8 7 6 GFlops/sec 5 4 3 2 1 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Matrix Order

  11. C OMMUNICATION L OWER B OUND FOR S EQUENTIAL M ATRIX -M ATRIX MULTIPLICATION 10 Mission Statement We study communication costs for the ordinary dense (OD) matrix-matrix multiplication in the sequential model.

  12. C OMMUNICATION L OWER B OUND FOR S EQUENTIAL M ATRIX -M ATRIX MULTIPLICATION 10 Mission Statement We study communication costs for the ordinary dense (OD) matrix-matrix multiplication in the sequential model. � dense matrix-matrix multiplication C" =" A" B"

  13. C OMMUNICATION L OWER B OUND FOR S EQUENTIAL M ATRIX -M ATRIX MULTIPLICATION 10 Mission Statement We study communication costs for the ordinary dense (OD) matrix-matrix multiplication in the sequential model. � dense matrix-matrix multiplication input: A an n -by- n matrix, B an n -by- n matrix output: C an n -by- n matrix % starting from C = 0 for i = 1 : n , for j = 1 : n , for k = 1 : n , c ij = c ij + a ik b kj

  14. C OMMUNICATION L OWER B OUND FOR S EQUENTIAL M ATRIX -M ATRIX MULTIPLICATION 10 Mission Statement We study communication costs for the ordinary dense (OD) matrix-matrix multiplication in the sequential model. � dense matrix-matrix multiplication input: A an n -by- n matrix, B an n -by- n matrix output: C an n -by- n matrix % starting from C = 0 for i = 1 : n , for j = 1 : n , for k = 1 : n , c ijk = a ik b kj c ij = c ij + c ijk 2 n 3 operations any order of creation of the c ijk results in the correct answer

  15. C OMMUNICATION L OWER B OUND FOR S EQUENTIAL M ATRIX -M ATRIX MULTIPLICATION 10 Mission Statement We study communication costs for the ordinary dense (OD) matrix-matrix multiplication in the sequential model. Intel%Xeon%Processor%E5520%(Nehalem)% � dense matrix-matrix multiplication � sequential: two levels of memory 10.12%GFLOP/sec/core% CPU% » sequential = not parallel! Cache% (8%MB)% » fast memory of size M » slow memory 25.6%GB/s %ec% » computation happens in fast memory Main%memory% (16%GB)%

  16. C OMMUNICATION L OWER B OUND FOR S EQUENTIAL M ATRIX -M ATRIX MULTIPLICATION 10 Mission Statement We study communication costs for the ordinary dense (OD) matrix-matrix multiplication in the sequential model. � dense matrix-matrix multiplication � sequential: two levels of memory � communication cost: time, energy, etc.

  17. C OMMUNICATION L OWER B OUND FOR S EQUENTIAL M ATRIX -M ATRIX MULTIPLICATION 10 Mission Statement We study communication costs for the ordinary dense (OD) matrix-matrix multiplication in the sequential model. � dense matrix-matrix multiplication � sequential: two levels of memory � communication cost: time, energy, etc. � ordinary: we compute all ( n 3 of them) the c ijk = a ik · b kj (consequence: Strassen-like matrix-matrix multiplication algorithms are not allowed.)

  18. C OMMUNICATION L OWER B OUND FOR S EQUENTIAL M ATRIX -M ATRIX MULTIPLICATION 11 Sequential Lower Bounds for Matrix-Matrix Multiplication Consider any ordinary dense matrix-matrix multiplication algorithm for multiplying an n –by– n matrix with an n –by– n matrix, consider a computer with fast memory of size M , then Upper bound :: square tile matrix-matrix multiplication The number of words transferred between slow and fast memory is at most � n 3 � √ 3 . 46 . M Lower Bound :: Irony, Toledo, and Tiskin, 2004 The number of words transferred between slow and fast memory is at least � n 3 � √ 0 . 35 − M . M √ Note: 3 . 46 ≈ 2 3 √ 2 ) − 1 Note: 0 . 35 ≈ ( 2

  19. C OMMUNICATION L OWER B OUND FOR S EQUENTIAL M ATRIX -M ATRIX MULTIPLICATION 12 What is an algorithm ? � A sequence of the following instructions define an algorithm: » Read an element from slow memory to fast memory. » Create an element in fast memory. » Write an element from fast memory to slow memory. » Delete an element from fast memory. » Perform a floating-point operation operation in fast memory.  Read a 11    Read b 11     Create c 111 = a 11 b 11      Read a 12      Read b 21 Create c 112 = a 12 b 21     Write c 11      Delete c 11 , a 11 , b 11     .   .   . Split the instructions into segments so exactly M reads and writes occur in each segment.

  20. C OMMUNICATION L OWER B OUND FOR S EQUENTIAL M ATRIX -M ATRIX MULTIPLICATION 13 What do we want to compute? We want to compute the n 3 cijk = aik bkj . The computation of cijk requires aik and bkj to be in cache. mul(plica(on"(i,j,k)" c ij" ="c ij" +"a ik "b kj " c ij " j" b kj " i" a ik " k"

  21. C OMMUNICATION L OWER B OUND FOR S EQUENTIAL M ATRIX -M ATRIX MULTIPLICATION 14 What do we want to compute? We want to compute the n 3 c ijk = a ik b kj . The computation of c ijk requires a ik , b kj , and a c ij ∗ to be in cache. In order to compute c ijk , � we either have to have a ik in cache at the start of the segment ( M a ) we have to read a ik ( R a ) from slow memory to cache during the segment. � we either have to have b kj in cache at the start of the segment a ik ( M b ) we have to read a kj ( R b ) from slow memory to cache during the segment. � we have to have a c ij ∗ in cache at the end of the segment ( N c ) or we have to write back ( W c ) during the segment. mul(plica(on"(i,j,k)" c ij" ="c ij" +"a ik "b kj " c ij " j" b kj " i" a ik " k"

  22. C OMMUNICATION L OWER B OUND FOR S EQUENTIAL M ATRIX -M ATRIX MULTIPLICATION 15 � Split the instructions into segments so exactly M reads and writes occur in each segment. � M reads and writes.  Read a 11     Read b 11     Create c 111 = a 11 b 11      Read a 12      Read b 21 Segment Create c 112 = a 12 b 21      Write c 11      Delete c 11 , a 11 , b 11     .  .   . 

  23. C OMMUNICATION L OWER B OUND FOR S EQUENTIAL M ATRIX -M ATRIX MULTIPLICATION 15 � Split the instructions into segments so exactly M reads and writes occur in each segment. � M reads and writes.  Read a 11  » R a = number of reads for A .    Read b 11     Create c 111 = a 11 b 11      Read a 12      Read b 21 Segment Create c 112 = a 12 b 21      Write c 11      Delete c 11 , a 11 , b 11     .  .   . 

  24. C OMMUNICATION L OWER B OUND FOR S EQUENTIAL M ATRIX -M ATRIX MULTIPLICATION 15 � Split the instructions into segments so exactly M reads and writes occur in each segment. � M reads and writes.  Read a 11  » R a = number of reads for A .    Read b 11  » W a = number of writes for A .    Create c 111 = a 11 b 11      Read a 12      Read b 21 Segment Create c 112 = a 12 b 21      Write c 11      Delete c 11 , a 11 , b 11     .  .   . 

Recommend


More recommend