lower bounds for communication in linear algebra
play

Lower Bounds for Communication in Linear Algebra Grey Ballard UC - PowerPoint PPT Presentation

Lower Bounds for Communication in Linear Algebra Grey Ballard UC Berkeley January 9, 2012 Research supported by Microsoft (Award #024263) and Intel (Award #024894) funding and by matching funding by U.C. Discovery (Award #DIG07-10227) 1


  1. Lower Bounds for Communication in Linear Algebra Grey Ballard UC Berkeley January 9, 2012 Research supported by Microsoft (Award #024263) and Intel (Award #024894) funding and by matching funding by U.C. Discovery (Award #DIG07-10227) 1

  2. Summary Motivation Communication is costly We’d like to reduce communication at the algorithm level How much communication can we possibly avoid? Outline Communication model Methods of proving lower bounds New algorithms developed to match lower bounds 2

  3. Memory Model By communication we mean moving data within memory hierarchy on a sequential computer moving data between processors on a parallel computer Local Local Local SLOW Local Local Local FAST Local Local Local Sequential Parallel 3

  4. Communication Cost Model Measure communication in terms of messages and words Flop cost: γ Cost of message of size w words: α + β w Total running time of an algorithm (ignoring overlap): α · ( # messages ) + β · ( # words ) + γ · ( # flops ) think of α as latency+overhead cost, β as inverse bandwidth As flop rates continue to improve more quickly than data transfer rates, the relative cost of communication (the first two terms) grows larger 4

  5. Prior Work: Matrix Multiplication Lower Bounds Assume O ( n 3 ) algorithm (i.e. not Strassen-like) Sequential case with fast memory of size M lower bound on words moved between fast/slow mem: ✓ n 3 ◆ Ω √ [ HK81 ] M attained by blocked algorithm and recursive algorithm 5

  6. Prior Work: Matrix Multiplication Lower Bounds Assume O ( n 3 ) algorithm (i.e. not Strassen-like) Sequential case with fast memory of size M lower bound on words moved between fast/slow mem: ✓ n 3 ◆ Ω √ [ HK81 ] M attained by blocked algorithm and recursive algorithm Parallel case with P processors (local memory of size M ) lower bound on words communicated (along critical path): n 3 ✓ ◆ Ω [ ITT04 ] √ P M attained by “2D” and “3D” algorithms: M lower bound attained by ⇣ ⌘ ⇣ ⌘ n 2 n 2 2D O [Can69] Ω P √ P ⇣ n 2 ⌘ ⇣ n 2 ⌘ 3D [DNS81] O Ω P 2 / 3 P 2 / 3 more on these upper bounds later 5

  7. Proving lower bounds We’ve used three approaches to proving communication lower bounds: Reduction argument 1 [BDHS10],[GDX11] Geometric embedding 2 [ITT04],[BDHS11b] Computation graph analysis 3 [HK81],[BDHS11a] Blue = work in which our group at Berkeley was involved 6

  8. Reduction Example: LU It’s easy to reduce matrix multiplication to LU: 2 3 2 3 2 3 I 0 − B I I 0 − B 5 = 5 ≡ L · U T ≡ A I 0 A I I A · B 4 4 5 4 0 0 I 0 0 I I LU factorization can be used to perform matrix multiplication Communication lower bound for matrix multiplication applies to LU Reduction to Cholesky is a little trickier, but same idea [BDHS10] 7

  9. Geometric Embedding Example: Matmul Crux of proof based on geometric inequality from [LW49] used to prove matrix multiplication lower bound in [ITT04] x � C shadow � y C C A B A B z V V y z � � w o d a h s x B � � � A shadow � Volume of box Volume of a 3D set p V area(A shadow ) · ≤ V = xyz = √ xz · yz · xy p area(B shadow ) · p area(C shadow ) Given limited set of data, how much useful computation can be done? 8

  10. Extensions to the rest of O ( n 3 ) linear algebra We extended the geometric embedding approach of [ITT04] to other algorithms that “smell” like 3 nested loops: the rest of BLAS Cholesky, LU, QR factorizations eigenvalue and SVD reductions sequences of algorithms (e.g. repeated matrix squaring) graph algorithms (e.g. all pairs shortest paths) to dense or sparse problems to sequential or parallel cases see [BDHS11b] for details and proofs 9

  11. Extensions to the rest of O ( n 3 ) linear algebra General lower bound for O ( n 3 ) linear algebra (3 nested loops): ! # flops # words = Ω p fast/local memory size this bound can be applied processor by processor even to heterogeneous platforms # flops refers to work done by that processor the memory size may depend on the hardware or the algorithm lower bound on # messages can be derived by dividing by the memory size (the largest possible message size) corresponds to synchronization required 10

  12. Computation graph analysis Red-blue pebble game introduced in [HK81] lower bounds proved for matrix multiplication, FFT, and others pebbling game extended in [Sav95] and later papers 11

  13. Computation graph analysis Red-blue pebble game introduced in [HK81] lower bounds proved for matrix multiplication, FFT, and others pebbling game extended in [Sav95] and later papers W S Input / Output V Intermediate value S Dependency R S We’ve connected graph expansion to communication [BDHS11a] expansion describes the relationship between a subset and its neighbors in the complement larger expansion implies more communication necessary 11

  14. Computation graph analysis example: Strassen Strassen’s original algorithm uses 7 multiplies and 18 adds for n = 2 Dec C 11 12 21 22 M 1 = ( A 11 + A 22 ) · ( B 11 + B 22 ) M 2 = ( A 21 + A 22 ) · B 11 M 3 = A 11 · ( B 12 − B 22 ) 7 5 4 1 3 2 6 M 4 = A 22 · ( B 21 − B 11 ) M 5 = ( A 11 + A 12 ) · B 22 M 6 = ( A 21 − A 11 ) · ( B 11 + B 12 ) M 7 = ( A 12 − A 22 ) · ( B 21 + B 22 ) = M 1 + M 4 − M 5 + M 7 C 11 C 12 = M 3 + M 5 C 21 = M 2 + M 4 C 22 = M 1 − M 2 + M 3 + M 6 ` 11 12 21 22 11 12 21 22 Enc A Enc B 12

  15. Computation graph analysis example: Strassen Strassen works recursively, so its graph has recursive structure Dec C ` Enc A Enc B

  16. Computation graph analysis example: Strassen Strassen works recursively, so its graph has recursive structure Dec C Dec C ` ` Enc A Enc A Enc B Enc B

  17. Computation graph analysis example: Strassen Strassen works recursively, so its graph has recursive structure Dec C Dec C Dec C ` ` ` Enc A Enc A Enc A Enc B Enc B Enc B

  18. Computation graph analysis example: Strassen Strassen works recursively, so its graph has recursive structure Dec C Dec C Dec C Dec C Dec C Enc B Enc A ` ` ` ` Enc A Enc A Enc A Enc A Enc B Enc B Enc B Enc B 13

  19. Computation graph analysis example: Strassen Strassen works recursively, so its graph has recursive structure Dec C Dec C Dec C Dec C Dec C Enc B Enc A ` ` ` ` Enc A Enc A Enc A Enc A Enc B Enc B Enc B Enc B Expansion properties of this graph lead to communication lower bounds 13

  20. Lower Bounds for Strassen The communication lower bounds are similar to classical matmul Classical Strassen ✓⇣ ⌘ log 2 8 ◆ ✓⇣ ⌘ log 2 7 ◆ n n Sequential Ω M Ω M √ √ M M ✓⇣ ⌘ log 2 8 M ◆ ✓⇣ ⌘ log 2 7 M ◆ n n Parallel Ω Ω √ P √ P M M 14

  21. Lower Bounds for Strassen The communication lower bounds are similar to classical matmul Classical Strassen Strassen-like ⌘ ω 0 M ✓⇣ ⌘ log 2 8 ◆ ✓⇣ ⌘ log 2 7 ◆ ⇣⇣ ⌘ n n n Sequential Ω M Ω M Ω √ √ √ M M M ✓⇣ ⌘ log 2 8 M ◆ ✓⇣ ⌘ log 2 7 M ◆ ⌘ ω 0 M ⇣⇣ ⌘ n n n Parallel Ω Ω Ω √ P √ P √ P M M M 14

  22. Lower Bounds for Strassen The communication lower bounds are similar to classical matmul Classical Strassen Strassen-like ⌘ ω 0 M ✓⇣ ⌘ log 2 8 ◆ ✓⇣ ⌘ log 2 7 ◆ ⇣⇣ ⌘ n n n Sequential Ω M Ω M Ω √ √ √ M M M ✓⇣ ⌘ log 2 8 M ◆ ✓⇣ ⌘ log 2 7 M ◆ ⌘ ω 0 M ⇣⇣ ⌘ n n n Parallel Ω Ω Ω √ P √ P √ P M M M these lower bounds imply that Strassen and faster algorithms require less communication sequential lower bound is attained by the recursive algorithm parallel lower bound is attainable with a new algorithm [BDH + 12] 14

  23. Lower bounds → algorithms Proving lower bounds allows us to evaluate existing algorithms identify possibilities for algorithmic innovation obtain asymptotic improvements speedups increase with n and/or M 15

  24. Communication Upper Bounds - Sequential O ( n 3 ) Algorithm Minimizes Minimizes # words # messages BLAS3 usual blocked or recursive algorithms [Gus97, FLPR99] Cholesky LAPACK [Gus97, AP00] [Gus97, AP00] [BDHS10] [BDHS10] [BDD + 12] Symmetric LAPACK (rarely) [BDD + 12] Indefinite LU with LAPACK (rarely) [GDX11] pivoting [Gus97, Tol97] [GDX11] QR LAPACK (rarely) [FW03] [FW03] [DGHL11] [EG98] [DGHL11] Eig, SVD [BDK12, BDD11] Blue = work in which our group at Berkeley was involved 16

  25. Communication Upper Bounds - Parallel O ( n 3 ) Algorithm Reference Factor exceeding Factor exceeding lower bound for lower bound for # words # messages Matrix Multiply [Can69] 1 1 Cholesky ScaLAPACK log P log P LU with [GDX11] log P log P ( N / P 1 / 2 ) log P pivoting ScaLAPACK log P log 3 P QR [DGHL11] log P ( N / P 1 / 2 ) log P ScaLAPACK log P log 3 P SymEig, SVD [BDK12] log P N / P 1 / 2 ScaLAPACK log P log 3 P NonsymEig [BDD11] log P P 1 / 2 log P ScaLAPACK N log P Blue = work in which our group at Berkeley was involved Red = not optimal ⇣ n 2 ⌘ √ Lower Bounds: Ω Ω ( P ) √ P 17

Recommend


More recommend