single processor tuning 2 2
play

Single processor tuning (2/2) Prof. Richard Vuduc Georgia Institute - PowerPoint PPT Presentation

Single processor tuning (2/2) Prof. Richard Vuduc Georgia Institute of Technology CSE/CS 8803 PNA: Parallel Numerical Algorithms [L.16] Thursday, February 28, 2008 1 Todays sources CS 267 (Demmel & Yelick @ UCB; Spring 2007) A


  1. Single processor tuning (2/2) Prof. Richard Vuduc Georgia Institute of Technology CSE/CS 8803 PNA: Parallel Numerical Algorithms [L.16] Thursday, February 28, 2008 1

  2. Today’s sources CS 267 (Demmel & Yelick @ UCB; Spring 2007) “ A family of high-performance matrix multiplication algorithms ,” by Gunnels, et al . (2006) “ Anatomy of high-performance matrix multiplication ,” by Goto and van de Geijn (2006) “ An experimental comparison of cache-oblivious and cache-conscious programs? ” by Yotov, et al . (SPAA 2007) Talk by Matteo Frigo at CScADS Autotuning Workshop (2007) 2

  3. Review: GPGPUs. (I don’t know; you tell me!) 3

  4. Review: A one-level model of the memory hierarchy 4

  5. A simple model of memory m No. words moved from slow to fast memory ≡ f No. of flops ≡ Time per slow memory op. ≡ α Time per flop ≡ τ f q m = Flop-to-mop ratio ⇐ Computational intensity ≡ � � τ · 1 1 + α T = f · τ + m · α = f · τ · q Machine balance 5

  6. Blocked (tiled) matrix multiply J // Let I, J, K = blocks of b indices for I ← index blocks 1 to n b do K for J ← index blocks 1 to n b do K // Read block C IJ I for K ← index blocks 1 to n b do m ≈ n 3 // Read block A IK q ≈ b = ⇒ // Read block B KJ b T C IJ ← C IJ + A IK · B KJ τ · 1 1 + α = // Write C IJ to slow memory f · τ b 6

  7. Can we do better? Nope. Theorem [Hong and Kung (1981)]: Any schedule of conventional matrix multiply must transfer Ω ( n 3 / √ M ) words between slow and fast memory, where M < n 2 / 6. Last time: We did intuitive proof by Toledo (1999) Historical note: Rutledge & Rubinstein (1951—52) So cached block matrix multiply is asymptotically optimal . � n 3 � n 3 � � � √ � b = O M ⇒ m = O = O √ = b M 7

  8. Architectural implications Size of fast mem. M ≡ ≈ α / τ Arch. M 3 b 2 ≤ M Ultra 2i 25 1.5 MB q ≈ b Ultra 3 14 460 KB ⇓ Pentium 3 6.3 94 KB 3 q 2 M ≥ P-3M 10 240 KB Power3 8.8 180 KB τ · 1 1 + α 15 527 KB < 1 . 1 Power4 q 36 3.0 MB Itanium 1 � 2 � α = ⇒ M ≥ 300 5.5 71 KB Itanium 2 τ Note: “M” in bytes to 2 digits; assumes 8-byte (double-precision) words 8

  9. What happens in practice? Experiment: One-level cache-blocked matrix multiply Block size chosen as square, by exhaustive search over sizes up to 64 9

  10. Tiled MM on AMD Opteron 2.2 GHz (4.4 Gflop/s peak), 1 MB L2 cache < 25% peak! We evidently still have a lot of work to do... 10

  11. Review: Real memory hierarchies 11

  12. What happened at powers of 2? 16 B Byte addressable XXXX XXXX XXXX XXXX XXX0 0000 0000 0000 32-bit addresses XXXX XXXX XXXX XXXX XXX0 0000 0001 0000 Cache XXXX XXXX XXXX XXXX XXX0 0000 0010 0000 Direct-mapped XXXX XXXX XXXX XXXX XXX0 0000 0011 0000 8 KB capacity XXXX XXXX XXXX XXXX XXX0 0000 0100 0000 16-byte lines XXXX XXXX XXXX XXXX XXX0 0000 0101 0000 ... XXXX XXXX XXXX XXXX XXX1 1111 1111 0000 12

  13. Fast Registers L1 Slow L2 Main 13

  14. Fast Registers L1 TLB Slow L2 Main 14

  15. TLB is part of the memory hierarchy Translation Look-aside Buffer (TLB) for virtual address space management Divide address space into pages (4—32 KB typical, larger possible) Page table maps virtual to physical addrs & whether page in mem or on disk Page table can be large; TLB caches recent translations May be set-associative or fully-associative Conceptually like a cache with large block size, i.e. , 1 page May have multiple levels of TLB, just like cache Can prefetch to hide cache misses, but not TLB misses 15

  16. Experiment to observe memory parameters. s Strided-stream through array; measure average access time. (Saavedra-Barrera benchmark) 16

  17. Average Memory Access Time (Saavedra-Barerra) — Sun Ultra IIi (333 MHz) Mem TLB: 8 KB page 32 entries L2: 2 MB 64 B lines L1: 16 KB 16 B lines 17

  18. Average Memory Access Time (Saavedra-Barerra) — Pentium III (Katmai; 550 MHz) TLB: 4 KB page 64 entries Mem L2: 512 KB 32 B lines L1: 16 KB 32 B lines 18

  19. General multi-level blocking [Goto & van de Geijn (2006)] 19

  20. C ← C + A · B “Matrix-matrix” n B k k A C m “Panel-Panel” “Matrix-panel” “Panel-matrix” or “Fat Outer Product” 20

  21. B A C 21

  22. B A C 22

  23. B A C 23

  24. B A C 24

  25. C ← C + A · B “Matrix-matrix” n B k k A C m “Block-Panel” “Panel-block” “Fat Dot Product” 25

  26. B A C 26

  27. 27

  28. 28

  29. 29

  30. b k b k b n b k b k K I b m b m m J k n // Let I, J, K = blocks of indices for K ← blocks 1 to k do b k for I ← blocks 1 to m do b m for J ← blocks 1 to n do b n C IJ ← C IJ + A IK × B KJ 30

  31. n Assumes : b n 1. A , B J , C J fit in cache ( e.g. , size M ) b k 2. Above ⇒ Product runs at peak b m 3. A not evicted prematurely J b m b k + ( b k + b m ) b n ≤ M // “Block-panel” multiply // Load b m × b k block of A into cache for J ← blocks 1 to n do b n // Load b k × b n block of B into cache // Load b m × b n block of C into cache C J ← C J + A × B J // Store b m × b n block of C to memory 31

  32. n Assumes : b n 1. A , B J , C J fit in cache ( e.g. , size M ) b k 2. Above ⇒ Product runs at peak b m 3. A not evicted prematurely J b m b k + ( b k + b m ) b n ≤ M // “Block-panel” multiply = 2 b m b k n f // Load b m × b k block of A into cache for J ← blocks 1 to n = b m b k + ( b k + 2 b m ) n m do b n ⇓ // Load b k × b n block of B into cache 2 // Load b m × b n block of C into cache = q � � C J ← C J + A × B J 1 b m + 2 1 n + b k // Store b m × b n block of C to memory 32

  33. n Assumes : b n 1. A , B J , C J fit in cache ( e.g. , size M ) b k 2. Above ⇒ Product runs at peak b m 3. A not evicted prematurely J b m b k + ( b k + b m ) b n ≤ M Peak L1 flop/s ρ 1 ≡ Peak L2 to CPU bw β 2 ≡ Given a multi-level memory hierarchy, in what cache should “ A ” block live? b m b k 2 b m b k b n ≥ ρ 1 β 2 ❖ Want large A block ❖ L1 cache usually quite small ⇓ ❖ What about L2? ρ 1 b n ≥ 2 β 2 Typically, need b n >= 2. 33

  34. n Assumes : b n 1. A , B J , C J fit in cache ( e.g. , size M ) b k 2. Above ⇒ Product runs at peak b m 3. A not evicted prematurely J b m b k + ( b k + b m ) b n M ≤ ρ 1 b n ≥ 2 β 2 34

  35. n Assumes : b n 1. A , B J , C J fit in cache ( e.g. , size M ) b k 2. Above ⇒ Product runs at peak b m 3. A not evicted prematurely J b m b k + ( b k + b m ) b n M ≤ ρ 1 b n ≥ 2 β 2 What about the TLB? 35

  36. Considerations for TLB 1 2 3 32 33 1 Matrix n = 1024 Column-major order TLB 512 Page = 4 KB 32 entries 1024 36

  37. n Assumes : b n 1. A , B J , C J fit in cache ( e.g. , size M ) b k 2. Above ⇒ Product runs at peak b m 3. A not evicted prematurely 4. Operands “fit in” TLB J b m b k + ( b k + b m ) b n M ≤ ρ 1 b n ≥ 2 β 2 What about the TLB? Block of A straddles pages, so re-pack on-the-fly ⇒ “Copy optimization” Copy B panel as well 37

  38. Panel-Block Fat-Dot 38

  39. b k b k b n b k b k K I b m b m m J k n // Let I, J, K = blocks of indices for K ← blocks 1 to k do b k ˜ B ← B K, ⋆ for I ← blocks 1 to m do b m ˜ A ← A IK for J ← blocks 1 to n do b n C ← ˜ ˜ A × ˜ // Compute in bu ff er, ˜ B J C C IJ ← C IJ + ˜ // Unpack ˜ C C 39

  40. B A C Which is better? 40

  41. Dense Matrix Multiply Performance (Square n × n Operands) [333 MHz Sun Ultra 2i] 600 0.9009 Vendor Reg/insn−level + cache tiling + copy opt. 550 0.8258 Cache tiling + copy opt. Reference 500 0.7508 450 0.6757 400 0.6006 Performance (Mflop/s) Fraction of Peak 350 0.5255 300 0.4505 250 0.3754 200 0.3003 150 0.2252 100 0.1502 50 0.0751 0 0 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 matrix dimension (n) Source: Vuduc, Demmel, Bilmes (IJHPCA 2004) 41

  42. Dense Matrix Multiply Performance (Square n × n Operands) [800 MHz Intel Pentium III−mobile] 700 0.875 650 0.8125 600 0.75 550 0.6875 500 0.625 450 0.5625 Performance (Mflop/s) Fraction of Peak 400 0.5 350 0.4375 300 0.375 Vendor 250 0.3125 Goto−BLAS Reg/insn−level + cache tiling + copy 200 0.25 Cache tiling + copy opt. Reference 150 0.1875 100 0.125 50 0.0625 0 0 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 matrix dimension (n) Source: Vuduc, Demmel, Bilmes (IJHPCA 2004) 42

  43. b n r b k b m c J c Inner-kernel Scheduling Register allocation 43

  44. Administrivia 44

  45. Two joint classes with CS 8803 SC Tues 2/19 : Floating-point issues in parallel computing by me Tues 2/26 : GPGPUs by Prof. Hyesoon Kim Scribe? Both classes meet in Klaus 1116E 45

  46. Homework 1: Parallel conjugate gradients Extension : Due Wednesday 2/27 @ 8:30 am Implement a parallel solver for Ax = b (serial C version provided) Evaluate on three matrices: 27-pt stencil, and two application matrices “Simplified:” No preconditioning Performance models to understand scalability of your implementation Make measurements Build predictive models Collaboration encouraged: Compare programming models or platforms 46

Recommend


More recommend