autotuning 2 5 2 tce empirical compilers
play

Autotuning (2.5/2): TCE & Empirical compilers Prof. Richard - PowerPoint PPT Presentation

Autotuning (2.5/2): TCE & Empirical compilers Prof. Richard Vuduc Georgia Institute of Technology CSE/CS 8803 PNA: Parallel Numerical Algorithms [L.19] Tuesday, March 11, 2008 1 Todays sources CS 267 at UCB (Demmel & Yelick)


  1. Autotuning (2.5/2): TCE & Empirical compilers Prof. Richard Vuduc Georgia Institute of Technology CSE/CS 8803 PNA: Parallel Numerical Algorithms [L.19] Tuesday, March 11, 2008 1

  2. Today’s sources CS 267 at UCB (Demmel & Yelick) Papers from various autotuning projects PHiPAC, ATLAS, FFTW, SPIRAL, TCE See: Proc. IEEE 2005 special issue on Program Generation, Optimization, and Platform Adaptation Me (for once!) 2

  3. Review: Autotuners 3

  4. Performance-engineering challenges Outer Control Structure Iterative Recursive Inner Control Structure Recursive Iterative Statement Mini-Kernel Micro-Kernel ATLAS CGw/S ATLAS Unleashed Scalarized / None / Coloring / Belady / Compiler BRILA Compiler BRILA 4

  5. Motivation for performance tuning pseudo Mflop/s Source: J. Johnson (2007), CScADS autotuning workshop 5

  6. Context for autotuning Problem: HPC needs detailed low-level machine knowledge Autotuning methodology Identify and generate a space of implementations Search (modeling, experiments) to choose the best one Early idea seedlings Polyalgorithms Profile and feedback-directed compilation Domain- and architecture-specific code generators 6

  7. Example: What a search space looks like Mflop/s k 0 = 1 m 0 n 0 Platform: Sun Ultra IIi 16 double regs 667 Mflop/s peak Unrolled, pipelined inner-kernel Sun cc v5.0 compiler Source: PHiPAC Project at UC Berkeley (1997) 7

  8. Cooley-Tukey FFT algorithm: Encoding in FFTW’s codelet generator N − 1 � x [ j ] · ω − kj x, y ∈ C N y [ k ] DFT N ( x, k ) ≡ ← N j =0 N 2 − 1 �� N 1 − 1 � � � � x [ n 1 · N 2 + n 2 ] · ω − k 1 n 1 · ω − k 1 n 2 · ω − k 2 n 2 y [ k 1 + k 2 · N 1 ] ← N 1 N N 2 n 1 n 2 =0 N 1 -point DFT Twiddle N 2 -point DFT let dftgen( N, x ) ≡ fun k → . . . # DFT N ( x, k ) let cooley tukey( N 1 , N 2 , x ) ≡ let ˆ x ≡ fun n 2 , n 1 → x ( n 2 + n 1 · N 2 ) in let G 1 ≡ fun n 2 → dftgen( N 1 , ˆ x ( n 2 , )) in (Functional let W ≡ fun k 1 , n 2 → G 1 ( n 2 , k 1 ) · ω − k 1 n 2 in pseudo-code) N let G 2 ≡ fun k 1 → dftgen( N 2 , W ( k 1 , )) in fun k → G 2 ( k mod N 1 , k div N 1 ) 8

  9. 9

  10. 10

  11. Tensor Contraction Engine (TCE) for quantum chemistry 11

  12. Tensor Contraction Engine (TCE) Application domain: Quantum chemistry Electronic structure calculations Dominant computation expressible as a “tensor contraction” TCE generates a complete parallel program from a high-level spec Automates time-space trade-offs Output S. Hirata (2002), and many others Following presentation taken from Proc. IEEE 2005 special issue 12

  13. Motivation: Simplify program development Source: Baumgartner, et al . (2005) 13

  14. Rewriting to reduce operation counts Naïvely, ≈ 4 × N 10 flops � S abij A acik × B befl × C d fjk × D cdel = c,d,e,f,k,l ⇓     � � �  × C d  × A acik S abij B befl × D cdel = fjk c,k d,f e,l Assuming associativity and distributivity, ≈ 6 × N 6 flops, but also requires temporary storage. Source: Baumgartner, et al . (2005) 14

  15. Operation and storage minimization via loop fusion T (1) � B befl × D cdel = bcd f e,l T (2) T (1) � f × C d = fjk bcjk bcd d,f T (2) � S abij bcjk × A acik = c,k T 1 = T 2 = S = 0 for b, c, d, e, f, l do T 1[ b, c, d, f ] += B [ b, e, f, l ] · D [ c, d, e, l ] for b, c, d, f, j, k do T 2[ b, c, j, k ] += T 1[ b, c, d, f ] · C [ d, f, j, k ] for a, b, c, i, j, k do S [ a, b, i, j ] += T 2[ b, c, j, k ] · A [ a, c, i, k ] 15

  16. Operation and storage minimization via loop fusion T (1) � B befl × D cdel = bcd f e,l T (2) T (1) � f × C d = fjk bcjk bcd d,f T (2) � S abij bcjk × A acik = c,k S = 0 for b, c do T 1 = T 2 = S = 0 T 1 f ← 0 , T 2 f ← 0 for b, c, d, e, f, l do for d, f do T 1[ b, c, d, f ] += B [ b, e, f, l ] · D [ c, d, e, l ] for e, l do for b, c, d, f, j, k do T 1 f += B [ b, e, f, l ] · D [ c, d, e, l ] T 2[ b, c, j, k ] += T 1[ b, c, d, f ] · C [ d, f, j, k ] for j, k do for a, b, c, i, j, k do T 2 f [ j, k ] += T 1 f · C [ d, f, j, k ] S [ a, b, i, j ] += T 2[ b, c, j, k ] · A [ a, c, i, k ] for a, i, j, k do S [ a, b, i, j ] += T 2 f [ j, k ] · A [ a, c, i, k ] 16

  17. Time-space trade-offs Max index of a — f : O (1000) for a, e, c, f do i — k : O (100) for i, j do X aecf += T ijae · T ijcf “Contraction” of T over i, j for c, e, b, k do T (1) cebk ← f 1 ( c, e, b, k ) for a, f, b, k do Integrals, O(1000) flops T (2) afbk ← f 2 ( a, f, b, k ) for c, e, a, f do for b, k do Y ceaf += T (1) cebk · T (2) “Contraction” over T (1) and T (2) afbk for c, e, a, f do E += X aecf · Y ceaf 17

  18. Time-space trade-offs Max index of a — f : O (1000) for a, e, c, f do i — k : O (100) for i, j do X aecf += T ijae · T ijcf for c, e, b, k do T (1) cebk ← f 1 ( c, e, b, k ) Same indices for a, f, b, k do ⇒ Loop fusion candidates T (2) afbk ← f 2 ( a, f, b, k ) for c, e, a, f do for b, k do Y ceaf += T (1) cebk · T (2) afbk for c, e, a, f do E += X aecf · Y ceaf 18

  19. Time-space trade-offs for a, e, c, f do for a, e, c, f do for i, j do for i, j do X aecf += T ijae · T ijcf X aecf += T ijae · T ijcf for c, e, b, k do for a, c, e, f, b, k do T (1) T (1) Add cebk ← f 1 ( c, e, b, k ) cebk ← f 1 ( c, e, b, k ) extra for a, f, b, k do for a, e, c, f, b, k do flops T (2) T (2) afbk ← f 2 ( a, f, b, k ) afbk ← f 2 ( a, f, b, k ) for c, e, a, f do for c, e, a, f do for b, k do for b, k do Y ceaf += T (1) cebk · T (2) Y ceaf += T (1) cebk · T (2) afbk afbk for c, e, a, f do for c, e, a, f do E += X aecf · Y ceaf E += X aecf · Y ceaf 19

  20. Time-space trade-offs for a, e, c, f do for i, j do for a, e, c, f do ⇐ Fused X aecf += T ijae · T ijcf for i, j do for c, e, b, k do x += T ijae · T ijcf T (1) cebk ← f 1 ( c, e, b, k ) for b, k do for a, f, b, k do T (1) cebk ← f 1 ( c, e, b, k ) T (2) afbk ← f 2 ( a, f, b, k ) T (2) afbk ← f 2 ( a, f, b, k ) for c, e, a, f do y += T (1) cebk · T (2) for b, k do afbk E += x · y Y ceaf += T (1) cebk · T (2) afbk for c, e, a, f do E += X aecf · Y ceaf 20

  21. Tiled & partially fused for a B , e B , c B , f B do for a, e, c, f do for a, e, c, f do for i, j do for i, j do X aecf += T ijae · T ijcf ˆ X aecf += T ijae · T ijcf for c, e, b, k do for b, k do T (1) cebk ← f 1 ( c, e, b, k ) for c, e do ˆ for a, f, b, k do T (1) ce ← f 1 ( c, e, b, k ) T (2) afbk ← f 2 ( a, f, b, k ) for a, f do T (2) ˆ for c, e, a, f do af ← f 2 ( a, f, b, k ) for b, k do for c, e, a, f do Y ceaf += T (1) cebk · T (2) T (2) Y ceaf += ˆ ˆ ce · ˆ T (1) afbk af for c, e, a, f do for c, e, a, f do E += ˆ X aecf · ˆ E += X aecf · Y ceaf Y ceaf 21

  22. Transform algebraically , to minimize flops Minimize temporary storage Distribute and partition data for a parallel system Search wrt space-time trade-off (feedback) For out-of-core problems, apply optimize data locality Generate final program (C/ Fortran + MPI/Global-arrays) 22

  23. Tensor loop nest ⇒ Expression tree for a, e, c, f do for i, j do X aecf += T ijae · T ijcf for c, e, b, k do E, + ceaf T (1) X, + ij Y, + bk cebk ← f 1 ( c, e, b, k ) for a, f, b, k do T (1) T (2) T (2) afbk ← f 2 ( a, f, b, k ) T T for c, e, a, f do f 1 f 2 for b, k do Y ceaf += T (1) cebk · T (2) afbk for c, e, a, f do E += X aecf · Y ceaf 23

  24. Expression tree ⇒ fusion graph E, + ceaf X, + ij Y, + bk T (1) T (2) T T f 1 f 2 E c e a f Y X T (1) T (2) T T c f a e i j j i f 1 f 2 b k a f b k c e 24

  25. Fusion graph E c e a f Y X T (1) T (2) T T c f a e i j j i f 1 f 2 b k a f b k c e 25

  26. Fusion graph E c e a f Fuse ⇒ X scalar Y X T (1) T (2) T T c f a e i j j i f 1 f 2 b k a f b k c e 26

  27. Fusion graph E c e a f Fuse ⇒ Y scalar Y X T (1) T (2) T T c f a e i j j i f 1 f 2 b k a f b k c e 27

  28. Fusion graph E c e a f Y X T (1) T (2) T T c f a e i j j i f 1 f 2 b k a f b k c e 28

  29. Fusion graph E c e a f Y X T (1) T (2) T T c f a e i j j i f 1 f 2 b k a f b k c e 29

  30. Fusion graph E c e a f Y X T (1) T (2) T T c f a e i j j i f 1 f 2 a f b k a f b k c e c e 30

  31. 31

  32. Empirical compilers and tools 32

  33. Code generation tools for autotuning Code generation tools GNU Superoptimizer -- Exhaustive search over schedules of straight-line code Denali -- Theorem proving based scheduling iFKO (Whaley @ UTSA) -- Iterative floating-point kernel optimizer POET (Yi @ UTSA) -- Parameterized Optimizations for Empirical Tuning 33

  34. Iterative/empirical compilers Compile-time “Iterative compilation” -- Kisuki, Knijnenberg, O’Boyle, et al . Hybrid model/search-based compiler -- Hall, et al. (USC) Eigenmann @ Purdue (Polaris) Quinlan, et al. (LLNL / PERI) Qasem (TSU), Kennedy, Mellor-Crummey (Rice) -- Whole program tuning Compilers that learn -- Cavazos (UDel); Stephenson/Amarsinghe (MIT) Run-time: Voss, et al.: ADAPT 34

  35. Administrivia 35

  36. Upcoming schedule changes Some adjustment of topics (TBD) Today — Project proposals due Th 3/13 — SIAM Parallel Processing (attendance encouraged) Tu 4/1 — No class Th 4/3 — Attend talk by Doug Post from DoD HPC Modernization Program 36

Recommend


More recommend