Autotuning (2.5/2): TCE & Empirical compilers Prof. Richard Vuduc Georgia Institute of Technology CSE/CS 8803 PNA: Parallel Numerical Algorithms [L.19] Tuesday, March 11, 2008 1
Today’s sources CS 267 at UCB (Demmel & Yelick) Papers from various autotuning projects PHiPAC, ATLAS, FFTW, SPIRAL, TCE See: Proc. IEEE 2005 special issue on Program Generation, Optimization, and Platform Adaptation Me (for once!) 2
Review: Autotuners 3
Performance-engineering challenges Outer Control Structure Iterative Recursive Inner Control Structure Recursive Iterative Statement Mini-Kernel Micro-Kernel ATLAS CGw/S ATLAS Unleashed Scalarized / None / Coloring / Belady / Compiler BRILA Compiler BRILA 4
Motivation for performance tuning pseudo Mflop/s Source: J. Johnson (2007), CScADS autotuning workshop 5
Context for autotuning Problem: HPC needs detailed low-level machine knowledge Autotuning methodology Identify and generate a space of implementations Search (modeling, experiments) to choose the best one Early idea seedlings Polyalgorithms Profile and feedback-directed compilation Domain- and architecture-specific code generators 6
Example: What a search space looks like Mflop/s k 0 = 1 m 0 n 0 Platform: Sun Ultra IIi 16 double regs 667 Mflop/s peak Unrolled, pipelined inner-kernel Sun cc v5.0 compiler Source: PHiPAC Project at UC Berkeley (1997) 7
Cooley-Tukey FFT algorithm: Encoding in FFTW’s codelet generator N − 1 � x [ j ] · ω − kj x, y ∈ C N y [ k ] DFT N ( x, k ) ≡ ← N j =0 N 2 − 1 �� N 1 − 1 � � � � x [ n 1 · N 2 + n 2 ] · ω − k 1 n 1 · ω − k 1 n 2 · ω − k 2 n 2 y [ k 1 + k 2 · N 1 ] ← N 1 N N 2 n 1 n 2 =0 N 1 -point DFT Twiddle N 2 -point DFT let dftgen( N, x ) ≡ fun k → . . . # DFT N ( x, k ) let cooley tukey( N 1 , N 2 , x ) ≡ let ˆ x ≡ fun n 2 , n 1 → x ( n 2 + n 1 · N 2 ) in let G 1 ≡ fun n 2 → dftgen( N 1 , ˆ x ( n 2 , )) in (Functional let W ≡ fun k 1 , n 2 → G 1 ( n 2 , k 1 ) · ω − k 1 n 2 in pseudo-code) N let G 2 ≡ fun k 1 → dftgen( N 2 , W ( k 1 , )) in fun k → G 2 ( k mod N 1 , k div N 1 ) 8
9
10
Tensor Contraction Engine (TCE) for quantum chemistry 11
Tensor Contraction Engine (TCE) Application domain: Quantum chemistry Electronic structure calculations Dominant computation expressible as a “tensor contraction” TCE generates a complete parallel program from a high-level spec Automates time-space trade-offs Output S. Hirata (2002), and many others Following presentation taken from Proc. IEEE 2005 special issue 12
Motivation: Simplify program development Source: Baumgartner, et al . (2005) 13
Rewriting to reduce operation counts Naïvely, ≈ 4 × N 10 flops � S abij A acik × B befl × C d fjk × D cdel = c,d,e,f,k,l ⇓ � � � × C d × A acik S abij B befl × D cdel = fjk c,k d,f e,l Assuming associativity and distributivity, ≈ 6 × N 6 flops, but also requires temporary storage. Source: Baumgartner, et al . (2005) 14
Operation and storage minimization via loop fusion T (1) � B befl × D cdel = bcd f e,l T (2) T (1) � f × C d = fjk bcjk bcd d,f T (2) � S abij bcjk × A acik = c,k T 1 = T 2 = S = 0 for b, c, d, e, f, l do T 1[ b, c, d, f ] += B [ b, e, f, l ] · D [ c, d, e, l ] for b, c, d, f, j, k do T 2[ b, c, j, k ] += T 1[ b, c, d, f ] · C [ d, f, j, k ] for a, b, c, i, j, k do S [ a, b, i, j ] += T 2[ b, c, j, k ] · A [ a, c, i, k ] 15
Operation and storage minimization via loop fusion T (1) � B befl × D cdel = bcd f e,l T (2) T (1) � f × C d = fjk bcjk bcd d,f T (2) � S abij bcjk × A acik = c,k S = 0 for b, c do T 1 = T 2 = S = 0 T 1 f ← 0 , T 2 f ← 0 for b, c, d, e, f, l do for d, f do T 1[ b, c, d, f ] += B [ b, e, f, l ] · D [ c, d, e, l ] for e, l do for b, c, d, f, j, k do T 1 f += B [ b, e, f, l ] · D [ c, d, e, l ] T 2[ b, c, j, k ] += T 1[ b, c, d, f ] · C [ d, f, j, k ] for j, k do for a, b, c, i, j, k do T 2 f [ j, k ] += T 1 f · C [ d, f, j, k ] S [ a, b, i, j ] += T 2[ b, c, j, k ] · A [ a, c, i, k ] for a, i, j, k do S [ a, b, i, j ] += T 2 f [ j, k ] · A [ a, c, i, k ] 16
Time-space trade-offs Max index of a — f : O (1000) for a, e, c, f do i — k : O (100) for i, j do X aecf += T ijae · T ijcf “Contraction” of T over i, j for c, e, b, k do T (1) cebk ← f 1 ( c, e, b, k ) for a, f, b, k do Integrals, O(1000) flops T (2) afbk ← f 2 ( a, f, b, k ) for c, e, a, f do for b, k do Y ceaf += T (1) cebk · T (2) “Contraction” over T (1) and T (2) afbk for c, e, a, f do E += X aecf · Y ceaf 17
Time-space trade-offs Max index of a — f : O (1000) for a, e, c, f do i — k : O (100) for i, j do X aecf += T ijae · T ijcf for c, e, b, k do T (1) cebk ← f 1 ( c, e, b, k ) Same indices for a, f, b, k do ⇒ Loop fusion candidates T (2) afbk ← f 2 ( a, f, b, k ) for c, e, a, f do for b, k do Y ceaf += T (1) cebk · T (2) afbk for c, e, a, f do E += X aecf · Y ceaf 18
Time-space trade-offs for a, e, c, f do for a, e, c, f do for i, j do for i, j do X aecf += T ijae · T ijcf X aecf += T ijae · T ijcf for c, e, b, k do for a, c, e, f, b, k do T (1) T (1) Add cebk ← f 1 ( c, e, b, k ) cebk ← f 1 ( c, e, b, k ) extra for a, f, b, k do for a, e, c, f, b, k do flops T (2) T (2) afbk ← f 2 ( a, f, b, k ) afbk ← f 2 ( a, f, b, k ) for c, e, a, f do for c, e, a, f do for b, k do for b, k do Y ceaf += T (1) cebk · T (2) Y ceaf += T (1) cebk · T (2) afbk afbk for c, e, a, f do for c, e, a, f do E += X aecf · Y ceaf E += X aecf · Y ceaf 19
Time-space trade-offs for a, e, c, f do for i, j do for a, e, c, f do ⇐ Fused X aecf += T ijae · T ijcf for i, j do for c, e, b, k do x += T ijae · T ijcf T (1) cebk ← f 1 ( c, e, b, k ) for b, k do for a, f, b, k do T (1) cebk ← f 1 ( c, e, b, k ) T (2) afbk ← f 2 ( a, f, b, k ) T (2) afbk ← f 2 ( a, f, b, k ) for c, e, a, f do y += T (1) cebk · T (2) for b, k do afbk E += x · y Y ceaf += T (1) cebk · T (2) afbk for c, e, a, f do E += X aecf · Y ceaf 20
Tiled & partially fused for a B , e B , c B , f B do for a, e, c, f do for a, e, c, f do for i, j do for i, j do X aecf += T ijae · T ijcf ˆ X aecf += T ijae · T ijcf for c, e, b, k do for b, k do T (1) cebk ← f 1 ( c, e, b, k ) for c, e do ˆ for a, f, b, k do T (1) ce ← f 1 ( c, e, b, k ) T (2) afbk ← f 2 ( a, f, b, k ) for a, f do T (2) ˆ for c, e, a, f do af ← f 2 ( a, f, b, k ) for b, k do for c, e, a, f do Y ceaf += T (1) cebk · T (2) T (2) Y ceaf += ˆ ˆ ce · ˆ T (1) afbk af for c, e, a, f do for c, e, a, f do E += ˆ X aecf · ˆ E += X aecf · Y ceaf Y ceaf 21
Transform algebraically , to minimize flops Minimize temporary storage Distribute and partition data for a parallel system Search wrt space-time trade-off (feedback) For out-of-core problems, apply optimize data locality Generate final program (C/ Fortran + MPI/Global-arrays) 22
Tensor loop nest ⇒ Expression tree for a, e, c, f do for i, j do X aecf += T ijae · T ijcf for c, e, b, k do E, + ceaf T (1) X, + ij Y, + bk cebk ← f 1 ( c, e, b, k ) for a, f, b, k do T (1) T (2) T (2) afbk ← f 2 ( a, f, b, k ) T T for c, e, a, f do f 1 f 2 for b, k do Y ceaf += T (1) cebk · T (2) afbk for c, e, a, f do E += X aecf · Y ceaf 23
Expression tree ⇒ fusion graph E, + ceaf X, + ij Y, + bk T (1) T (2) T T f 1 f 2 E c e a f Y X T (1) T (2) T T c f a e i j j i f 1 f 2 b k a f b k c e 24
Fusion graph E c e a f Y X T (1) T (2) T T c f a e i j j i f 1 f 2 b k a f b k c e 25
Fusion graph E c e a f Fuse ⇒ X scalar Y X T (1) T (2) T T c f a e i j j i f 1 f 2 b k a f b k c e 26
Fusion graph E c e a f Fuse ⇒ Y scalar Y X T (1) T (2) T T c f a e i j j i f 1 f 2 b k a f b k c e 27
Fusion graph E c e a f Y X T (1) T (2) T T c f a e i j j i f 1 f 2 b k a f b k c e 28
Fusion graph E c e a f Y X T (1) T (2) T T c f a e i j j i f 1 f 2 b k a f b k c e 29
Fusion graph E c e a f Y X T (1) T (2) T T c f a e i j j i f 1 f 2 a f b k a f b k c e c e 30
31
Empirical compilers and tools 32
Code generation tools for autotuning Code generation tools GNU Superoptimizer -- Exhaustive search over schedules of straight-line code Denali -- Theorem proving based scheduling iFKO (Whaley @ UTSA) -- Iterative floating-point kernel optimizer POET (Yi @ UTSA) -- Parameterized Optimizations for Empirical Tuning 33
Iterative/empirical compilers Compile-time “Iterative compilation” -- Kisuki, Knijnenberg, O’Boyle, et al . Hybrid model/search-based compiler -- Hall, et al. (USC) Eigenmann @ Purdue (Polaris) Quinlan, et al. (LLNL / PERI) Qasem (TSU), Kennedy, Mellor-Crummey (Rice) -- Whole program tuning Compilers that learn -- Cavazos (UDel); Stephenson/Amarsinghe (MIT) Run-time: Voss, et al.: ADAPT 34
Administrivia 35
Upcoming schedule changes Some adjustment of topics (TBD) Today — Project proposals due Th 3/13 — SIAM Parallel Processing (attendance encouraged) Tu 4/1 — No class Th 4/3 — Attend talk by Doug Post from DoD HPC Modernization Program 36
Recommend
More recommend