A Fast Fourier Transform Compiler Paper by: Matteo Frigo MIT Laboratory for Computer Science. February 16, 1999 Presented by: Marco Poltera. November 16, 2011 Software Engineering Seminar
Introduction and motivation / Computation of Discrete Fourier transform (DFT) required by many real world applications real world result, uses DFT program, application , i.e. i.e. JPEG i.e. FFTW compressed compression image β’ Look at the internals of FFTW Goal β’ Argue that a specialized compiler is a valuable tool 2
Recap: DFT / linear transform: π§ = ππ¦ / DFT: with (primitive n-root of unity) π§ = πΈπΊπ π π¦ / FFT: We can compute π§ = ππ¦ = (π 1 (π 2 .. (π π π¦))) 3
Recap: DFT / DFT 4 = Example from: How to write fast numerical code. Markus PΓΌschel. Carnegie Mellon University. Course 18-645. Lecture 17. 4
FFTW / FFTW consists of three parts: Compiler Planner Executor ( genfft ) β’ run once β’ run once for β’ computes every the DFT β’ output: transform codelets β’ output: size transformed β’ hardware vector adaption β’ output: plan β’ reusable 5
FFTW graphic from: How to write fast numerical code. Markus PΓΌschel. Carnegie Mellon University. Course 18-645. Lecture 19. 6
FFTW code to adapt 5 % to hardware auto- 95 % codelets generated by genfft FFTW code graph from paper 7
The four phases of ge genf nfft creation phase simplifier scheduler unparser 8
creation phase simplifier scheduler unparser Creation phase n is a multiple split-radix of 4 algorithm choose an FFT algorithm n = n 1 n 2 and prime factor gcd (n 1 , n 2 ) = 1 algorithm n = n 1 n 2 and Cooley-Tukey n i β 1 FFT algorithm Raderβs n is prime algorithm application of DFT definition 9
creation phase simplifier scheduler unparser Creation phase according to FFT generate dag 10
creation phase simplifier scheduler unparser Creation phase / Example: Cooley-Tukey algorithm let rec cooley_tukey n1 n2 input sign = let tmp1 j2 = fftgen n1 (fun j1 -> input (j1*n2+j2)) sign in let tmp2 i1 j2 = exp n (sign*i1*j2) @* tmp1 j2 i1 in let tmp3 i1= fftgen n2 (tmp2 i1) sign in (fun i -> tmp3 (i mod n1) (i/n1)) 11
creation phase simplifier scheduler unparser Creation phase / DAG representation Type node = Num of Number.number | Load of Variable.variable | Store of Variable.variable * node | Plus of node list | Times of node * node | Uminus of node 2 v 2 v 4 v 3 = Plus [ v 2 ; Times (Num 3, v 0 ) ] v 1 v 4 = Plus [ Times (Num 2, v 2 ); v 1 ; v 0 ] v 0 v 3 3 12
creation phase simplifier scheduler unparser Simplifier / algebraic transformations / i.e. apply distributive property: ππ¦ + ππ§ β π(π¦ + π§) / common-subexpressions / DFT-specific improvements / make numeric constants positive / dag transposition 13
creation phase simplifier scheduler unparser Simplifier: DAG transposition / three passes: E T D F simplify simplify simplify F T E G 14
creation phase simplifier scheduler unparser Simplifier: DAG transposition 2 a π = 4(2π + 3π) c 4 b 3 a 2 a = 2 β 4π c 4 π = 3 β 4π 3 b 15
creation phase simplifier scheduler unparser Scheduler Goal β’ maximize register usage / schedule is cache-oblivious 16
creation phase simplifier scheduler unparser Scheduler 17
creation phase simplifier scheduler unparser Scheduler / #register spills = Ξ (n log(n) / log(R)) 18
creation phase simplifier scheduler unparser Unparser / Schedule is unparsed to C 19
Conclusion / performance / rapid turnaround / effectiveness / derived new algorithms / not reduced to a specific language such as C 20
Further information / Download FFTW: www.fftw.org / Details on FFTW: βFFTW: An Adaptive Software Architecture For The FFT β by M. Frigo/S. Johnson (1998) 21
22
Usage of FFTW #include <fftw3.h> ... { fftw_complex *in, *out; fftw_plan p; ... in = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N); out = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N); p = fftw_plan_dft_1d(N, in, out, FFTW_FORWARD, FFTW_ESTIMATE); ... fftw_execute(p); /* repeat as needed */ ... fftw_destroy_plan(p); fftw_free(in); fftw_free(out); } from the tutorial included in the FFTW distribution 3.3 23
DFT / FFT refers to / any O( N log N ) algorithm or / the specific Cooley-Tukey algorithm / computing a DFT of N points takes / in the naive way, using the definition, O( N 2 ) arithmetical operations / O( N log N ) operations for a FFT 24
FFTW and Parallelism / Parallel versions are available for / Cilk / Posix threads / MPI 25
creation phase simplifier scheduler unparser Simplifier / Implementation: / simplifier written as if it was an expression tree / mapping from trees to DAGs accomplished by memoization which is performed implicitly by a monad 26
Pragmatic aspects of ge genf nfft / running time / memory requirements 27
Recommend
More recommend