Large Multicore FFTs: Approaches to Optimization Sharon Sacco and James Geraci 24 September 2008 This work is sponsored by the Department of the Air Force under Air Force contract FA8721-05-C-0002. Opinions, interpretations, conclusions and recommendations are those of the author and are not necessarily endorsed by the United States Government MIT Lincoln Laboratory HPEC 2008-1 SMHS 9/24/2008
Outline • Introduction • 1D Fourier Transform • Mapping 1D FFTs onto Cell • 1D as 2D Traditional Approach • Technical Challenges • Design • Performance • Summary MIT Lincoln Laboratory HPEC 2008-2 SMHS 9/24/2008
1D Fourier Transform g j = Σ N-1 f k e -2 π i jk /N k = 0 • This is a simple equation • This is a simple equation • A few people spend a lot of their careers trying to make it • A few people spend a lot of their careers trying to make it run fast run fast MIT Lincoln Laboratory HPEC 2008-3 SMHS 9/24/2008
Mapping 1D FFT onto Cell FFT Data • Small FFTs can fit into a single LS memory. 4096 is the largest size. • Medium FFTs can fit into multiple LS memory. 65536 is the largest size. • Cell FFTs can be classified by • Cell FFTs can be classified by memory requirements memory requirements • Medium and large FFTs • Medium and large FFTs require careful memory require careful memory • Large FFTs must use XDR transfers transfers memory as well as LS memory. MIT Lincoln Laboratory HPEC 2008-4 SMHS 9/24/2008
1D as 2D Traditional Approach 2. FFT on columns w 0 w 0 w 0 w 0 1. Corner 3. Corner turn to turn to w 0 w 1 w 2 w 3 compact original 0 4 8 12 columns orientation w 0 w 2 w 4 w 6 1 5 9 13 w 0 w 3 w 6 w 9 0 1 2 3 2 6 10 14 0 1 2 3 4 5 6 7 3 7 11 15 4 5 6 7 8 9 10 11 4. Multiply 8 9 10 11 (elementwise) 6. Corner 12 13 14 15 by central turn to 12 13 14 15 twiddles correct data order 5. FFT on rows 0 1 2 3 • 1D as 2D FFT reorganizes data a lot • 1D as 2D FFT reorganizes data a lot 4 5 6 7 – Timing jumps when used – Timing jumps when used 8 9 10 11 • Can reduce memory for twiddle tables • Can reduce memory for twiddle tables 12 13 14 15 • Only one FFT needed • Only one FFT needed MIT Lincoln Laboratory HPEC 2008-5 SMHS 9/24/2008
Outline • Introduction • Communications • Technical Challenges • Memory • Cell Rounding • Design • Performance • Summary MIT Lincoln Laboratory HPEC 2008-6 SMHS 9/24/2008
Communications SPE connection to Bandwidth to EIB is 50 GB/s XDR memory 25.3 GB/s EIB bandwidth is 96 bytes / cycle • Minimizing XDR memory accesses is critical • Minimizing XDR memory accesses is critical • Leverage EIB • Leverage EIB • Coordinating SPE communication is desirable • Coordinating SPE communication is desirable – Need to know SPE relative geometry – Need to know SPE relative geometry MIT Lincoln Laboratory HPEC 2008-7 SMHS 9/24/2008
Memory Each SPE has 256 KB local store memory XDR Memory is much larger than 1M pt FFT requirements Each Cell has 2 MB local store memory • Need to rethink algorithms to leverage the • Need to rethink algorithms to leverage the total memory memory – Consider local store both from individual and – Consider local store both from individual and collective SPE point of view collective SPE point of view MIT Lincoln Laboratory HPEC 2008-8 SMHS 9/24/2008
Cell Rounding IEEE 754 Round to Nearest Cell (truncation) 1 bit b00 b01 b10 b00 b01 b10 • Average value – x01 + 0 bits • Average value – x01 + .5 bit • The cost to correct basic binary operations, add, multiply, and subtract, is prohibitive • Accuracy should be improved by minimizing steps to produce a result in algorithm MIT Lincoln Laboratory HPEC 2008-9 SMHS 9/24/2008
Outline • Introduction • Technical Challenges • Using Memory Well • Design • Reducing Memory Accesses • Distributing on SPEs • Performance • Bit Reversal • Complex Format • Computational Considerations • Summary MIT Lincoln Laboratory HPEC 2008-10 SMHS 9/24/2008
FFT Signal Flow Diagram and radix 2 stage Terminology 0 0 1 8 butterfly 4 2 12 3 4 2 5 10 block 6 6 7 14 8 1 9 9 10 5 11 13 12 3 13 11 14 7 15 15 • Size 16 can illustrate concepts for large FFTs • Size 16 can illustrate concepts for large FFTs – Ideas scale well and it is “drawable” – Ideas scale well and it is “drawable” • This is the “decimation in frequency” data flow • This is the “decimation in frequency” data flow • Where the weights are applied determines the algorithm • Where the weights are applied determines the algorithm MIT Lincoln Laboratory HPEC 2008-11 SMHS 9/24/2008
Reducing Memory Accesses • Columns will be loaded in strips that 1024 fit in the total Cell local store • FFT algorithm processes 4 columns at a time to leverage SIMD 1024 registers • Requires separate code from row FFTS • Data reorganization requires SPE to SPE 64 DMAs • No bit reversal 4 MIT Lincoln Laboratory HPEC 2008-12 SMHS 9/24/2008
1D FFT Distribution with Single Reorganization 0 0 1 8 4 2 12 3 4 2 5 10 6 6 reorganize 7 14 8 1 9 9 10 5 11 13 12 3 13 11 14 7 15 15 • One approach is to load everything onto a single SPE to do • One approach is to load everything onto a single SPE to do the first part of the computation the first part of the computation • After a single reorganization each SPE owns an entire block • After a single reorganization each SPE owns an entire block and can complete the computations on its points and can complete the computations on its points MIT Lincoln Laboratory HPEC 2008-13 SMHS 9/24/2008
1D FFT Distribution with Multiple Reorganizations 0 0 1 8 4 2 12 3 4 2 5 10 6 6 reorganize reorganize 7 14 8 1 9 9 10 5 11 13 12 3 13 11 14 7 15 15 • A second approach is to divide groups of contiguous • A second approach is to divide groups of contiguous butterflies among SPEs and reorganize after each stage until butterflies among SPEs and reorganize after each stage until the SPEs own a full block the SPEs own a full block MIT Lincoln Laboratory HPEC 2008-14 SMHS 9/24/2008
Selecting the Preferred Reorganization Single Reorganization Multiple Reorganizations • • Number of exchanges Number of exchanges Typical N is 32k P * (P – 1) P * log 2 (P) complex • • Number of elements Number of elements elements exchanged exchanged N * (P – 1) / P (N / 2) * log 2 (P) N - the number of elements in SPE memory, P - number of SPEs Number of Number of Data Moved Number of Data Moved SPEs Exchanges in 1 DMA Exchanges in 1 DMA 2 2 N / 4 2 N / 4 4 12 N / 16 8 N / 8 8 56 N / 64 24 N / 16 • Evaluation favors multiple reorganizations • Evaluation favors multiple reorganizations – Fewer DMAs have less bus contention – Fewer DMAs have less bus contention Single Reorganization exceeds the number of busses Single Reorganization exceeds the number of busses – DMA overhead (~ .3 μ s) is minimized – DMA overhead (~ .3 μ s) is minimized – Programming is simpler for multiple reorganizations – Programming is simpler for multiple reorganizations MIT Lincoln Laboratory HPEC 2008-15 SMHS 9/24/2008
Column Bit Reversal • Bit reversal of columns 000000001 can be implemented by the order of Binary Row processing rows and Numbers double buffering • Reversal row pairs are 100000000 both read into local store and then written to each others memory location • Exchanging rows for bit reversal has a low cost • DMA addresses are table driven • Bit reversal table can be very small • Row FFTs are conventional 1D FFTs MIT Lincoln Laboratory HPEC 2008-16 SMHS 9/24/2008
Complex Format • Two common formats for complex – interleaved real 0 imag 0 real 1 imag 1 – split • Complex format for user should be real 0 real 1 standard • Internal format conversion is light imag 0 imag 1 weight • SIMD units need split • Internal format should benefit the format for complex algorithm arithmetic – Internal format is opaque to user • Interleaved complex format reduces number of DMAs MIT Lincoln Laboratory HPEC 2008-17 SMHS 9/24/2008
Outline • Introduction • Technical Challenges • Using Memory Well • Design • Computational Considerations • Central Twiddles • Performance • Algorithm Choice • Summary MIT Lincoln Laboratory HPEC 2008-18 SMHS 9/24/2008
Central Twiddles • Central twiddles can take as w 0 w 0 w 0 w 0 w 0 … much memory as the input data w 0 w 1 w 2 w 3 • Reading from memory could w 0 w 2 w 4 . increase FFT time up to 20% w 0 w 3 . . • For 32-bit FFTs central twiddles w 0 can be computed as needed … w 1023 * 1023 – Trigonometric identity methods require double precision Central Twiddles for 1M FFT Next generation Cell should make this the method of choice • Central twiddles are a – Direct sine and cosine significant part of the algorithms are long design MIT Lincoln Laboratory HPEC 2008-19 SMHS 9/24/2008
Recommend
More recommend