f f fast transforms using the cell b e processor fast
play

F F Fast Transforms using the Cell/B.E. Processor Fast Transforms - PowerPoint PPT Presentation

F F Fast Transforms using the Cell/B.E. Processor Fast Transforms using the Cell/B.E. Processor t T t T f f i g th C ll/B E P i g th C ll/B E P David A. Bader joint work with Seunghwa Kang and Virat Agarwal Sony-Toshiba-IBM Center of


  1. F F Fast Transforms using the Cell/B.E. Processor Fast Transforms using the Cell/B.E. Processor t T t T f f i g th C ll/B E P i g th C ll/B E P David A. Bader joint work with Seunghwa Kang and Virat Agarwal

  2. Sony-Toshiba-IBM Center of Competence for the Cell/B.E. at Georgia Tech for the Cell/B.E. at Georgia Tech � Mission Mission: grow the community of Cell Broadband Engine users and developers •Fall 2006: Georgia Tech wins competition for hosting the STI Center •First publicly-available IBM QS20 Cluster y •200 attendees at 2007 STI Workshop •Multicore curriculum and training •Multicore curriculum and training •Demonstrated performance on –Multimedia and gaming –Scientific computing S i tifi ti –Medical applications –Financial services David A. Bader, Director http://sti.cc.gatech.edu David A. Bader

  3. Applications • CellBuzz : Freely-available, open source libraries optimized for the Cell/B.E. f C http://sourceforge.net/projects/cellbuzz/ – ZLIB & GZIP: data compression – FFT: fast Fourier transform – RC5: encryption – MPEG-2: video encoding and decoding – JPEG2000: digital content processing • Financial Modeling David A. Bader

  4. Cell/B.E. Libraries: FFT and JPEG2000 • FFTC: Fastest Fourier Transform on FFTC: Fastest Fourier Transform on the Cell/B.E. the Cell/B.E. – – 1-Dimensional single precision DIF-FFT optimized 1-Dimensional single precision DIF-FFT optimized for 1K-16K complex input samples – Parallelize & optimize computation of a single FFT computation – D Design high performance synchronization barrier using i hi h f h i ti b i i inter-SPE communication – Demonstrated superior performance of 18.6 GFlop/s for 8K complex input samples. Butterflies of ordered DIF FFT IBM Power5 AMD Opteron Intel Pentium 4 • JPEG2000 on the JPEG2000 on the Cell/B.E. Cell/B.E. 25 FFTW on Cell Our implementation (8 SPEs) Intel Core Duo 20 – Optimize coding/decoding by data decomposition / data FFTC alignment / vectorization GigaFlop/s 15 – Demonstrated average speedup of 3.1 over 10 Intel 3.2 GHz Pentium-4 5 The source code is freely available from our CellBuzz project in SourceForge 0 1024 2048 4096 8192 16384 http://sourceforge.net/projects/cellbuzz/ Input size David A. Bader

  5. Cell/B.E. Libraries: ZLIB and MPEG-2 • ZLIB Data compression & ZLIB Data compression & decompression library decompression library – Vectorize compute intensive kernels and parallelize to run on multiple SPEs – Extend the gzip header format while maintaining compatibility with legacy gzip decompressors – Demonstrated speedup of 2.9 over high-end Intel Pentium-4 system • MPEG-2 Video Decoding MPEG-2 Video Decoding – First parallelization of a multimedia application on Cell/B.E. – Demonstrated a speedup of 3 over Intel 3.2GHz Xeon. e o st ated a speedup o 3 o e te 3 G eo The source code is freely available from our CellBuzz project in SourceForge http://sourceforge.net/projects/cellbuzz/ David A. Bader

  6. Using the Cell/B.E. in Aircraft Health Monitoring “Retired Marine Lt. Gen. Bernard Trainor said the issue of aging aircraft is a constant complaint of all branches of service.” Atl Atlanta Journal Constitution t J l C tit ti April 27, 2002 • Fault Diagnosis g – Estimate the crack length without di disassembly based on bl b d vibration data collected from multiple sensors. • Failure Prognosis – Estimate the expected time before crash David A. Bader

  7. System-of-Systems Decompostion Powe r and Cooling T ur bo Mac hine L ife , Oil Ac tuator L e akage E ngine Condition, Oil Se r vic ing and We ar and F ilte r Condition Ge ne r ator Oil L e ve l Hydr aulic F ilte r s, Pump, and Hydr aulic F luid L e ve l Batte r y Oxyge n Ge ne r ator Nitr oge n Ge ne r ator and F ilte r L L anding Ge ar anding Ge ar and and Ar r e sting Hook Str uc tur e fatigue life L anding Ge ar Str ut R otar y Ac tuator We ar Pr e ssur e and F luid L e ve l David A. Bader

  8. Overview of the Diagnosis and Prognosis Process Online Modules 1.6 1.6 1.4 1.4 1.2 1.2 1 1 0.8 0.8 0.6 0.6 0.4 0.4 Feature 0.2 0.2 0 1 2 3 4 5 6 7 8 0.2 0 0 1 2 3 4 5 6 7 8 crack length crack length Features & Features & De-Noising De Noising Preprocessed Preprocessed Extraction Extraction Mapping Sensor Data Data System Diagnosis Loading Fault Growth Feature Feature De Noising De-Noising Extraction & Techniques Particle Flight Regime Data Mapping Particle Filter & Model Parameter Techniques Filter Noise Models Tuning Experimental Data Prognosis Stress Table Crack K K mi ma Length 1. 30.2 27.9 2 27.2 25.6 2. 21.5 21.2 Simulated Data Driven 3 19.4 17.82 System Model for Data Methods System Model for Prognosis Diagnosis RUL Offline Modules DAQ In Involv lves m es multiple ltiple computa computationally e tionally expensiv pensive modules!!! e modules!!! David A. Bader

  9. Fast Transforms on the Cell/B.E. • Fast Fourier Transform • Discrete Wavelet Transform David Bader 9

  10. FFTC: Fastest Fourier Transform for Cell/B.E. • Focus on medium size FFT computations – Complex single-precision 1-Dimensional FFT • Input samples and output results reside in main memory. • Radix 2, 3 and 5. ad , 3 a d 5 • Optimized for 1K-16K input samples. • Focus on achieving high performance for the • Focus on achieving high performance for the computation of a single FFT, rather than increasing throughput increasing throughput. David Bader 10

  11. Existing FFT Research on Cell/B.E. • [Williams et al., 2006], analyzed peak performance. • [Cico, Cooper and Greene, 2006] estimated 22.1 GFlops/s for an 8K complex 1D FFT that resides in the Local Store of one SPE the Local Store of one SPE. – 8 independent FFTs in local store of 8 SPEs gives 176.8 GFlops/s. p / • [Chow, Fossum and Brokenshire, 2005] achieved 46.8 GFlops/s for 16M complex FFT. – Highly specialized for this particular input size. • FFTW is a highly portable FFT library of various types, precision and input size. David Bader 11

  12. Our FFTC is based on Cooley Tukey • Input is one dimensional vector of complex values. • Algorithm is iterative, no recursion. • Out of Place approach is used. pp • Requires two arrays A&B for computation, one input and one output that are swapped at every stage. p pp y g • Out of place approach prevents data reordering after the last stage. g • Algorithm requires log N stages. Each stage requires O( N ) computation. p – Complexity O( N log N ) David Bader 12

  13. Stage begin Twiddle factors Stage end David Bader 13

  14. Illustration of the Algorithm � Illustration of the algorithm for n=16 algorithm for n 16 complex values. � Distance between pairs of output values double at every subsequent stage. � Shows how output of one stage serves as the inp t to another input to another. David Bader 14

  15. FFTC design on Cell/B.E. : Challenges • Synchroni Synchronize e step after every step after every stage leads to signifi stage leads to significan cant overhead. overhead. • Reduce synchronization stages. g • Design efficient barrier synchronization routine. • We will later describe an We will later describe an efficient tree-based synchronization algorithm based on inter-SPE based on inter SPE communication. Insert synchronization barrier Insert synchronization barrier David Bader 15

  16. FFTC design on Cell/B.E. : Challenges (contd..) � Load balancing to achieve better SPU utilization Load balancing to achieve better SPU utilization – No SPE should wait at the synchronization barrier. – Require efficient parallelization technique to allocate data to R i ffi i t ll li ti t h i t ll t d t t SPEs. – Strategy should be scalable across multiple chips (large number of SPEs). b f SPE ) First 2 stages. � Vect ctorization dif orization difficult f icult for r ever ery stage y stage - Stages 1 & 2, do not have regular data access pattern. - Require data reorganization to fully utilize the SPE computational power. - Optimizing the first 2 stages become important for medium size inputs, as it may constitute 20-25% of the total 20 2 f running time. David Bader 16

  17. FFTC design on Cell/B.E. : Challenges (cont’d) � Limited local store Limited local store - require space for N/2 twiddle factors and input data. require space for N/2 twiddle factors and input data. - loop unrolling and duplication increases size of the code. - Effectively manage code and data within 256KB. � Algorith Algorithm is m is branch branchy: y: - Doubly nested for loop within the Doubly nested for loop within the outer while loop - Lack of branch predictor compromises performance compromises performance. - Provide branch hints and restructure the algorithm to eliminate branch eliminate branch. David Bader 17

Recommend


More recommend