a systolic fft architecture for real time fpga systems
play

A Systolic FFT Architecture for Real Time FPGA Systems Preston - PowerPoint PPT Presentation

A Systolic FFT Architecture for Real Time FPGA Systems Preston Jackson, Cy Chan, Charles Rader, Jonathan Scalera, and Michael Vai HPEC 2004 29 September 2004 This work was sponsored by DARPA ATO under Air Force Contract F19628-00-C-0002.


  1. A Systolic FFT Architecture for Real Time FPGA Systems Preston Jackson, Cy Chan, Charles Rader, Jonathan Scalera, and Michael Vai HPEC 2004 29 September 2004 This work was sponsored by DARPA ATO under Air Force Contract F19628-00-C-0002. Opinions, interpretations, conclusions and recommendations are those of the authors and are not necessarily endorsed by the United States Government . MIT Lincoln Laboratory Systolic Architecture-1 PAJ 9/29/2004

  2. Outline Introduction • Motivation – Evaluation metrics – Parallel architecture • Systolic architecture • Performance summary • Conclusions • MIT Lincoln Laboratory Systolic Architecture-2 PAJ 9/29/2004

  3. Radar Processing Application ADC 1.2 GSPS x 32K ∑ ∗ = − Corr [ m ] x [ n ] y [ n m ] ADC x, y Correlation 1.2 GSPS n y 8K FFT bottleneck Real-time • Complex • I/Q FFT FIFO Conjugate 0.6 GSPS input (16-bits) • 1.2 GSPS output (12-bits) • k - 1 × I/Q FFT FIFO FIFO + × + MIT Lincoln Laboratory Systolic Architecture-3 PAJ 9/29/2004

  4. Evaluation Scorecard The design changes will be scored based on the following • metrics: Length of FFT ∆ Size 16 8192 IO pins Pins ? ? ? Fly ? ? ? Butterflies Mult ? ? ? Add ? ? ? Multipliers Shift ? ? ? Adder/subtractors Shift registers MIT Lincoln Laboratory Systolic Architecture-4 PAJ 9/29/2004

  5. Outline Introduction • Parallel architecture • Data flow graph – Effects of serial input – Systolic architecture • Performance summary • Conclusions • MIT Lincoln Laboratory Systolic Architecture-5 PAJ 9/29/2004

  6. Baseline Parallel Architecture ∆ Size 16 8192 1 1 1 1 Pins 448 229K 2 2 2 2 Fly 32 53K Mult 3 3 3 3 Add 4 4 4 4 Shift 0 0 5 5 5 5 6 6 6 6 7 7 7 7 Parallel FFT 8 8 8 8 Butterfly structure • 9 9 9 9 Removes • 10 10 10 10 redundant 11 11 11 11 calculation 12 12 12 12 13 13 13 13 14 14 14 14 15 15 15 15 16 16 16 16 MIT Lincoln Laboratory Systolic Architecture-6 PAJ 9/29/2004

  7. Complex Butterfly ∆ Size 16 8192 Pins 448 229K Butterfly contains • Fly 32 53K 1 complex addition Mult – 1 complex subtraction Add – Shift 0 0 1 complex, constant multiply – u x + v y × - r W N MIT Lincoln Laboratory Systolic Architecture-7 PAJ 9/29/2004

  8. Complex Addition ∆ Size 16 8192 Pins 448 229K Complex addition adds the real and • Fly 32 53K imaginary parts separately: Mult Add 128 213K + + + = + + + (a jb) (c jd) (a c) j(b d) Shift 0 0 2 adds a real + c b imag + d MIT Lincoln Laboratory Systolic Architecture-8 PAJ 9/29/2004

  9. Complex Multiply ∆ Size 16 8192 Pins 448 229K The FOIL method of multiplying complex • Fly 32 53K numbers: Mult 128 213K Add 192 320K + + = − + + (a jb)(c jd) (ac bd) j(ad bc) Shift 0 0 4 multiplies and 2 adds a × real - c × b × imag + d × MIT Lincoln Laboratory Systolic Architecture-9 PAJ 9/29/2004

  10. Efficient Complex Multiply ∆ Size 16 8192 Pins 448 229K Another approach requires fewer multiplies: • Fly 32 53K Mult 96 159K 75% + = + − − (ad bc) c(a b) a(c d) Add 288 480K 150% Shift 0 0 − = − + − (ac bd) d(a b) a(c d) 3 multiplies and 5 adds a - × b real + - × c imag + × d - MIT Lincoln Laboratory Systolic Architecture-10 PAJ 9/29/2004

  11. Parallel-Pipelined Architecture ∆ Size 16 8192 1 1 1 1 Pins 448 229K 2 2 2 2 Fly 32 53K 3 3 3 3 Mult 96 159K Add 288 480K 4 4 4 4 Shift 0 0 5 5 5 5 6 6 6 6 7 7 7 7 A pipelined version 8 8 8 8 IO Bound • 9 9 9 9 100% Efficient • 10 10 10 10 11 11 11 11 12 12 12 12 13 13 13 13 14 14 14 14 15 15 15 15 16 16 16 16 MIT Lincoln Laboratory Systolic Architecture-11 PAJ 9/29/2004

  12. Serial Input ∆ Size 16 8192 1 1 1 1 Pins 28 28 .01% 2 2 2 2 Fly 32 53K 3 3 3 3 Mult 96 159K Add 288 480K 4 4 4 4 Shift 0 0 5 5 5 5 6 6 6 6 7 7 7 7 A serial version 8 8 8 8 IO-rate matches • 9 9 9 9 A/D 10 10 10 10 6.25% Efficient • 11 11 11 11 12 12 12 12 13 13 13 13 14 14 14 14 15 15 15 15 16 16 16 16 MIT Lincoln Laboratory Systolic Architecture-12 PAJ 9/29/2004

  13. Outline Introduction • Parallel architecture • Systolic architecture • Serial implementation – Application specific optimizations – Performance summary • Conclusions • MIT Lincoln Laboratory Systolic Architecture-13 PAJ 9/29/2004

  14. Serial Architecture ∆ Size 16 8192 Pins 28 28 The parallel architecture can be collapsed • Fly 4 13 .03% – One butterfly per stage Mult 12 39 .03% – Consumes 1 sample per cycle Add 36 117 .03% Shift 22 12K – Same latency and throughput – More efficient design Stage 1 Stage 2 Stage 3 Stage 4 50% Efficiency MIT Lincoln Laboratory Systolic Architecture-14 PAJ 9/29/2004

  15. High Level View ∆ Size 16 8192 Pins 28 28 Replace complex structure with an • Fly 4 13 abstract cell which contains: Mult 12 39 FIFOs – Add 36 117 Butterfly Shift 22 12K – Switch network – 1 2 3 4 Stage 1 Stage 2 Stage 3 Stage 4 MIT Lincoln Laboratory Systolic Architecture-15 PAJ 9/29/2004

  16. 8192-Point Architecture ∆ Size 16 8192 Pins 28 28 Requires 13 stages • Fly 4 13 Fixed point arithmetic • Mult 12 39 Add 36 117 Varies the dynamic range to increase • Shift 22 12K accuracy Overflow replaced with saturated value • 1 2 3 4 5 6 7 8 9 10 11 12 13 4 int 4 int 5 int 6 int 7 int 8 int 9 int 10 int 4 frac 14 frac 13 frac 12 frac 11 frac 10 frac 9 frac 8 frac 0110.0101 Multipliers limit design to 18-bits and 150 MHz • 6 + 5 Achieves 70 dB of accuracy • 16 MIT Lincoln Laboratory Systolic Architecture-16 PAJ 9/29/2004

  17. Increase Parallelism ∆ Size 16 8192 Pins 112 112 400% Add more pipelines Fly 16 52 400% Design limited to 150 MHz by multipliers • Mult 48 156 400% I/Q module generate 600 MSPS • Add 144 468 400% Meets real-time requirement through parallelism • Shift 16 12K 100% 1 2 3 4 5 6 7 8 9 10 11 12 13 1 2 3 4 5 6 7 8 9 10 11 12 13 1 2 3 4 5 6 7 8 9 10 11 12 13 1 2 3 4 5 6 7 8 9 10 11 12 13 MIT Lincoln Laboratory Systolic Architecture-17 PAJ 9/29/2004

  18. Simplification ∆ Size 16 8192 Pins 160 160 143% Target application allows a specific simplification Fly 16 52 Pads a 4096-point sequence with 4096 zeros • Mult 36 144 92% Removes 1 st stage multipliers and adders • Add 108 432 92% Achieves 100% efficiency in steady state Shift 4 8K 67% • 1 2 3 4 5 6 7 8 9 10 11 12 13 1 2 3 4 5 6 7 8 9 10 11 12 13 1 2 3 4 5 6 7 8 9 10 11 12 13 1 2 3 4 5 6 7 8 9 10 11 12 13 MIT Lincoln Laboratory Systolic Architecture-18 PAJ 9/29/2004

  19. Outline Introduction • Parallel architecture • Systolic architecture • Performance summary • Power, operations per second – FPGA resources, frequency – Latency, throughput – Conclusions • MIT Lincoln Laboratory Systolic Architecture-19 PAJ 9/29/2004

  20. Results The current implementation has been placed on a Virtex II 8000 and verified at 150 MHz Power: 22 Watts @ 65 C • GOPS: 86 total @ 3.9 GOPS/Watt • FPGA resources (XC2V8000) • Multipliers: 144 (85%) – LUTs and SRLs: 39,453 (42%) – BlockRAM: 56 (33%) – Filp flops: 35,861 (38%) – Frequency: 150 MHz • Latency: 1127 cycles • Throughput: 1.2 GSPS • MIT Lincoln Laboratory Systolic Architecture-20 PAJ 9/29/2004

  21. Outline Introduction • Parallel architecture • Systolic architecture • Performance summary • Conclusions • Applicability to other platforms – Future work – MIT Lincoln Laboratory Systolic Architecture-21 PAJ 9/29/2004

  22. Conclusions Created a high performance, real-time FFT core • Low power (3.9 GOPS/Watt) – High throughput (1.2 GSPS), low latency (7.6 µsec/sample) – Fixed-point (18-bits), high accuracy (70 dB) – General architecture • Extendable to a generic FPGA core – Retargetable to ASIC technology – Future work • Develop a parameterizable IP core generator – MIT Lincoln Laboratory Systolic Architecture-22 PAJ 9/29/2004

Recommend


More recommend