reconfiguration overhead in dynamic task based
play

Reconfiguration Overhead in Dynamic Task-Based Implementations on - PowerPoint PPT Presentation

Reconfiguration Overhead in Dynamic Task-Based Implementations on FPGAs Padmini Nagaraj University of California, Berkeley, Distributed Mentor Program, Researcher Summer 2004 Professor Elaheh Bozorgzadeh University of California, Irvine,


  1. Reconfiguration Overhead in Dynamic Task-Based Implementations on FPGAs Padmini Nagaraj University of California, Berkeley, Distributed Mentor Program, Researcher Summer 2004 Professor Elaheh Bozorgzadeh University of California, Irvine, Distributed Mentor Program, Mentor

  2. Introduction Field Programmable Gate Arrays Metrics CLB Performance Time Reconfiguration Time Resources Available Xilinx Virtex 2 XCV2000E Partially CLB Reconfigurable by Combinational Logic Example Xilinx FPGA Chip Block (CLB) columns 2 CLB Padmini Nagaraj - minar@ocf.berkeley.edu

  3. Reconfiguration Overhead • Reconfiguration delay is crucial in dynamic reconfigurable architecture if it is exploited at runtime. • Project: Study the trade-off between reconfiguration delay and performance of implemented task on FPGA device. • Reconfiguration delay is highly correlated with the physical layout of the implementation. • In Xilinx, reconfiguration is column by column. • Number of columns of the layout of design is highly correlated with reconfiguration delay 3 Padmini Nagaraj - minar@ocf.berkeley.edu

  4. Project Description Task: Implementation on FPGA devices Objective: Configuration Time Performance Number of CLB Vs. Application Clock Columns Frequency 4 Padmini Nagaraj - minar@ocf.berkeley.edu

  5. FPGA-based Compilation Flow Hardware Description Language VHDL Simulation ModelSim SE 5.7g Simulation Synthesis Synplify Pro 7.6.1 Synthesis Place and Route Xilinx Place and Route Tools Place and Route + Xilinx COREGen Write to Chip Write to Chip Report performance and area 5 Padmini Nagaraj - minar@ocf.berkeley.edu

  6. Experimental Analysis • Metrics used (Xilinx Place and Route Tools Provided): – CLB Columns constrained – Maximum Clock Frequency – Maximum Pin Delay – Average Delay of 10 Worst Nets • Applications: – Matrix Multiply – Fast Fourier Transform – 2-D Discrete Cosine Transform – JPEG – Others: CORDIC, Multiply Accumulator, Comb Filter, etc. 6 Padmini Nagaraj - minar@ocf.berkeley.edu

  7. Experimental Data: Matrix Multiplier Matrix Multiplier CLock Frequency vs. CLB Columns Matrix Multiplier Delays and Clock Period 1.700E+08 7.000E-09 Clock Frequency (Hz) 6.000E-09 1.650E+08 5.000E-09 Maximum 1.600E+08 4.000E-09 3.000E-09 1.550E+08 2.000E-09 1.500E+08 1.000E-09 0.000E+00 1.450E+08 10 12 14 16 Whole Chip 10 12 14 16 Whole Physical Constraint (Number of CLB Columns) Chip Physical Constraint (Number of CLB Columns) Minimum Clock Period (s) Maximum Pin Delay (s) Worst 10 Net Delays (s) 7 Matrix Multiplier constrained at 12 columns Matrix Multiplier unconstrained Padmini Nagaraj - minar@ocf.berkeley.edu

  8. Experimental Data: Fast Fourier Transform FFT Delays and Clock Period FFT Clock Frequency vs. CLB Columns Maximum Clock Frequency 1.200E-08 1.600E+08 1.000E-08 1.400E+08 1.200E+08 8.000E-09 1.000E+08 (Hz) 6.000E-09 8.000E+07 6.000E+07 4.000E-09 4.000E+07 2.000E-09 2.000E+07 0.000E+00 0.000E+00 16 20 24 28 32 Whole 16 20 24 28 32 Whole Chip Chip Physical Constraint (Number of CLB Columns) Physical Constraints (Number of CLB Columns) Minimum Clock Period (s) Maximum Pin Delay(s) Worst 10 Net Delay(s) 8 FFT constrained at 20 columns FFT unconstrained Padmini Nagaraj - minar@ocf.berkeley.edu

  9. Experimental Data: 2-D Discrete Cosine Transform 2-D Discretre Cosine Transform Clock Frequency 2-D Discrete Cosine Transform Delays and Clock vs. CLB Columns Period 1.800E+08 8.000E-09 1.600E+08 Maximum Clock Frequency (Hz) 6.000E-09 1.400E+08 1.200E+08 4.000E-09 1.000E+08 8.000E+07 2.000E-09 6.000E+07 4.000E+07 0.000E+00 2.000E+07 12 16 20 24 28 Whole 0.000E+00 Chip 12 16 20 24 28 Whole Physical Constraint (Number of CLB Columns) Chip Physical Constraint (Number of CLB Columns) Minimum Clock Period (s) Maximum Pin Delay Worst 10 Net Delays 9 2DCT constrained at 28 columns 2DCT unconstrained Padmini Nagaraj - minar@ocf.berkeley.edu

  10. Experimental Data Minimum Number of Minimum Clock Maximum Clock Worst 10 net Max Pin Delay CLB Period Frequency Delay columns FFT 256 20 7.571E-09 1.321E+08 5.228E-09 3.702E-09 FFT 16 1.053E-08 9.501E+07 6.711E-09 5.617E-09 2-D Disc. Cosine Transform 14 6.923E-09 1.444E+08 4.040E-09 3.382E-09 FFT 1024 12 9.312E-09 1.074E+08 5.462E-09 4.724E-09 Matrix Multiplier 10 6.466E-09 1.547E+08 4.235E-09 3.567E-09 CORDIC 4 8.453E-09 1.183E+08 2.876E-09 2.288E-09 Digital Down Converter 4 8.373E-09 1.194E+08 3.108E-09 2.377E-09 1-D Disc. Cosine Transform 2 4.857E-09 2.059E+08 2.835E-09 2.360E-09 Cascaded Int. Comb Filter 2 3.380E-09 2.959E+08 1.461E-09 1.009E-09 Multiply Accumulator 2 5.443E-09 1.837E+08 3.060E-09 2.388E-09 Sine/Cosine Look Up Table 2 0.000E+00 0.000E+00 1.677E-09 1.120E-09 Direct Digital Synthesizer 2 4.532E-09 2.207E+08 1.810E-09 1.233E-09 10 Padmini Nagaraj - minar@ocf.berkeley.edu

  11. Application: JPEG Image Block Decoding 8 x 8 Pixels Inverse Quantize RGB->YCrCb Inverse 2-D Disc. 2-D Disc. Cosine Cosine Transform Transform YCrCb->RGB Quantize Image Block Encoding 8 x 8 Pixels JPEG encoding steps JPEG decoding steps 11 Padmini Nagaraj - minar@ocf.berkeley.edu

  12. Application: JPEG JPEG Application Frequencies 1.800E+08 Frequency (Hz) 1.600E+08 1.400E+08 1.200E+08 1.000E+08 8.000E+07 6.000E+07 4.000E+07 2.000E+07 0.000E+00 XAPP637 Transform Qauntization Quantization Disc. Cosine 2-D Disc. CrCb to RGB Inverse 2-D XAPP238Y RGB to XAPP615 XAPP615 Transform YCbCr Cosine Inverse- JPEG Clock Period and Delays Applications 1.000E-08 8.000E-09 6.000E-09 4.000E-09 2.000E-09 0.000E+00 XAPP637 Transform Qauntization Quantization Disc. Cosine 2-D Disc. CrCb to RGB Inverse 2-D XAPP238Y RGB to YCbCr XAPP615 XAPP615 Transform Cosine Inverse- Applications Clock Period (s) Max Pin Delay Worst 10 net Delay 12 Padmini Nagaraj - minar@ocf.berkeley.edu

  13. Conclusion • Studied the trade-off between reconfiguration delay and performance in implementation of applications on FPGA device • Compared performance at different layout area for implementation • Results show the following: – In several cases, by having a more relaxed area constraint, the performance can be improved by the tool and in some cases it doesn’t for the following reasons: • I/O dominated applications • FPGA CAD tools are not matured enough to try small area for better performance 13 Padmini Nagaraj - minar@ocf.berkeley.edu

Recommend


More recommend