reconfiguration overhead in dynamic task based
play

Reconfiguration Overhead in Dynamic Task-Based Implementations on - PowerPoint PPT Presentation

Reconfiguration Overhead in Dynamic Task-Based Implementations on FPGAs Padmini Nagaraj UCB, Distributed Mentor Program, Researcher Summer 2004 Professor Elaheh Bozorgzadeh UCI, Distributed Mentor Program, Mentor Outline I. Introduction


  1. Reconfiguration Overhead in Dynamic Task-Based Implementations on FPGAs Padmini Nagaraj UCB, Distributed Mentor Program, Researcher Summer 2004 Professor Elaheh Bozorgzadeh UCI, Distributed Mentor Program, Mentor

  2. Outline I. Introduction II. Project Description III. Example Application: Matrix Multiplier IV. Experimental Data A. Matrix Multiplier B. Fast Fourier Transform C. 2-D Discrete Cosine Transform D. Multiple Applications V. Real World Application: JPEG VI. Conclusion 2 Padmini Nagaraj - minar@ocf.berkeley.edu

  3. I II III IV A B C D V VI Introduction Field Programmable Gate Array Metrics CLB Performance Time Reconfiguration Time Resources Available Xilinx Virtex 2 XCV2000E CLB Partially Reconfigurable by CLB columns Example Xilinx Chip CLB 3 Padmini Nagaraj - minar@ocf.berkeley.edu

  4. I II III IV A B C D V VI Introduction (cont…) Hardware Description Language VHDL Simulation ModelSim SE 5.7g Synthesis Synplify Pro 7.6.1 Place and Route Xilinx Place and Route Tools No Writes to Chip Write to Chip (iMPACT) 4 Project Navigator 6.2.03i Padmini Nagaraj - minar@ocf.berkeley.edu

  5. I II III IV A B C D V VI Project Description GOAL: Application configuration time vs. performance time. Application Clock Number of CLB Frequency Columns Several small to large independent applications Real world example: JPEG 5 Padmini Nagaraj - minar@ocf.berkeley.edu

  6. I II III IV A B C D V VI Example Application: Matrix Multiplier 8 x 8 Matrix Multiplier Needs Lots of Data! Interested in seeing effects independent of a.) BRAMs other chip resources b.) Lots of I/O pins Okay. Really slow! Too much c.) Neither time reading inputs. 6 Padmini Nagaraj - minar@ocf.berkeley.edu

  7. I II III IV A B C D V VI Example Application: Matrix Multiplier (cont…) Matrix Multiply Block Diagram A0[15:0] Mult B0[15:0] Add A1[15:0] B1[15:0] Mult Add A2[15:0] B2[15:0] Mult A3[15:0] Add Mult B3[15:0] Result[15:0] A4[15:0] Add Mult B4[15:0] Add A5[15:0] Mult B5[15:0] Add A6[15:0] B6[15:0] Mult A7[15:0] Add B7[15:0] Mult 7 Padmini Nagaraj - minar@ocf.berkeley.edu

  8. I II III IV A B C D V VI Example Application: Matrix Multiplier (cont…) 1.) Write code 2.) Simulate - Testbench 3.) Synthesis 4.) Place and Route - Constrain Time and Columns 8 Padmini Nagaraj - minar@ocf.berkeley.edu

  9. I II III IV A B C D V VI Experimental Data Xilinx CORE Generator Intellectual Property of Xilinx Metrics Used: CLB Columns Maximum Clock Frequency Maximum Pin Delay Average Delay of 10 Worst Nets 9 Padmini Nagaraj - minar@ocf.berkeley.edu

  10. I II III IV A B C D V VI Experimental Data: Matrix Multiplier Matrix Multiplier CLock Frequency vs. CLB Columns 1.700E+08 z) lock Frequency (H 1.650E+08 um 1.600E+08 axim 1.550E+08 M 1.500E+08 C 1.450E+08 10 12 14 16 Whole Chip Physical Constraint (Number of CLB Columns) Matrix Multiplier Delays and Clock Period 7.000E-09 6.000E-09 5.000E-09 4.000E-09 3.000E-09 2.000E-09 1.000E-09 0.000E+00 10 12 14 16 Whole Chip Physical Constraint (Number of CLB Columns) 10 Minimum Clock Period (s) Maximum Pin Delay (s) Worst 10 Net Delays (s) Padmini Nagaraj - minar@ocf.berkeley.edu

  11. I II III IV A B C D V VI Experimental Data: Matrix Multiplier (cont…) 11 Matrix Multiplier constrained at 12 columns Matrix Multiplier unconstrained Padmini Nagaraj - minar@ocf.berkeley.edu

  12. I II III IV A B C D V VI Experimental Data: Matrix Multiplier (cont…) Physical Constraint (number of CLB columns) 10 12 14 16 Whole Chip Minimum Clock Period (s) 6.466E-09 6.476E-09 6.496E-09 6.496E-09 5.930E-09 Maximum Clock Frequency 1.547E+08 1.544E+08 1.539E+08 1.539E+08 1.686E+08 (Hz) Maximum Pin Delay (s) 4.235E-09 4.174E-09 3.938E-09 4.120E-09 3.787E-09 Worst 10 Net Delays (s) 3.567E-09 3.692E-09 3.406E-09 3.470E-09 3.396E-09 12 Padmini Nagaraj - minar@ocf.berkeley.edu

  13. I II III IV A B C D V VI Experimental Data: Fast Fourier Transform FFT Clock Frequency vs. CLB Columns y c 1.600E+08 n e u 1.400E+08 q re 1.200E+08 F 1.000E+08 k z) c 8.000E+07 (H lo C 6.000E+07 m 4.000E+07 u im 2.000E+07 x 0.000E+00 a M 16 20 24 28 32 Whole Chip Physical Constraints (Number of CLB Columns) FFT Delays and Clock Period 1.200E-08 1.000E-08 8.000E-09 6.000E-09 4.000E-09 2.000E-09 0.000E+00 16 20 24 28 32 Whole Chip Physical Constraint (Number of CLB Columns) 13 Minimum Clock Period (s) Maximum Pin Delay(s) Worst 10 Net Delay(s) Padmini Nagaraj - minar@ocf.berkeley.edu

  14. I II III IV A B C D V VI Experimental Data: Fast Fourier Transform (cont…) 14 FFT constrained at 20 columns FFT unconstrained Padmini Nagaraj - minar@ocf.berkeley.edu

  15. I II III IV A B C D V VI Experimental Data: Fast Fourier Transform (cont…) Physical Constraint (Number of CLB columns) 16 20 24 28 32 Whole Chip Minimum Clock 1.053E-08 7.214E-09 8.276E-09 8.276E-09 8.170E-09 8.365E-09 Period (s) Maximum Clock Frequency 9.501E+07 1.386E+08 1.208E+08 1.208E+08 1.224E+08 1.195E+08 (Hz) Maximum Pin 6.711E-09 5.545E-09 6.227E-09 5.397E-09 5.864E-09 5.540E-09 Delay (s) Worst 10 Net 5.617E-09 4.736E-09 5.404E-09 4.778E-09 5.067E-09 4.776E-09 Delay (s) 15 Padmini Nagaraj - minar@ocf.berkeley.edu

  16. I II III IV A B C D V VI Experimental Data: 2-D Discrete Cosine Transform 2-D Discretre Cosine Transform Clock Frequency vs. CLB Columns 1.800E+08 1.600E+08 k ) c z 1.400E+08 lo H C ( 1.200E+08 y m c 1.000E+08 n u e 8.000E+07 im u q 6.000E+07 x e a 4.000E+07 r M F 2.000E+07 0.000E+00 12 16 20 24 28 Whole Chip Physical Constraint (Number of CLB Columns) 2-D Discrete Cosine Transform Delays and Clock Period 8.000E-09 6.000E-09 4.000E-09 2.000E-09 0.000E+00 12 16 20 24 28 Whole Chip Physical Constraint (Number of CLB Columns) 16 Minimum Clock Period (s) Maximum Pin Delay Worst 10 Net Delays Padmini Nagaraj - minar@ocf.berkeley.edu

  17. I II III IV A B C D V VI Experimental Data: 2-D Discrete Cosine Transform (cont…) 17 2DCT constrained at 28 columns 2DCT unconstrained Padmini Nagaraj - minar@ocf.berkeley.edu

  18. I II III IV A B C D V VI Experimental Data: 2-D Discrete Cosine Transform (cont…) Physical Constraint (number of CLB columns) CLB Columns 12 16 20 24 28 Whole Chip Minimum Clock Period (s) 7.169E-09 6.349E-09 6.197E-09 6.286E-09 6.163E-09 7.457E-09 Maximum Clock Frequency 1.395E+08 1.575E+08 1.614E+08 1.591E+08 1.623E+08 1.341E+08 (Hz) Maximum Pin Delay 4.798E-09 4.208E-09 4.163E-09 4.088E-09 3.707E-09 6.367E-09 Worst 10 Net Delays 3.667E-09 3.420E-09 3.373E-09 3.295E-09 3.280E-09 5.711E-09 18 Padmini Nagaraj - minar@ocf.berkeley.edu

  19. I II III IV A B C D V VI Experimental Data: Multiple Applications Multiple Applications Frequencies 3.500E+08 Frequency (Hz) 3.000E+08 2.500E+08 2.000E+08 1.500E+08 1.000E+08 5.000E+07 0.000E+00 FFT 256 Multiplier Digital Down Cascaded Int. Comb Sine/Cosine 2-D Disc. Cosine Converter Matrix Look Up Applications Multiple Applications Delays 1.200E-08 1.000E-08 8.000E-09 6.000E-09 4.000E-09 2.000E-09 0.000E+00 FFT 256 Multiplier Digital Down Cascaded Int. Comb Sine/Cosine 2-D Disc. Cosine Converter Matrix Look Up Applications 19 Minimum Clock Period (s) Max Pin Delay (s) Worst 10 net Delay (s) Padmini Nagaraj - minar@ocf.berkeley.edu

  20. I II III IV A B C D V VI Experimental Data: Multiple Applications (cont…) Minimum Number of Minimum Clock Maximum Clock Worst 10 net Max Pin Delay CLB Period Frequency Delay columns FFT 256 20 7.571E-09 1.321E+08 5.228E-09 3.702E-09 FFT 16 1.053E-08 9.501E+07 6.711E-09 5.617E-09 2-D Disc. Cosine Transform 14 6.923E-09 1.444E+08 4.040E-09 3.382E-09 FFT 1024 12 9.312E-09 1.074E+08 5.462E-09 4.724E-09 Matrix Multiplier 10 6.466E-09 1.547E+08 4.235E-09 3.567E-09 CORDIC 4 8.453E-09 1.183E+08 2.876E-09 2.288E-09 Digital Down Converter 4 8.373E-09 1.194E+08 3.108E-09 2.377E-09 1-D Disc. Cosine Transform 2 4.857E-09 2.059E+08 2.835E-09 2.360E-09 Cascaded Int. Comb Filter 2 3.380E-09 2.959E+08 1.461E-09 1.009E-09 Multiply Accumulator 2 5.443E-09 1.837E+08 3.060E-09 2.388E-09 Sine/Cosine Look Up Table 2 0.000E+00 0.000E+00 1.677E-09 1.120E-09 Direct Digital Synthesizer 2 4.532E-09 2.207E+08 1.810E-09 1.233E-09 20 Padmini Nagaraj - minar@ocf.berkeley.edu

  21. I II III IV A B C D V VI Experimental Data: Multiple Applications (cont…) 21 FFT constrained at 16 columns 1DCT constrained at 2 columns Padmini Nagaraj - minar@ocf.berkeley.edu

  22. I II III IV A B C D V VI Real World Application: JPEG Image Block Decoding 8 x 8 Pixels Inverse Quantize RGB->YCrCb Inverse 2-D Disc. 2-D Disc. Cosine Cosine Transform Transform YCrCb->RGB Quantize Image Block Encoding 8 x 8 Pixels 22 JPEG encoding steps JPEG decoding steps Padmini Nagaraj - minar@ocf.berkeley.edu

Recommend


More recommend