Ted N. Booth DesignLinx Hardware Solutions September 2015
Using Vivado HLS for Video Algorithm Implementation for Demonstration and Validation
Agenda • Project Description • HLS Lessons Learned • Summary
Project Description • Create a platform for developing and demonstrating video IP • Faster validation for large IP and/or large images • Better demonstrations for customers • Hardware based on Virtex VC709 evaluation board and a host PC • Use Ethernet to transfer frame to/from host PC • Use Video DMAs to move data between IP & external memory • Use MicroBlaze processor to initialize and control the system • Vivado HLS for IP development • Faster conversion from C/C++ to IP • Support different implementations with same code
HLS Architecture Options • Supports different implementations with the same code Real-Time ( II = 1) Minimum Resources ( II = 3) (3 elements/clock cycle) (3 elements/3 clock cycles)
What is Initiation Interval ( II )? Number of clock cycles ROW_LOOP : for (row = 0; row < rows; row++) { COL_LOOP : for (col = 0; col < cols; col++) { before the function can #pragma HLS PIPELINE II=? // 1 or 3 accept new input data RGB_LOOP : for (rgb = 0; rgb < 3; rgb++) { out[row][col][rgb] = in[row][col][rgb ] …
HLS Lessons Learned Use optimized HLS libraries • OpenCV – Open Source Computer Vision • HLS Video – HLS video infrastructure • HLS Math – C/C++ math libraries (fixed and float) • HLS IP – Xilinx IP such as FFT and FIR • HLS Linear Algebra – Common functions • HLS DSP – Common DSP functions for SDR
HLS Lessons Learned Create a library of optimized base modules • Optimize small modules to build larger modules • Use #define and function parameters • MAX_WIDTH and MAX_HEIGHT define hardware resources • Rows and columns define the size of the image #define MAX_WIDTH 1920 #define MAX_HEIGHT 1080 Void my_image (…, int Rows, int Cols) { … }
HLS Lessons Learned Learn to read the synthesis reports • Reports provide estimated utilization, timing, and latency • Reports are hierarchical so you can inspect lower level functions • Loops with variable indexes will lead to undetermined latency values (?) • Asserts or TripCount directives can be added to the code to bound loops that have variable indexes
HLS Lessons Learned Using Asserts to define Loop boundaries • Affects latency calculations and hardware generation #define MAX_WIDTH 1920 #define MAX_HEIGHT 1080 Void my_image (…, int Rows, int Cols) { … assert(Rows<=MAX_HEIGHT); assert(Cols<=MAX_WIDTH); ROW_LOOP: for(j=0; j<Rows; j++) { COL_LOOP: for(i=0; i<Cols; i++) { … } } }
HLS Lessons Learned Use streaming data in and out of an HLS IP • Use DMAs to move data on/off chip • Let HLS implement AXI4 interfaces void my_filter(hls::stream<ap_axiu<16,1,1,1> >& In, hls::stream<ap_axiu<16,1,1,1> >& Out, int Rows, int Cols) { // Specify AXI4-Stream connections #pragma HLS INTERFACE axis port=In bundle=INPUT_STREAM #pragma HLS INTERFACE axis port=Out bundle=OUTPUT_STREAM // Group all other ports into an AXI4-Lite interface #pragma HLS INTERFACE s_axilite register port=Rows bundle=Ctrl #pragma HLS INTERFACE s_axilite register port=Cols bundle=Ctrl #pragma HLS INTERFACE s_axilite port=return bundle=Ctrl … }
HLS Lessons Learned Specify bit widths for variables • Sizing variables defines limits for HLS synthesis • Support for signed and unsigned data • ap_[u]int<N> for integers • N specifies the number of bits • ap_[u]fixed<W,I,Q,O,N> for fixed-point • Parameters specify the number of bits, decimal point, quantization mode and overflow behavior • HLS aligns the decimal point during calculations
HLS Lessons Learned Using “ int ” vs. “ ap_int ” for multiplication void my_int_mult (int in1, void my_ap_mult (ap_int<10> in1, int in2, ap_int<10> in2, int &out) { ap_int<20> &out) { out = in1 * in2; out = in1 * in2; } }
HLS Lessons Learned Using “ int ” vs. “ ap_int ” for summation void int_sum (int in1, int in2, void ap_sum (ap_int<8> in1, ap_int<9> in2, int in3, int in4, ap_int<10> in3, ap_int<11> in4, int &out) { ap_int<13> &out) { int temp1,temp2; int temp1, temp2; temp1 = in1 + in2; temp1 = in1 + in2; temp2 = in3 + in4; temp2 = in3 + in4; out = temp1 + temp2; out = temp1 + temp2; } }
HLS Lessons Learned Small code changes can affect synthesis void my_code1 (…, ap_uint<9> &out) { void my_code2 (…, ap_uint<9> &out) { … … out = 0; ap_uint<9> cnt = 0; MY_LOOP1 : for(i=0; i<127; i++) { MY_LOOP2 : for(i=0; i<127; i++) { #pragma HLS PIPELINE II=2 #pragma HLS PIPELINE II=2 sum = my_array[i] + my_array[i+1]; sum = my_array[i] + my_array[i+1]; if (sum > thresh) { out++; } if (sum > thresh) { cnt++; } } } … out = cnt; …
Summary Achieving optimal results from Vivado HLS often requires tuning of the C/C++ code • Addition of HLS Directives • Use of optimized libraries • Re-architecting to use a modular approach • Use of HLS defined datatypes – ap_int and ap_fixed • Reorganizing code can affect synthesis results
About DesignLinx • Veteran Owned Business Offering FPGA Design & Support Services • On-Demand Onsite FPGA Support • Support for small and large-scale projects, enabling increased bandwidth for your team • Xilinx Certified Alliance Program Member and Fidus Systems Partner • Senior design team with expertise in: • FPGA design • Verilog, VHDL and HLS • IP integration • Video Processing, DSP • DDR3/4, LPDDR2, high-speed transceivers, and more • Embedded hardware/software • SMP, AMP, Linux, vxWorks, FreeRTOS, and more • C, C++ • ASIC-level verification using ModelSim and System Verilog
About Fidus • High-speed, high-complexity design • High-speed communications, high-resolution video, high-performance computing • Original products that enable Xilinx IP and custom services • Xilinx Premier Design Services member • Senior team with expertise in: • Hardware design, including digital, RF, analog, and PCB layout • FPGA design, including IP integration, signal processing, DDR3/4, high-speed transceivers, and more • Embedded software, including Zynq/MPSoC, and MicroBlaze • Signal integrity (board and system level)
Questions? Ted Booth DesignLinx Hardware Solutions tbooth@designlinxhs.com
Recommend
More recommend