accelerated data processing on soc with fpga
play

Accelerated Data Processing on SoC with FPGA Marek Va sut < - PowerPoint PPT Presentation

Accelerated Data Processing on SoC with FPGA Marek Va sut < marex@denx.de > June 3, 2015 Marek Va sut < marex@denx.de > Accelerated Data Processing on SoC with FPGA Marek Vasut Software engineer at DENX S.E. since 2011


  1. Accelerated Data Processing on SoC with FPGA Marek Vaˇ sut < marex@denx.de > June 3, 2015 Marek Vaˇ sut < marex@denx.de > Accelerated Data Processing on SoC with FPGA

  2. Marek Vasut ◮ Software engineer at DENX S.E. since 2011 ◮ Embedded and Real-Time Systems Services, Linux kernel and driver development, U-Boot development, consulting, training. ◮ Versatile Linux kernel hacker ◮ Custodian at U-Boot bootloader Marek Vaˇ sut < marex@denx.de > Accelerated Data Processing on SoC with FPGA

  3. Structure of the talk ◮ Motivation ◮ Introduction to FPGAs ◮ Your first FPGA data cruncher ◮ Interfacing with Linux ◮ Speeding things up Marek Vaˇ sut < marex@denx.de > Accelerated Data Processing on SoC with FPGA

  4. Why listen to this talk ◮ Get fresh ideas ◮ Learn something new ◮ Reduce energy envelope of your device ◮ Process data quickly and efficiently You won’t learn marketing stuff or random benchmark numbers Marek Vaˇ sut < marex@denx.de > Accelerated Data Processing on SoC with FPGA

  5. FPGA ◮ Abbr. for Field Programmable Gate Array ◮ Programmable logic ◮ Usually used for: ◮ Digital Signal Processing (DSP) ◮ Data crunching ◮ Custom hardware interfaces ◮ ASIC prototyping ◮ . . . ◮ Common vendors – Xilinx, Altera, Lattice, Microsemi. . . Marek Vaˇ sut < marex@denx.de > Accelerated Data Processing on SoC with FPGA

  6. Internal structure W.T.Freeman http://www.vision.caltech.edu/CNS248/Fpga/fpga1a.gif CC BY 2.5: http://creativecommons.org/licenses/by/2.5/ Marek Vaˇ sut < marex@denx.de > Accelerated Data Processing on SoC with FPGA

  7. FPGA and the outside ◮ FPGA has plenty of I/O options: ◮ Regular I/O with configurable voltage levels ◮ Differential I/O ◮ High-speed SerDes ◮ . . . ◮ Usual interface with host: ◮ Stand-alone FPGA, usually PCIe, USB, . . . ◮ FPGA on a CPU bus (PowerPCs, ie. ML507) ◮ Built into CPU (SoCFPGA/Zynq), usually AMBA/AXI Marek Vaˇ sut < marex@denx.de > Accelerated Data Processing on SoC with FPGA

  8. Programming the FPGA ◮ Each vendor has his own tools – Altera Quartus, Xilinx Vivado ◮ FPGA tools often closed source :-( ◮ FPGA bitstream format is closed :-( ◮ Basic vendor tools available free of charge ◮ Sufficient amount of functionality to implement data cruncher ◮ Vendor tools needed for place-and-route and assembler ◮ Third-party tools for synthesis are available Marek Vaˇ sut < marex@denx.de > Accelerated Data Processing on SoC with FPGA

  9. Comparison to a GPU – I. CPU GPU FPGA Toolchain Open Closed Closed HW design Proprietary Proprietary Your own HW units Fixed Fixed As needed I/O Limited None As needed Marek Vaˇ sut < marex@denx.de > Accelerated Data Processing on SoC with FPGA

  10. HDL – Hardware Description Language ◮ FPGA content is written in HDLs ◮ HDL – Hardware Description Language ◮ HDLs are used to model behavior of logic block ◮ Two major HDLs – VHDL and Verilog ◮ Tools often allow seamless mixing of HDLs ◮ Many readily-available cores under acceptable license: OpenCores http://opencores.org/ OpenCores projects http://opencores.org/projects CERN Open HW Repo http://www.ohwr.org/ Marek Vaˇ sut < marex@denx.de > Accelerated Data Processing on SoC with FPGA

  11. Modeling behavior HW Behavior modeling vs. Writing CPU code: ◮ Vastly different and confusing to software people :-) ◮ CPU: Programmer implements an algorithm ◮ FPGA: Programmer implements hardware to run the algorithm Marek Vaˇ sut < marex@denx.de > Accelerated Data Processing on SoC with FPGA

  12. Implicit parallelism ◮ Everything in a block is executed in parallel ◮ All conditions in a conditional statement are tested in parallel if, case – differs from C 1 if (foo == 1) bar <= 1’b0; 2 else bar <= 1’b1; ◮ Blocks are executed in parallel 1 begin x <= 1’b0; 2 y <= 1’b1; 3 4 end Marek Vaˇ sut < marex@denx.de > Accelerated Data Processing on SoC with FPGA

  13. Combinatorial vs. Sequential logic ◮ Combo – imm. value of var is the product of the imm. inputs of the function: assign Z = X ^ Y; ◮ Seq logic is sync to clock (involves a latch) always @(posedge clk) Z <= DAT; Marek Vaˇ sut < marex@denx.de > Accelerated Data Processing on SoC with FPGA

  14. Verilog example ◮ Looks like C, based on C, but behaves differently ◮ Used a lot in Europe ◮ Example: CRC5, polynomial x 5 + x 2 + x 0 ◮ Example modified from: http://www.asic-world.com/examples/verilog/ serial_crc.html 1 module crc5 ( /* SYSTEM I/O */ 2 input reset, 3 input clk, 4 /* CRC5 I/O */ 5 input data, 6 output reg [4:0] crc 7 8 ); Marek Vaˇ sut < marex@denx.de > Accelerated Data Processing on SoC with FPGA

  15. Verilog example II 1 always @(posedge clk) begin if (reset) begin 2 crc <= 5’b00000; 3 end else begin 4 crc[0] <= data ^ crc[4]; 5 crc[1] <= crc[0]; 6 crc[2] <= crc[1] ^ data ^ crc[4]; 7 crc[3] <= crc[2]; 8 crc[4] <= crc[3]; 9 end 10 11 end 12 endmodule Marek Vaˇ sut < marex@denx.de > Accelerated Data Processing on SoC with FPGA

  16. VHDL example ◮ Distinctive syntax based on Ada ◮ More explicit typing system than Verilog ◮ Used a lot in the USA ◮ Example: CRC5, polynomial x 5 + x 2 + x 0 ◮ Example from http://outputlogic.com/?page_id=321 1 library ieee; 2 use ieee.std_logic_1164.all; 3 4 entity crc is port ( data_in : in std_logic_vector (0 downto 0); 5 rst, clk : in std_logic; 6 crc_out : out std_logic_vector (4 downto 0)); 7 8 end crc; Marek Vaˇ sut < marex@denx.de > Accelerated Data Processing on SoC with FPGA

  17. VHDL example II 1 architecture imp_crc of crc is signal lfsr_q: std_logic_vector (4 downto 0); 2 signal lfsr_c: std_logic_vector (4 downto 0); 3 4 begin crc_out <= lfsr_q; 5 lfsr_c(0) <= lfsr_q(4) xor data_in(0); 6 lfsr_c(1) <= lfsr_q(0); 7 lfsr_c(2) <= lfsr_q(1) xor lfsr_q(4) xor data_in(0); 8 lfsr_c(3) <= lfsr_q(2); 9 lfsr_c(4) <= lfsr_q(3); 10 11 process (clk,rst) begin 12 if (rst = ’1’) then 13 lfsr_q <= b"11111"; 14 elsif (clk’EVENT and clk = ’1’) then 15 lfsr_q <= lfsr_c; 16 end if; 17 end process; 18 19 end architecture imp_crc; Marek Vaˇ sut < marex@denx.de > Accelerated Data Processing on SoC with FPGA

  18. Comparison to a GPU – II. CPU GPU FPGA Languages All OpenCL, CUDA OpenCL, HDLs Design paradigm Sequential Seq/Par Parallel Design granularity Instruction Instruction Gate Opt. possibility Low Low High Opt. difficulty Low Low High Marek Vaˇ sut < marex@denx.de > Accelerated Data Processing on SoC with FPGA

  19. Development and debugging ◮ Simulation (on developer’s system) ◮ Probing (on-target) Marek Vaˇ sut < marex@denx.de > Accelerated Data Processing on SoC with FPGA

  20. Simulation ◮ Simulation tools: Icarus Verilog http://iverilog.icarus.com/ ghdl http://home.gna.org/ghdl/ ModelSim http://en.wikipedia.org/wiki/ModelSim/ ◮ Write testcase for a module in an augmented HDL ◮ Execute testcase ◮ Observe results ◮ View waveforms ◮ Decode and inspect busses ◮ Trigger on complex conditions ◮ . . . Marek Vaˇ sut < marex@denx.de > Accelerated Data Processing on SoC with FPGA

  21. Probing ◮ Used to observe design on target ◮ Think of this as a bus analyzer in the FPGA ◮ Probing tools (ie. SignalTap) ◮ Design is augmented with a probing IP, FPGA is reprogrammed ◮ Probing is controlled through a debug probe attached to the FPGA (JTAG or similar) ◮ Probe internal signals, observe waveforms, trigger on complex conditions. . . Marek Vaˇ sut < marex@denx.de > Accelerated Data Processing on SoC with FPGA

  22. Structuring the design ◮ HDL files – lowest in the hierarchy ◮ IP block – collection of HDL files with an interface ◮ FPGA design – collection of IP blocks ◮ Vendor tools contain tools to assemble IP blocks into FPGA design – ie. Altera QSys. Marek Vaˇ sut < marex@denx.de > Accelerated Data Processing on SoC with FPGA

  23. Comparison to a GPU – III. CPU GPU FPGA Simulation QEMU ? Icarus, ModelSim Debugger GDB CUDA-GDB, CodeXL SignalTap Marek Vaˇ sut < marex@denx.de > Accelerated Data Processing on SoC with FPGA

  24. Linux interface ◮ No standard in-kernel FPGA interface due to variance of designs ◮ Attempts do exist: ◮ Device Tree Overlay(s) stored in FPGA ◮ SDB – http://www.ohwr.org/projects/fpga-config-space ◮ Usually there are control registers in the FPGA design ◮ Usually the DMA is involved (either on FPGA or CPU side) ◮ Two options for controlling the FPGA: ◮ Custom Linux kernel driver ◮ Userspace utility Marek Vaˇ sut < marex@denx.de > Accelerated Data Processing on SoC with FPGA

  25. Custom kernel driver ◮ Driver written to match the particular FPGA bitstream ◮ Driver can crash the host machine if written badly :-( ◮ Driver usually exports custom userland I/O ◮ splice(2) Marek Vaˇ sut < marex@denx.de > Accelerated Data Processing on SoC with FPGA

  26. Userland approach ◮ Userland accesses the FPGA registers via uio ◮ The uio is like a restricted devmem ◮ In case DMA is involved, kernel module to prepare the data for the DMA (ie. assure cache coherency) is needed. ◮ CMA might be used to export large slab of custom kernel memory to user Marek Vaˇ sut < marex@denx.de > Accelerated Data Processing on SoC with FPGA

Recommend


More recommend