data processing on modern hardware
play

Data Processing on Modern Hardware Jens Teubner, TU Dortmund, DBIS - PowerPoint PPT Presentation

Data Processing on Modern Hardware Jens Teubner, TU Dortmund, DBIS Group jens.teubner@cs.tu-dortmund.de Summer 2014 Jens Teubner Data Processing on Modern Hardware Summer 2014 c 1 Part VII FPGAs for Data Processing Jens Teubner


  1. Data Processing on Modern Hardware Jens Teubner, TU Dortmund, DBIS Group jens.teubner@cs.tu-dortmund.de Summer 2014 � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 1

  2. Part VII FPGAs for Data Processing � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 239

  3. Motivation Modern hardware features a number of “speed-up tricks”: caches, instruction scheduling (out-of-order exec., branch prediction, . . . ), parallelism (SIMD, multi-core), throughput-oriented designs (GPUs). Combining these “tricks” is essentially an economic choice : → chip space ≡ eee → chip space ↔ component selection ↔ workload � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 240

  4. Another Constraint: Power Can use transistors for either logic or caches. Source: Borkar and Chien. The Future of Microprocessors. CACM 2011 . 2008, 45nm, 100mm 2 100 18 16 Power Dissipation Case A, 16MB of Cache 80 14 total Power (Watts) 12 Cache Size Cache (mB) 60 Case C 10 50MT Logic 6MB Cache 8 40 6 4 20 2 Case A, 0 Logic, 8W 0 0 Case B 0 20 40 60 80 Logic transistors (millions) → Power consumptions limits amount of logic that can be put on chip. � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 241

  5. Heterogeneous Hardware 5 MT 2 3 Large-Core Large-Core 2 25MT 25 MT 5 MT 2 3 3 4 5 6 30 20 Large-Core homogeneous small-Core homogeneous small-Core homogeneous Large-core 1 Large-core Large-core 1 throughput throughput throughput Small-core Small-core Pollack’s Rule Small-core Pollack’s Rule throughput throughput (5/25) 0.5 =0.45 throughput (5/25) 0.5 =0.45 Total 6 Total 13 Total 11 throughput throughput throughput (a) (b) (c) � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 242

  6. Field-Programmable Gate Arrays Field-Programmable Gate Arrays (FPGAs) are yet-another point in the design space. “Programmable hardware.” Make (some) design decisions after chip fabrication. Promises of FPGA technology: � Build application-/workload-specific circuit. � Spend chip space only on functionality that you really need. � Tune for throughput, latency, energy consumption, . . . � Overcome limits of general-purpose hardware with regard to task at hand ( e.g. , I/O limits). � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 243

  7. Field-Programmable Gate Arrays An array of logic gates Functionality fully programmable Re-programmable after deployment (“in the field ”) → “programmable hardware” FPGAs can be configured to implement any logic circuit. Complexity bound by available chip space . → Obviously, the effective chip space is less than in custom-fabricated chips (ASICs). � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 244

  8. Basic FPGA Architecture IOB IOB IOB IOB DCM IOB IOB IOB IOB IOB chip layout: 2D array IOB IOB Components CLB CLB CLB CLB CLB: Configurable Logic Block IOB IOB CLB CLB CLB CLB (“logic gates”) IOB IOB IOB: Input/Output Block CLB CLB CLB CLB DCM: Digital Clock Manager IOB IOB Interconnect Network CLB CLB CLB CLB IOB IOB signal lines IOB IOB IOB DCM IOB IOB IOB IOB configurable switch boxes � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 245

  9. Signal Routing programmable Switch Box and programmable bundle of lines intersection point programmable SRAM cell switch with memory cell � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 246

  10. Configurable Logic Block (CLB) SRAM SRAM cell cell out in 0 in 1 D Multiplexer in 2 in 3 4-LUT Flip clock Flop implements { 0 , 1 } 4 → { 0 , 1 } stores a single bit function � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 247

  11. Programming FPGAs Programming is usually done using a hardware description language . E.g. , VHDL 6 , Verilog High-level circuit description Circuit description is compiled into a bitstream , then loaded into SRAM cells on the FPGA: VHDL synthesis map place & route FPGA netlist bitstream 6 VHSIC Hardware Description language � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 248

  12. Example: VHDL HDLs enable programming language-like descriptions of hardware circuits. architecture Behavioral of compare is begin process (A, B) begin if ( A = B ) then C <= ’1’; else C <= ’0’; end if; end process; end Behavioral; VHDL can be synthesized, but also executed in software ( simulation ). � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 249

  13. Real-World Hardware Simplified Virtex-5 XC5VFXxxxT floor plan Frequently used high-level components are provided in discrete silicon BlockRAM (BRAM): set of CPU 0 CPU 1 blocks that each store up 36 kbits of data DSP48 slices: 25x18-bit multipliers followed by a 48-bit accumulator CPU: two full embedded PowerPC 440 cores � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 250

  14. Development Board with Virtex-5 FPGA Virtex-5 XC5VLX110T Lookup Tables (LUTs) 69,120 Block RAM (kbit) 5,328 DSP48 Slices 64 PowerPC Cores 0 max. clock speed ≈ 450 MHz release year 2006 source: Xilinx Inc., ML50x Evaluation Platform. User Guide. � Low-level speed of configurable gates is slower than in custom-fabricated chips (clock frequencies: ∼ 100 MHz). → Compensate with efficient circuit for problem at hand. � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 251

  15. State Machines The key asset of FPGAs is their inherent parallelism . Chip areas naturally operate independently and in parallel. For example, consider finite-state automata . a b c d q 0 q 1 q 2 q 3 q 4 * * → non-deterministic automaton for .*abc.*d � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 252

  16. State Machines ✛ How would you implement an automaton in software? Problems with state machine implementations in software: In non-deterministic automata , several states can be active at a time, which requires iterative execution on sequential hardware. Deterministic automata avoid this problem at the expense of a significantly higher state count . � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 253

  17. State Machines in Hardware Automata can be translated mechanically into hardware circuits. each state → flip-flop (A flip-flop holds a single bit of information. Just the right amount to keep the ‘active’/‘not active’ information.) transitions : → signals (“wires”) between states conditioned on current input symbol ( � ‘and’ gate) multiple sources for one flip-flop input → ‘or’ gate . � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 254

  18. State Machines in Hardware a b c d q 0 q 1 q 2 q 3 q 4 * * ? ? ? ? input = a input = b input = c input = d q 0 q 1 q 2 q 3 q 4 and and and and or or FF FF FF FF FF � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 255

  19. 5 Flip-flop cons. in % NFA 4 DFA DFA (compressed) 3 2 1 0 6 5 NFA LUT cons. in % DFA 4 DFA (compressed) 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 i in (0|1)* 1 (0|1) i � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 256

  20. Use Case: Network Intrusion Detection Analyze network traffic using regular expressions . Scan for known attack tools. Prevent exploitation of known security holes. Scan for shell code. E.g. , Snort ( http://www.snort.org/ ) → Hundreds of (regular expression-based) rules . Idea: Instantiate a hardware state machine for each rule. → Leverage available hardware parallelism. → Challenge: optimize for high throughput . � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 257

  21. Predicate Decoding Optimization 1: Centralized character classification a decod. d a b c d q 0 q 1 q 2 q 3 q 4 and and and and or or FF FF FF FF FF → Optimizes for space , not for speed. Character/predicate decoder: Use FPGA logic resources or use on-chip BRAM (configure as ROM and use as lookup table). � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 258

  22. Predicate Decoding Factored Out 30 No decoder: With decoder: Resource consumption % 25 LUTs LUTs Slices Slices 20 15 10 5 0 0 50 100 150 200 250 i in (A B) i � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 259

  23. Signal Propagation Delay Signal propagation delays determine a circuit’s speed . Here: One state transition per clock cycle. Longest signal path → maximum clock frequency rising clock edge stable at may be undefined CLK rising clock in-between reg. input reg. output register written at rising clock � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 260

  24. Propagation Delays and Many State Machines Straightforward design with many rules and one input: input � NFA 1 NFA 2 NFA 3 NFA 4 NFA 5 NFA 6 output � or � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 261

  25. Pipelining Optimization 2: Pipelining → What matters is longest path between any two registers (flip-flops). longest path input NFA 1 NFA 2 NFA 3 NFA 4 NFA 5 NFA 6 output or → Introduce pipeline registers . → ✛ Flip side of the idea? � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 262

Recommend


More recommend