Data Processing on Modern Hardware Jens Teubner, TU Dortmund, DBIS - PowerPoint PPT Presentation

Data Processing on Modern Hardware Jens Teubner, TU Dortmund, DBIS Group jens.teubner@cs.tu-dortmund.de Summer 2014 � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 1

Part VII FPGAs for Data Processing � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 239

Motivation Modern hardware features a number of “speed-up tricks”: caches, instruction scheduling (out-of-order exec., branch prediction, . . . ), parallelism (SIMD, multi-core), throughput-oriented designs (GPUs). Combining these “tricks” is essentially an economic choice : → chip space ≡ eee → chip space ↔ component selection ↔ workload � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 240

Another Constraint: Power Can use transistors for either logic or caches. Source: Borkar and Chien. The Future of Microprocessors. CACM 2011 . 2008, 45nm, 100mm 2 100 18 16 Power Dissipation Case A, 16MB of Cache 80 14 total Power (Watts) 12 Cache Size Cache (mB) 60 Case C 10 50MT Logic 6MB Cache 8 40 6 4 20 2 Case A, 0 Logic, 8W 0 0 Case B 0 20 40 60 80 Logic transistors (millions) → Power consumptions limits amount of logic that can be put on chip. � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 241

Heterogeneous Hardware 5 MT 2 3 Large-Core Large-Core 2 25MT 25 MT 5 MT 2 3 3 4 5 6 30 20 Large-Core homogeneous small-Core homogeneous small-Core homogeneous Large-core 1 Large-core Large-core 1 throughput throughput throughput Small-core Small-core Pollack’s Rule Small-core Pollack’s Rule throughput throughput (5/25) 0.5 =0.45 throughput (5/25) 0.5 =0.45 Total 6 Total 13 Total 11 throughput throughput throughput (a) (b) (c) � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 242

Field-Programmable Gate Arrays Field-Programmable Gate Arrays (FPGAs) are yet-another point in the design space. “Programmable hardware.” Make (some) design decisions after chip fabrication. Promises of FPGA technology: � Build application-/workload-specific circuit. � Spend chip space only on functionality that you really need. � Tune for throughput, latency, energy consumption, . . . � Overcome limits of general-purpose hardware with regard to task at hand ( e.g. , I/O limits). � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 243

Field-Programmable Gate Arrays An array of logic gates Functionality fully programmable Re-programmable after deployment (“in the field ”) → “programmable hardware” FPGAs can be configured to implement any logic circuit. Complexity bound by available chip space . → Obviously, the effective chip space is less than in custom-fabricated chips (ASICs). � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 244

Basic FPGA Architecture IOB IOB IOB IOB DCM IOB IOB IOB IOB IOB chip layout: 2D array IOB IOB Components CLB CLB CLB CLB CLB: Configurable Logic Block IOB IOB CLB CLB CLB CLB (“logic gates”) IOB IOB IOB: Input/Output Block CLB CLB CLB CLB DCM: Digital Clock Manager IOB IOB Interconnect Network CLB CLB CLB CLB IOB IOB signal lines IOB IOB IOB DCM IOB IOB IOB IOB configurable switch boxes � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 245

Signal Routing programmable Switch Box and programmable bundle of lines intersection point programmable SRAM cell switch with memory cell � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 246

Configurable Logic Block (CLB) SRAM SRAM cell cell out in 0 in 1 D Multiplexer in 2 in 3 4-LUT Flip clock Flop implements { 0 , 1 } 4 → { 0 , 1 } stores a single bit function � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 247

Programming FPGAs Programming is usually done using a hardware description language . E.g. , VHDL 6 , Verilog High-level circuit description Circuit description is compiled into a bitstream , then loaded into SRAM cells on the FPGA: VHDL synthesis map place & route FPGA netlist bitstream 6 VHSIC Hardware Description language � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 248

Example: VHDL HDLs enable programming language-like descriptions of hardware circuits. architecture Behavioral of compare is begin process (A, B) begin if ( A = B ) then C <= ’1’; else C <= ’0’; end if; end process; end Behavioral; VHDL can be synthesized, but also executed in software ( simulation ). � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 249

Real-World Hardware Simplified Virtex-5 XC5VFXxxxT floor plan Frequently used high-level components are provided in discrete silicon BlockRAM (BRAM): set of CPU 0 CPU 1 blocks that each store up 36 kbits of data DSP48 slices: 25x18-bit multipliers followed by a 48-bit accumulator CPU: two full embedded PowerPC 440 cores � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 250

Development Board with Virtex-5 FPGA Virtex-5 XC5VLX110T Lookup Tables (LUTs) 69,120 Block RAM (kbit) 5,328 DSP48 Slices 64 PowerPC Cores 0 max. clock speed ≈ 450 MHz release year 2006 source: Xilinx Inc., ML50x Evaluation Platform. User Guide. � Low-level speed of configurable gates is slower than in custom-fabricated chips (clock frequencies: ∼ 100 MHz). → Compensate with efficient circuit for problem at hand. � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 251

State Machines The key asset of FPGAs is their inherent parallelism . Chip areas naturally operate independently and in parallel. For example, consider finite-state automata . a b c d q 0 q 1 q 2 q 3 q 4 * * → non-deterministic automaton for .*abc.*d � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 252

State Machines ✛ How would you implement an automaton in software? Problems with state machine implementations in software: In non-deterministic automata , several states can be active at a time, which requires iterative execution on sequential hardware. Deterministic automata avoid this problem at the expense of a significantly higher state count . � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 253

State Machines in Hardware Automata can be translated mechanically into hardware circuits. each state → flip-flop (A flip-flop holds a single bit of information. Just the right amount to keep the ‘active’/‘not active’ information.) transitions : → signals (“wires”) between states conditioned on current input symbol ( � ‘and’ gate) multiple sources for one flip-flop input → ‘or’ gate . � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 254

State Machines in Hardware a b c d q 0 q 1 q 2 q 3 q 4 * * ? ? ? ? input = a input = b input = c input = d q 0 q 1 q 2 q 3 q 4 and and and and or or FF FF FF FF FF � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 255

5 Flip-flop cons. in % NFA 4 DFA DFA (compressed) 3 2 1 0 6 5 NFA LUT cons. in % DFA 4 DFA (compressed) 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 i in (0|1)* 1 (0|1) i � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 256

Use Case: Network Intrusion Detection Analyze network traffic using regular expressions . Scan for known attack tools. Prevent exploitation of known security holes. Scan for shell code. E.g. , Snort ( http://www.snort.org/ ) → Hundreds of (regular expression-based) rules . Idea: Instantiate a hardware state machine for each rule. → Leverage available hardware parallelism. → Challenge: optimize for high throughput . � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 257

Predicate Decoding Optimization 1: Centralized character classification a decod. d a b c d q 0 q 1 q 2 q 3 q 4 and and and and or or FF FF FF FF FF → Optimizes for space , not for speed. Character/predicate decoder: Use FPGA logic resources or use on-chip BRAM (configure as ROM and use as lookup table). � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 258

Predicate Decoding Factored Out 30 No decoder: With decoder: Resource consumption % 25 LUTs LUTs Slices Slices 20 15 10 5 0 0 50 100 150 200 250 i in (A B) i � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 259

Signal Propagation Delay Signal propagation delays determine a circuit’s speed . Here: One state transition per clock cycle. Longest signal path → maximum clock frequency rising clock edge stable at may be undefined CLK rising clock in-between reg. input reg. output register written at rising clock � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 260

Propagation Delays and Many State Machines Straightforward design with many rules and one input: input � NFA 1 NFA 2 NFA 3 NFA 4 NFA 5 NFA 6 output � or � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 261

Pipelining Optimization 2: Pipelining → What matters is longest path between any two registers (flip-flops). longest path input NFA 1 NFA 2 NFA 3 NFA 4 NFA 5 NFA 6 output or → Introduce pipeline registers . → ✛ Flip side of the idea? � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 262

Data Processing on Modern Hardware Jens Teubner, TU Dortmund, DBIS - PowerPoint PPT Presentation

Data Processing on Modern Hardware Jens Teubner, TU Dortmund, DBIS Group jens.teubner@cs.tu-dortmund.de Summer 2014 Jens Teubner Data Processing on Modern Hardware Summer 2014 c 1 Part VII FPGAs for Data Processing Jens Teubner

MODERN 1 MODERN 2 MODERN 3 MODERN 4 MODERN A peep at some distant orb has power to raise

Hardware Observability Framework Hardware Observability Framework Hardware Observability

Data Processing on Modern Hardware Jens Teubner, TU Dortmund, DBIS Group

Data Processing on Modern Hardware Jens Teubner, TU Dortmund, DBIS Group

Data Processing on Modern Hardware Jens Teubner, TU Dortmund, DBIS Group

Data Processing on Modern Hardware Jens Teubner, TU Dortmund, DBIS Group

Data Processing on Modern Hardware Jens Teubner, TU Dortmund, DBIS Group

Bare Metal Library Abstractions for modern hardware Cyprien Noel Plan 1. Modern Hardware? 2.

61A Lecture 30 Announcements Data Processing Data Processing 4 Data Processing Many data sets

VC. VC. Hardware Startup The Hardware Revolu/on The Hardware Revolution Removing Barriers to

Sec Secure ure Hardware Hardware and Hardware and Hardware- En Enabled abled Security

5/24/10 Modern Hardware is Complex Modern systems built on layers of hardware Tamper Evident

Modern Risk Modern Risk Modern Risk Management Modern Risk Management anagement Concepts:

Digital Signal Processing Solutions Digital Signal Processing Solutions SIGNAL PROCESSING

FOOD PROCESSING FOOD PROCESSING GREEN BEAN PROCESSING GREEN BEAN PROCESSING GREEN BEAN

GPU-accelerated Data Management Data Processing on Modern Hardware Sebastian Bre TU Dortmund

Conceptual spaces for matching and representing preferences Anton Benz Alexandra Strekalova ZAS

The Golden Rule of Forecasting Kesten C. Green University of South Australia Business School

Deep Learning of Optimization Heuristics http://chriscummins.cc/pact17 Chris Cummins

Using the New Products Margin to Predict the Industry-Level Impact of Trade Reform Timothy J.

HPC Environment Management: New Challenges in the Petaflop Era Jonas Dias jonas@nacad.ufrj.br

I am NOT: Wishful Thinking Against New Ideas Against New Therapies Lemming

Pumping Lemma for CFLs In any sufficiently long string in a CFL, it is possible to find at most

Dynamic hedging for the real option management of electricity storage Joakim Dimoski, Stein-Erik