fpgas
play

FPGAs milliseconds+ to reconfjgure custom chips ??? (next week) - PowerPoint PPT Presentation

FPGAs milliseconds+ to reconfjgure custom chips ??? (next week) FPGAs ??? GPUs vector computations examples: computation second processor specialized for particular the accelerator concept 2 fmexible, slower functional units


  1. FPGAs milliseconds+ to reconfjgure custom chips — ??? (next week) FPGAs — ??? GPUs — vector computations examples: computation second processor specialized for particular the accelerator concept 2 fmexible, slower functional units fjxed, fast functional units lots of routing lots of control logic fetch 1+ instruction/cycle 1 set of wirings stream of instructions reconfjg. HW ‘normal’ processor reconfjgurable hardware 1 Datacenter Services” Putnam et al, ”A Reconfjgurable Fabric for Accelerating Large-Scale review required) Brown and Rose, ”Architecture of FPGAs and CPLDs: A Tutorial”. (no This day’s papers: To read more… 3

  2. FPGA structure attempts at easier interfaces: count <= count + 1'b1; end assign value = count; endmodule 6 A note about HW programming not intuitive “schematic capture” — draw circuit diagram else common, doesn’t seem great at scale higher-level tools, e.g., Chisel (Berkeley research project) compile to RTL; used at scale automatic translation of C-like language (C to gates) Very mixed reputation — very hard compilers problem But see Aladdin paper begin end Brown and Rose, Figure 2 count <= 0; 4 FPGA programs: RTL e.g.: Verilog everything happens in parallel every cycle manually specify what’s in registers, etc. same languages used to design real processors 5 RTL example module counter(clock,reset,value); input clock; input reset; output value; reg [32:0] count; always @ ( posedge reset or posedge clock) if (reset) begin 7 determines wiring between gates, registers, memories

  3. FPGA design pipeline Programmable switches: example Brown and Rose, Figure 5 Programmable switches: example 10 Brown and Rose, Figure 5 can be written by seperate circuit (not shown) SRAM cell continously outputs stored value Example switch: transistor + SRAM cell 9 Brown and Rose, Figure 7 efgects performance — longer wires/more switches nearly full not straightforward; hours+ to compute if FPGA connect needs to turn into what components in the FPGA to RTL compiles to “gate list” FPGA: place and route 8 11 (SRAM cell ≈ 1-bit register)

  4. FPGA routing example 12 FPGA logic block example (1) 13 FPGA logic block example (2) 14 FPGA confjguration what to do for every switch just loading values into memory that controls switch 15

  5. FPGA efficiency review comments failure handling centralized allocation needs fast FPGA-to-FPGA communication programs across multiple FPGAs physical space power density (cooling, power distribution) cost (only 10% more???) datacenter logistics Catapult challenges 18 programmability? other large-scale deployments? versus/combined with GPUs/CPUs? what are FPGAs good for anyways? 17 most transistors perform routing, not computation RAM much longer signal paths than in CPUs slower clock rates for same task development tool usefulness/quality is not great 16 FPGA: more complex logic many FPGAs include specialized fjxed functionality adders, multipliers but slower/bigger fmoating point units common DSP computations full embedded-class CPU cores … could implement these using fully programmable logic 19

  6. s t y n r e e m u u q c o d documents query index shard index shard index shard The Shell ranking service index shard documents, query rankings index shard index shard 23% of FPGA (confjgurable) area: Search engine architecture 20 CPU to FPGA transfers (about maximum PCIe 3.0 transfer rate) 21 Catapult roles hand-coded Verilog (RTL language) hand partitioned across FPGAs? precise duplication of existing software 22 search index shard query cache top-level aggregator (TLA) MLA MLA index shard index shard index shard 23 10 µ s for 16 KB — approx 15 GB/s

  7. s t n y e r e m u u q c o d documents query s t y y n r r e e e m u u u q q c o d documents query query (TLA) MLA index shard index shard ranking service index shard index shard index shard index shard index shard MLA index shard index shard index shard Search engine architecture index shard documents, query index shard rankings documents, query ranking service index shard index shard index shard index shard index shard index shard index shard index shard rankings top-level MLA MLA (TLA) aggregator top-level cache query search Search engine architecture 23 aggregator Search engine architecture cache index shard query MLA MLA (TLA) aggregator top-level cache query search Search engine architecture 23 rankings documents, query ranking service index shard index shard MLA query cache top-level aggregator (TLA) MLA index shard index shard index shard index shard index shard index shard index shard index shard index shard index shard index shard index shard search search 23 rankings documents, query ranking service index shard index shard index shard index shard 23 index shard s t n y r e e m u q u c o d documents query s t n e m u c o d documents

  8. Overall Motivation to load from external RAM fully pipelined essentially regexes compiled to gates? parallel fjnite-state machines Feature Extraction FSMs 26 trick: proess queries for same model together (distributed) on FPGA memories: approx. 40MB capacity 27 24 “model reload” Queue Manager 25 (1600 clock cycles) output: score recieve: document, some features via shared memory FPGA operation each FPGA runs a macropipeline stage — 8 µ s can only store one model at a time — takes 250 µ s

  9. Feature Expressions perhaps even if difgerent operations — hard for GPUs being easy to program well … but sometimes dedicated SRAM blocks caching lots of data? purpose-built, denser ALUs just win fmoating point, other ‘big’ arithmetic operations What are FPGAs bad at? 30 prototyping CPUs, GPUs low-latency I/O interface and processing? inherently parallel programs? speialized mathematical expressions bit-twiddling (lots of simple CPU instrs.)? What are FPGAs good for? 29 “Complex” logic area 28 threads priority-scheduled split across multiple FPGAs mostly integer — small FPGA area — but some FP model determines what the expressions are custom multithreaded processor 31 programming FPGAs ≈ processor design!

  10. FPGAs versus GPUs register renaming and reservation stations … 34 Exam topics Memory hierarchy — caches, TLBs Pipelining, instruction scheduling, VLIW Multiple issue/out-of-order: reorder bufgers and branch prediction 255 hardware multithreading Multicore shared memory: cache coherency protocols/networks relaxed memory models and sequential consistency synchronization: spin locks, transaction memory, etc. Vector machines, GPUs, other accelerators 511 35 both good at doing massively parallel computations Interlude: Homework 3 FPGAs better at exploiting multiple instruction parallelism? FPGAs can be lower latency for simple operations FPGAs much worse at fmoating point/non-small-integer calculations? 32 33 Homework 3 supplied kernel what does the supplied kernel do? 0 1 2 … 255 256 257 258 … 511 512 …

  11. Next time: Custom ASICs mathematical tradeofgs (remove “unimportant” gates”-like) tools Complements existing high-level synthesis (“C to Produces fast estimates accelerator designs Tool (used by Minerva) for quickly evaluating Previre: Aladdin 37 architectural tradeofgs things from model) from a pre-trained model) higher dev cost/higher efficiency Deep Neural Networks — machine learning models Preview: Minerva 36 all these things probably apply to FPGA stufg another: a case study using that (Minerva) (Aladdin) one on: automating design of custom ASIC accelerators two papers: 38 accelerating evaluating DNNs (making predictions

Recommend


More recommend