discussion on ML in FPGAs 1
T RIGGER S YSTEM 2 40 MHz subset data “Level-1” full data Custom Electronics Absorbs ~100s Tb/s Trigger decision in ~10 μ s 100 kHz “High Level Trigger” ~13k CPU farm 100 ms/event 500 Hz “Offline Computing” Grid, O(10) Pb
G OALS CUSTOM 3 HARDWARE / FIRMWARE Level-1 Trigger : 100s Tb/s, 40MHz pipeline, 10 us/event buffer “ OFF THE High-level Trigger : 100 kHz pipeline, 100 ms/event SHELF ” Offline processing : minutes/event HARDWARE Identify where we can get the biggest gains from ML inference with FPGAs for these very different classes of problems. Physicist “friendly” tools, engineering resources are scarce HLS4ML - better for physicist prototyping, faster development cycles, less expert knowledge needed Level 1 trigger is a custom target for LHC physics problems — not many available tools for O(500 ns) performance, need RTL; preference for HLS over tools like openCL open-source, accessible for academia
S TATUS WITH AWS 4 A first test of SDSoC examples on Zynq works out-of-box https://github.com/Xilinx/SDSoC_Examples/tree/master/cpp/getting_started uses the HLS “parlance” Next: adapt our HLS4ML to work with the AWS F1 examples: https://github.com/aws/aws-fpga/tree/master/SDAccel/examples First tests of SDAccel examples on t2.2xlarge with FPGA AMI looks good too push in this direction in the next month
F OR DISCUSSION 5 Level-1 Trigger : 100s Tb/s, 40MHz pipeline, 10 us/event buffer How optimal is HLS? we have some conceptual idea of the optimization, but… easy to compare a simple example against a verilog implementation? are we missing some obvious improvements? Useful for other fields? is it something other fields would be interested in? are there particular features we are not thinking of? how tied are we to Xilinx design tools? which HLS? we use Vivado HLS right now
F OR DISCUSSION 6 High-level Trigger : 100 kHz pipeline, 100 ms/event Offline processing : minutes/event exploring hardware could we explore AWS F1 for such applications? what are the tools like? language? HLS is ok? comparison to GPUs, other FPGA architectures? throughput some studies by other colleagues from another experiment note that a PCIe interface is a bottleneck. infini-band? parallel input streams? how to paritition/organize multi-FPGA networks? non-ML applications porting some popular physics algorithms like Kalman Filter and cellular automata
Recommend
More recommend