a distributed model of computation for reconfigurable
play

A distributed model of computation for reconfigurable devices based - PowerPoint PPT Presentation

A distributed model of computation for reconfigurable devices based on a streaming architecture Paolo Cretaro National Institute for Nuclear Physics FPL 2019 Barcelona, September 2019 The ExaNeSt project: hardware highlights Unit: Xilinx


  1. A distributed model of computation for reconfigurable devices based on a streaming architecture Paolo Cretaro National Institute for Nuclear Physics FPL 2019 Barcelona, September 2019

  2. The ExaNeSt project: hardware highlights Unit: Xilinx Zynq Ultrascale+ FPGA  Four 64bit ARM Cortex A53 @1.5GHz  Programmable logic  16 high speed serial links @16Gbps Node: Quad-FPGA Daughter-Board (QFDB)  All-to-all internal connectivity  10 HSS links to remote QFDB (through network FPGA)  64 GB DDR4 RAM (16GB per FPGA)  512 GB NVMe SSD on storage FPGA Blade/mezzanine  4 QFDB in Track 1  2 HSS links per edge (local direct network)  32 SFP+ connectors for inter-mezzanine hybrid network I worked in the team who made the 3D torus network, based on a custom Virtual Cut-Through protocol Paolo Cretaro - FPL2019 10/09/2019 2

  3. Mixing acceleration and network  With High Level Synthesis tools, FPGAs are becoming a viable way to accelerate tasks  Accelerators must be able to access the network directly to achieve low-latency communication among themselves and other remote hosts  A dataflow programming paradigm could take advantage of this feature to optimize communication patterns and loads System memory mapped bus ACCELERATOR CPU CPU DDR DDR NETW NETWORK INTERFACE ORK Paolo Cretaro - FPL2019 10/09/2019 3

  4. Kahn processing networks advantages Group of sequential processes communicating through FIFO channels  Determinism: for the same input history the network produces exactly the same output  No shared memory: processes can run concurrently and synchronize through blocking read on input channel FIFOs  Distributing tasks on multiple devices is easy A C P B Paolo Cretaro - FPL2019 10/09/2019 4

  5. Accelerator hardware interface  Virtual input/output channels for each source/destination  Direct host memory access for buffering and configuration (a device driver is needed)  Direct coupling with the network NETWORK NETWORK NETWORK NETWORK ADAPTER ADAPTER ADAPTER ADAPTER ACCELERATION ACCELERATION CORE CORE HOST HOST MEMORY MEMORY Paolo Cretaro - FPL2019 10/09/2019 5

  6. Steps description A C 1. Write kernels in HLS E 2. A config file delineates tasks B D and data dependencies 3. A directed graph is built and mapped on the network topology 4. Accelerator blocks are flashed CU CU CU 7 8 6 on targeted nodes 5. Data is fed into entry points and CU CU CU tasks are started 3/C 5 4/D,E 6. Each task consumes its data CU CU CU and send the results to the next 1/B 2 0/A ones Paolo Cretaro - FPL2019 10/09/2019 6

  7. Simplified task graph configuration example Device0 { Type: FPGA Task0 { Impl: source_task.c Input_channels: 0 Output_channels { Ch0: Device1.Task0.Ch1 } } Task1 { Impl: source_task.c Input_channels: 0 Output_channels { Ch0: Device1.Task0.Ch0 } } } Device1 { Type: FPGA Task0 { Impl: example_task.c Input_channels: 2 Output_channels { Ch0: Device1.Task1.Ch0 } } Task1 { Impl: sink_task.c input_channels: 1 } } Paolo Cretaro - FPL2019 10/09/2019 7

  8. Thank you! Paolo Cretaro - FPL2019 10/09/2019 8

Recommend


More recommend