IWES 2018 Third Italian Workshop on Embedded Systems Siena – 13-14 September 2018 An FPGA-Based Scalable Hardware Scheduler For Data-Flow Models Roberto Giorgi, Marco Procaccini, Farnam Khalili University of Siena, Italy
The end of Dennard scaling… Engineering community forced to find new solutions to improve [1] performance with a limited power budget by Stop increasing clock frequency Shifting to multicore processors Moore’s law 2018 – Source: Wikipedia Programming limitations to exploit full performance still remains…
The “DF-Threads” Data-Flow execution model The “DF-Threads” Data-Flow execution model is capable of taking advantage of the full parallelism offered by multicore systems [2][3][4][5][6][7] Execution relies on data-dependencies Parallel execution of data independent paths
Hybrid Data-Flow Model DF-Threads based execution does not need to totally replace the conventional general purpose processors (GPP) Hybrid Model based on GPP and Field Programmable Gate Arrays (FPGA) GPP cores are suitable for legacy or OS FPGA can easily provide an efficient parallel execution via DF-Threads
System Design A possible architecture to enable an easy distribution of the Data-Flow [8] Threads (DF-Threads) among multiple core and multiple nodes
The Idea Improving the execution of the Data-Flow Threads scheduling, by [9][10] implementing an Hardware Scheduler (HS) on FPGA PS: processing system HS: hardware scheduler GPPs: general purpose processors HS-L1: local scheduler HDF: hardware Data-Flow Threads HS-L2: distributed scheduler The GPP: The HS Retrieves meta-information Asynchronous APIs Provides ready HDF-Threads Schedule DF-Threads Distribute HDF-Threads on network Execute DF-Threads
Compilation and testing flow Testing Environment [11] COTSon Simulator [12] AXIOM Board
System Abstraction in a Perspective Routing topologies: 2d-mesh or ring PS (Processing System) Application: Fibonacci Algorithm HS API DDR Axiom Library Memory [2] Axiom IOCTLs NIC Device HS Device Driver Driver AXI buses HS Registers Registers PS PS DDR DDR NIC HS [1] HS HS NIC NIC PL (Programmable Logic) PL PL Proposed Hardware Scheduler [1] Vasileios Amourgianos-Lorentzos. “Efficient network interface design for low cost distributed systems” Master Thesis, 2017 at Technical University of Crete as part of the FORTH Axiom program. [2] Evidence Embedding Technology, 2017, “https://github.com/evidence
Hardware Scheduler (HS) Primitives Hardware Scheduler Level 1 DDR Memory Memory HS API Controller Opcode Register f_ptr = load_frame(); HFD_schedule(f_ptr, i_ptr, init_sc) Register Argument 1 Decoder HDF_decrease (f_ptr, num_sc) HS-L1 Controller Register FSM HDF_subscribe (d_ptr) [1] Argument 2 HDF_publish (f_ptr) Register HS-L2 PS (Processing System) HS Module Hardware Scheduler Level 2 NIC Network Interface Card PL (Programmable Logic) [1] F. Khalili, M. Procaccini and R. Giorgi. “Reconfigurable logic interface architecture for cpu-fpga accelerators.“ In HiPEAC ACACES-2018, pp. 1,4. Fiuggi, Italy, July 2018. Poster.
Register Controller [2] The Write/Read access of each registers are separately controllable through the ‘Control’ register. The Register Controller FSM (1) is responsible to control Master AXI Stream Handler Module (2) and exchange data between AXI Stream and AXI memory mapped Domains. Register Controller FSM (1) also polls control_reg (3) and checks corresponding bit fields of each register if it is configured as write access or read access to set the direction of the data. [2] F. Khalili, M. Procaccini and R. Giorgi. “Recongurable logic interface architecture for cpu-fpga accelerators.“ In HiPEAC ACACES-2018, pp. 1{4. Fiuggi, Italy, Julyy 2018. Poster.
HS-L1 (Hardware Scheduler Level 1) Retrieves meta-information of FRAMEs (Schedule FSM) Schedules the FRAMEs which are ready to be executed (Decrease FSM). Frames are stored in GM Sector Fetches the IP (Instruction Pointer) from the ready FRAMEs (Fetch FSM) Global Memory – Direct HS-L2 Memory Access Memory Decoder Schedule FSM GM GM-ctrler GM-DMA FSM Sector Decrease RFQ FSM RFQ-ctrler RFQ-DMA Sector Fetch FSM Ready Frame Pointers are stored in RFQ Sector Ready Frame Queue – Direct Memory Access
HS-L2 (Hardware Scheduler Level 2) Distribute FRAMEs in order to balance the loads throughout the network. Work-stealing from remote nodes. Off-load the works to remote nodes N Msg- TX-FIFO Composer Load HS-L1 W NIC Balancing S FSM Msg- E RX-FIFO Interpreter
Design Snippets GM Controller GM DMA Decrease FSM Register Controller Schedule FSM RFQ DMA RFQ Controller Msg-Composer TX - FIFO Fetch FSM RX - FIFO Msg-Interpreter
Evaluation – Execution Cycles Number Of Clock Cycles (PL). Operation Data Width Worst Best FIFO Enqueue/Dequeue 64 bits 2 1 Global Memory Write (DDR4) 16 bytes 48 40 Global Memory Read (DDR4) 16 bytes 38 38 Ready Queue Write 32 bits 48 40 Ready Queue Read 32 bits 44 44 Number of Clock Cycles (PL). Instruction Name Delay Contributors Worst Best Total 49 40 HDF-Schedule DMA IP 48 39 Decoder FSM 1 1 Total 89 43 HDF-Decrease DMA IP 86 40 Decoder FSM 3 3 Total 85 34 HDF-Fetch DMA IP 82 31 Fetch FSM 3 3
Evaluation – Resource Utilization Extracted resource utilization from Vivavo Design Suit 2016.4. • Axiom board Zynq UltraScale+ XCZU9EG platform. PL Units Number of Units Available Utilization % LUT 20357 274080 7.43 LUTRAM 2876 144000 2.00 FF 26116 548160 4.76 BRAM 49.50 912 5.43 IO 27 204 13.24 GT 2 16 12.50 BUFG 6 404 1.49
Results HDF-Threads vs OpenMPI – Matrix Multiply Test 512+8 14 10 1 9 0.9 12 Exdecution Time (sec) 8 0.8 10 Speedup T(1)/T(n) 7 0.7 Efficiency S(p)/p 8 6 0.6 5 0.5 6 4 0.4 4 3 0.3 2 2 0.2 1 0.1 0 0 0 1N 1C 2N 1C 4N 1C 4N 2C 2N 1C 4N 1C 4N 2C 2N 1C 4N 1C 4N 2C OpenMPI HDF-Threads OpenMPI HDF-Threads OpenMPI HDF-Threads Execution Time Speedup Efficiency
Results HDF-Threads vs OpenMPI – Matrix Multiply Test 2.5 70 size =512, b =8 size =512, b =8 2.26 2.17 58.73 2.01 60 1.97 2 52.03 50 Bus Utilization % Kernel Cycles 42.56 39.75 1.4 1.5 40 1.08 1.07 1.07 30 1 20 0.5 10 2.6 2.52 2.42 0.79 0 0 1N 1C 2N 1C 4N 1C 4N 2C 1N 1C 2N 1C 4N 1C 4N 2C OpenMPI HDF-Threads OpenMPI HDF-Threads Kernel Cycles Bus Utilization
References [1] Frank, D. J., Dennard, R. H., Nowak, E., Solomon, P. M., Taur, Y., & Wong, H. S. P. (2001). Device scaling limits of Si MOSFETs and their application dependencies. Proceedings of the IEEE , 89 (3), 259-288. [2] Mondelli, Andrea, et al. "Dataflow support in x86_64 multicore architectures through small hardware extensions." Digital System Design (DSD), 2015 Euromicro Conference on . IEEE, 2015 [3] Dennis, J. B. (1980). Data flow supercomputers. Computer , (11), 48-56. [4] Giorgi, R., & Faraboschi, P. (2014, October). An introduction to DF-Threads and their execution model. In Computer Architecture and High Performance Computing Workshop (SBAC-PADW), 2014 International Symposium on (pp. 60-65). IEEE. [5] Verdoscia, L., Vaccaro, R., & Giorgi, R. (2014, August). A clockless computing system based on the static dataflow paradigm. In Data-Flow Execution Models for Extreme Scale Computing (DFM), 2014 Fourth Workshop on (pp. 30-37). IEEE. [6] Giorgi, R., Popovic, Z., & Puzovic, N. (2007, October). DTA-C: A decoupled multi-threaded architecture for CMP systems. In Computer Architecture and High Performance Computing, 2007. SBAC-PAD 2007. 19th International Symposium on (pp. 263-270). IEEE. [7] Kavi, K. M., Giorgi, R., & Arul, J. (2001). Scheduled dataflow: Execution paradigm, architecture, and performance evaluation. IEEE Transactions on Computers , 50 (8), 834-846. [8] Procaccini, M., Giorgi, R. (2017). A Data-Flow Execution Engine for Scalable Embedded Computing. HiPEAC ACACES-2018. [9] Procaccini, M., Khalili, F., Giorgi, R. (2018). An FPGA-based Scalable Hardware Scheduler for Data-Flow Models. HiPEAC ACACES- 2018. [10] Khalili, F., Procaccini, M., Giorgi, R. (2018). Reconfigurable Logic Interface Architecture for CPU-FPGA Accelerators. HiPEAC ACACES-2018. [11] Argollo, E., Falcón, A., Faraboschi, P., Monchiero, M., & Ortega, D. (2009). COTSon: infrastructure for full system simulation. ACM SIGOPS Operating Systems Review , 43 (1), 52-61. [12] Theodoropoulos, D., Mazumdar, S., Ayguade, E., Bettin, N., Bueno, J., Ermini, S., ... & Montefoschi, F. (2017). The AXIOM platform for next-generation cyber physical systems. Microprocessors and Microsystems , 52 , 540-555.
THANK FOR YOU YOUR ATTENTION ANY QUESTIONS ?
Recommend
More recommend