hero open source heterogeneous embedded research platform
play

HERO: Open-Source Heterogeneous Embedded Research Platform for - PowerPoint PPT Presentation

HERO: Open-Source Heterogeneous Embedded Research Platform for Exploring RISC-V Manycore Architectures on FPGA First Workshop on Computer Architecture Research with RISC-V (CARRV) @ MICRO 50 2017-10-14 Andreas Kurth Pirmin Vogel Alessandro


  1. HERO: Open-Source Heterogeneous Embedded Research Platform for Exploring RISC-V Manycore Architectures on FPGA First Workshop on Computer Architecture Research with RISC-V (CARRV) @ MICRO 50 2017-10-14 Andreas Kurth Pirmin Vogel Alessandro Capotondi Andrea Marongiu Luca Benini Integrated Systems Laboratory Digital Circuits and Systems Group

  2. Heterogeneous Embedded Systems on Chip (HESoCs) PMCA PMCA I/O Host 0 N Shared Main Memory Die shot of an Apple A11 SoC (Source: chipworks) . Architectural template of HESoCs. HESoCs co-integrate a general-purpose host processor and efficient, domain-specific programmable manycore accelerators (PMCAs) . They combine versatility with extreme nominal energy efficiency. While industry rapidly advances products, ... A. Kurth, P. Vogel, A. Capotondi, A. Marongiu, L. Benini (Digital CAS Group, IIS) 1 / 20

  3. The Research Gap on HESoCs ... research on HESoCs lags behind ! PMCA PMCA I/O Host 0 N Shared Main Memory Architectural template of HESoCs. There are many open questions in various areas of computer engineering: programming models, task distribution and scheduling, memory organization, communication, synchronization, accelerator architectures and granularity, ... But there is no research platform for HESoCs! A. Kurth, P. Vogel, A. Capotondi, A. Marongiu, L. Benini (Digital CAS Group, IIS) 2 / 20

  4. Problems with Simulating HESoCs Developing HESoC components in isolation and estimating their system-level performance is problematic: Complex interactions between host, accelerators, and memory hierarchy make (reasonably accurate) simulations orders of magnitude slower than running prototypes. Even full-system simulators (e.g., GEM5 ) do not model all HESoC components. Models make assumptions about non-deterministic processes. The validity of results thus entirely depends on the validity of assumptions, and the assumptions for modeling HESoCs are very complex. Conclusion: A research platform for HESoCs must be available. This is not only about hardware: For system-level research, the platform must be efficiently programmable . Additionally, the platform should come with tools to increase the observability and decrease the validation and implementation overhead of the prototype. A. Kurth, P. Vogel, A. Capotondi, A. Marongiu, L. Benini (Digital CAS Group, IIS) 3 / 20

  5. HERO: Open-Source Heterogeneous Embedded Research Platform Heterogeneous Hardware Architecture Host PMCA Bank 0 L1 SPM Bank 1 L1 SPM Bank 2 L1 SPM Bank M-1 L1 SPM L2 Cluster 0 A57 A57 A53 A53 A53 A53 DMA Core 0 Core 1 Core 0 Core 1 Core 2 Core 3 Mem L1 Mem Cluster Bus MMU MMU MMU MMU MMU MMU Cluster 1 L1 I$ L1 D$ L1 I$ L1 D$ L1 I$ L1 D$ L1 I$ L1 D$ X-Bar Interconnect Mailbox SoC Bus L1 I$ L1 D$ L1 I$ L1 D$ L1 Mem Coherent Interconnect Coherent Interconnect Per2AXI Peripheral Bus TRYX TRYX TRYX AXI2Per L2 $ L2 $ Cluster L-1 DEMUX DEMUX DEMUX Event Unit RAB L1 Mem RISC-V RISC-V RISC-V Coherent Interconnect TLX-400 PE 0 PE 1 PE N-1 Timer ACE-Lite TLX-400 Shared L1 I$ Shared APU DDR DRAM TLX-400 TLX-400 ARM Juno SoC Heterogeneous Sofware Stack Heterogeneous Application single-source, single-binary cross compilation toolchain O ffl oaded Kernel User Level OpenMP RTE OpenMP RTE RTE LIB RTE VMM LIB OpenMP 4.5 Driver Kernel Level Linux Kernel shared virtual memory for Host and PMCA Hardware Host PMCA Profiling and automated verification solutions A. Kurth, P. Vogel, A. Capotondi, A. Marongiu, L. Benini (Digital CAS Group, IIS) 4 / 20

  6. HERO’s Hardware Architecture industry-standard, hard-macro scalable, configurable, modifiable FPGA implementation ARM Cortex-A Host processor of a silicon-proven, cluster-based PMCA with RISC-V PEs Host PMCA Bank 0 L1 SPM Bank 1 L1 SPM Bank 2 L1 SPM Bank M-1 L1 SPM Cluster 0 L2 A57 A57 A53 A53 A53 A53 DMA Core 0 Core 1 Core 2 Core 3 Mem Core 0 Core 1 L1 Mem Cluster Bus MMU MMU MMU MMU MMU MMU Cluster 1 L1 I$ L1 D$ L1 I$ L1 D$ L1 I$ L1 D$ L1 I$ L1 D$ X-Bar Interconnect Mailbox SoC Bus L1 I$ L1 D$ L1 I$ L1 D$ L1 Mem Coherent Interconnect Coherent Interconnect Per2AXI Peripheral Bus TRYX TRYX TRYX AXI2Per L2 $ L2 $ DEMUX DEMUX DEMUX Cluster L-1 Event Unit RAB L1 Mem RISC-V RISC-V RISC-V PE 0 PE 1 PE N-1 Coherent Interconnect TLX-400 Timer ACE-Lite TLX-400 Shared L1 I$ Shared APU DDR DRAM TLX-400 shared main DRAM TLX-400 low-latency interconnect, which offers coherency to host caches ARM Juno SoC HERO’s hardware, as implemented on the Juno ADP. A. Kurth, P. Vogel, A. Capotondi, A. Marongiu, L. Benini (Digital CAS Group, IIS) 5 / 20

  7. PMCA Implementation on FPGA: Overview multi-banked, sofware-managed scratchpad memories (SPMs) and multi-channel DMA engine instead of data caches Bank 0 L1 SPM Bank 1 L1 SPM Bank 2 L1 SPM Bank M-1 L1 SPM Cluster 0 Cluster 0 L2 L2 DMA Mem Mem L1 Mem L1 Mem Cluster Bus Cluster 1 Cluster 1 X-Bar Interconnect Mailbox Mailbox SoC Bus SoC Bus L1 Mem L1 Mem Per2AXI Peripheral Bus TRYX TRYX TRYX AXI2Per DEMUX DEMUX DEMUX Cluster L-1 Cluster L-1 Event Unit RISC-V processing RAB RAB L1 Mem L1 Mem RISC-V RISC-V RISC-V elements (PEs) and PE 0 PE 1 PE N-1 Timer shared auxiliary TLX-400 TLX-400 multi-cluster design processing units to overcome Shared L1 I$ Shared APU (APUs) operating on scalability limitations local data shared virtual memory access through the sofware-managed, lightweight Remapping Address Block (RAB) PMCA based on the PULP architectural template. A. Kurth, P. Vogel, A. Capotondi, A. Marongiu, L. Benini (Digital CAS Group, IIS) 6 / 20

  8. PMCA on FPGA: Configurable, Modifiable, and Expandable Configurable : L2 SPM size # of clusters ∈ { 1 , 2 , 4 , 8 } L1 SPM size and # of banks Bank 0 L1 SPM Bank 1 L1 SPM Bank 2 L1 SPM Bank M-1 L1 SPM Cluster 0 L2 DMA Mem L1 Mem Cluster Bus Cluster 1 RAB L1 TLB size X-Bar Interconnect Mailbox SoC Bus L1 Mem and L2 TLB size, # of PEs Per2AXI Peripheral Bus associativity, TRYX TRYX TRYX AXI2Per ∈ { 2 , 4 , 8 } and # of banks DEMUX DEMUX DEMUX Cluster L-1 Event Unit RAB FPU L1 Mem RISC-V RISC-V RISC-V PE 0 PE 1 PE N-1 ∈ { private , shared (APU) , off } Timer system-level TLX-400 integer DSP unit interconnect Shared L1 I$ Shared APU ∈ { private , shared (APU) } topology I$ design, size, # of banks Modifiable and expandable : All components are open-source and written in industry-standard SystemVerilog. Interfaces are either standard (mostly AXI) or simple (e.g., stream-payload). New components can be easily added to the memory map. A. Kurth, P. Vogel, A. Capotondi, A. Marongiu, L. Benini (Digital CAS Group, IIS) 7 / 20

  9. HERO’s Memory Map (from the Perspective of a PE in the PMCA) 0x1000 0000 0x1000 0000 256 MiB of virtual addresses Remote Cluster 0 reserved for PMCA-internal usage 0x103F FFFF 0x1040 0000 Remote Cluster 1 0x1A10 2000 UART 0x107F FFFF Standard I/O 0x1A11 0000 0x1000 0000 + n *0x40 0000 Remote Cluster n 0x103F FFFF Mailbox 0x1A12 0000 + n *0x40 0000 0x1A13 0000 RAB Configuration 0x1A10 0000 SoC Peripherals 0x1A1F FFFF Timer 0x1B20 0400 0x1B00 0000 Clkgate Control 0x1B20 0900 Own Cluster 0x1B00 0000 Tightly-Coupled 0x1B3F FFFF Data Memory 0x1B40 4000 Event Unit 0x1C00 0000 DMA Control 0x1B40 4400 0x1B20 0000 L2 Memory Peripherals 0x1CFF FFFF 0x1FFF FFFF A. Kurth, P. Vogel, A. Capotondi, A. Marongiu, L. Benini (Digital CAS Group, IIS) 8 / 20

  10. HERO’s Sofware Stack Allows to write programs that start on the host but seamlessly integrate the PMCAs. int main() { Heterogeneous Application vertex vertices[N]; O ffl oaded Kernel load(&vertices, N); User Level OpenMP RTE OpenMP RTE #pragma omp target map(tofrom:vertices) RTE LIB RTE VMM LIB { #pragma omp parallel for Driver Kernel Level for (i = 0; i < N; ++i) Linux Kernel vertices[i] = process(); Hardware Host PMCA } } Offloads with OpenMP 4.5 target semantics, zero-copy (pointer passing) or copy-based Integrated cross-compilation and single-binary linkage PMCA-specific runtime environment and hardware abstraction libraries (HAL) A. Kurth, P. Vogel, A. Capotondi, A. Marongiu, L. Benini (Digital CAS Group, IIS) 9 / 20

  11. Sofware Stack: OpenMP The libgomp plugin determines how input and output variables are passed between host and PMCA : With copy-based shared memory , data is copied to and from a physically contiguous, uncached section in main memory, and physical pointers are passed to the PMCA. Shared virtual memory enables zero-copy offloads, directly passing virtual pointers to the PMCA. Furthermore, the plugin implements essential OpenMP functionality such as parallel (starting parallel execution) team (definition of parallel thread teams) sections (distributed, one-time execution worksharing) barrier (synchronization barrier) critical (single-threaded execution within a parallel region) efficiently on the specific PMCA hardware. A. Kurth, P. Vogel, A. Capotondi, A. Marongiu, L. Benini (Digital CAS Group, IIS) 10 / 20

Recommend


More recommend