IEEE High Performance Extreme Computing Conference (HPEC), 2017 Broadening the Exploration of the Accelerator Design Space in Embedded Scalable Platforms Luca Piccolboni , Paolo Mantovani, Giuseppe Di Guglielmo, Luca Carloni Columbia University, New York, NY, USA
Why Hardware Accelerators? • High-performance embedded systems are heterogeneous: • they include multiple general-purpose processor cores • they include special-function hardware accelerators Processor Processor Hardware Processor Core #1 Core #2 Accelerator Cores Generality Processor Hardware Processor Core #4 Accelerator Core #3 Hardware Hardware Hardware Hardware Accelerators Accelerator Accelerator Accelerator Efficiency 2 / 15 IEEE High Performance Extreme Computing Conference (HPEC), 2017
Embedded Scalable Platforms (ESP) • To balance the demand for hardware specialization with the need of maintaining helpful degrees of regularity and modularity we proposed: Embedded Scalable Platforms [L. Carloni, “ The Case for Embedded Scalable Platforms ”, DAC 2016] Memory Processor I/O Misc. Hardware Controller Core Channels, etc. Accelerator Hardware Hardware Hardware Hardware Accelerator Accelerator Accelerator Accelerator Hardware Hardware Hardware Hardware Accelerator Accelerator Accelerator Accelerator Hardware Hardware Hardware Memory Accelerator Accelerator Accelerator Controller 3 / 15 IEEE High Performance Extreme Computing Conference (HPEC), 2017
Embedded Scalable Platforms (ESP) • To balance the demand for hardware specialization with the need of maintaining helpful degrees of regularity and modularity we proposed: Embedded Scalable Platforms ESP instance for WAMI (Wide-Area Motion Imagery) Memory Proc. core I/O Misc. Accelerator Controller LEON3 CPU Channels, etc. WARP Accelerator Accelerator Accelerator Accelerator GRAYSCALE MATRIX-SUB MATRIX-RES SD-UPDATE Accelerator Accelerator Accelerator Accelerator GRADIENT MATRIX-ADD CHANGE-DET HESSIAN Accelerator Accelerator Accelerator Memory DEBAYER MATRIX-MUL STEEP-DESC. Controller 3 / 15 IEEE High Performance Extreme Computing Conference (HPEC), 2017
Embedded Scalable Platforms (ESP) • To balance the demand for hardware specialization with the need of maintaining helpful degrees of regularity and modularity we proposed: Embedded Scalable Platforms System-Level Design with High-Level Synthesis (HLS) Memory Proc. core I/O Misc. Accelerator HLS Controller LEON3 CPU Channels, etc. WARP Accelerator Accelerator Accelerator Accelerator HLS HLS HLS HLS GRAYSCALE MATRIX-SUB MATRIX-RES SD-UPDATE Accelerator Accelerator Accelerator Accelerator HLS HLS HLS HLS GRADIENT MATRIX-ADD CHANGE-DET HESSIAN Accelerator Accelerator Accelerator Memory HLS HLS HLS DEBAYER MATRIX-MUL STEEP-DESC. Controller 3 / 15 IEEE High Performance Extreme Computing Conference (HPEC), 2017
Embedded Scalable Platforms (ESP) • To balance the demand for hardware specialization with the need of maintaining helpful degrees of regularity and modularity we proposed: Embedded Scalable Platforms System-Level Design with High-Level Synthesis (HLS) Proc. core I/O Misc. Accelerator Memory rapid integration LEON3 CPU Channels, etc. WARP Controller and prototyping Accelerator Accelerator Accelerator Accelerator GRAYSCALE MATRIX-SUB MATRIX-RES SD-UPDATE Accelerator Accelerator Accelerator Accelerator GRADIENT MATRIX-ADD CHANGE-DET HESSIAN Accelerator Accelerator Accelerator Memory DEBAYER MATRIX-MUL STEEP-DESC. Controller ESP instance for WAMI (Wide-Area Motion Imagery) 3 / 15 IEEE High Performance Extreme Computing Conference (HPEC), 2017
Hardware Accelerators with HLS SystemC Specification Memory Proc. core I/O Misc. Accelerator Controller LEON3 CPU Channels, etc. WARP GRAYSCALE Interface Accelerator Accelerator Accelerator Accelerator GRAYSCALE MATRIX-SUB MATRIX-RES SD-UPDATE GRAYSCALE Logic Accelerator Accelerator Accelerator Accelerator GRADIENT MATRIX-ADD CHANGE-DET HESSIAN Accelerator Accelerator Accelerator Memory DEBAYER MATRIX-MUL STEEP-DESC. Controller load compute store bank bank bank bank bank bank bank bank Input PLM Output PLM Private Local Memories (PLMs) 4 / 15 IEEE High Performance Extreme Computing Conference (HPEC), 2017
Hardware Accelerators with HLS SystemC Specification GRAYSCALE Interface High-Level Synthesis (HLS) GRAYSCALE Logic knob conf. #1 load compute store Cost (Area) bank bank bank bank bank bank bank bank Input PLM Output PLM RTL Private Local Memories (PLMs) Performance (Latency) 4 / 15 IEEE High Performance Extreme Computing Conference (HPEC), 2017
Hardware Accelerators with HLS SystemC Specification GRAYSCALE Interface High-Level Synthesis (HLS) GRAYSCALE Logic knob conf. #2 load compute store Cost (Area) bank bank bank bank bank bank bank bank Input PLM Output PLM RTL Private Local Memories (PLMs) Performance (Latency) 4 / 15 IEEE High Performance Extreme Computing Conference (HPEC), 2017
Hardware Accelerators with HLS SystemC Specification GRAYSCALE Interface High-Level Synthesis (HLS) GRAYSCALE Logic knob conf. #3 load compute store Cost (Area) bank bank bank bank bank bank bank bank Input PLM Output PLM RTL Private Local Memories (PLMs) Performance (Latency) 4 / 15 IEEE High Performance Extreme Computing Conference (HPEC), 2017
Hardware Accelerators with HLS SystemC Specification GRAYSCALE Interface High-Level Synthesis (HLS) GRAYSCALE Logic knob conf. #4 load compute store Cost (Area) bank bank bank bank bank bank bank bank Input PLM Output PLM RTL Private Local Memories (PLMs) Performance (Latency) 4 / 15 IEEE High Performance Extreme Computing Conference (HPEC), 2017
Hardware Accelerators with HLS SystemC Specification GRAYSCALE Interface High-Level Synthesis (HLS) GRAYSCALE Logic Pareto Optimal Pareto Dominated load compute store Cost (Area) bank bank bank bank bank bank bank bank Input PLM Output PLM RTL Private Local Memories (PLMs) Performance (Latency) 4 / 15 IEEE High Performance Extreme Computing Conference (HPEC), 2017
Standard HLS Knobs Standard knobs provided by the current HLS tools Knob Settings and Effects Loop manipulations Unrolls, pipelines or breaks the body of loops Array mappings Maps arrays to registers or on-chip memories Clock period Sets the target clock period for synthesis • These knobs enable already a rich design-space exploration • However, they are not sufficient for exploring accelerators We need other knobs to broaden the exploration 5 / 15 IEEE High Performance Extreme Computing Conference (HPEC), 2017
Motivational Example #1 synthesized with the standard knobs synthesized with the proposed knobs 2.2 DEBAYER 2.0 Bounded by on-chip Normalized Area memory bandwidth 1.8 1.6 1.4 1.2 1.0 1.0 1.5 2.0 2.5 3.0 Normalized Effective Latency • Limiting factor : limited bandwidth to the on-chip memory • We need knobs to tailor the PLM to the accelerator needs 6 / 15 IEEE High Performance Extreme Computing Conference (HPEC), 2017
Motivational Example #2 synthesized with the standard knobs synthesized with the proposed knobs 3.5 GRAYSCALE Bounded by off-chip 3.0 memory bandwidth Normalized Area 2.5 2.0 1.5 1.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 Normalized Effective Latency • Limiting factor : limited bandwidth to the off-chip memory • We need knobs to operate on the communication interfaces 6 / 15 IEEE High Performance Extreme Computing Conference (HPEC), 2017
Contributions: Xknobs eXtended Knobs for High-Level Synthesis XKnob Settings and Effects PLM PORTS Sets the on-chip memory bandwidth DMA WIDTH Sets the off-chip memory bandwidth DMA CHUNK Sets the size of the input and output PLM 7 / 15 IEEE High Performance Extreme Computing Conference (HPEC), 2017
Xknob #1: PLM PORTS • Sets the number of read/write ports of input/output PLMs • Higher values of PLM PORTS → more read/write accesses • Higher values of PLM PORTS → higher area (more banks) PLM PORTS = 1 PLM PORTS = 2 PLM PORTS = 4 2.2 DEBAYER Normalized Area 2.0 1.8 1.6 1.4 1.2 1.0 1.0 1.5 2.0 2.5 3.0 Normalized Effective Latency 8 / 15 IEEE High Performance Extreme Computing Conference (HPEC), 2017
Xknob #2: DMA WIDTH • Set the size in bits of the DMA communication channels • Higher values of DMA WIDTH → higher mem. throughput • Higher values of DMA WIDTH → higher area (more banks) (higher number of write/read ports of input/output PLMs) DMA WIDTH = 64 DMA WIDTH = 256 DMA WIDTH = 128 DMA WIDTH = 512 1.5 GRAYSCALE Normalized Area 1.4 1.3 1.2 1.1 1.0 1.0 1.5 2.0 2.5 3.0 3.5 Normalized Effective Latency 9 / 15 IEEE High Performance Extreme Computing Conference (HPEC), 2017
Xknob #3: DMA CHUNK • Set the size of the PLM in multiple of the stored data type • Higher values of DMA CHUNK → optimized communication • Higher values of DMA CHUNK → higher area (for the PLM) DMA CHUNK = 256 DMA CHUNK = 512 DMA CHUNK = 1024 DMA CHUNK = 2048 without contention with contention 2.4 2.4 GRAYSCALE DMA WIDTH = 256 GRAYSCALE DMA WIDTH = 256 PLM PORTS = 4/8 PLM PORTS = 4/8 2.2 2.2 Normalized Area Normalized Area 2.0 2.0 1.8 1.8 1.6 1.6 1.4 1.4 1.2 1.2 1.0 1.0 1.0 1.1 1.2 1.3 1.0 1.2 1.4 1.6 1.8 2.0 2.2 Normalized Effective Latency Normalized Effective Latency 10 / 15 IEEE High Performance Extreme Computing Conference (HPEC), 2017
Recommend
More recommend