Embedded Scalable Platforms Luca Piccolboni , Paolo Mantovani, - PowerPoint PPT Presentation

IEEE High Performance Extreme Computing Conference (HPEC), 2017 Broadening the Exploration of the Accelerator Design Space in Embedded Scalable Platforms Luca Piccolboni , Paolo Mantovani, Giuseppe Di Guglielmo, Luca Carloni Columbia University, New York, NY, USA

Why Hardware Accelerators? • High-performance embedded systems are heterogeneous: • they include multiple general-purpose processor cores • they include special-function hardware accelerators Processor Processor Hardware Processor Core #1 Core #2 Accelerator Cores Generality Processor Hardware Processor Core #4 Accelerator Core #3 Hardware Hardware Hardware Hardware Accelerators Accelerator Accelerator Accelerator Efficiency 2 / 15 IEEE High Performance Extreme Computing Conference (HPEC), 2017

Embedded Scalable Platforms (ESP) • To balance the demand for hardware specialization with the need of maintaining helpful degrees of regularity and modularity we proposed: Embedded Scalable Platforms [L. Carloni, “ The Case for Embedded Scalable Platforms ”, DAC 2016] Memory Processor I/O Misc. Hardware Controller Core Channels, etc. Accelerator Hardware Hardware Hardware Hardware Accelerator Accelerator Accelerator Accelerator Hardware Hardware Hardware Hardware Accelerator Accelerator Accelerator Accelerator Hardware Hardware Hardware Memory Accelerator Accelerator Accelerator Controller 3 / 15 IEEE High Performance Extreme Computing Conference (HPEC), 2017

Embedded Scalable Platforms (ESP) • To balance the demand for hardware specialization with the need of maintaining helpful degrees of regularity and modularity we proposed: Embedded Scalable Platforms ESP instance for WAMI (Wide-Area Motion Imagery) Memory Proc. core I/O Misc. Accelerator Controller LEON3 CPU Channels, etc. WARP Accelerator Accelerator Accelerator Accelerator GRAYSCALE MATRIX-SUB MATRIX-RES SD-UPDATE Accelerator Accelerator Accelerator Accelerator GRADIENT MATRIX-ADD CHANGE-DET HESSIAN Accelerator Accelerator Accelerator Memory DEBAYER MATRIX-MUL STEEP-DESC. Controller 3 / 15 IEEE High Performance Extreme Computing Conference (HPEC), 2017

Embedded Scalable Platforms (ESP) • To balance the demand for hardware specialization with the need of maintaining helpful degrees of regularity and modularity we proposed: Embedded Scalable Platforms System-Level Design with High-Level Synthesis (HLS) Memory Proc. core I/O Misc. Accelerator HLS Controller LEON3 CPU Channels, etc. WARP Accelerator Accelerator Accelerator Accelerator HLS HLS HLS HLS GRAYSCALE MATRIX-SUB MATRIX-RES SD-UPDATE Accelerator Accelerator Accelerator Accelerator HLS HLS HLS HLS GRADIENT MATRIX-ADD CHANGE-DET HESSIAN Accelerator Accelerator Accelerator Memory HLS HLS HLS DEBAYER MATRIX-MUL STEEP-DESC. Controller 3 / 15 IEEE High Performance Extreme Computing Conference (HPEC), 2017

Embedded Scalable Platforms (ESP) • To balance the demand for hardware specialization with the need of maintaining helpful degrees of regularity and modularity we proposed: Embedded Scalable Platforms System-Level Design with High-Level Synthesis (HLS) Proc. core I/O Misc. Accelerator Memory rapid integration LEON3 CPU Channels, etc. WARP Controller and prototyping Accelerator Accelerator Accelerator Accelerator GRAYSCALE MATRIX-SUB MATRIX-RES SD-UPDATE Accelerator Accelerator Accelerator Accelerator GRADIENT MATRIX-ADD CHANGE-DET HESSIAN Accelerator Accelerator Accelerator Memory DEBAYER MATRIX-MUL STEEP-DESC. Controller ESP instance for WAMI (Wide-Area Motion Imagery) 3 / 15 IEEE High Performance Extreme Computing Conference (HPEC), 2017

Hardware Accelerators with HLS SystemC Specification Memory Proc. core I/O Misc. Accelerator Controller LEON3 CPU Channels, etc. WARP GRAYSCALE Interface Accelerator Accelerator Accelerator Accelerator GRAYSCALE MATRIX-SUB MATRIX-RES SD-UPDATE GRAYSCALE Logic Accelerator Accelerator Accelerator Accelerator GRADIENT MATRIX-ADD CHANGE-DET HESSIAN Accelerator Accelerator Accelerator Memory DEBAYER MATRIX-MUL STEEP-DESC. Controller load compute store bank bank bank bank bank bank bank bank Input PLM Output PLM Private Local Memories (PLMs) 4 / 15 IEEE High Performance Extreme Computing Conference (HPEC), 2017

Hardware Accelerators with HLS SystemC Specification GRAYSCALE Interface High-Level Synthesis (HLS) GRAYSCALE Logic knob conf. #1 load compute store Cost (Area) bank bank bank bank bank bank bank bank Input PLM Output PLM RTL Private Local Memories (PLMs) Performance (Latency) 4 / 15 IEEE High Performance Extreme Computing Conference (HPEC), 2017

Hardware Accelerators with HLS SystemC Specification GRAYSCALE Interface High-Level Synthesis (HLS) GRAYSCALE Logic Pareto Optimal Pareto Dominated load compute store Cost (Area) bank bank bank bank bank bank bank bank Input PLM Output PLM RTL Private Local Memories (PLMs) Performance (Latency) 4 / 15 IEEE High Performance Extreme Computing Conference (HPEC), 2017

Standard HLS Knobs Standard knobs provided by the current HLS tools Knob Settings and Effects Loop manipulations Unrolls, pipelines or breaks the body of loops Array mappings Maps arrays to registers or on-chip memories Clock period Sets the target clock period for synthesis • These knobs enable already a rich design-space exploration • However, they are not sufficient for exploring accelerators We need other knobs to broaden the exploration 5 / 15 IEEE High Performance Extreme Computing Conference (HPEC), 2017

Motivational Example #1 synthesized with the standard knobs synthesized with the proposed knobs 2.2 DEBAYER 2.0 Bounded by on-chip Normalized Area memory bandwidth 1.8 1.6 1.4 1.2 1.0 1.0 1.5 2.0 2.5 3.0 Normalized Effective Latency • Limiting factor : limited bandwidth to the on-chip memory • We need knobs to tailor the PLM to the accelerator needs 6 / 15 IEEE High Performance Extreme Computing Conference (HPEC), 2017

Motivational Example #2 synthesized with the standard knobs synthesized with the proposed knobs 3.5 GRAYSCALE Bounded by off-chip 3.0 memory bandwidth Normalized Area 2.5 2.0 1.5 1.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 Normalized Effective Latency • Limiting factor : limited bandwidth to the off-chip memory • We need knobs to operate on the communication interfaces 6 / 15 IEEE High Performance Extreme Computing Conference (HPEC), 2017

Contributions: Xknobs eXtended Knobs for High-Level Synthesis XKnob Settings and Effects PLM PORTS Sets the on-chip memory bandwidth DMA WIDTH Sets the off-chip memory bandwidth DMA CHUNK Sets the size of the input and output PLM 7 / 15 IEEE High Performance Extreme Computing Conference (HPEC), 2017

Xknob #1: PLM PORTS • Sets the number of read/write ports of input/output PLMs • Higher values of PLM PORTS → more read/write accesses • Higher values of PLM PORTS → higher area (more banks) PLM PORTS = 1 PLM PORTS = 2 PLM PORTS = 4 2.2 DEBAYER Normalized Area 2.0 1.8 1.6 1.4 1.2 1.0 1.0 1.5 2.0 2.5 3.0 Normalized Effective Latency 8 / 15 IEEE High Performance Extreme Computing Conference (HPEC), 2017

Xknob #2: DMA WIDTH • Set the size in bits of the DMA communication channels • Higher values of DMA WIDTH → higher mem. throughput • Higher values of DMA WIDTH → higher area (more banks) (higher number of write/read ports of input/output PLMs) DMA WIDTH = 64 DMA WIDTH = 256 DMA WIDTH = 128 DMA WIDTH = 512 1.5 GRAYSCALE Normalized Area 1.4 1.3 1.2 1.1 1.0 1.0 1.5 2.0 2.5 3.0 3.5 Normalized Effective Latency 9 / 15 IEEE High Performance Extreme Computing Conference (HPEC), 2017

Xknob #3: DMA CHUNK • Set the size of the PLM in multiple of the stored data type • Higher values of DMA CHUNK → optimized communication • Higher values of DMA CHUNK → higher area (for the PLM) DMA CHUNK = 256 DMA CHUNK = 512 DMA CHUNK = 1024 DMA CHUNK = 2048 without contention with contention 2.4 2.4 GRAYSCALE DMA WIDTH = 256 GRAYSCALE DMA WIDTH = 256 PLM PORTS = 4/8 PLM PORTS = 4/8 2.2 2.2 Normalized Area Normalized Area 2.0 2.0 1.8 1.8 1.6 1.6 1.4 1.4 1.2 1.2 1.0 1.0 1.0 1.1 1.2 1.3 1.0 1.2 1.4 1.6 1.8 2.0 2.2 Normalized Effective Latency Normalized Effective Latency 10 / 15 IEEE High Performance Extreme Computing Conference (HPEC), 2017

Embedded Scalable Platforms Luca Piccolboni , Paolo Mantovani, - PowerPoint PPT Presentation

IEEE High Performance Extreme Computing Conference (HPEC), 2017 Broadening the Exploration of the Accelerator Design Space in Embedded Scalable Platforms Luca Piccolboni , Paolo Mantovani, Giuseppe Di Guglielmo, Luca Carloni Columbia

Embedded PC The modular Industrial PC for mid-range control Embedded PC 1 Embedded OS

EMBEDDED EMBEDDED REAL TIME SYSTEMS REAL TIME SYSTEMS EMBEDDED EMBEDDED REAL TIME SYSTEMS

Platform Convergence Journey Windows Embedded Standard 7 Windows Embedded Standard 8 Converged

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

WILL YOU EAT OR BE EATEN ? Platforms are as old as trains 2 Sometimes platforms go wrong 3

Embedded PC The modular Industrial PC for mid-range control Stefan Hoppe 14.09.2007 1 Embedded

Scalable String Matching on the Scalable String Matching on the Scalable String Matching on the

You call it Data Lake; we call it Data Historian Naghman Waheed Data Platforms Lead Brian

Platforms Where is the market going? Adviser lead Platforms: Current state of affairs c.

Embedded Systems (ESII) Prof. Dr. J. Henkel, Dr. M. Shafique CES - Chair for Embedded Systems

The Embedded Learning Library The Embedded Learning Library (ELL) Cross-compiler for AI

4TU MASTER EMBEDDED SYSTEMS Bert Molenkamp 19/03/2020 Master Embedded Systems 1 Table of

Embedded implicatures Bart Geurts Embedded implicatures?!? (with Nausicaa Pouscoulous) In:

HW/SW Codesign w/ FPGAs Embedded Systems ECE 495/595 Overview (Slides from Embedded Systems

Embedded Embedded Architecture Architecture Systems Systems Jakob Engblom, PhD Jakob

EMBEDDED RUST ON THE BEAGLEBOARD X15 MEETING EMBEDDED Jonathan Pallant 14 November 2018

S EMI -A UTOMATED R OCK D EPICTION M ETHODS FOR L ARGE S CALE T OPOGRAPHIC M APS Matthias

Training Neural Networks: Normalization, Regularization etc. Intro to Deep Learning, Fall 2020 1

QuickCheck 10.2 Starting from rest, a marble first rolls down a steeper hill, then down a less

Presentation to Policy Insights 2018 A first-in-the-nation countywide effort to alleviate The

DACA, Immigrant Students and Community Colleges Presentation to the AACC Commission on Diversity,

Single-dish observations for the study of the AGN life cycle Marisa Brienza Ra ff aella Morganti

Scalable Multi-Core Model Checking Alfons Laarman ( alfons@laarman.com ), Theory joint work with

Bulk of talk Brief summary of poster Background - Parameter Scaling - Climate Change and