Parallel Programming and Heterogeneous Computing FPGA Accelerators - PowerPoint PPT Presentation

Parallel Programming and Heterogeneous Computing FPGA Accelerators Max Plauth, Sven Köhler, Felix Eberhardt, Lukas Wenzel and Andreas Polze Operating Systems and Middleware Group

Introduction Mapping Workloads to Hardware Example: Given Arrays a , b , and f calculate r[i] = a[i] × f[i] + b[i] × (1 - f[i]) LD R0, #0 loop: LD R1, [f + R0] Memory SUB R2, #1, R1 LD R3, [a + R0] LD R4, [b + R0] MUL R5, R3, R1 MUL R6, R4, R2 − ADD R5, R5, R6 ST [r + R0], R5 Execute ADD R0, R0, #1 BLT R0, #N, loop ParProg 2020 C3 Register FPGA Accelerators Lukas Wenzel Chart 2.1 General Purpose Hardware Custom Hardware

Introduction Mapping Workloads to Hardware Example: Given Arrays a , b , and f calculate r[i] = a[i] × f[i] + b[i] × (1 - f[i]) LD R0, #0 loop: LD R1, [f + R0] Memory SUB R2, #1, R1 LD R3, [a + R0] LD R4, [b + R0] MUL R5, R3, R1 MUL R6, R4, R2 × ADD R5, R5, R6 ST [r + R0], R5 Execute ADD R0, R0, #1 BLT R0, #N, loop ParProg 2020 C3 Register FPGA Accelerators Lukas Wenzel Chart 2.2 General Purpose Hardware Custom Hardware

Introduction Mapping Workloads to Hardware Example: Given Arrays a , b , and f calculate r[i] = a[i] × f[i] + b[i] × (1 - f[i]) LD R0, #0 loop: LD R1, [f + R0] Memory SUB R2, #1, R1 LD R3, [a + R0] LD R4, [b + R0] MUL R5, R3, R1 MUL R6, R4, R2 × ADD R5, R5, R6 ST [r + R0], R5 Execute ADD R0, R0, #1 BLT R0, #N, loop ParProg 2020 C3 Register FPGA Accelerators Lukas Wenzel Chart 2.3 General Purpose Hardware Custom Hardware

Introduction Mapping Workloads to Hardware Example: Given Arrays a , b , and f calculate r[i] = a[i] × f[i] + b[i] × (1 - f[i]) LD R0, #0 loop: LD R1, [f + R0] Memory SUB R2, #1, R1 LD R3, [a + R0] LD R4, [b + R0] MUL R5, R3, R1 MUL R6, R4, R2 ADD R5, R5, R6 + ST [r + R0], R5 Execute ADD R0, R0, #1 BLT R0, #N, loop ParProg 2020 C3 Register FPGA Accelerators Lukas Wenzel Chart 2.4 General Purpose Hardware Custom Hardware

Introduction Mapping Workloads to Hardware Example: Given Arrays a , b , and f calculate r[i] = a[i] × f[i] + b[i] × (1 - f[i]) LD R0, #0 loop: LD R1, [f + R0] Memory SUB R2, #1, R1 LD R3, [a + R0] − LD R4, [b + R0] = = = MUL R5, R3, R1 MUL R6, R4, R2 − × ADD R5, R5, R6 + × × ST [r + R0], R5 Execute ADD R0, R0, #1 BLT R0, #N, loop + ParProg 2020 C3 Register FPGA Accelerators Lukas Wenzel Chart 3.1 General Purpose Hardware Custom Hardware

Introduction Mapping Workloads to Hardware Truly custom hardware built as Application-Specific Integrated Circuits (ASICs) is ■ extremely expensive to design and manufacture Only feasible for high production volumes ➢ − Usually requires at least some general-purpose aspects to fit many use-cases = = = ➢ × × + Field Programmable Gate Arrays (FPGAs) are manufactured as general-purpose ■ integrated circuits, and thus far less expensive than equivalent ASICs FPGAs can be configured to realize a custom hardware architecture ■ ParProg 2020 C3 FPGA Accelerators Lukas Wenzel Chart 4

FPGA Characteristics Hardware Structure Regular fixed-function integrated circuits implement a single and usually highly ■ optimized hardware architecture (e.g. CPUs, GPUs, …) FPGA fabric is a regular structure of hardware ■ primitives and an interconnect for signal lines Interconnect can be configured to connect □ signals lines between primitives Primitives can be configured to select □ variations of their basic behavior Appropriate configurations can make the ➢ FPGA behave like any custom hardware design (within fabric capacity) ParProg 2020 C3 FPGA Accelerators Lukas Wenzel Chart 5

FPGA Characteristics Hardware Structure Hardware primitives include: Logic Blocks (CLB) with Flipflops, Lookup ■ Tables, Multiplexers, … Memory Blocks (BRAM) to act as single port, ■ dual port or FIFO memories Arithmetic Blocks (DSP) with hardware ■ multipliers, adders, shifters, … Clock Management Blocks (MMCM) to derive ■ clock signals with specific frequency and phase relations IO Banks with logic for various signaling ■ standards ParProg 2020 C3 FPGA Accelerators Lukas Wenzel Chart 6 CLB in a Xilinx UltraScale FPGA (from: Xilinx UG 474, Figure 5-1)

FPGA Characteristics Hardware Structure Floorplan of a Xilinx Kintex Ultra Scale XCKU060 FPGA ParProg 2020 C3 FPGA Accelerators Lukas Wenzel Chart 7

FPGA Characteristics Hardware Structure Example: Accumulator (2 bit) CLB CLB 00|0 01|1 in0 acc0 10|1 11|0 LUT2 FF FF in 00|0 FPGA 01|0 2 10|0 11|1 acc + LUT2 000|0 001|0 010|0 011|1 in1 acc1 100|0 101|1 110|1 ParProg 2020 C3 111|1 FF FF LUT3 FPGA Accelerators Lukas Wenzel Chart 8

FPGA Characteristics Performance Fixed-function hardware is rated by maximum operating clock frequency ■ FPGAs have no uniform clock frequency rating: ■ FPGA fabric supports multiple clock signals in different regions □ Specific configurations define combinatorial paths of varying lengths □ Maximum clock frequency is design specific and constrained by the longest ➢ combinatorial path delay Specific primitives like BRAMs can have maximum clock frequency ratings ■ BRAMs on current Xilinx FPGAs run at up to 800MHz □ Individual logic delays range from 0.1ns to 0.5ns ■ ParProg 2020 C3 Small and tightly coupled design sections may run at 1GHz ➢ FPGA Accelerators Common frequency for complete designs is 250MHz Lukas Wenzel ■ Chart 9

FPGA Characteristics Performance Example: Accumulator (2 bit) Combinatorial paths begin and ■ +3ns end at flipflops CLB CLB +1ns Clock period must be longer that ■ 00|0 3ns 01|1 +2ns +1ns the maximum path delay in0 0ns 2ns 4ns 5ns acc0 0ns 10|1 11|0 LUT2 FF FF +1ns Maximum delay: 00|0 3ns 01|0 2ns 4ns 10|0 𝐧𝐛𝐲{𝒖 𝜺 } = 𝟖𝐨𝐭 11|1 +1ns +1ns LUT2 000|0 001|0 5ns 010|0 +2ns 011|1 +1ns Clock frequency: in1 acc1 0ns 2ns 6ns 7ns 0ns 100|0 101|1 𝟐 110|1 3ns ParProg 2020 C3 111|1 FF FF 𝒈 ≤ = 𝟐𝟓𝟒𝐍𝐈𝐴 LUT3 FPGA Accelerators 𝐧𝐛𝐲 𝒖 𝜺 Lukas Wenzel +3ns Chart 10

FPGA Characteristics Performance FPGA designs operate at up to an order of magnitude lower clock frequencies than ASIC accelerators! How do FPGAs achieve speedups over fixed function hardware? Avoid overheads of general-purpose hardware: ➢ CPUs invest a large amount of logic and cycles into fetching and decoding □ general-purpose instructions CPUs must accommodate a wide variety of applications by providing a □ compromise set of execution facilities (i.e. function units, forwarding ParProg 2020 C3 paths, …) FPGA Accelerators Lukas Wenzel Chart 11

FPGA Design Basic Patterns Any program can be transformed into an equivalent hardware design: Variables and operations are realized in the datapath ■ Control flow is realized through a finite state machine (FSM) controlling the ■ datapath a rA ret + b rB × int proc( int a, int b, int f) { f rF int f_inv = 1 - f; − a *= f; b *= f_inv; rI return a + b; 1 ParProg 2020 C3 Control Signals Status Signals } FPGA Accelerators Lukas Wenzel S 0 S 1 S 2 S 3 𝐬𝐟𝐮 ← 𝐬𝐁 + 𝐬𝐂 𝐬𝐁 ← 𝐛 𝐬𝐁 ← 𝐬𝐁 × 𝐬𝐆 𝐬𝐂 ← 𝐬𝐂 × 𝐬𝐉 Chart 12 𝐬𝐂 ← 𝐜 𝐬𝐉 ← 𝟐 − 𝐬𝐆 𝐬𝐆 ← 𝐠

FPGA Design Basic Patterns Strictly reproducing the original control flow always yields a correct hardware implementation for a program. ! Resulting design is rarely efficient , as original control flow is ignorant of datapath utilization and does not capture data dependencies Efficient designs leverage pipelining and replication of operations to maximize ➢ computational throughput int proc( int a, int b, int f) ParProg 2020 C3 { FPGA Accelerators int f_inv = 1 - f; = a *= f; S 0 S 1 S 2 S 3 𝐬𝐟𝐮 ← 𝐬𝐁 + 𝐬𝐂 Lukas Wenzel 𝐬𝐁 ← 𝐛 𝐬𝐁 ← 𝐬𝐁 × 𝐬𝐆 𝐬𝐂 ← 𝐬𝐂 × 𝐬𝐉 b *= f_inv; 𝐬𝐂 ← 𝐜 𝐬𝐉 ← 𝟐 − 𝐬𝐆 return a + b; 𝐬𝐆 ← 𝐠 } Chart 13

Parallel Programming and Heterogeneous Computing FPGA Accelerators - PowerPoint PPT Presentation

Parallel Programming and Heterogeneous Computing FPGA Accelerators Max Plauth, Sven Khler, Felix Eberhardt, Lukas Wenzel and Andreas Polze Operating Systems and Middleware Group Introduction Mapping Workloads to Hardware Example: Given Arrays

Parallel Computing: Opportunities and Challenges Victor Lee Parallel Computing Lab (PCL), Intel

Parallel Programming and Heterogeneous Computing A2 - Parallel Hardware Max Plauth, Sven Khler,

Parallel Computing the Why and the How Albert-Jan Yzelman February, 2010 Albert-Jan Yzelman

Cluster Basics Hana Sevcikova University of Washington DataCamp Parallel Programming in R

Coverage in Heterogeneous Coverage in Heterogeneous Networks Xiaoli Chu King s College

Outline Overview Theoretical background Parallel computing systems Parallel

Overview Parallel computing platforms Approaches to building parallel computers

PARALLEL Joachim Nitschke PROGRAMMING Project Seminar Parallel Programming, Summer

Parallel Programming and Heterogeneous Computing B2 - Shared-Memory: Programming Models Max

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Parallel Programming and Heterogeneous Computing Shared-Nothing Parallelism Models Max Plauth,

Adventures in HPC and R: Going Parallel What is Parallel Computing? Justin Harrington &

Introduction to OpenMP ! Introduction to parallel computing ! Classification of parallel

Introduction to Parallel Computing George Karypis Parallel Programming Platforms Elements of a

COMP 633 - Parallel Computing Lecture 15 October 1, 2020 Programming Accelerators using

Lecture 13: Block Diagrams and the Inverse Z Transform Mark Hasegawa-Johnson ECE 401: Signal and

Re-indexing the DFT (n and k) We can investigate the various implementations of the DFT by

Sine/Cosine using Sine/Cosine using CORDIC Algorithm CORDIC Algorithm Prof. Kris Gaj Gaj

A Fully Parallel DNN Implementation and its Application to Automatic Modulation Classification

Scan Mark Greenstreet CpSc 418 Jan. 20, 2016 Mark Greenstreet Scan CS 418 Jan. 20,

1 Analysis of sequential algorithms: The PRAM Model a Parallel RAM RAM model (Random Access

Parallel Programs 1 Why Bother with Programs? Theyre what runs on the machines we design

PARALLEL AND DISTRIBUTED ALGORITHMS BY DEBDEEP MUKHOPADHYAY AND ABHISHEK SOMANI

Parallel Programming and Heterogeneous Computing FPGA Accelerators - PowerPoint PPT Presentation

Parallel Programming and Heterogeneous Computing FPGA Accelerators Max Plauth, Sven Khler, Felix Eberhardt, Lukas Wenzel and Andreas Polze Operating Systems and Middleware Group Introduction Mapping Workloads to Hardware Example: Given Arrays

Parallel Computing: Opportunities and Challenges Victor Lee Parallel Computing Lab (PCL), Intel

Parallel Programming and Heterogeneous Computing A2 - Parallel Hardware Max Plauth, Sven Khler,

Parallel Computing the Why and the How Albert-Jan Yzelman February, 2010 Albert-Jan Yzelman

Cluster Basics Hana Sevcikova University of Washington DataCamp Parallel Programming in R

Coverage in Heterogeneous Coverage in Heterogeneous Networks Xiaoli Chu King s College

Outline Overview Theoretical background Parallel computing systems Parallel

Overview Parallel computing platforms Approaches to building parallel computers

PARALLEL Joachim Nitschke PROGRAMMING Project Seminar Parallel Programming, Summer

Parallel Programming and Heterogeneous Computing B2 - Shared-Memory: Programming Models Max

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Parallel Programming and Heterogeneous Computing Shared-Nothing Parallelism Models Max Plauth,

Adventures in HPC and R: Going Parallel What is Parallel Computing? Justin Harrington &amp;

Introduction to OpenMP ! Introduction to parallel computing ! Classification of parallel

Introduction to Parallel Computing George Karypis Parallel Programming Platforms Elements of a

COMP 633 - Parallel Computing Lecture 15 October 1, 2020 Programming Accelerators using

Lecture 13: Block Diagrams and the Inverse Z Transform Mark Hasegawa-Johnson ECE 401: Signal and

Re-indexing the DFT (n and k) We can investigate the various implementations of the DFT by

Sine/Cosine using Sine/Cosine using CORDIC Algorithm CORDIC Algorithm Prof. Kris Gaj Gaj

A Fully Parallel DNN Implementation and its Application to Automatic Modulation Classification

Scan Mark Greenstreet CpSc 418 Jan. 20, 2016 Mark Greenstreet Scan CS 418 Jan. 20,

1 Analysis of sequential algorithms: The PRAM Model a Parallel RAM RAM model (Random Access

Parallel Programs 1 Why Bother with Programs? Theyre what runs on the machines we design

PARALLEL AND DISTRIBUTED ALGORITHMS BY DEBDEEP MUKHOPADHYAY AND ABHISHEK SOMANI

Adventures in HPC and R: Going Parallel What is Parallel Computing? Justin Harrington &