A GPU-Inspired Soft Processor for High- Throughput Acceleration - PowerPoint PPT Presentation

A GPU-Inspired Soft Processor for High- Throughput Acceleration Throughput Acceleration Jeffrey Kingyens and J. Gregory Steffan Electrical and Computer Engineering University of Toronto 1

FGPA-Based Acceleration � In-socket acceleration platforms � FPGA and CPU on same motherboard � Xtremedata, Nallatech, SGI RASC � How to program them? � How to program them? � HDL is for experts � Behavioural synthesis is limited XD1000 Can we provide a more familiar programming model? A GPU-Inspired Soft Processor 2

Potential Solution: Soft Processor � Advantages of soft processors: � Familiar, portable, customizable � Our Goal: Develop a new S.P. architecture that: � Excels at high-throughput workloads � Is naturally capable of high utilization of datapath � Challenges: � Memory latency � Pipeline latency and hazards � Exploiting parallelism � Scaling A GPU-Inspired Soft Processor 3

Inspiration: GPU Architecture � Multithreading � Tolerate memory and pipeline latencies � Vector instructions � Data-level parallelism, scaling � � Multiple processors � Scaling Long-term goal: FPGA-specific design using above This work: FPGA implementation of a GPU A GPU-Inspired Soft Processor 4

Overview � A GPU-based system � NVIDIA’s Cg � AMD’s CTM r5xx ISA � A GPU-inspired architecture � Overcoming port limitations � Avoiding stalls � Preliminary results � Simulation based on Xtremedata XD1000 A GPU-Inspired Soft Processor 5

A GPU-Based System A GPU-Based System A GPU-Inspired Soft Processor 6

GPU Shader Processors ( X o ,Y o ) Register Fetch ( n, x , y ) File Shader Program Constant Data Registers Registers Input Buffers Y o X o Output Buffer Separate in/out buffers simplify memory coherence A GPU-Inspired Soft Processor 7

NVIDIA’s Cg Language ( C -like) Cg Shader struct data_out { Program float4 sum : COLOR ; }; data_out multadd( float2 coord : TEXCOORD0 , uniform sampler2D A: TEXUNIT0 , uniform sampler2D A: TEXUNIT0 , uniform sampler2D B: TEXUNIT1 ) { data_out r; float4 offset = {1.0f, 1.0f, 1.0f, 1.0f}; r.sum = tex2D (A,coord)* tex2D (B,coord)+offset; return r; } Matrix-matrix element-wise multiplication + offset 8 A GPU-Inspired Soft Processor

AMD’s CTM r5xx ISA (simplified) multadd: TEX r1, r0 s1 Loads TEX r0, r0 s0 ALU op ALU op MAD o0, r1 r0 c0 MAD o0, r1 r0 c0 A B C END Dest regs Source regs Each register is a 4-element vector 9 A GPU-Inspired Soft Processor

A GPU-Inspired Architecture A GPU-Inspired Architecture A GPU-Inspired Soft Processor 10

Soft Processor Architecture Soft Processor HT Config Coordinate Generator Slave FPGA Block- RAMs Register File have only 2 ports! A A B C TEX 64 ALU Cycles! 305 cycles! HT Fifo Master Fifo Output Register Must tolerate port limitations and latencies A GPU-Inspired Soft Processor 11

Overcoming Port Limitations � Problem: central register file: � Needs four reads and two writes per cycle � FPGA block RAMs have only two ports � Solution: exploit symmetry of threads � Symmetry: every thread executes same inst sequence � Group threads into batches of four � Fetch operands across batches in lock-step Only read one operand per thread per cycle A GPU-Inspired Soft Processor 12

Reading Operands Across Batch Batch (of 4 threads) multadd: multadd: multadd: multadd: TEX r1 r0 s1 TEX r1 r0 s1 TEX r1 r0 s1 TEX r1 r0 s1 TEX r0 r0 s0 TEX r0 r0 s0 TEX r0 r0 s0 TEX r0 r0 s0 MAD o0 r1 r0 c0 MAD o0 r1 r0 c0 MAD o0 r1 r0 c0 MAD o0 r1 r0 c0 END END END END Three cycles to read operands: 1) Read A’s 2) Read B’s 3) Read C’s Only read one operand per thread per cycle A GPU-Inspired Soft Processor

Transposed RegFile Access T3 T2 T1 T0 RF RF RF RF 3 2 1 0 3 2 1 0 3 2 1 0 C B A A GPU-Inspired Soft Processor 14

Avoiding ALU Pipeline Bubbles � Problem: long pipeline and memory latency Frequent stalls lead to underutilized ALU datapath � � Solution: exploit abundance of threads Store contexts for multiple batches of threads Store contexts for multiple batches of threads � � � Issue instructions from different batches to hide latencies Requires logic to issue-from and manage batches How many batches do we need to avoid bubbles? A GPU-Inspired Soft Processor 15

Issuing from Multiple Batches Batch: 0 1 2 3 ALU Pipeline Ideally ALU is fully utilized A GPU-Inspired Soft Processor

Methodology and Results Methodology and Results A GPU-Inspired Soft Processor 17

Simulation Methodology � SystemC-based simulation � Parameterized to model XD1000 � Assume conservative 100Mhz soft processor clock � Cycle accurate at the block interfaces � Cycle accurate at the block interfaces � Models HyperTransport (bandwidth and latency) � currently 8-bit HT, capable of 16-bit HT � Benchmarks � photon: monte-carlo heat-transfer sim (ALU-intensive) � matmatmult: dense matrix multiplication (mem-intensive) A GPU-Inspired Soft Processor 18

ALU Utilization (8-bit HT) 100% tilization (%) 80% Data Hazard 60% Memory ALU Utiliz Not ALU Not ALU 40% Utilized 20% 0% 1 2 4 8 16 32 64 Number of Hardware Batch Contexts (Photon) A GPU-Inspired Soft Processor 19

ALU Utilization (8-bit HT) Utilized Not ALU Memory Data Hazard Matmatmult is bottlenecked on memory bandwidth A GPU-Inspired Soft Processor 20

ALU Utilization (16-bit HT) Utilized Not ALU Memory Data Hazard 32 batches is sufficient A GPU-Inspired Soft Processor 21 21

Conclusions � GPU-inspired soft processor architecture � exploits multithreading, vector operations � Thread symmetry and batching allows: � tolerating limited block RAM ports � tolerating long memory and pipeline latencies � tolerating long memory and pipeline latencies � 32 batches sufficient � to achieve 100% ALU utilization � Future work: � customize programming model and arch. to FPGAs � exploit longer vectors, multiple CPUs, custom ops A GPU-Inspired Soft Processor 22

A GPU-Inspired Soft Processor for High- Throughput Acceleration - PowerPoint PPT Presentation

A GPU-Inspired Soft Processor for High- Throughput Acceleration Throughput Acceleration Jeffrey Kingyens and J. Gregory Steffan Electrical and Computer Engineering University of Toronto 1 FGPA-Based Acceleration In-socket acceleration

WALES SOFT POWER BAROMETER 2018 Measuring soft power beyond the nation-state April 2018 01 WHAT

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

FPGA co-processor Patrick Dunne for the co-processor group Introduction Co-processor will

Embedded systems & the Nios II soft core processor A Nios II processor system I equivalent to

On Fuzzy Soft Rings Banu Pazar Varol and Halis Ayg un Department of Mathematics, Kocaeli

Introduction 1 Turbo Principle 2 Coding and uncoding SISO (Soft Input Soft Output) 3

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

High throughput High throughput kafka for science kafka for science Testing Kafkas limits

Evaluation of Improved Scalability Comparison points Throughput (IPC/Node)

Processor Design Pipelined Processor Hung-Wei Tseng Drawbacks of a single-cycle processor

Systems Architecture The ARM Processor The ARM Processor p. 1/14 The ARM Processor ARM:

> SOFT EDGE < By Iskos-Be rlin > SOFT EDGE < Soft Edge chair series is based on the

Kvadrat Soft Cells Acoustic excellence. Sustainable design. Where it all began. Kvadrat Soft

Existing Site Proposed Site Plan St. Michaels - Existing Ground Floor St. Michaels - Proposed

A Three-Layer Planning Architecture for the Autonomous Control of Rehabilitation Therapies Based

Planar: Parallel Lightweight Architecture-Aware Adaptive Graph Repartitioning Angen Zheng ,

What is organic? n USDA says its intended to promote and Making Your Garden Organic enhance

Architectural Methods to Understand Soft Errors/ Process Variations in DSN 2012 Jun YAO Nara

solar plasmas T.V. Zaqarashvili and M.L. Khodachenko Space Research Institute of Austrian Academy

2019 papers by Stephen Parke 7. Constraint on the solar m 2 using 4,000 days of short

Decoherence in Neutrino Propagation through matter and bounds from IceCube/DeepCore Hiroshi

A GPU-Inspired Soft Processor for High- Throughput Acceleration - PowerPoint PPT Presentation

A GPU-Inspired Soft Processor for High- Throughput Acceleration Throughput Acceleration Jeffrey Kingyens and J. Gregory Steffan Electrical and Computer Engineering University of Toronto 1 FGPA-Based Acceleration In-socket acceleration

WALES SOFT POWER BAROMETER 2018 Measuring soft power beyond the nation-state April 2018 01 WHAT

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

FPGA co-processor Patrick Dunne for the co-processor group Introduction Co-processor will

Embedded systems &amp; the Nios II soft core processor A Nios II processor system I equivalent to

On Fuzzy Soft Rings Banu Pazar Varol and Halis Ayg un Department of Mathematics, Kocaeli

Introduction 1 Turbo Principle 2 Coding and uncoding SISO (Soft Input Soft Output) 3

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

High throughput High throughput kafka for science kafka for science Testing Kafkas limits

Evaluation of Improved Scalability Comparison points Throughput (IPC/Node)

Processor Design Pipelined Processor Hung-Wei Tseng Drawbacks of a single-cycle processor

Systems Architecture The ARM Processor The ARM Processor p. 1/14 The ARM Processor ARM:

&gt; SOFT EDGE &lt; By Iskos-Be rlin &gt; SOFT EDGE &lt; Soft Edge chair series is based on the

Kvadrat Soft Cells Acoustic excellence. Sustainable design. Where it all began. Kvadrat Soft

Existing Site Proposed Site Plan St. Michaels - Existing Ground Floor St. Michaels - Proposed

A Three-Layer Planning Architecture for the Autonomous Control of Rehabilitation Therapies Based

Planar: Parallel Lightweight Architecture-Aware Adaptive Graph Repartitioning Angen Zheng ,

What is organic? n USDA says its intended to promote and Making Your Garden Organic enhance

Architectural Methods to Understand Soft Errors/ Process Variations in DSN 2012 Jun YAO Nara

solar plasmas T.V. Zaqarashvili and M.L. Khodachenko Space Research Institute of Austrian Academy

2019 papers by Stephen Parke 7. Constraint on the solar m 2 using 4,000 days of short

Decoherence in Neutrino Propagation through matter and bounds from IceCube/DeepCore Hiroshi

Embedded systems & the Nios II soft core processor A Nios II processor system I equivalent to

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

> SOFT EDGE < By Iskos-Be rlin > SOFT EDGE < Soft Edge chair series is based on the