Generic Polyphase Filterbanks with CUDA Jan Krmer DLR German - PowerPoint PPT Presentation

Generic Polyphase Filterbanks with CUDA Jan Krämer DLR German Aerospace Center Communication and Navigation Satellite Networks Weßling 04.02.2017 ❑♥♦✇❧❡❞❣❡ ❢♦r ❚♦♠♦rr♦✇ ❉▲❘

www.dlr.de · Slide 1 of 27 > Generic Polyphase Filterbanks with CUDA > Jan Krämer > 04.02.2017 Outline 1. Motivation 2. Short introduction to CUDA 3. PFBs and the Channelizer 4. Translation to CUDA 5. Results 6. Release plans and future changes ❉▲❘

www.dlr.de · Slide 2 of 27 > Generic Polyphase Filterbanks with CUDA > Jan Krämer > 04.02.2017 Once upon a time in a space project Multicarrier scheme with 15/30/45 carrier ❉▲❘

www.dlr.de · Slide 2 of 27 > Generic Polyphase Filterbanks with CUDA > Jan Krämer > 04.02.2017 Once upon a time in a space project Multicarrier scheme with 15/30/45 carrier So let’s just use a PFB, right? ❉▲❘

www.dlr.de · Slide 3 of 27 > Generic Polyphase Filterbanks with CUDA > Jan Krämer > 04.02.2017 Early Trouble 45 carrier means 45x the bandwidth Only 12-15 % guardband available At least 3x oversampling needed Up to 1500 tap filters needed ❉▲❘

www.dlr.de · Slide 4 of 27 > Generic Polyphase Filterbanks with CUDA > Jan Krämer > 04.02.2017 Early Trouble CPU reference implementation 1000 taps 35dB rejection Originally 9x oversampling 2 Msamples/second achieved ⇒ 4 Msamples/second needed ❉▲❘

www.dlr.de · Slide 5 of 27 > Generic Polyphase Filterbanks with CUDA > Jan Krämer > 04.02.2017 What is CUDA NVidias framework for GPGPU Used mainly to accelerate scientific computing Uses the massive amount of available compute cores inside a GPU ❉▲❘

www.dlr.de · Slide 6 of 27 > Generic Polyphase Filterbanks with CUDA > Jan Krämer > 04.02.2017 GPU Interior GPU consists of several Streaming Multiprocessors (SM) Each SM consists of numerous compute or CUDA cores Single-Instruction Multiple-Threads (SIMT) structure Several kinds of memory Global Memory (GDDR5 RAM) (slow) On-Chip (shared) Memory per SM (faster) Registers (blazingly fast) ❉▲❘

www.dlr.de · Slide 7 of 27 > Generic Polyphase Filterbanks with CUDA > Jan Krämer > 04.02.2017 CUDA Interior Builds a (up to) 3 dimensional Grid The Gid contains the (up to) 3 dimensional Thread Blocks containing the threads Groups of 32 threads inside a Thread Block are grouped together ⇒ Warp ❉▲❘

www.dlr.de · Slide 8 of 27 > Generic Polyphase Filterbanks with CUDA > Jan Krämer > 04.02.2017 CUDA Interior ❉▲❘

www.dlr.de · Slide 9 of 27 > Generic Polyphase Filterbanks with CUDA > Jan Krämer > 04.02.2017 Thread Execution Each Block has a unique ID inside the Grid ⇒ Each thread has a unique global ID Thread Scheduler assigns each Thread Block to one SM and executed concurrently All threads in a Warp are executed concurrently inside the SM ❉▲❘

www.dlr.de · Slide 10 of 27 > Generic Polyphase Filterbanks with CUDA > Jan Krämer > 04.02.2017 Performance Bottlenecks Uncoalesced loads from global memory ⇒ Several cache-lines to be loaded ❉▲❘

www.dlr.de · Slide 10 of 27 > Generic Polyphase Filterbanks with CUDA > Jan Krämer > 04.02.2017 Performance Bottlenecks Uncoalesced loads from global memory ⇒ Several cache-lines to be loaded Bank conflicts when accessing shared memory ❉▲❘

www.dlr.de · Slide 10 of 27 > Generic Polyphase Filterbanks with CUDA > Jan Krämer > 04.02.2017 Performance Bottlenecks Uncoalesced loads from global memory ⇒ Several cache-lines to be loaded Bank conflicts when accessing shared memory Branching ⇒ Which instruction should be executed? ❉▲❘

www.dlr.de · Slide 11 of 27 > Generic Polyphase Filterbanks with CUDA > Jan Krämer > 04.02.2017 Why PFBs and Channelizers/Synthesizers? Used to reduce computational complexity for resampling filters Used to separate small bandwidth channels Used to generate multicarrier ’broadband’ signals ❉▲❘

www.dlr.de · Slide 12 of 27 > Generic Polyphase Filterbanks with CUDA > Jan Krämer > 04.02.2017 Structure of a PFB Channelizer Extracting a channel with 1 N of the total bandwidth Mix Signal to Baseband Apply anti-alias filter Downsample the signal ❉▲❘

www.dlr.de · Slide 13 of 27 > Generic Polyphase Filterbanks with CUDA > Jan Krämer > 04.02.2017 Structure of a PFB Channelizer Extracting a channel with 1 N of the total bandwidth Mix Signal to Baseband Apply anti-alias filter Downsample the signal N -phase PFB splits one-dimensional filter in its N different phase shares ❉▲❘

www.dlr.de · Slide 14 of 27 > Generic Polyphase Filterbanks with CUDA > Jan Krämer > 04.02.2017 Structure of a PFB Channelizer Taps of the regular prototype filter ❉▲❘

www.dlr.de · Slide 15 of 27 > Generic Polyphase Filterbanks with CUDA > Jan Krämer > 04.02.2017 Structure of a PFB Channelizer Taps of the regular prototype filter Split into 4 polyphase partitions ❉▲❘

www.dlr.de · Slide 16 of 27 > Generic Polyphase Filterbanks with CUDA > Jan Krämer > 04.02.2017 Structure of a PFB Channelizer Taps of the regular prototype filter Split into 4 polyphase partitions Newly structured dataflow ❉▲❘

www.dlr.de · Slide 17 of 27 > Generic Polyphase Filterbanks with CUDA > Jan Krämer > 04.02.2017 Structure of a PFB Channelizer Taps of the regular prototype filter Split into 4 polyphase partitions Newly structured dataflow FFT separates all the channels ❉▲❘

www.dlr.de · Slide 18 of 27 > Generic Polyphase Filterbanks with CUDA > Jan Krämer > 04.02.2017 Structure of a PFB Channelizer Oversampling can be achieved by manipulating the input commutator and FFT input To synthesize several incoming channels just the reorder the operations ❉▲❘

www.dlr.de · Slide 18 of 27 > Generic Polyphase Filterbanks with CUDA > Jan Krämer > 04.02.2017 ❉▲❘

www.dlr.de · Slide 19 of 27 > Generic Polyphase Filterbanks with CUDA > Jan Krämer > 04.02.2017 Identifying necessary operations Channelizer consists of 4 operations Shuffle the input stream Polyphase filtering FFT Shuffle the output stream ❉▲❘

www.dlr.de · Slide 20 of 27 > Generic Polyphase Filterbanks with CUDA > Jan Krämer > 04.02.2017 Input Shuffling Input Commutator implemented as matrix traversal Number of threads needs to accomodate to the filter history ⇒ Grid dimension takes care of this Input buffer reads are coalesced ⇒ Block x-dimension same size as polyphase partition Intermediate buffer writes are therfore not coalesced ❉▲❘

www.dlr.de · Slide 21 of 27 > Generic Polyphase Filterbanks with CUDA > Jan Krämer > 04.02.2017 Filter Operations Block X dimension computes several input samples Block Y dimension computes oversampled output samples Grid X dimension represents polyphase partitions Grid Y dimension provide additional concurrency (due to block thread limits) ❉▲❘

www.dlr.de · Slide 22 of 27 > Generic Polyphase Filterbanks with CUDA > Jan Krämer > 04.02.2017 Filter Operations Each threadblock transfers memory from global memory to shared memory Each sample is accessed several times ⇒ shared memory offers faster memory transfers Register and shared memory spills are avoided ❉▲❘

www.dlr.de · Slide 23 of 27 > Generic Polyphase Filterbanks with CUDA > Jan Krämer > 04.02.2017 FFT and Output Shuffling FFT is the CuFFT of CUDA Output shuffling implemented as double loop done on Host CPU (for now) ❉▲❘

Generic Polyphase Filterbanks with CUDA Jan Krmer DLR German - PowerPoint PPT Presentation

Generic Polyphase Filterbanks with CUDA Jan Krmer DLR German Aerospace Center Communication and Navigation Satellite Networks Weling 04.02.2017 r rr www.dlr.de Slide 1 of

Lesson 7 Rate Conversion Filtering and Downsampling interchange Filtering and Upsampling

Outline Overview Parallel Computing with GPU Introduction to CUDA CUDA Thread Model

Introduction to CUDA C What is CUDA? CUDA Architecture Expose general-purpose GPU

Lecture 2.1 - Introduction to CUDA C CUDA C vs. Thrust vs. CUDA Libraries Objective To learn

Multi-rate Signal Processing 3. The Polyphase Representation Electrical & Computer

CUDA/Ada An Ada binding to CUDA Reto B urki, Adrian-Ken R uegsegger University of Applied

What are Generics? e.g. Generics, Generic Programming, Generic Types, Generic Methods 6

A High-Level Intro to CUDA CS5220 Fall 2015 What is CUDA? C ompute U nified D evice A

GPU Programming Alan Gray EPCC The University of Edinburgh Overview Motivation and need

Lecture 2.4 Introduction to CUDA C Introduction to the CUDA Toolkit Objective To become

Generic Programming in a Dependently Typed Language Generic proofs for generic programs Peter

Generic Methods 36 What are Generic Methods? Generic methods = methods that introduce type

1 Definition of a simple generic class Why generic programming (cont.) class Pair <T> {

Specification of APERTIF Polyphase Filter Bank in C aSH Rinse Wester a , Dimitrios Sarakiotis a

Signal processing with heterogeneous digital filterbanks: lessons from the MWA and EDA Randall

Lecture 5: Short-Time Fourier Transform and Filterbanks Mark Hasegawa-Johnson ECE 417:

Mixed-Signal VLSI Design Course Code: EE719 Department: Electrical Engineering Lecture 38: April

T witte r F e e ds Pr ofiling With T F - IDF Juraj Petrik & Daniela Chuda 1 T a sk

Learning and Imbalanced Data January 28, 2019 David Rimshnick Data Science in the Wild, Spring

Kolmogorov-Chaitin Complexity of Linear Digital Controllers Implemented using Fixed-point

From Atoms to Bits Ahmet Onat 2018 onat@sabanciuniv.edu Layout of the Lecture Analog

Hate Speech Detection is Not as Easy as You May Think: A Closer Look at Model Validation Aym

Welcome! Todays Agenda: Primitives (contd.) Ray Tracing Intersections

Multilinear Algebra Based Fitting of a Sum of Exponentials to Oversampled Data Lieven De

Generic Polyphase Filterbanks with CUDA Jan Krmer DLR German - PowerPoint PPT Presentation

Generic Polyphase Filterbanks with CUDA Jan Krmer DLR German Aerospace Center Communication and Navigation Satellite Networks Weling 04.02.2017 r rr www.dlr.de Slide 1 of

Lesson 7 Rate Conversion Filtering and Downsampling interchange Filtering and Upsampling

Outline Overview Parallel Computing with GPU Introduction to CUDA CUDA Thread Model

Introduction to CUDA C What is CUDA? CUDA Architecture Expose general-purpose GPU

Lecture 2.1 - Introduction to CUDA C CUDA C vs. Thrust vs. CUDA Libraries Objective To learn

Multi-rate Signal Processing 3. The Polyphase Representation Electrical &amp; Computer

CUDA/Ada An Ada binding to CUDA Reto B urki, Adrian-Ken R uegsegger University of Applied

What are Generics? e.g. Generics, Generic Programming, Generic Types, Generic Methods 6

A High-Level Intro to CUDA CS5220 Fall 2015 What is CUDA? C ompute U nified D evice A

GPU Programming Alan Gray EPCC The University of Edinburgh Overview Motivation and need

Lecture 2.4 Introduction to CUDA C Introduction to the CUDA Toolkit Objective To become

Generic Programming in a Dependently Typed Language Generic proofs for generic programs Peter

Generic Methods 36 What are Generic Methods? Generic methods = methods that introduce type

1 Definition of a simple generic class Why generic programming (cont.) class Pair &lt;T&gt; {

Specification of APERTIF Polyphase Filter Bank in C aSH Rinse Wester a , Dimitrios Sarakiotis a

Signal processing with heterogeneous digital filterbanks: lessons from the MWA and EDA Randall

Lecture 5: Short-Time Fourier Transform and Filterbanks Mark Hasegawa-Johnson ECE 417:

Mixed-Signal VLSI Design Course Code: EE719 Department: Electrical Engineering Lecture 38: April

T witte r F e e ds Pr ofiling With T F - IDF Juraj Petrik &amp; Daniela Chuda 1 T a sk

Learning and Imbalanced Data January 28, 2019 David Rimshnick Data Science in the Wild, Spring

Kolmogorov-Chaitin Complexity of Linear Digital Controllers Implemented using Fixed-point

From Atoms to Bits Ahmet Onat 2018 onat@sabanciuniv.edu Layout of the Lecture Analog

Hate Speech Detection is Not as Easy as You May Think: A Closer Look at Model Validation Aym

Welcome! Todays Agenda: Primitives (contd.) Ray Tracing Intersections

Multilinear Algebra Based Fitting of a Sum of Exponentials to Oversampled Data Lieven De

Multi-rate Signal Processing 3. The Polyphase Representation Electrical & Computer

1 Definition of a simple generic class Why generic programming (cont.) class Pair <T> {

T witte r F e e ds Pr ofiling With T F - IDF Juraj Petrik & Daniela Chuda 1 T a sk