Bifrost: Easy GPU Pipeline Development - PowerPoint PPT Presentation

Bifrost: Easy GPU Pipeline Development github.com/ledatelescope/bifrost • Presenter: Miles Cranmer (CfA/McGill) • On behalf of: Ben Barsdell (NVIDIA), Danny Price (Berkeley), Jayce Dowell (UNM), Hugh Garsden (CfA), Frank Schinzel (NRAO), Greg T aylor (UNM), Lincoln Greenhill (CfA) 8/14/17 Miles Cranmer 1

Stream-processing and real-time GPU computing • Stream-processing: operating on data which is potentially unlimited in extent • E.g., time stream of digitized voltages • Nontrivial for CPU/GPU systems: • Creation of data structures for bufger memory management, packet capture • Additional complexities for asynchronous copies and kernel execution • Manual parallelization/core binding of algorithms and pipelines • Potential issues include memory leaks and race conditions 8/14/17 Miles Cranmer 2

Bifrost is deployed in the wild: • Backend for newest LWA station in NM • Bifrost-powered data capture for live all-sky image • Google: “LWA TV 2” • Pulsar detection: • Validation timing within 0.0001 ms of canonical for PSR B0834+06 (well within 1σ of measurement) 8/14/17 Miles Cranmer 3

Bifrost core concepts • Blocks • Independent thread • “Black box” algorithm • Ring bufgers (Rings) • Emulates wrap-around in memory • Memory spaces • Rings assigned to specifjc “space” • Pipelines • Combination of the above 8/14/17 Miles Cranmer 4

The Bifrost framework • Python frontend wraps fast C/C++/CUDA backend • Frontend: • Blocks and Pipelines are Python object abstractions for the backend • ND-array object for memory management (span of ring bufger) • ctypes wraps all C calls • Backend: • Common type defjnitions and “BFarray” generic data structure • “Ring bufger” used for inter-block communication • Several common modules implemented 8/14/17 Miles Cranmer 5

Ring Bufger implementation • Multiple readers, single writer ⇒ branched pipelines OK • Thread safe • Allocated in system (CPU), cuda (GPU), or cuda_host (pinned CPU) memory 8/14/17 Miles Cranmer 6 • What’s unique?

API example 1: block class QuantizeBlock( TransformBlock ): def __init__ ( self , iring , dtype , scale = 1., * args , ** kwargs ): TransformBlock. __init__ (self, iring, * args, ** kwargs) self.dtype = dtype self.scale = scale def on_sequence ( self , isequence ): ohdr = deepcopy(isequence.header) ohdr['_tensor']['dtype'] = self.dtype return ohdr def on_data ( self , ispan , ospan ): bf.quantize.quantize(ispan.data, ospan.data, self.scale) 8/14/17 Miles Cranmer 7

API example 2: pipeline bc = bf.BlockChainer() Read in fjle bc.blocks.read_wav(['audio_file.wav'], gulp_nframe = 4096) bc.blocks.copy( space = 'cuda') Copy to GPU bc.views.split_axis('time', 256, label = 'fine_time') FFT bc.blocks.fft( axes = 'fine_time', axis_labels = 'freq') Square modulus bc.blocks.detect( mode = 'scalar') Transpose bc.blocks.transpose(['time', 'pol', 'freq']) bc.blocks.copy( space = 'cuda_host') Copy back to CPU Convert to 8-bit bc.blocks.quantize('i8') integer bc.blocks.write_sigproc() Save pipeline = bf.get_default_pipeline() pipeline.shutdown_on_signals() Run the pipeline pipeline.run() 8/14/17 Miles Cranmer 8

bf.map • Easy CUDA kernel generation from Bifrost • JIT compiler uses NVRTC # Create three arrays on the GPU, A and B, and an empty output C a = bf.ndarray([1,2,3,4,5], space = 'cuda') b = bf.ndarray([1,0,1,0,1], space = 'cuda') c = bf.empty(5, space = 'cuda') # Add A, B together bf.map("c = a + b", data = {'c': c, 'a': a, 'b': b}) 8/14/17 Miles Cranmer 9

bf.map Explicit indexing also supported. Outer product: bf.map("c(i,j) = a(i) * b(j)", {'c': c, 'a': a, 'b': b}, axis_names = ('i','j')) 8/14/17 Miles Cranmer 10

Why Bifrost? 8/14/17 Miles Cranmer 11

Why Bifrost? Astronomy-specifjc • Bifrost developed in parallel with LWA-SV, driven by radio astronomy applications • ⇒ Core structural advantages for astronomy • Ring features • Metadata describes the units of ring bufger dimensions; used in algorithms (e.g., dedispersion) • Multi-sequence ring bufgers, useful for difgerent observations. The metadata will propagate down the pipeline. • Time-tagged sequences in ring bufgers ⇒ can dump section of data to disk based on time range, observation name • Useful for detections of transient phenomena • Ndarray is a child of numpy.ndarray ⇒ compatibility with many numpy functions, matplotlib, etc. 8/14/17 Miles Cranmer 12

Why Bifrost? Block library Many astronomy and general processing blocks already built • State of the art and fmexible high-performance implementations • Metadata rich • Well-documented • accumulate • Flexible dimensions • audio • binary_io • detect These include: • fdmt • fft • fftshift • guppi_raw • quantize • reduce • reverse • serialize • sigproc • transpose • unpack • wav 8/14/17 Miles Cranmer 13

Why Bifrost? Logging and performance benchmarking • getirq • getsiblings • like_bmon • like_ps • like_top • pipeline2dot • setirq 8/14/17 Miles Cranmer 14

Why Bifrost? Rapid development speed; high performance Bifrost code vs. C++ legacy: 8/14/17 Miles Cranmer 15

Why Bifrost? Rapid development speed; high performance 8/14/17 Miles Cranmer 16

Why Bifrost? Rapid development speed; high performance 8/14/17 Miles Cranmer 17

Conclusion • Future work • PSRDADA – Bifrost block • T o enable capture with PSRDADA to a Bifrost ring for post-processing • Additional options for visualization, "ScopeBlock” • Visualize ring contents in real-time • Aiming for full support of correlation, pulsar/transient backend pipelines github.com/ledatelescope/bifrost (or, Google: “leda telescope bifrost”) 8/14/17 Miles Cranmer 18

Bifrost: Easy GPU Pipeline Development - PowerPoint PPT Presentation

Bifrost: Easy GPU Pipeline Development github.com/ledatelescope/bifrost Presenter: Miles Cranmer (CfA/McGill) On behalf of: Ben Barsdell (NVIDIA), Danny Price (Berkeley), Jayce Dowell (UNM), Hugh Garsden (CfA), Frank Schinzel (NRAO),

Bifrost Easy High-Throughput Computing github.com/ledatelescope/bifrost Miles Cranmer

BIFROST HIGH-THROUGHPUT CPU/GPU PIPELINES MADE EASY Ben Barsdell, 4/7/2016 DISAMBIGUATION The

Panfrost A reverse engineered FOSS driver for Mali Midgard and Bifrost GPUs Contributors

February 2003 FIRST Technical Colloquium February 10-11, 2003 @ Uppsala, Sweden bifrost a high

Easy-to-Use Easy-to-Install Easy on the Budget orecx.com Easy-to-Use

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

Massive, Open, Online Courses (MOOCs) as Components of Rich Landscapes of Learning Gerhard

Mesh Denoising via L 0 Minimization Lei He Scott Schaefer Texas A&M University Surface

Randomness with CA Bruno Martin Universit e C ote dAzur, I3S-CNRS Journ ee Al ea

Online Algorithms Lecture 3 Ji r Sgall Computer Science Institute of the Charles Univ.,

First Quarter 2020 Earnings Presentation April 30, 2020 www.ussteel.com Forward-looking

Graph Algorithms Graphs Nodes/vertexes: Edges: (undirected) (directed) b a Representations

Adaptive Metric-Aware Job Scheduling for Production Supercomputers Wei Tang, Dongxu Ren,

Acylindrically hyperbolic groups Denis Osin Vanderbilt University June 6, 2013 1 / 12 Some

Bifrost: Easy GPU Pipeline Development - PowerPoint PPT Presentation

Bifrost: Easy GPU Pipeline Development github.com/ledatelescope/bifrost Presenter: Miles Cranmer (CfA/McGill) On behalf of: Ben Barsdell (NVIDIA), Danny Price (Berkeley), Jayce Dowell (UNM), Hugh Garsden (CfA), Frank Schinzel (NRAO),

Bifrost Easy High-Throughput Computing github.com/ledatelescope/bifrost Miles Cranmer

BIFROST HIGH-THROUGHPUT CPU/GPU PIPELINES MADE EASY Ben Barsdell, 4/7/2016 DISAMBIGUATION The

Panfrost A reverse engineered FOSS driver for Mali Midgard and Bifrost GPUs Contributors

February 2003 FIRST Technical Colloquium February 10-11, 2003 @ Uppsala, Sweden bifrost a high

Easy-to-Use Easy-to-Install Easy on the Budget orecx.com Easy-to-Use

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Super GPU &amp; Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

Massive, Open, Online Courses (MOOCs) as Components of Rich Landscapes of Learning Gerhard

Mesh Denoising via L 0 Minimization Lei He Scott Schaefer Texas A&amp;M University Surface

Randomness with CA Bruno Martin Universit e C ote dAzur, I3S-CNRS Journ ee Al ea

Online Algorithms Lecture 3 Ji r Sgall Computer Science Institute of the Charles Univ.,

First Quarter 2020 Earnings Presentation April 30, 2020 www.ussteel.com Forward-looking

Graph Algorithms Graphs Nodes/vertexes: Edges: (undirected) (directed) b a Representations

Adaptive Metric-Aware Job Scheduling for Production Supercomputers Wei Tang, Dongxu Ren,

Acylindrically hyperbolic groups Denis Osin Vanderbilt University June 6, 2013 1 / 12 Some

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Mesh Denoising via L 0 Minimization Lei He Scott Schaefer Texas A&M University Surface