Bifrost: Easy GPU Pipeline Development github.com/ledatelescope/bifrost • Presenter: Miles Cranmer (CfA/McGill) • On behalf of: Ben Barsdell (NVIDIA), Danny Price (Berkeley), Jayce Dowell (UNM), Hugh Garsden (CfA), Frank Schinzel (NRAO), Greg T aylor (UNM), Lincoln Greenhill (CfA) 8/14/17 Miles Cranmer 1
Stream-processing and real-time GPU computing • Stream-processing: operating on data which is potentially unlimited in extent • E.g., time stream of digitized voltages • Nontrivial for CPU/GPU systems: • Creation of data structures for bufger memory management, packet capture • Additional complexities for asynchronous copies and kernel execution • Manual parallelization/core binding of algorithms and pipelines • Potential issues include memory leaks and race conditions 8/14/17 Miles Cranmer 2
Bifrost is deployed in the wild: • Backend for newest LWA station in NM • Bifrost-powered data capture for live all-sky image • Google: “LWA TV 2” • Pulsar detection: • Validation timing within 0.0001 ms of canonical for PSR B0834+06 (well within 1σ of measurement) 8/14/17 Miles Cranmer 3
Bifrost core concepts • Blocks • Independent thread • “Black box” algorithm • Ring bufgers (Rings) • Emulates wrap-around in memory • Memory spaces • Rings assigned to specifjc “space” • Pipelines • Combination of the above 8/14/17 Miles Cranmer 4
The Bifrost framework • Python frontend wraps fast C/C++/CUDA backend • Frontend: • Blocks and Pipelines are Python object abstractions for the backend • ND-array object for memory management (span of ring bufger) • ctypes wraps all C calls • Backend: • Common type defjnitions and “BFarray” generic data structure • “Ring bufger” used for inter-block communication • Several common modules implemented 8/14/17 Miles Cranmer 5
Ring Bufger implementation • Multiple readers, single writer ⇒ branched pipelines OK • Thread safe • Allocated in system (CPU), cuda (GPU), or cuda_host (pinned CPU) memory 8/14/17 Miles Cranmer 6 • What’s unique?
API example 1: block class QuantizeBlock( TransformBlock ): def __init__ ( self , iring , dtype , scale = 1., * args , ** kwargs ): TransformBlock. __init__ (self, iring, * args, ** kwargs) self.dtype = dtype self.scale = scale def on_sequence ( self , isequence ): ohdr = deepcopy(isequence.header) ohdr['_tensor']['dtype'] = self.dtype return ohdr def on_data ( self , ispan , ospan ): bf.quantize.quantize(ispan.data, ospan.data, self.scale) 8/14/17 Miles Cranmer 7
API example 2: pipeline bc = bf.BlockChainer() Read in fjle bc.blocks.read_wav(['audio_file.wav'], gulp_nframe = 4096) bc.blocks.copy( space = 'cuda') Copy to GPU bc.views.split_axis('time', 256, label = 'fine_time') FFT bc.blocks.fft( axes = 'fine_time', axis_labels = 'freq') Square modulus bc.blocks.detect( mode = 'scalar') Transpose bc.blocks.transpose(['time', 'pol', 'freq']) bc.blocks.copy( space = 'cuda_host') Copy back to CPU Convert to 8-bit bc.blocks.quantize('i8') integer bc.blocks.write_sigproc() Save pipeline = bf.get_default_pipeline() pipeline.shutdown_on_signals() Run the pipeline pipeline.run() 8/14/17 Miles Cranmer 8
bf.map • Easy CUDA kernel generation from Bifrost • JIT compiler uses NVRTC # Create three arrays on the GPU, A and B, and an empty output C a = bf.ndarray([1,2,3,4,5], space = 'cuda') b = bf.ndarray([1,0,1,0,1], space = 'cuda') c = bf.empty(5, space = 'cuda') # Add A, B together bf.map("c = a + b", data = {'c': c, 'a': a, 'b': b}) 8/14/17 Miles Cranmer 9
bf.map Explicit indexing also supported. Outer product: bf.map("c(i,j) = a(i) * b(j)", {'c': c, 'a': a, 'b': b}, axis_names = ('i','j')) 8/14/17 Miles Cranmer 10
Why Bifrost? 8/14/17 Miles Cranmer 11
Why Bifrost? Astronomy-specifjc • Bifrost developed in parallel with LWA-SV, driven by radio astronomy applications • ⇒ Core structural advantages for astronomy • Ring features • Metadata describes the units of ring bufger dimensions; used in algorithms (e.g., dedispersion) • Multi-sequence ring bufgers, useful for difgerent observations. The metadata will propagate down the pipeline. • Time-tagged sequences in ring bufgers ⇒ can dump section of data to disk based on time range, observation name • Useful for detections of transient phenomena • Ndarray is a child of numpy.ndarray ⇒ compatibility with many numpy functions, matplotlib, etc. 8/14/17 Miles Cranmer 12
Why Bifrost? Block library Many astronomy and general processing blocks already built • State of the art and fmexible high-performance implementations • Metadata rich • Well-documented • accumulate • Flexible dimensions • audio • binary_io • detect These include: • fdmt • fft • fftshift • guppi_raw • quantize • reduce • reverse • serialize • sigproc • transpose • unpack • wav 8/14/17 Miles Cranmer 13
Why Bifrost? Logging and performance benchmarking • getirq • getsiblings • like_bmon • like_ps • like_top • pipeline2dot • setirq 8/14/17 Miles Cranmer 14
Why Bifrost? Rapid development speed; high performance Bifrost code vs. C++ legacy: 8/14/17 Miles Cranmer 15
Why Bifrost? Rapid development speed; high performance 8/14/17 Miles Cranmer 16
Why Bifrost? Rapid development speed; high performance 8/14/17 Miles Cranmer 17
Conclusion • Future work • PSRDADA – Bifrost block • T o enable capture with PSRDADA to a Bifrost ring for post-processing • Additional options for visualization, "ScopeBlock” • Visualize ring contents in real-time • Aiming for full support of correlation, pulsar/transient backend pipelines github.com/ledatelescope/bifrost (or, Google: “leda telescope bifrost”) 8/14/17 Miles Cranmer 18
Recommend
More recommend