Tuesday 19 th March, 2019 GPU Technology Conference 2019 San Jose, USA S9306 Extreme Signal-Processing Performance Using Tensor Cores Astronomical Imaging on GPUs John Romein and Bram Veenboer This talk consists of two parts. In the first part, we explain how we use Tensor Cores to obtain extreme signal-processing performance. In the second part of this talk, we explain how we solve the largest computational challenge in the imaging pipeline of modern radio telescopes. Netherlands Institute for Radio Astronomy
Tensor Cores: Signal Processing at Unprecedented Speeds John Romein ASTRON (Netherlands Institute for Radio Astronomy) GTC'19 / Tensor Core Signal Processing March 18-21, 2019 1
outline ● tensor cores ● complex numbers and matrix multiplications ● signal-processing algorithms – correlations analyze performance – beam forming – ... GTC'19 / Tensor Core Signal Processing March 18-21, 2019 2
tensor cores ● mixed-precision matrix multiplication hardware – Volta, Turing ● V100: peak 112 (!) TFLOPS ● designed for deep learning GTC'19 / Tensor Core Signal Processing March 18-21, 2019 3
how to use tensor cores ● libraries (cuBLAS, cutlass, ...) ✘ – insufficient complex numbers support ● WMMA ✔ – operates directly on 16x16 matrices (+ few more formats) – use in CUDA program GTC'19 / Tensor Core Signal Processing March 18-21, 2019 4
WMMA example ● warp performs 16x16 matrix multiplication load_matrix_sync(a_frag, &a[…][…], K); load_matrix_sync(b_frag, &b[…][…], N); fill_fragment(c_frag, 0); mma_sync(d_frag, a_frag, b_frag, c_frag); // d=a*b+c store_matrix_sync(&d[…][…], d_frag, …); GTC'19 / Tensor Core Signal Processing March 18-21, 2019 5
signal processing: complex numbers ● describes phase & amplitude of signal ● real and imaginary part ( a r , a i ) ● complex multiply-add = 4 real multiply-adds c += ab c r += a r b r c r += –a i b i –sign → no tensor core support c i += a r b i c i += a i b r GTC'19 / Tensor Core Signal Processing March 18-21, 2019 6
two complex array/matrix formats 1) split matrix 2) interleaved r 0,0 r 0,1 r 0,2 r 0,3 i 0,0 i 0,1 i 0,2 i 0,3 r 0,0 i 0,0 r 0,1 i 0,1 r 0,2 i 0,2 r 0,3 i 0,3 r 1,0 r 1,1 r 1,2 r 1,3 r 1,0 i 1,1 i 1,2 i 1,3 r 1,0 r 1,0 r 1,1 i 1,1 r 1,2 i 1,2 r 1,3 i 1,3 r 2,0 r 2,1 r 2,2 r 2,3 i 2,0 i 2,1 i 2,2 i 2,3 r 2,0 i 2,0 r 2,1 i 2,1 r 2,2 i 2,2 r 2,3 i 2,3 r 3,0 r 3,1 r 3,2 r 3,3 i 3,0 i 3,1 i 3,2 i 3,3 r 3,0 i 3,0 r 3,1 i 3,1 r 3,2 i 3,2 r 3,3 i 3,3 float real[4][4], imag[4][4]; std::complex<float> matrix[4][4]; GTC'19 / Tensor Core Signal Processing March 18-21, 2019 7
1) complex split matrices [ C ]=[ A ][ B ] → [ C r ]=[ A r ][ B r ]+[− A i ][ B i ] [ C i ]=[ A r ][ B i ]+[− A i ][ B r ] ● maps well to tensor cores – negate A i values GTC'19 / Tensor Core Signal Processing March 18-21, 2019 8
2) interleaved complex matrices r 0 i 0 r i r 0 i 0 r 1 i 1 r 2 i 2 ⋯ ⋯ r 7 i 7 -i 0 r 0 r 1 i 1 -i 1 r 1 r 2 i 2 = X -i 2 r 2 ⋮ ⋮ ⋮ ⋮ r 7 i 7 -i 7 r 7 ● reorder right matrix for tensor core use – duplicate/permute/negate entries GTC'19 / Tensor Core Signal Processing March 18-21, 2019 9
complex formats: split array vs. interleaved ● implemented both ● generally no big performance difference GTC'19 / Tensor Core Signal Processing March 18-21, 2019 10
tensor cores for signal processing ● suitable if – input ≤ 16 bit ✔ – algorithm translates to matrix-matrix multiplication ✔ + ✘ GTC'19 / Tensor Core Signal Processing March 18-21, 2019 11
algorithm 1: correlations GTC'19 / Tensor Core Signal Processing March 18-21, 2019 12
correlations ● combines telescope data – each pair: multiply & accumulate – ½r(r+1) pairs integration time − 1 correlation recv 1 ,recv 2 = ∑ sample recv 1 ,time × sample recv 2 ,time time = 0 GTC'19 / Tensor Core Signal Processing March 18-21, 2019 13
correlator computations ● C ← A * A H ● C = C H → compute & store triangle receivers → receivers → C: A: receivers → time → GTC'19 / Tensor Core Signal Processing March 18-21, 2019 14
work decomposition ● computeSquares() – thread block: 64x64 receivers – warp: 32x16 receivers ● computeTriangles() receivers → – redundant computations above diagonal = 64x64 receivers receivers → GTC'19 / Tensor Core Signal Processing March 18-21, 2019 15
correlator implementation ● cache input: L2 → shared mem → registers – fix –sign on the fly ● wmma::store_matrix_sync() cannot write to triangle – copy via shared mem, or – write accumulation registers directly (hack!) GTC'19 / Tensor Core Signal Processing March 18-21, 2019 16
correlator performance 80 70 60 50 TFLOPS 40 30 overall 20 correlateSquares() 10 correlateTriangles() 0 64 128 192 256 320 384 448 512 576 # receivers (measured on Tesla V100) GTC'19 / Tensor Core Signal Processing March 18-21, 2019 17
correlator roofline analysis compute bound 100 correlateSquares() d n u o b h t d TFLOPS i correlateTriangles() w d n a 10 b y r o m e m 1 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 FLOPS/byte GTC'19 / Tensor Core Signal Processing March 18-21, 2019 18
correlator energy efficiency 300 250 200 GFLOP/J 150 100 overall computeSquares() 50 computeTriangles() 0 64 128 192 256 320 384 448 512 576 # receivers (measured on Titan RTX, not V100) GTC'19 / Tensor Core Signal Processing March 18-21, 2019 19
innovation beyond Moore's law 80 Tensor Cores 70 FP32 Titan RTX (Turing) Tesla V100 (Volta) 60 Titan X (Pascal) Titan X (Maxwell) Tesla K40 (Kepler) 50 TFLOPS 40 30 20 10 0 2013 2014 2015 2016 2017 2018 2019 2020 GTC'19 / Tensor Core Signal Processing March 18-21, 2019 20
algorithm 2: beam forming GTC'19 / Tensor Core Signal Processing March 18-21, 2019 21
beam forming ● increase sensitivity in particular direction ● (weighted) addition of signals credit: Jason Hessels nr recv − 1 bfdata time,beam = ∑ samples time,recv weights recv,beam recv = 0 GTC'19 / Tensor Core Signal Processing March 18-21, 2019 22
beam former implementation ● multiple beams: complex matrix-matrix multiplication receivers → time → time → x = beams → receivers → beam weights → nr recv − 1 bfdata time,beam = ∑ samples time,recv weights recv,beam recv = 0 GTC'19 / Tensor Core Signal Processing March 18-21, 2019 23
beam former performance and roofline analysis 80 70 compute bound memory bandwidth bound 100 60 50 TFLOPS TFLOPS 512 receivers 40 30 64 receivers 512 beams 20 256 beams 10 128 beams 0 10 64 128 192 256 320 384 448 512 16 32 64 128 256 512 1024 # receivers FLOPS/byte GTC'19 / Tensor Core Signal Processing March 18-21, 2019 24
other algorithms GTC'19 / Tensor Core Signal Processing March 18-21, 2019 25
other signal-processing algorithms ● nonuniform Fourier transforms ✔ – map well to complex matrix-matrix multiplication – ≤ 80 TFLOPS 100 ● FIR filter ✘ nuFt – matrix multiplication → many zeros no need for tensor cores – typically memory bandwidth bound 10 ● FFT ✘ TFLOPS FIR filter – not a matrix multiplication FFT – memory bandwidth bound 1 1 4 6 4 6 4 6 1 6 5 2 9 2 0 0 1 4 FLOPS/byte GTC'19 / Tensor Core Signal Processing March 18-21, 2019 26
current / future work ● try further optimizations – correlator: near diagonal – beam forming: cublasLtMatmul() (CUDA 10.1) ● support any number of receivers/beams ● 8 bit, 4 bit GTC'19 / Tensor Core Signal Processing March 18-21, 2019 27
conclusions ● tensor cores for signal processing: – correlating – multi-beam forming matrix-matrix multiplication – nonuniform Fourier transforms ● unprecedented performance (≤ ~75 TFLOPS, ≤ 6x) This work is funded by the European Union under grant no. H2020-FETHPC-754304 (DEEP-EST). GTC'19 / Tensor Core Signal Processing March 18-21, 2019 28
Tuesday 19 th March, 2019 GPU Technology Conference 2019 San Jose, USA Astronomical Imaging on GPUs Bram Veenboer Netherlands Institute for Radio Astronomy
Outline • Introduction: • Interferometry • The Image-Domain Gridding algorithm • Performance analysis • Analysis and optimization: • Gridder kernel • Imaging application • Performance and energy-efficiency comparison • Results in context of Square Kilometre Array • Summary 2
Introduction to radio astronomy Image credits: NRAO • Observe the sky at radio wavelengths − → map of radio sources • Dish-based telescopes: VLA, ALMA, MeerKAT, SKA-1 Mid • Size of the telescope is proportional to the wavelength • Use array of antennas for low frequencies: (LOFAR, MWA, SKA-1 Low) 3
Radio telescope: astronomical interferometer • Interferometer: array of seperate telescopes • Interferometry: combine the signals from seperate radio telescopes • Resolution similar to one very large dish baseline 4
Interferometry theory m 60 N Pole 40 image plane 20 l v [km] 0 Θ − 20 v w − 40 u − 60 uv plane − 60 − 40 − 20 0 20 40 60 u [km] • Sampling of the ‘uv-plane’: ‘visibilities’ • Earth-rotation synthesis 5
Recommend
More recommend