Exascale Computing for Radio Astronomy: GPU or FPGA? Kees van Berkel MPSoC 2016, Nara, Japan 2016, July 14
Radio Astronomy: Herculus A (a.k.a. 3C 348) “… optically invisible jets, one-and-a-half million light-years wide, dwarf the visible galaxy from which they emerge.” Image courtesy of NRAO/AUI Kees van Berkel page 1
VLA radio telescope, New Mexico 27 independent antennae (dishes), each with a diameter of 25m Kees van Berkel page 2
NGC6946: where is the NH 3 ? and how cold is it? Optical + X-ray combined Radio: 24 GHz ( λ =12.5 mm) 20 million light years from earth 1.76 GB of “radio data” (image about 50 arcsecs wide) (a few fJ in total, a few B photons) Kees van Berkel page 3
NGC6946: where is the NH 3 ? and how cold is it? „image cube”: (256 × 256 pixels) × 640 channels 19° Kelvin 640 channels of 500 kHz each NH 3 cloud Kees van Berkel page 4
Exascale Computing for Radio Astronomy: GPU or FPGA? Computing: • for the Square Kilometer Array (y2022) • what kind is needed? • 2D-FFT, (de-)convolution, filters, dedispersion, and a lot more • “exa-scale”: 10 18 FLOP/sec, • how much? i.e. 10 × fastest computer existing • 10 4.5 nodes × 10 4.5 ALUs × f c =10 9 Hz? • in what form? • GPU or FPGA? • accelerator / node? • use rooflines as a tool, • how to find out? for modeling and for prediction Kees van Berkel page 5
Interferometry 2-element interferometer Output of the correlator: • E ν (r 1 ) is the electric field at position r 1 , • ν the observation frequency, and • * denotes complex conjugation baseline correlator Kees van Berkel page 6
Van Cittert–Zernike theorem [1934-38] sky intensity speed of light correlator output quasi solid angle base line vector, monochromatic separating the 2 antennae Adding geometry (assuming “narrow field”): 2D Fourier transform! where (l, m) are sky-image coordinates and (u, v) are coordinates of the base-line vector [Tay99, Cla90, Tho01] Kees van Berkel page 7
Van Cittert–Zernike theorem [1934-38] In principle : I-DFT (u, v) coverage (l,m) image (A, φ ) at (u,v) pixel intensity DFT v m u l Kees van Berkel page 8
Sampling Lucy in u-v domain with a disc u, v × = IDFT DFT = x, y * Kees van Berkel page 9
DFT convolution theorem visibility sampling observed complex function visibility (hermitian) “de-convolution” V ν ( u , v ) S ( u , v ) VS ν ( u , v ) × = DFT ! ! ! I-DFT I D I ν ( l , m ) B ν ( l , m ) ν ( l , m ) ∗ = convolution image dirty beam real dirty image map point spread function dirty map Kees van Berkel page 10
DFT convolution : Lucy with 2 hours VLA time u, v × = IDFT DFT = x, y * Kees van Berkel page 11
DFT convolution : Lucy with 12 hours VLA time u, v × = IDFT DFT = x, y * Kees van Berkel page 12
DFT convolution: synthetic sky with 2 hours VLA time u, v × = IDFT DFT CLEAN I-DFT “de-convolution” = x, y * Kees van Berkel page 13
De-convolution (“imaging”) based on CLEAN **G ~ () FFT V(u,v,w) V(u,v,w=0) sky model **clean beam update − 3 × 10 point Image V(u,v,w) + + iterations source I(l,m) ** γ× PSF extract 100 × (dirty beam) residual image − V(u,v,w) V(u,v,w=0) G T () × I-FFT 3D 2D visibilities [complex] image [real] [Hög74] (W-projection/snapshot implicit) Kees van Berkel page 14
SKA1-mid [South Africa]: science in 2020 Towards a Square Kilometer Array photograph artist impression SKA Organisation [Dew13] /Swinburne Astronomy Productions Kees van Berkel page 15
Imaging: compute load for SKA1-mid quantity unit 10 log note # base lines 5+ 2 2 × (#dishes) 2 = (2 × 200) 2 dump rate s -1 1+ (integration time = 0.08s) -1 observation time s 3 # channels 5 “image cube” for spectral analysis # visibilities / observation 14.5 = input to imaging ( ≈ 10 16 Byte) # op /visibility /iteration 4.5 convolution, matrix multiply, (I)FFT # major iterations 1.5 (3 × calibration) × (10 × major) # op /observation 20.5 # op /sec Hz 17.5 ≈ 1 exaflop/ sec • #operations/visibility/iteration depends on W -projection method • calibration loop (3 × ) around imaging loop [Jon14, Ver15, Wijn14] Kees van Berkel page 16
EXAflops/sec in 2015? SKA1-mid 20 MW Piz Daint (CH) * Tianhe-2, #1 2016 • net SKA1-mid computation load “ 2020 ” versus [Gre14] • gross (peak) compute performance “ 2015 ” Kees van Berkel page 17
Piz Daint: a Cray XC30 system Comprising many kilometers of (optical) cable, ... and 5272 nodes Kees van Berkel page 18
Super computers 101 A modern supercomputer = N (10 3 -10 5 ) identical nodes connected by a network (ignoring storage, peripherals, service nodes, …) network (system-level interconnect) Synchronous DRAM / node LOC LOC SD LOC RAM Mem = on-chip memory, e.g. L2 MC = Memory Controler NIC = Network Interface Mem MC NIC NoC = Network on Chip NoC TOC = Throughput Optimized Core LOC LOC LOC LOC LOC LOC SoC LOC = Latency Optimized Core TOC LOC SoC = System on Chip N “TOC, LOC” is Nvidia speak Kees van Berkel page 19
Piz Daint: 7.8 Pflops/s @ 1.8MW TOC = Tesla K20X GPU Cray CX30, N= 5272 LOC = 8 × Intel Xeon @2.6 GHz Aries network, Dragonfly router IC Piz Daint node system LOC LOC SD LOC RAM # nodes 1 5272 Xeon 2.6GHz 0.17 L2 MC NIC Tesla K20X 1.31 TFLOP/s 1.48 7787 NoC TB 0.06 337 LOC LOC LOC LOC LOC LOC SoC TOC LOC N kW 0.33 1754 Kees van Berkel page 20
Nvidia Research [Ore14]: 1 ExaFlops/s @ 23MW in 2020 TOCs = 8,192 multiply-add @ 1GHz N= 76800 (7nm CMOS) [Ore14] double-precision Aries-like network Nvidia 2020 node system LOC LOC SD LOC RAM # nodes 1 76800 L2 MC NIC TOC 16.4 TFLOP/s PF/s 16.4 1258 NoC TB 0.51 39322 LOC LOC LOC LOC LOC LOC SoC TOC LOC N kW 0.30 23000 Kees van Berkel page 21
Exascale computing for radio astronomy Exascale computing: 10 18 flops Radio astronomy: 10 17.5 flops with gridding ( W -projection) and 2D-FFT as heavy kernels. LOC LOC SD LOC Let’s map 2D-FFT on a node. RAM Option 1: FPGA* Mem MC NIC NoC Option 2: GPU LOC LOC LOC LOC LOC LOC SoC TOC LOC * in same package (not same SoC) N Kees van Berkel page 22
State-of-the-art GPU and FPGAs Nvidia Intel/Altera Xilinx GP100 Stra4x 10 VU13P cmos nm 16 14 16 clock frequency MHz 1328 800 *800 scalar/dsp processors 3584 11520 11,904 peak throughput GFLOP/s 9519 9216 7619 data type [32b] float float fixed DRAM interface HBM2 # HBM2 # HBM2 DRAM bandwidth GB/s 256 256 256 power consump4on W 300 126 GfFLOP/W 32 73 * assumption, no data found # HBM2 (High Bandwidth Memory) interface to 3D stacked DRAM is an option. Kees van Berkel page 23
DFT in matrix-vector form Let x and X be complex vectors of length N. T = F N ⋅ x 0 , x 1 , x N − 1 X T = F N ⋅ x T T ( ) ( ) or X 0 , X 1 , X N − 1 ω = e 2 π i / N Where F N is the twiddle factor matrix, Y = F M ⋅ X ⋅ F N In 2 dimensions: Where Y and X are matrices of size M × N. F M ⋅ X : apply M -point 1D-DFT to each column of matrix X . Kees van Berkel page 24
DFT: Cooley-Tukey factorization theorem Let N=P × M . Then F N can be factorized as [Loa92] F N = P N , P ( I P ⊗ F M ) D N ( F P ⊗ I M ) Where I M and I p are identity matrices, ⊗ denotes the tensor product, D N the diagonal twiddle matrix. Example: P=2, M=N/P =4: • M radix-2 DFTs • D N • P radix- N/2 DFTs • P N,P omitted Cooley Tukey factorization is the basis of FFT Kees van Berkel page 25
Recommend
More recommend