image domain gridding on accelerators
play

Image-Domain Gridding on Accelerators Bram Veenboer Monday 26 th - PowerPoint PPT Presentation

Netherlands Institute for Radio Astronomy Image-Domain Gridding on Accelerators Bram Veenboer Monday 26 th March, 2018, GPU Technology Conference 2018, San Jose, USA ASTRON is part of the Netherlands Organisation for Scientific Research (NWO)


  1. Netherlands Institute for Radio Astronomy Image-Domain Gridding on Accelerators Bram Veenboer Monday 26 th March, 2018, GPU Technology Conference 2018, San Jose, USA ASTRON is part of the Netherlands Organisation for Scientific Research (NWO)

  2. Introduction to radio astronomy Observe the sky at radio wavelengths: Image credits: NRAO Size of the telescope is proportional to the wavelength e.g. Hubble Space telescope: 1 um, 2.4 m Same resolution for 1 mm requires 2 km dish! 1

  3. Radio telescope: astronomical interferometer Array of seperate telescopes: interferometer Create an image: interferometry Combine the signals for pairs of telescopes Resolution similar to one very large dish 2

  4. Interferometry theory Sparse sampling of the ‘uv-plane‘: Every baseline samples the uv-plane: ‘visibility’ and ‘uvw-coordinate’ Orientation of baseline also determines orientation in the uv-plane A sample V ( u , v ) is the 2D FT of the brightness on the sky B ( l , m ) Apply Non-uniform Fourier transform to get sky image from uv data 3

  5. Sampling using 2 antennas Every sample (in the uv-domain) shows up as a waveform in the image Image credits: NRAO 4

  6. Sampling using 4 antennas Every baseline (pair of two antennas) adds information to the image 5

  7. Sampling using 16 antennas (compact) Using 16 antennas, the (artificial) source in the center of the image becomes visibile 6

  8. Sampling using 16 antennas (extended) Longer baselines (larger antenna spacings) increase resolution of the image 7

  9. Sampling using 32 antennas, for 8 hours Sampling for an extended period of time increases signal to noise 8

  10. Creating a sky-image gridder → regular grid 2D FFT Imager: visibilities → sky-image − − − − − − − − − i n c o m i n g ionosphere r a d i receiver o w a v e s imager sky-model correlator sky-image visibilities I × baseline (pair of receivers) calibration imaging Correlator: combine signal into ‘visibility’ (with associated ‘uvw-coordinate’) 9

  11. Gridding using AW-projection (and W-stacking) A-term : correct for W-term : correct for direction-dependent effects curvature of the earth 10

  12. W-projection gridding and Image-Domain gridding W-projection gridding using convolution kernels grid: visibilities visibilities gridder kernel image subgrids gridder kernel FFT time Fourier subgrids channels adder visibility: convolution: updated pixel: Fourier grid Fourier grid Image-Domain gridding using subgrids For more details: Image-Domain Gridding on Graphics Processors, Bram Veenboer, Matthias Petschow and John. W Romein, IPDPS 2017 11

  13. Square Kilometre Array SKA1 Low, Australia SKA1 Mid, Africa Data rates up to ≈ 10 . 000 . 000 . 000 visibilities/second 12

  14. Results: runtime/throughput Runtime: time spend in one imaging cycle: gridding, fft and degridding Throughput: number of visibilities processed per second gridding Haswell Haswell degridding KNL KNL Pascal Pascal gridder subgrid-ifft adder grid-fft splitter subgrid-fft Vega Vega degridder 0 20 40 60 80 100 120 140 160 180 200 0 10 20 30 40 50 60 Throughput [MVisibilities/s] Runtime [seconds] Most time spent in gridder/degridder GPUs perform > order of magnitude better than CPU and Xeon Phi Very similar throughput for gridding and degridding 13

  15. Roofline analysis: overview Vega 10 Pascal degridder gridder KNL Performance [TOp/s] Haswell 1 degridder gridder 0.1 Haswell KNL Pascal Vega 1 2 4 8 16 32 64 128 256 512 1024 Operational intensity [Op/Byte] 14

  16. Throughput limitation: host-device transfers PCIe : ≈ 12 GB/s vs. NVLINK : ≈ 68 GB/s Fast interconnect needed to keep GPU computing. 15

  17. Roofline analysis: overview Vega 10 Pascal degridder gridder KNL Performance [TOp/s] Haswell 1 degridder gridder 0.1 Haswell KNL Pascal Vega 1 2 4 8 16 32 64 128 256 512 1024 Operational intensity [Op/Byte] 16

  18. Inner loop for (de)gridder kernel: instruction mix Many fused-multiply-add ( FMA ) and one sine/cosine computation: 1 α = . . . ; 2 3 for c=1,. . . , ˜ C do // channel Φ = cos ( α ) + i sin ( α ); 4 5 Re( pix 11 ) += Re( vis 11 [ c ]) ∗ Re(Φ[ c ]); 6 Im( pix 11 ) += Re( vis 11 [ c ]) ∗ Im(Φ[ c ]); 7 Re( pix 11 ) − = Im( vis 11 [ c ]) ∗ Im(Φ[ c ]); 8 Im( pix 11 ) += Im( vis 11 [ c ]) ∗ Re(Φ[ c ]); 9 10 // [... same for pix 12 , pix 21 and pix 22 ] 11 12 end FMA : peak performance on all architectures sine/cosine : poor performance on Intel architectures 17

  19. Roofline analysis: instruction mix Vega 10 Pascal degridder gridder KNL Performance [TOp/s] Haswell 1 degridder gridder 0.1 Haswell KNL Pascal Vega 1 2 4 8 16 32 64 128 256 512 1024 Operational intensity [Op/Byte] 18

  20. Roofline analysis: shared memory Pascal Vega 10 degridder gridder Performance [TOp/s] 1 0.1 Pascal Vega 1 2 4 8 1 1 4 2 Operational intensity [Op/Byte] 19

  21. Results: energy consumption/efficiency gridder Haswell Haswell degridder KNL KNL Pascal Pascal gridder subgrid-ifft adder grid-fft splitter subgrid-fft Vega Vega degridder host 0 2 4 6 8 10 12 14 16 18 0 5 10 15 20 25 30 35 Energy consumption [kJ] Energy efficiency [GFlop/W] Most energy spent in gridder/degridder GPUs perform > order of magnitude better than CPU and Xeon Phi 20

  22. Results: comparison with AW-projection IDG Pascal WPG Pascal Throughput [Visibilities/s] AWPG Pascal 10 8 10 7 64 56 48 40 32 24 16 8 W-kernel size N W IDG outperforms W-projection, while it also corrects for the challenging A-terms 21

  23. Results: creating very large images (gpu-only) 240 Throughput [MVisibilities/s] 220 200 180 GPU only, gridding GPU only, degridding 1024 2048 4096 8192 16384 32768 65536 Size [pixels 2 ] The size of the image is restricted by the amount of GPU device memory 22

  24. Results: creating very large images (gpu+cpu) 240 Throughput [MVisibilities/s] 220 200 180 Hybrid, gridding Hybrid, degridding GPU only, gridding GPU only, degridding 1024 2048 4096 8192 16384 32768 65536 Size [pixels 2 ] The adder kernel is executed by the host CPU 23

  25. Results: creating very large images (Unified Memory) 240 tiling in adder/splitter Throughput [MVisibilities/s] 220 200 Unified, gridding Unified, degridding 180 Hybrid, gridding Hybrid, degridding GPU only, gridding GPU only, degridding 1024 2048 4096 8192 16384 32768 65536 Size [pixels 2 ] Unified memory (and tiling) enables the GPU to create very large images 24

  26. Image-Domain Gridding for the Square Kilometre Array SKA-1 Low SKA-1 Mid Imaging data rate: 200 GVis/s # receivers 512 133 Compute: 50 PFlop/s (DP) # baselines 13,0816 8778 # channels 65,536 65.536 Power budget: ≈ 1 MW # polarizations 4 4 (De)gridding: ≈ 60% integration time 0.9 (s) 0.14 (s) data rate 8.3 (GVis/s) 9.53 (GVis/s) IDG on Tesla V100/NVLINK: ≈ 0 . 26 GVis/s per GPU → 770 V100s required Required compute: 770 × 7 . 8 ≈ 6 PFlop/s ≪ 15 PFlop/s available Power budget: 770 × 300 ≈ 231 KW ≪ 600 KW available 25

  27. Summary High-performance gridding and degridding, including AW-term correction GPUs much faster and more (energy)-efficient than CPUs and Xeon Phi On GPUs, IDG outperforms AW-projection IDG is able to make very large images (using Unified Memory) Most challenging sub-parts of imaging for SKA is solved! More details: Image-Domain Gridding on Graphics Processors, Bram Veenboer, Matthias Petschow and John. W Romein, IPDPS 2017 Source available at: https://gitlab.com/astron-idg/idg 26

Recommend


More recommend