powering real time radio astronomy signal processing with
play

Powering Real-time Radio Astronomy Signal Processing with latest - PowerPoint PPT Presentation

Powering Real-time Radio Astronomy Signal Processing with latest GPU architectures Vinay Deshpande Bharat Kumar Harshavardhan Reddy Suda NVIDIA, India NVIDIA, India NCRA, India What signals we are processing? Digitized baseband signals


  1. Powering Real-time Radio Astronomy Signal Processing with latest GPU architectures Vinay Deshpande Bharat Kumar Harshavardhan Reddy Suda NVIDIA, India NVIDIA, India NCRA, India

  2. What signals we are processing? ▪ Digitized baseband signals from 30 dual polarized antennas of GMRT GMRT ▪ The Giant Meter-wave Radio Telescope (GMRT) is a world class instrument for studying astrophysical phenomena at low radio frequencies ▪ Located 80 km north of Pune, 160 km east of Mumbai ▪ Array telescope with 30 antennas of 45 m diameter, operating at meter wavelengths

  3. GMRT ▪ Supports two modes of operation : - Interferometry (correlator) - Array mode (beamformer) ▪ Frequency bands : - 130 to 260 MHz - 250 to 500 MHz - 550 to 900 MHz - 1050 to 1600 MHz ▪ Maximum instantaneous bandwidth : 400 MHz (Legacy GMRT = 32 MHz) ▪ Effective collecting area (2-3% of SKA) -30,000 sq m at lower frequencies -20,000 sq m at higher frequencies

  4. The Giant Meter-wave Radio Telescope A Google eye view

  5. GMRT receiver chain Signal processing in digital back-end Image courtesy : Ajith Kumar, NCRA

  6. Computation requirements Antenna Signals(M=64) Sampler Maximum Bandwidth 400 MHz 16k point spectral channels – Fourier Transform O(NlogN) 3 TFlops Phase 0.1 TFlops Correction MAC 6.6 TFlops M(M+1)/2 Total ~ 10 TFlops

  7. Design : Time slicing model

  8. Design : Time slicing model A 4-node example Ant 1, Ant 2 --- Ant 16 : Digitized data of baseband signals of Antennas

  9. Implementation ▪ 16 Dell T630 machines as Compute Nodes ▪ 16 ROACH (FPGA) boards with Atmel/e2v based ADCs developed by CASPER group, Berkeley for digitization and packetization ▪ 32 Tesla K40c GPU cards for processing ▪ 36 port Mellanox Infiniband switch for data sharing between Compute Nodes and Host Nodes ▪ Software : C/C++ and CUDA C programming with OpenMPI and OpenMP directives ▪ Developed in collaboration with Swinburne University, Australia

  10. Implementation Image courtesy : Irappa Halagalli, NCRA

  11. Sample result Image of Coma cluster Upgraded GMRT 300 – 500 MHz : Legacy GMRT 325 MHz : 350 μ Jy 28 μ Jy Significantly lower noise RMS and better image quality with upgraded GMRT Dharam Vir Lal and Ishwar Chandra, NCRA

  12. Computation Performance : K40 FFT MAC Channels (Gflops) (Gflops) 2048 620 626 4096 626 620 8192 512 574 16384 498 537 No. of antennas : 32 (dual pol) CUDA 7.5

  13. Motivation for next generation GPUs ▪ Adding more compute intensive applications - Multi-beamforming - Processing on each beam (beam steering) - Gated correlator - FIR filtering with many taps for narrow-band mode implementation ▪ Working GMRT system and code provides an excellent testing ground for the features of next generation GPUs ▪ Performance measured and compared on GP100 and V100

  14. Computation performance – K40 vs GP100 Cuda 7.5, ECC off Performance follows CUFFT benchmarks for K40 and P100 Reference for K40 benchmark : CUDA 6.5 performance report, September 2014 Reference for P100 benchmark : CUDA 8 PERFORMANCE OVERVIEW, November 2016

  15. Computation performance : K40 vs GP100 Cuda 7.5, ECC off No. of antennas : 32 (dual pol)

  16. Computation performance : K40 vs GP100 Cuda 7.5, ECC off Peak Performance : Peak Global Memory Bandwidth : K40 – 4.3 TFlops K40 – 288 GB / sec GP100 – 9.3 TFlops GP100 – 732 GB / sec

  17. Computation performance as % of Real-time Bandwidth : 200 MHz No. of antennas : 32 (dual pol) Spectral Channels : 16384

  18. Computation performance : GP100 vs V100 GP100 on Cuda 7.5 V100 on Cuda 9.1 (using PSG cluster)

  19. Computation performance : GP100 vs V100 GP100 on Cuda 7.5 V100 on Cuda 9.1 (using PSG cluster) No. of antennas : 32 (dual pol)

  20. Computation performance : GP100 vs V100 GP100 on Cuda 7.5 V100 on Cuda 9.1 (using PSG cluster) Peak Performance : Peak Global Memory Bandwidth : GP100 – 9.3 TFlops GP100 – 732 GB / sec V100 – 14 TFlops V100 – 900 GB / sec

  21. Reasons behind relatively low performance of MAC ▪ Non-contiguous Global Memory access at block level MAC input data format ▪ Low Arithmetic Intensity

  22. GPU kernel improvements ▪ FFT : Single Precision to Half Precision floating point ▪ MAC : Simplified Index Arithmetic Improved the L2 hit ratio : less then 5% to nearly 86% Vectorized loads – Increased ILP (float4) Exposing more parallelism by increasing the occupancy Single Precision to Half Precision floating point – No performance gain

  23. MAC : Performance gain with optimizations on V100 V100 on Cuda 9.1 (using PSG cluster) No. of antennas : 32 (dual pol)

  24. FFT : Performance gain with half precision on V100 V100 on Cuda 9.1 (using PSG cluster)

  25. FFT : Error analysis with half precision in power spectrum Spectral Channels : 2048 Batch size : 128

  26. FFT : Error analysis with half precision in phase spectrum Spectral Channels : 2048 Batch size : 128

  27. Going forward ▪ Improving MAC using Tensor cores – potential 2x improvement ▪ Implementing the MAC optimizations and half-precision floating point FFT in the GMRT code ▪ Optimized FIR filtering routines in CUDA for narrow-band mode implementation ▪ Implementing multi-beamforming, beam steering and gated correlator

  28. Acknowledgements ▪ Prof. Yashwant Gupta, Centre Director, NCRA ▪ Ajith Kumar B., Back-end group co-ordinator, GMRT, NCRA ▪ Sanjay Kudale, GMRT, NCRA ▪ Shelton Gnanaraj, GMRT, NCRA ▪ Andrew Jameson, Swinburne University, Australia ▪ Benjamin Barsdel, Swinburne University, Australia (now at Nvidia) ▪ CASPER Group, Berkeley ▪ Digital Back-end Group, GMRT, NCRA ▪ Computer Group, GMRT, NCRA ▪ Control Room, GMRT

  29. Thank You

Recommend


More recommend