realtime signal processing on nvidia tx2
play

Realtime Signal Processing on Nvidia TX2 using CUDA Armin Weiss - PowerPoint PPT Presentation

Realtime Signal Processing on Nvidia TX2 using CUDA Armin Weiss Institute of Embedded Systems High Performance Multimedia Research Group Dr. Amin Mazloumian Zurich University of Applied Sciences Dr. Matthias Rosenthal Zrcher Fachhochschule


  1. Realtime Signal Processing on Nvidia TX2 using CUDA Armin Weiss Institute of Embedded Systems High Performance Multimedia Research Group Dr. Amin Mazloumian Zurich University of Applied Sciences Dr. Matthias Rosenthal Zürcher Fachhochschule

  2. System Overview Digital Audio Mixing Console TX2 Control Unit Audio Processing Audio Audio Audio I / O Source Sink Card 2 Zürcher Fachhochschule

  3. Motivation • In Comparison to FPGA / DSP Solutions: • Performance Gain: 100x (e.g. Analog Devices SHARC) • Fast Development Time Nvidia TX Series • System on Single Chip • Cost Effective CPU GPU 3 Zürcher Fachhochschule

  4. Challenges • Short and Deterministic Latency • 32 Samples ( 0.33 ms @ 96 kHz) • Video (60 Hz): 16.7 ms / Frame • High Input and Output Data Rate • 256 Channels ∗ 7 Inputs ∗ 32 Bit Input ∗ 96 kHz = 5.5 Gb/s • 1080p@60 (24-bit RGB): 3.0 Gb/s 4 Zürcher Fachhochschule

  5. Short and Deterministic Latency Variability in GPU Kernel Launch ≈ 99.8% as expected __global__ void identity (float * input , float * output , int numElem ) { for (int index = 0; index < numElem; index++) { output[index] = input[index]; } } Outliers numElem = 25 < 0.1% 0.1 % - 0.2% 5 Zürcher Fachhochschule

  6. Short and Deterministic Latency Variability in GPU Kernel Launch Latency ~ Buffer Size Outliers ≈ 50 ms 6 Zürcher Fachhochschule

  7. Short and Deterministic Latency Problems • How to Improve Deterministic Behavior? •  Solution: Persistent CUDA Kernel • Eliminate Launch Time 7 Zürcher Fachhochschule

  8. Short and Deterministic Latency Persistent Kernel CPU GPU Host Application CUDA Kernel … __global__ void audioKernel (…) { while (running) { … if (new_audio_samples() == true) { // Infinite loop send_sync_to_GPU(); while (true) { wait_for_GPU_sync(); wait_for_CPU_sync(); } // Audio channel processing } … … wait_for_all_threads_to_finish(); send_sync_to_CPU(); } } 8 Zürcher Fachhochschule

  9. Short and Deterministic Latency Persistent Kernel: Synchronization GPU Memory Accessible CUDA Kernel by CPU and GPU __global__ void audioKernel (…, volatile int* gpuToken ) { … wait_for_CPU_sync() { // Infinite loop int i = blockIdx.x * blockDim.x + threadIdx.x; while (true) { if (i == 0) wait_for_CPU_sync(); while (* gpuToken == NO_AUDIO); // Audio channel processing synchronize_threads(); … } wait_for_all_threads_to_finish(); send_sync_to_CPU(); } } 9 Zürcher Fachhochschule

  10. Short and Deterministic Latency CUDA Memory Comparison Managed <-> Zero Copy Zero Copy Managed Memory TX2 TX2 GPU GPU CPU CPU Cache Cache Cache Cache Memory Controller Memory Controller Incl. SMMU Incl. SMMU DRAM 8GB DRAM 8GB GPU GPU Buffer Buffer 10 Zürcher Fachhochschule

  11. Short and Deterministic Latency Problems • Memory Accessible by CPU and GPU • Use Zero Copy Memory • What about Parameters? 11 Zürcher Fachhochschule

  12. Short and Deterministic Latency Infinite Loop: Parameter Communication CPU GPU Audio Processing Thread Application Local Parameter Copy Writes at Arbitrary Periodic Update Time System Memory Slow Access! Parameter Memory (Zero Copy) 12 Zürcher Fachhochschule

  13. Short and Deterministic Latency Conclusion • Use Memory Accessible by CPU and GPU • Use Zero Copy Memory • What about Parameters? • Speed-up due to Local Copy • How to Make CPUs Deterministic? • CPU Core Isolation • Initramfs built with Yocto • No Flash Access Anymore During Runtime • Minimize Influence from other Applications 13 Zürcher Fachhochschule

  14. Challenges • Short and Deterministic Latency • High Input and Output Data Rate 14 Zürcher Fachhochschule

  15. High Input / Output Data Rate Separate Buffers TX2 Audio CPU GPU I / O PCIe Card Cache Cache Memory Controller Incl. SMMU DRAM 8GB memcpy I/O GPU Buffer Buffer 15 Zürcher Fachhochschule

  16. High Input / Output Data Rate memcpy() Measurements memcpy() Time 4096 bytes @ 48 kHz for 12h on 3 CPUs (A57) 1.E+09 # Occurrences 1.E+06 75 % CPU Usage! 1.E+03 1.E+00 Time (µs) TX1 ( Kernel v3.10.96) TX2 (Kernel v4.4.15) 16 Zürcher Fachhochschule

  17. High Input / Output Data Rate Shared Buffer TX2 Audio CPU GPU I / O PCIe Card Cache Cache Memory Controller Incl. SMMU DRAM 8GB memcpy I/O GPU Buffer Buffer 17 Zürcher Fachhochschule

  18. High Input / Output Data Rate Shared Buffer TX2 Audio CPU GPU I / O PCIe Card Cache Cache Memory Controller Incl. SMMU DRAM 8GB Shared Buffer 18 Zürcher Fachhochschule

  19. High Input / Output Data Rate Shared Buffer • Existing Solutions for Buffer Sharing • GPUDirect RDMA 19 Zürcher Fachhochschule

  20. High Input / Output Data Rate GPUDirect RDMA Desktop TX2 Discrete GPU Integrated GPU CPUs CPUs GPU System Memory Interconnect Memory System Bridge Memory PCIe Controller PCIe PCIe 3rd 3rd GPU GPUDirect Party Party RDMA Memory Device Device 20 Zürcher Fachhochschule

  21. High Input / Output Data Rate Shared Buffer • Existing Solutions for Buffer Sharing • GPUDirect RDMA  Not Available • CudaHostRegister() 21 Zürcher Fachhochschule

  22. High Input / Output Data Rate CudaHostRegister() TX2 Audio CPU GPU I / O PCIe Card Cache Cache Memory Controller Incl. SMMU DRAM 8GB CudaHostRegister() I/O Buffer 22 Zürcher Fachhochschule

  23. High Input / Output Data Rate Shared Buffer • Existing Solutions for Buffer Sharing • GPUDirect RDMA  Not Available • CudaHostRegister()  Not Implemented on TX2 • Video4Linux2  Userptr Mode 23 Zürcher Fachhochschule

  24. High Input / Output Data Rate Video4Linux - Userptr TX2 Embedded Video CPU GPU Camera Input Cache Cache Memory Controller Incl. SMMU DRAM 8GB Userptr Mode GPU Buffer Mapped Access 24 Zürcher Fachhochschule

  25. High Input / Output Data Rate Video4Linux - Userptr TX2 Audio CPU GPU I / O PCIe Card Cache Cache Memory Controller Incl. SMMU DRAM 8GB Userptr Mode GPU Buffer Mapped Access 25 Zürcher Fachhochschule

  26. High Input / Output Data Rate Shared Buffer • Existing Solutions for Buffer Sharing • GPUDirect RDMA  Not Available • CudaHostRegister()  Not Implemented on TX2 ✓ • Video4Linux2  Userptr Mode 26 Zürcher Fachhochschule

  27. Conclusion • Feasibility of Low-Latency Signal Processing on GPU • Professional Audio Mixer with 200 Channels • Short and Deterministic Latency • Persistent CUDA Kernel • High Input / Output Data Rate • Shared Buffer I/O <-> GPU (Video4Linux) 27 Zürcher Fachhochschule

  28. Questions Get started with signal processing on GPU! Website: http://www.zhaw.ch/ines/ Blog: https://blog.zhaw.ch/high-performance/ Github: https://github.com/ines-hpmm E-Mail: armin.weiss@zhaw.ch amin.mazloumian@zhaw.ch matthias.rosenthal@zhaw.ch 28 Zürcher Fachhochschule

Recommend


More recommend