experiments for time predictable execution of gpu kernels
play

Experiments for Time-Predictable Execution of GPU Kernels Flavio - PowerPoint PPT Presentation

Experiments for Time-Predictable Execution of GPU Kernels Flavio Kreiliger, Joel Matjka, Michal Sojka and Zdenk Hanzlek OSPERT 2019 July 9, 2019, Stuttgart, Germany F. Kreiliger et al. Experiments for Time-Predictable Execution of GPU


  1. Experiments for Time-Predictable Execution of GPU Kernels Flavio Kreiliger, Joel Matějka, Michal Sojka and Zdeněk Hanzálek OSPERT 2019 July 9, 2019, Stuttgart, Germany F. Kreiliger et al. Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 1 / 21

  2. Motivation/Approach NVIDIA Tegra X2 F. Kreiliger et al. Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 2 / 21 ▶ CPUs: 4× ARM Cortex A57, 2× Denver (ARM/NVIDIA) ▶ GPU: 256 CUDA cores in 2 streaming multiprocessors (SM)

  3. Motivation/Approach Outline Motivation/Approach Experiments and results Future work F. Kreiliger et al. Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 3 / 21

  4. Motivation/Approach NVIDIA Tegra X2 block diagram F. Kreiliger et al. Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 4 / 21

  5. Motivation/Approach NVIDIA Tegra X2 block diagram F. Kreiliger et al. Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 4 / 21 CPUs

  6. Motivation/Approach NVIDIA Tegra X2 block diagram F. Kreiliger et al. Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 4 / 21 CPUs GPU

  7. Motivation/Approach NVIDIA Tegra X2 block diagram F. Kreiliger et al. Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 4 / 21 CPUs USB SATA GPU Video & display

  8. Motivation/Approach NVIDIA Tegra X2 block diagram F. Kreiliger et al. Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 4 / 21 CPUs USB SATA GPU Video & display MEM

  9. Motivation/Approach GPU execution times under CPU interference OSPERT19 Experiments for Time-Predictable Execution of GPU Kernels F. Kreiliger et al. Deliverable D2.2, H2020 project HERCULES, 2017. Source: Capodieci et al., Detailed characterization of platforms , 5 / 21 Tegra X2, CPUs performing sequential memory accesses 200% Relative execution time % 180% 160% Alone Interf 1 140% Interf 2 120% Interf 3 Interf 4 100% Interf 5 80% CUDA UVM CUDA Kernel CUDA CUDA memset memcpy

  10. Motivation/Approach Safety-Critical applications E.g. autonomous driving performance critical one ISO26262: Freedom from interference F. Kreiliger et al. Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 6 / 21 ▶ Future application will need to combine safety and high ▶ Typically, only some parts of the system are safety-critical ▶ Goal: isolate critical parts from non-critical ones ▶ Failure in non-critical component should not propagate to a

  11. Motivation/Approach Safety-Critical applications E.g. autonomous driving performance critical one F. Kreiliger et al. Experiments for Time-Predictable Execution of GPU Kernels OSPERT19 6 / 21 ▶ Future application will need to combine safety and high ▶ Typically, only some parts of the system are safety-critical ▶ Goal: isolate critical parts from non-critical ones ▶ Failure in non-critical component should not propagate to a ▶ ISO26262: Freedom from interference

  12. � � Motivation/Approach Interference on TX2 OSPERT19 Experiments for Time-Predictable Execution of GPU Kernels F. Kreiliger et al. Deliverable D2.2, H2020 project HERCULES, 2017. Source: Capodieci et al., Detailed characterization of platforms , 7 / 21 2. GPU-to-CPU 3. CPU-to-CPU 4. GPU-to-GPU 1. CPU-to-GPU 200% Relative execution time % 180% 160% Alone Interf 1 140% Interf 2 120% Interf 3 Interf 4 100% Interf 5 80% CUDA UVM CUDA Kernel CUDA CUDA memset memcpy

  13. � � Motivation/Approach Interference on TX2 OSPERT19 Experiments for Time-Predictable Execution of GPU Kernels F. Kreiliger et al. Deliverable D2.2, H2020 project HERCULES, 2017. Source: Capodieci et al., Detailed characterization of platforms , 7 / 21 2. GPU-to-CPU 3. CPU-to-CPU 4. GPU-to-GPU 1. CPU-to-GPU 200% Relative execution time % 180% 160% Alone Interf 1 140% Interf 2 120% Interf 3 Interf 4 100% Interf 5 80% CUDA UVM CUDA Kernel CUDA CUDA memset memcpy

  14. Motivation/Approach Interference on TX2 OSPERT19 Experiments for Time-Predictable Execution of GPU Kernels F. Kreiliger et al. Deliverable D2.2, H2020 project HERCULES, 2017. Source: Capodieci et al., Detailed characterization of platforms , 7 / 21 3. CPU-to-CPU 4. GPU-to-GPU 2. GPU-to-CPU 1. CPU-to-GPU 200% Relative execution time % 180% 160% Alone Interf 1 140% Interf 2 120% Interf 3 Interf 4 100% Interf 5 80% CUDA UVM CUDA Kernel CUDA CUDA memset memcpy CPU – sequen � al read, sequen � al interference 60 50 Latency [ns] 40 30 Cache limit 20 Alone Interf 1 10 Interf 2 0 Interf 3 1024 2048 4096 8192 16384 32768 65536 131072 262144 524288 1048576 2097152 4194304 8388608 16777216 WSS [B]

  15. Motivation/Approach Interference on TX2 OSPERT19 Experiments for Time-Predictable Execution of GPU Kernels F. Kreiliger et al. Deliverable D2.2, H2020 project HERCULES, 2017. Source: Capodieci et al., Detailed characterization of platforms , 7 / 21 3. CPU-to-CPU 4. GPU-to-GPU 2. GPU-to-CPU 1. CPU-to-GPU 200% Relative execution time % 180% 160% Alone Interf 1 140% Interf 2 120% Interf 3 Interf 4 100% Interf 5 80% CUDA UVM CUDA Kernel CUDA CUDA memset memcpy CPU – sequen � al read, sequen � al interference 60 50 Latency [ns] 40 30 Cache limit 20 Alone Interf 1 10 Interf 2 0 Interf 3 1024 2048 4096 8192 16384 32768 65536 131072 262144 524288 1048576 2097152 4194304 8388608 16777216 WSS [B]

  16. Motivation/Approach Interference on TX2 OSPERT19 Experiments for Time-Predictable Execution of GPU Kernels F. Kreiliger et al. Deliverable D2.2, H2020 project HERCULES, 2017. Source: Capodieci et al., Detailed characterization of platforms , 7 / 21 3. CPU-to-CPU 4. GPU-to-GPU 2. GPU-to-CPU 1. CPU-to-GPU 200% Relative execution time % 180% 160% Alone Interf 1 140% Interf 2 120% Interf 3 Interf 4 100% Interf 5 80% CUDA UVM CUDA Kernel CUDA CUDA memset memcpy CPU – sequen � al read, sequen � al interference 60 50 Latency [ns] 40 30 Cache limit 20 Alone Interf 1 10 Interf 2 0 Interf 3 1024 2048 4096 8192 16384 32768 65536 131072 262144 524288 1048576 2097152 4194304 8388608 16777216 WSS [B]

  17. Motivation/Approach » PREM applications: OSPERT19 Experiments for Time-Predictable Execution of GPU Kernels F. Kreiliger et al. CPU-to-CPU interference these 8 / 21 number-crunching (PREM) memory (cache/scratchpad) and synchronize on access to main ▶ Possible solution (a part of): PRedictable Execution Model ▶ Tasks prefetch batches of data to CPU-local memory ▶ Well applicable to P C C W CPU1 P C W CPU2 ▶ Image processing ▶ Neural networks P P W W MC ▶ GPUs are better suited for

  18. Motivation/Approach » PREM performance OSPERT19 Experiments for Time-Predictable Execution of GPU Kernels F. Kreiliger et al. Problems with PREM on GPUs 9 / 21 parallelism bottleneck ▶ Memory bandwidth is almost always a P C C W CPU1 ▶ Compute-phases are shorter due to high P C W CPU2 P P W W MC ▶ Mutual exclusion for memory access kills ▶ Costly synchronization ( ≈ 2 µs) ▶ between CPU and GPU or ▶ between multiple SMs in the GPU

  19. Motivation/Approach » PREM PREM on GPU: Early approach – GPUguard (ETHZ) OSPERT19 Experiments for Time-Predictable Execution of GPU Kernels F. Kreiliger et al. CPU/GPU CPU and GPU 10 / 21 Create GG-Interface, exchange SHM, Retrieve GG-stats CPU cudaDeviceSynchronize M-WCET C-WCET Setup T K(GPU-GUARD) Offload Req Comp-phase Req Mem-phase Check-in M? C? Further execution Check-out C? HV SHM GPU Init Spinning on SHM M Spinning on SHM C M fini ▶ Low performance due to excessive synchronization between

  20. Motivation/Approach » PREM PREM on GPU: Early approach – GPUguard (ETHZ) OSPERT19 Experiments for Time-Predictable Execution of GPU Kernels F. Kreiliger et al. CPU/GPU CPU and GPU 10 / 21 Create GG-Interface, exchange SHM, Retrieve GG-stats CPU cudaDeviceSynchronize M-WCET C-WCET Setup T K(GPU-GUARD) Offload Req Comp-phase Req Mem-phase Check-in M? C? Further execution Check-out C? HV SHM GPU Init Spinning on SHM M Spinning on SHM C M fini ▶ Low performance due to excessive synchronization between

  21. Motivation/Approach » Time-Triggered scheduling Cons: OSPERT19 Experiments for Time-Predictable Execution of GPU Kernels F. Kreiliger et al. approach Reduced by our uncertain execution time Over-provisioning due to workload Cannot handle dynamic chip Another approach: Time-Triggered scheduling but can span the whole Applies not only to GPU overhead Low synchronization Pros: frame) 11 / 21 ▶ GPU jobs are often offmoaded in batches (e.g. one video ▶ the whole batch can be scheduled ▶ all parameters are known at least at offmoad time ▶ the processing pipeline is static (safety)

  22. Motivation/Approach » Time-Triggered scheduling Cons: OSPERT19 Experiments for Time-Predictable Execution of GPU Kernels F. Kreiliger et al. approach Reduced by our uncertain execution time Over-provisioning due to workload Cannot handle dynamic chip Another approach: Time-Triggered scheduling but can span the whole Applies not only to GPU overhead Pros: frame) 11 / 21 ▶ GPU jobs are often offmoaded in batches (e.g. one video ▶ the whole batch can be scheduled ▶ all parameters are known at least at offmoad time ▶ the processing pipeline is static (safety) ▶ Low synchronization

Recommend


More recommend