S7105 – ADAS/AD CHALLENGES: GPU SCHEDULING & SYNCHRONIZATION Venugopala Madumbu, NVIDIA GTC 2017 – 210D
ADVANCED DRIVING ASSIST SYSTEMS (ADAS) & AUTONOMOUS DRIVING (AD) High Compute Workloads Mapped to GPU 2
ADAS/AD Requirements & Challenges Real-Time Behavior Performance • Determinism • Maximum Throughput • Freedom from Interference • Minimal Latency • Priority of Functionalities Multi-Core GPU/DSP/HWA CPU 3
ADAS/AD WORKLOADS Challenges Illustrated If so, How to Scenario#1 – Standalone Exec Scenario#3 – Concurrent Exec • Achieve determinism Achieve Freedom from interference • GL GL Workload Workload Prioritize one Workload over other • X msec While also having maximum throughput CUDA Workload • minimum latency • > (X+Y) msec Scenario#2 – Standalone Exec CUDA Workload CUDA GL Workload Workload Y msec X msec Time Shared GPU Execution Y msec 4
GPU IN TEGRA High Level Tegra SoC Block Diagram GPU Other Clients CPU CPU submits job/work to GPU Host Engines (ISP , Display, etc.) GPU runs asynchronously to CPU GPU Memory Interface GPU has its own hardware scheduler (Host) It switches between workloads without CPU involvement Memory Controller DRAM 5
GPU SCHEDULING Concepts Channel – independent stream of work on the GPU Command Push Buffer – Command buffer written by Software and read by Hardware Channel Switching – Save/restore GPU state on a channel switch Semaphores/SyncPoints – Synchronization mechanism for events within the GPU Time Slice – How long a GPU executes commands of a channel before a channel switch Run-list – An ordered list of channels that SW wants the GPU to execute 6
GPU SCHEDULING Timesharing by Channel Switching App1 App3 App4 App2 Channel switching occurs when any ONE of the following happens: • Time slice expires • Engine runs out of work (no more Timesliced Round-Robin commands) • Blocked on a semaphore GPU Channel Switch time = Drain Time + Save/Restore time . . . . . . Preemption can reduce Channel Switch times drastically GPU Occupancy Time 7
GPU SCHEDULING Preemption 8
GPU SCHEDULING Channel Switching with Time Slice Scenarios 1. Channel finishes before time slice expires Channel 1 Context switch to next channel Time slice 2. Channel preemption Stop all commands in pipeline Channel 1 Channel 1 Wait for engines to idle Higher Context Switch time Time slice Channel Switch Timeout 3. Channel Reset Engine could not idle and context could Channel 1 Channel 1 not save before channel switch timeout Callback to notify kernel of channel Time slice Channel Switch Channel Reset reset event Timeout 9
CHALLENGE REVISTED How can we achieve both? Performance: Real-Time behavior: • Maximum Throughput • Determinism • Minimal Latency • Freedom from Interference • Priority of Functionalities 10
GPU SYNCHRONIZATION & SCHEDULING Software Control 1. User Driver Level (GPU Synchronization Approach) • Syncpoints/Semaphores for Synchronization • Through EglStreams, EGLSync etc 2. Kernel Driver Level (GPU Priority Scheduling Approach) • Run-List Engineering How long channel runs • Order of Channel execution • 11
GPU SYNCHRONIZATION APPROACH No Synchronization Case Kernel launch GPU Semaphore Latency due to Priority GPU Task Concurrent Execution GPU Task GPU Task GPU CPU Task CPU Task CPU Task CPU 15 30 0 5 10 20 25 35 msec 12
GPU SYNCHRONIZATION APPROACH Synchronization on CPU: Not good for GPU Kernel launch GPU Semaphore Priority GPU Task GPU Task GPU Task GPU CPU Task CPU Task CPU Task CPU 15 30 0 5 10 20 25 35 msec 13
GPU SYNCHRONIZATION APPROACH Synchronization on GPU: No Context Switches Kernel launch GPU Semaphore Determinism Priority GPU Task GPU Task Freedom from Delayed GPU Interference Start Task GPU Priority of Functionalities CPU Task CPU Task CPU Task CPU 15 30 0 5 10 20 25 35 msec 14
GPU PRIORITY SCHEDULING APPROACH Hypothetical Example WORST CASE TASK PRIORITY FPS EXECUTION TIME (WCET) H1 H1 High 60 9ms M1 M1 Medium 30 4ms M2 M2 Medium 30 4ms L1 Low/Best Effort 30 10ms L1 15
GPU PRIORITY SCHEDULING APPROACH Engineered Run-list and Time Slice Ensuring FPS and Latency M1 (Max Exec Time = 4 ms) H1 H1 H1 (Max Exec Time = 9 ms) Time slice = 3 ms Time slice = 9 ms M1 M1 M2 (Max Exec Time = 4 ms) L1 (Max Exec Time = 10 ms) M2 M2 Time slice = 1 ms Time slice = 3 ms L1 Run-List Ensured not >16ms for 60fps operation . . . . . . Work on GPU Time 16
GPU PRIORITY SCHEDULING APPROACH Reduce Latency for GPU Work Completion Ensure timeslice is long enough to complete work Ensure work is continually submitted and also well ahead in time To Avoid • GPU idle time • Unnecessary context switches • 17
GPU SCHEDULING Best Practices to Keep GPU Busy Submit work in advance • So the GPU has some work to execute at any point of time Try to reduce/eliminate work dependencies Have contingency plan for work overload If feedback shows over budget, submit work few frames ahead and spread • Plan for worst case scenario • Deal with GPU reset case esp for the Low priority cases GL Robustness Extensions • 18
CONCLUSION GPU Synchronization & Scheduling Approaches Performance: Real-Time behavior: • Maximum Throughput • Determinism • Minimal Latency • Freedom from Interference • Priority of Functionalities 19
ACKNOWLEDGEMENTS • Scott Whitman, NVIDIA • Vladislav Buzov, NVIDIA • Amit Rao, NVIDIA • Yogesh Kini, NVIDIA GTC Instructor led Lab:: L7105 – EGLSTREAMS : INTEROPERABILITY OF CAMERA, CUDA AND OPENGL 11 TH MAY 2017 9:30-11:30AM LL21D 20
Q & A 21
THANK YOU
Recommend
More recommend