Making OpenVX Really “Real Time” Ming Yang 1 , Tanya Amert 1 , Kecheng Yang 1,2 , Nathan Otterness 1 , James H. Anderson 1 , F. Donelson Smith 1 , and Shige Wang 3 1 The University of North Carolina at Chapel Hill 2 Texas State University 3 General Motors Research
700 ms
A new approach for graph scheduling
Shorter response time + Less capacity loss
1. State of the art 2. Our approach 3. Future work � 6
Example OpenVX Graph OpenVX Graph-based Native Downstream Node OpenVX OpenVX Camera Application architecture Node Node OpenVX Control Processing Node Application Application Portability to diverse hardware GPU FPGA DSP Does OpenVX really target “real-time” processing? � 7 Source: https://www.khronos.org/openvx/
Does OpenVX really target “real-time” processing? 1. It lacks real-time concepts 2. Entire graphs = monolithic schedulable entities Example OpenVX Graph OpenVX Node Downstream Native Camera OpenVX OpenVX Application Node Control Node Processing OpenVX Node � 8 Source: https://www.khronos.org/openvx/
Does OpenVX really target “real-time” processing? 1. It lacks real-time concepts 2. Entire graphs = monolithic schedulable entities C A D B � 9 Source: https://www.khronos.org/openvx/
Does OpenVX really target “real-time” processing? 1. It lacks real-time concepts 2. Entire graphs = monolithic schedulable entities C A D B … A B C D A A B C D Time Monolithic scheduling � 10 Source: https://www.khronos.org/openvx/
Prior Work Coarse-grained scheduling • OpenVX nodes = schedulable entities [23, 51] C A D B Task A: A A Task B: B B Task C: C C … Task D: D D Time Coarse-grained scheduling � 11
Prior Work Coarse-grained scheduling • OpenVX nodes = schedulable entities [23, 51] Remaining problems: 1. More parallelism to be explored 2. Suspension-oblivious analysis was applied and causes capacity loss. � 12
Fine-Grained Scheduling This Work
1. Coarse-grained vs. fine-grained 2. Response-time bounds analysis 3. Case study � 14
1. Coarse-grained vs. fine-grained 2. Response-time bounds analysis 3. Case study � 15
C Task A: A D Task B: B Task C: Suspension for GPU Task D: execution Time … Coarse-Grained Scheduling C Task A: A D Task E: E F G Task F: Task G: GPU execution Task C: Task D: Time Fine-Grained Scheduling � 16
1. Coarse-grained vs. fine-grained 2. Response-time bounds analysis 3. Case study � 17
Deriving Response-Time Bounds for a DAG* Step 1: Schedule the nodes as sporadic tasks Step 2: Compute bounds for every node Step 3: Sum the bounds of nodes on the critical path * C. Liu and J. Anderson, “Supporting Soft Real-Time DAG-based Systems on Multiprocessors with No Utilization Loss,” in RTSS, 2013. � 18
Deriving Response-Time Bounds for a DAG C A D B E F � 19
Deriving Response-Time Bounds for a DAG C A D B E F � 20
Deriving Response-Time Bounds for a DAG CPU … C A D B F GPU E Need a response-time … bound analysis for GPU tasks � 21
A system model of GPU Tasks T 1 Per-block worst-case Number of SM1 2048 workload blocks C 1 H 1 = 1024 τ i = ( C i , T i , B i , H i ) B 1 SM0 2048 Number of Period threads per 3 0 6 Time block (or block size) τ 1 = (3076,6,2,1024) � 22
Response-Time Bounds Proof Sketch 1. We first show the necessity of a total utilization bound and intra-task parallelism via counterexamples. � 23
Response-Time Bounds Proof Sketch 1. We first show the necessity of a total utilization bound and intra-task parallelism via counterexamples. Releases: Without intra-task 1 2 3 4 5 parallelism: 1 3 5 With intra-task 2 4 parallelism: Time � 24
Response-Time Bounds Proof Sketch 1. We first show the necessity of a total utilization bound and intra-task parallelism via counterexamples. R k 3. We prove the job SM1 finishes τ k , j before . r k , j + R k 2. We then bound the SM0 unfinished workload from jobs released at or before . r k , j Time r k , j � 25
1. Coarse-grained vs. fine-grained 2. Response-time bounds analysis 3. Case study � 26
Case Study: Comparing Fine-Grained/ Coarse-Grained/Monolithic Scheduling • Application: Histogram of Oriented Gradients (HOG) vxHOGCells vxHOGFeature vxHOGCells vxHOGFeature Node vxHOGCells vxHOGFeatures sNode Node sNode Node Node Compute Normalize Compute Compute Normalize Resize Image Orientation Compute Orientation Normalize Compute Resize Image Gradients Orientation Orientation Compute Histograms Gradients Histograms Orientation Resize Image Orientation Histograms Histograms Gradients Histograms Histograms CPU+GPU Execution (Coarse-Grained) GPU Execution (Fine-Grained) � 27
Case Study: Comparing Fine-Grained/ Coarse-Grained/Monolithic Scheduling • Application: Histogram of Oriented Gradients (HOG) • 6 instances • 33 ms period • 30,000 samples • Platform: NVIDIA Titan V GPU + Two eight-core Intel CPUs. • Schedulers: G-EDF , G-FL (fair-lateness) � 28
Case Study: Comparing Fine-Grained/ Coarse-Grained/Monolithic Scheduling % of samples Left is better 50% samples have response time less than 60 ms Time � 29
Case Study: Comparing Fine-Grained/ Coarse-Grained/Monolithic Scheduling FL: fair-lateness [1] Fine-grained (G-FL) [2] Coarse-grained (G-EDF) [3] Monolithic (G-EDF) 65.99 136.57 84669.47 Average Response Time (ms) 125.66 427.07 170091.06 Maximum Response Time (ms) � 30
Case Study: Comparing Fine-Grained/ Coarse-Grained/Monolithic Scheduling [1] [2] FL: fair-lateness Half the average response time [1] Fine-grained (G-FL) [2] Coarse-grained (G-EDF) [3] Monolithic (G-EDF) 136.57 84669.47 Average Response Time (ms) 65.99 125.66 427.07 170091.06 Maximum Response Time (ms) � 31
Case Study: Comparing Fine-Grained/ Coarse-Grained/Monolithic Scheduling [1] [2] FL: fair-lateness Half the average response time [1] Fine-grained (G-FL) [2] Coarse-grained (G-EDF) [3] Monolithic (G-EDF) 136.57 84669.47 Average Response Time (ms) 65.99 427.07 170091.06 Maximum Response Time (ms) 125.66 One-third the maximum response time � 32
Case Study: Comparing Fine-Grained/ Coarse-Grained/Monolithic Scheduling [1] [2] FL: fair-lateness [3] Half the average response time [1] Fine-grained (G-FL) [2] Coarse-grained (G-EDF) [3] Monolithic (G-EDF) 136.57 84669.47 Average Response Time (ms) 65.99 427.07 170091.06 Maximum Response Time (ms) 125.66 One-third the maximum response time � 33
Case Study: Comparing Fine-Grained/ Coarse-Grained/Monolithic Scheduling [1] [2] FL: fair-lateness [3] [3] Half the average response time [1] Fine-grained (G-FL) [2] Coarse-grained (G-EDF) [3] Monolithic (G-EDF) 136.57 84669.47 Average Response Time (ms) 65.99 427.07 170091.06 Maximum Response Time (ms) 125.66 One-third the maximum response time � 34
Case Study: Comparing Fine-Grained/ Coarse-Grained/Monolithic Scheduling [1] [2] FL: fair-lateness [3] [3] [1] Fine-grained (G-FL) [2] Coarse-grained (G-EDF) [3] Monolithic (G-EDF) 136.57 84669.47 Average Response Time (ms) 65.99 427.07 170091.06 Maximum Response Time (ms) 125.66 N/A Analytical Bound (ms) � 35
Case Study: Comparing Fine-Grained/ Coarse-Grained/Monolithic Scheduling [1] [2] FL: fair-lateness [3] [3] [1] Fine-grained (G-FL) [2] Coarse-grained (G-EDF) [3] Monolithic (G-EDF) 136.57 84669.47 Average Response Time (ms) 65.99 427.07 170091.06 Maximum Response Time (ms) 125.66 N/A N/A Analytical Bound (ms) � 36
Case Study: Comparing Fine-Grained/ Coarse-Grained/Monolithic Scheduling [1] [2] FL: fair-lateness [3] [3] [1] Fine-grained (G-FL) [2] Coarse-grained (G-EDF) [3] Monolithic (G-EDF) 136.57 84669.47 Average Response Time (ms) 65.99 427.07 170091.06 Maximum Response Time (ms) 125.66 N/A N/A Analytical Bound (ms) 542.39 � 37
Case Study: Comparing Fine-Grained/ Coarse-Grained/Monolithic Scheduling [1] [2] FL: fair-lateness [3] [3] An alert driver takes 700 ms to react. [1] Fine-grained (G-FL) [2] Coarse-grained (G-EDF) [3] Monolithic (G-EDF) 136.57 84669.47 Average Response Time (ms) 65.99 427.07 170091.06 Maximum Response Time (ms) 125.66 N/A N/A Analytical Bound (ms) 542.39 � 38
Case Study: Comparing Fine-Grained/ Coarse-Grained/Monolithic Scheduling [1] [2] • Fair-lateness-based scheduler is FL: fair-lateness beneficial as it reduced node response times by up to 9.9%. [3] [3] • Overheads of supporting fine-grained An alert driver takes scheduling was 14.15%. 700 ms to react. [1] Fine-grained (G-FL) [2] Coarse-grained (G-EDF) [3] Monolithic (G-EDF) 136.57 84669.47 Average Response Time (ms) 65.99 427.07 170091.06 Maximum Response Time (ms) 125.66 N/A N/A Analytical Bound (ms) 542.39 � 39
Conclusions 1. Fine-grained scheduling 2. Response-time bounds analysis for GPU tasks 3. Case study � 40
Future Work 1. Cycles in the graph 2. Other resource constraints 3. Schedulability studies � 41
Thanks!
Recommend
More recommend