Real-Time Multi/Many-Core Architecture Heechul Yun 1
Real-Time Multi/Many-Core Architecture • Projects on Real-Time CPU Architectures • Assigned Papers – Shedding the Shackles of Time-Division Multiplexing, RTSS, 2018 – Deterministic Memory Abstraction and Supporting Multicore System Architecture. ECRTS, 2018 2
Trends in Automotive E/E Systems Source: Bosch A. Hamann (Bosch). “Industrial Challenge: Moving from Classical to High -Performance Real- Time Systems.” WATER, 2018 . Centralization & High-Performance HW 3
Modern System-on-a-Chip (SoC) GPU Core1 NPU… Core2 Shared Cache Memory Controller (MC) DRAM • Integrate multiple cores, GPU, accelerators • Good performance, size, weight, power • Challenges: time predictability 4
Worst-Case Execution Time (WCET) Image source: [Wilhelm et al., 2008] • Real-time scheduling theory is based on the assumption of known WCETs of real-time tasks 5
Computing WCET • Static analysis – Input: program code, architecture model – output: WCET – Problem: architecture model is hard and pessimistic • Measurement – No guarantee on true worst-case – But, widely used in practice 6
Memory Hierarchies, Pipelines, and Buses for Future Architectures in Time-Critical Embedded Systems IEEE TCAD, 2009 7
“Problematic” CPU Features • Architectures are optimized to reduce average performance • WCET estimation is hard because of – Pipelining – TLBs/Caches – Super-scalar – Out-of-order scheduling – Branch predictors – Hardware prefetchers – Basically anything that affect processor state 8
Static Timing Analysis processor’ finally control-flo 9 [11]–[13]. control-flo flo ely—together interactions—to first first ol-flow program’ flo control-flo influence identifies influence ol-flow influence
Control Flow Graph (CFG) • Analyze code • Split basic blocks • Compute per-block WCET – use abstract CPU model 10
Timing Anomalies • Locally faster != globally faster 11 Image source: [Wilhelm et al., 2008]
Timing Anomalies • Locally faster != globally faster 12 Image source: [Wilhelm et al., 2008]
Challenge: Shared Memory Hierarchy • Memory performance varies widely due to interference • Task WCET can be extremely pessimistic Task 3 Task 4 Task 1 Task 2 Core1 Core3 Core4 Core2 I D I D I D I D Shared Cache Memory Controller (MC) DRAM 13
Effect of Memory Interference 12 Solo Corun 10 Normalized Exeuction Time 8 DNN BwWrite 6 Core1 Core2 Core3 Core4 4 LLC DRAM 2 0 DNN (Core 0,1) BwWrite (Core 2,3) • DNN control task suffers >10X slowdown – When co-scheduling different tasks on on idle cores. Waqar Ali and Heechul Yun. “RT -Gang: Real-Time Gang Scheduling Framework for Safety-Critical Systems.” RTAS , 2019 (to appear) 14
Cache Denial-of-Service Attacks victim attackers Core1 Core2 Core3 Core4 LLC • Observed worst-case: >300X (times) slowdown – On simple in-order multicores (Raspberry Pi3, Odroid C2) Difficult to guarantee predictable timing Michael G. Bechtel and Heechul Yun. “Denial -of-Service Attacks on Shared Cache in Multicore: Analysis and Prevention.” In RTAS , 2019 15 (to appear, Outstanding Paper Award )
Real-Time CPU Architectures • PRET – UC Berkeley. • MERASA/parMERASA project – EU • ACROSS – EU • ARAMIS – Germany • EMC2 – EU 16
FlexPRET: A Processor Platform for Mixed-Criticality Systems RTAS, 2014 17
18
PRET Pipeline Thread 1, Instruction 1 Thread 1, Instruction 2 DECOD EXECUT DECOD EXECUT THREAD#1 FETCH REGACC MEM EXCEPT FETCH REGACC MEM EXCEPT E E E E DECOD EXECUT DECOD EXECUT THREAD#2 FETCH REGACC MEM EXCEPT FETCH REGACC MEM E E E E DECOD EXECUT DECOD THREAD#3 FETCH REGACC MEM EXCEPT FETCH REGACC MEM E E E DECOD EXECUT DECOD FETCH REGACC MEM EXCEPT FETCH REGACC THREAD#4 E E E DECOD EXECUT DECOD FETCH REGACC MEM EXCEPT FETCH THREAD#5 E E E DECOD EXECUT FETCH REGACC MEM EXCEPT FETCH THREAD#6 E E t 1 clock 19
FlexPRET Pipeline 20
Hardware Support for WCET Analysis of Hard Real-Time Multicore Systems ISCA 2009 21
Analyzable Multicore Architecture • Idea1: Bound interference on shared resources – On-chip shared bus – (shared) L2 cache • Idea2: WCET computation mode 22
Architecture 23
Round-Robin Bus Arbitration • UBD = (NHRT – 1) * Lbus 24
Request vs. Job-level WCET Analysis • Request-level analysis – Assume worst-case interference for each access of the task under analysis – Pessimistic as not all accesses will get interference • Job-level analysis – Assume the total number of competing memory access is known – Can reduce pessimism 25
Summary • Timing anomalies – Locally fast != globally fast on non-timing compositional architectures (i.e., most architectures) • Timing compositional architecture – Free of timing anomalies 26
Discussion • Why is this interesting? • Are assumptions realistic? – Task model – Cache model – Memory model – CPU (pipeline) model 27
Discussion • Why is this interesting? • Are assumptions realistic? – Task model – Cache model – Memory model – CPU (pipeline) model 28
Atomic vs. Split-Transaction Bus • … J. P. Shen and M. H. Lipasti. Modern Processor Design: Fundamentals of Superscalar Processors. Wav 29 eland Press, 2013.
Announcement • Mini Project #1 • DeepPicar Competition – Build a self-driving car – Based on DeepPicar – Competition format 30
Acknowledgement • Some slides are from: – Prof. Rodolfo Pellizzoni, University of Waterloo – Prof. Edward A. Lee, University of Berkeley 31
Recommend
More recommend