Efficient fine-grain parallelism in shared memory for real-time avionics P. Baufreton – Safran V. Bregeon , J. Souyris – Airbus K. Didier, D. Potop-Butucaru , G. Iooss – Inria
Critical real-time on multi-/many-cores • All the single-core problems plus: C1 C2 C3 C4 – Significantly more concurrency Interconnect • More sources of interferences DMA – Making the parallelization decisions IO IO Shared Memory • And more complicated memory allocation, etc. Router Peripherals • Ensuring safety is paramount – Time/space isolation facilitates the demonstration of certain properties C0 C1 C4 C5 Shared C2 C3 C6 C7 • Ensuring efficiency Memory C8 C9 C12 C13 – Bad implementation decisions -> poor performance C10 C11 C14 C15 • If you get 1.2x acceleration on two cores, then maybe it’s not System SMEM DSU worth it… Core • Too much isolation -> poor performance! DMA C-NoC D-NoC C-NoC Router Router
Critical real-time on multi-/many-cores • IMA = Integrated Modular Avionics – Partition = dual concept App1 App1 App1 App1 App1 App1 Core 3 App1 • Piece of (multi-task) software App5 Core 2 App2 App • Resources statically allocated to this software Core 5 1 App3 App3 – Time-Space Partitioning (TSP) Core 0 t • A partition must never over-step its resource allocation • CAST-32A – Avionics recommendations for multi-core implementation App1 App2 App3 App4 App3 App5 App1 Core - Maintains strict TSP requirement between 3 Core partitions: « Robust Resource and Time 2 Core Partitioning » is difficult 1 Core 0 0 t
Critical real-time on multi-/many-cores • Current approach – natural extension of single-core practice App1 App1 App1 App1 App1 App1 Core 3 – One partition executes on only one core App1 App5 Core 2 • Often corresponding to re-usable modules App2 App Core 5 1 – Advantage: modularity in development App3 App3 Core 0 – Disadvantages: t Interferences known • Performance – due to lack of parallelization inside only at integration time partitions and due to TSP between partitions • Difficult to demonstrate Robust Resource and Time Partitioning on common multi-core platforms – Interferences between partitions running in parallel – Requires HW resource partitioning (e.g. caches, RAM, I/O, etc.)
Critical real-time on multi-/many-cores • Possible solution: Parallelize partitions – Fixed resource envelopes (Cores, memory banks) – Advantage: • If all partitions are parallelized on all cores, classical IMA TSP between partitions – Empty caches, reset shared devices • No time or space isolation required inside the partition – Difficulty: efficient parallelization is not easy Interferences known • Concurrent resource allocation = NP-complete at app. design time – But efficient heuristics exist App1 App2 App3 App4 App3 App5 App1 • Timing analysis of parallel code is difficult Core 3 – Interferences due to the access to shared resources Core 2 – Time/space isolation properties are often used Core to facilitate timing analysis , reducing efficiency 1 Core 0 0 t
Our previous work: LoPhT [ACM TACO’19] • Efficient parallelization of one partition – Allow interferences and control them -> better resource sharing/usage – Guarantee respect of real-time requirements – Scalable Non- – Efficient: Lustre/Scade Platform model functional functional (cores, requirements • Low memory footprint specification memory) (e.g. real-time) • Low synchronization overhead Timing analysis • Efficient scheduling Parallelization Real-time scheduling • Memory allocation to Parallel code gen. Compilers, linker minimize cache misses and Respect of Functional interferences requirements correctness Parallel real-time executable code 6
Our previous work: LoPhT [ACM TACO’19] • Two large use cases: – Flight controller (>5k nodes, >36k variables) • 5.17x speed-up on 8 cores for the flight controller (upper bound: 6.8x) – Aircraft engine control 8 Theoretical upper bound = 6.8x • 2.66x on 4 cores (upper bound: 2.69x) 7 6 – Target platforms: 5 Speed-up • Kalray MPPA 256 Bostan compute 4 Guaranteed parallelization 3 cluster (16 cores) 2 • T1042 (4 cores) ongoing work 1 – Also improve sequential code generation 0 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Cores used for parallelization 7
This work • Evaluate the efficiency cost of isolation properties – Use Lopht and the use cases – Enforce isolation properties through mapping and code generation – Determine the costs – Do not focus on very costly isolation mechanisms that are obviously not needed when parallelizing (e.g. full-fledged ARINC653-like TSP), but on those proposed in the literature/industry for the same type of application 8
Space isolation • Optimized Lopht code generation – No isolation, one C variable per dataflow variable, all users access it i x void* thread_cpu0(void* unused){ void* thread_cpu1(void* unused){ 1 f lock_init_pe(0); init(); lock_init_pe(1); 2 z for(;;){ for(;;){ 3 h global_barrier_reinit(2); 4 y time_wait(3000); g 5 global_barrier_sync(0); global_barrier_sync(1); 6 z -1 7 dcache_inval(); dcache_inval(); 8 f(i,&x); g(z,&y); 9 Core 0 Core 1 dcache_flush(); dcache_flush(); 10 Global barrier sync unlock(1); lock(1,1); 11 unlock(0); 12 f g lock(0,0); 13 dcache_inval(); 14 h(x,y,&z); 15 h dcache_flush(); 16 } } 17 Global barrier sync 9 } } 18 f
Space isolation • Space isolation – Between threads/cores • Each one has a separate copy of the variables it uses • Explicit copy operations to transfer values from one core to another – Between tasks/nodes – Advantage: • In conjunction with memory allocation policies it facilitates timing analysis, error isolation – e.g. One memory bank per core, computations only access local bank – Disadvantages: • Memory footprint • Copy operations overhead • Error isolation is not required inside a partition! (over-engineering) 10
Space isolation • Space isolation – memory footprint – Flight controller application – communication vars. Per-node variable copies Per-CPU variable copies No variable copies (Lopht default) – Copy operations (one per variable copy) 11
Time isolation methods • Meant to improve predictability and simplify timing analysis • Time-triggered execution model (as opposed to Event-Driven) – Computations/Tasks remain inside statically-defined time reservations • Enforced through mapping (allocation, scheduling) – Absence of interferences between cores • Two cores cannot access the same shared resource (e.g. a RAM bank) at the same time • Ensured by scheduling and resource (memory) allocation – Separate computations from communications • Globally: BSP (bulk synchronous parallel) – Alternating phases of computation (without communication) and global synchronization/communication – Often used along with memory allocation (e.g. one memory bank per core) 12
Time-triggered vs. Event-driven execution • Use of TT where it’s needed to enforce real -time requirements, ED elsewhere for robustness void* thread_cpu0(void* unused){ void* thread_cpu1(void* unused){ 1 lock_init_pe(0); init(); lock_init_pe(1); 2 for(;;){ for(;;){ 3 global_barrier_reinit(2); 4 time_wait(3000); 5 global_barrier_sync(0); global_barrier_sync(1); 6 7 dcache_inval(); dcache_inval(); 8 f(i,&x); g(z,&y); 9 dcache_flush(); dcache_flush(); 10 unlock(1); lock(1,1); 11 unlock(0); 12 lock(0,0); 13 dcache_inval(); 14 h(x,y,&z); 15 dcache_flush(); 16 } } 17 13 } } 18
Scheduling-enforced properties • Constraints reduce the solution space => efficiency loss – Intuition: BSP scheduling No Interferences Unconstrained Core 0 Core 1 Core 0 Core 1 Core 0 Core 1 f f f h h h f g g g n n h n g n Functional specification Three possible schedules 14
Scheduling-enforced properties • Constraints reduce the solution space => efficiency loss – Flight controller application • No other isolation property Allowing interferences – Significant penalty Not allowing interferences 15
Application (re-)structuring • Parallelizing requires exposing potential parallelism (concurrency) – If your application is intrinsically sequential, parallelization does not help – Not exposing parallelism -> significant penalty • Automatic parallelization methods exist, but they add to implementation/certification cost • Aircraft engine control: – Version 1: One large sub-system seen as a single, sequential task • Theoretical limit on parallelization speed-up: 1.8x (1.74x attained on 4 cores) – Version 2: Sub-system internal concurrency exposed (20% more nodes) • Theoretical limit on parallelization speed-up: 2.69x (2.66x attained on 4 cores) 16
Conclusion • First evaluation of the cost of common isolation properties on large-scale use cases • Time/Space isolation should be modulated depending (also) on performance needs – Subject to (strict) safety requirements – Trade-off with ease of development – Tools are here 17
Recommend
More recommend