INTEGRATED WCET ESTIMATION OF MULTICORE APPLICATIONS Dumitru - PowerPoint PPT Presentation

1 INTEGRATED WCET ESTIMATION OF MULTICORE APPLICATIONS Dumitru Potop-Butucaru, Isabelle Puaut

Motivation: Scalable timing analysis 2  Real-time systems: complexity steadily increases  Hardware: Multi-core, networks-on-chips  Software: Parallel/concurrent software  Safety margins used in practice after schedulability analysis are already enormous (40%-60%)  Further static abstraction is not a solution  How to preserve both tractability and precision?  Probabilistic approaches (another form of abstraction), or  Use « WCET-friendly » hardware and software  Limit/control timing interferences due to concurrency  Static (off-line) scheduling, non-preemptive, etc.  No shared caches, LRU caches, time-triggered execution, etc.

Static timing analysis 3  3 basic sources of imprecision:  Application-related:  Input arrival dates, data-dependent behavior  Mapping-related:  Concurrency (pipelining, buses, scheduling)  Analysis-related:  Abstraction (e.g. IPET, real-time calculus, etc.)  Our thesis: Few sources of imprecision in the application and mapping allow for scalable, precise analysis

Reducing imprecision 4  Everybody is doing it (to a point)  Industry: Space & time partitioning (among others)  Time-triggered standards: TTA, ARINC 653  Recent many-core chips: TilePro64, Kalray MPPA256, etc.  Research:  Precision timed architectures (PRET) – Lee, etc.  CompSoC, Aethereal, etc.  Off-line scheduling – Fohler, Eles, Sorel, etc.  But we do it all the way:  Remove all application- and mapping-related imprecision sources that are not handled by classical WCET analysis  Possibly add some back later on (future work)  This paper: see that it’s possible and determine the gain

Djemal et al., DASIP 2012 Tiled MPSoC architecture Based on SoCLib (UPMC/LIP6) 5 Command RAM/ROM Lock unit router Multi- Prog. bank RAM Local interconnect (crossbar) NIC Cache n (PLRU, Cache n (LRU, write-through) Buffered write-through) DMA I/O  Multi-bank RAM CPU n (option) CPU n Response (MIPS32) (MIPS32)  Harvard-like architecture router  Full crossbar intra-tile interconnect  Hardware locks for synchronization (not interrupts)  Static routing (X-first)

Tiled MPSoC architecture 6 Command RAM/ROM Lock unit router West East Multi- Prog. bank RAM Local interconnect (crossbar) NIC Cache n (LRU, Buffered Local South write-back) DMA I/O (option) CPU n Response (MIPS32) router  Provide timing guarantees for inter-tile communications  Use of locks, programmed arbitration (others do TDMA or other types of resource reservation)  Tool limitation: 1CPU/tile

Tiled MPSoC applications 7  On each processor, sequential code  Non-preemptive, off-line scheduling  Synchronization by blocking send/recv operations  Lossless FIFOs  A.k.a. Kahn process networks (G. Kahn, 1974)  No concurrent access to RAM banks, DMA units, NoC router outputs  Data allocation on memory banks, use of locks to enforce a predefined schedule  Tool limitations  Sampled I/O only  Send/recv primitives are explicitly matched  Send/recv only at top level (global loop), non-conditioned

Tiled MPSoC applications (example) 8 ¡ const ¡int ¡decis_levl ¡[30]; ¡ void ¡core1() ¡{ ¡ int ¡core2() ¡{ ¡ ¡ ¡int ¡tqmf[24]; ¡long ¡xa, ¡xb, ¡el; ¡ ¡ ¡ ¡int ¡q,el; ¡ ¡ ¡int ¡xin1, ¡xin2, ¡decis_levl; ¡ ¡ ¡ ¡for(;;) ¡{ ¡//Infinite ¡loop ¡ ¡ ¡ ¡for(;;) ¡{//Infinite ¡loop ¡ ¡ ¡ ¡ ¡xa ¡= ¡0; ¡xb ¡= ¡0; ¡ ¡ ¡ ¡ ¡ ¡for ¡(i=0;i<12;i++) ¡{ ¡// ¡12 ¡iterations ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡xa ¡+= ¡(long) ¡tqmf[2*i]*h[2*i]; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡xb ¡+= ¡(long) ¡tqmf[2*i+1]*h[2*i+1]; ¡ ¡ ¡ ¡ ¡ ¡} ¡ ¡ ¡ ¡ ¡ ¡send(channel1,(int)((xa+xb)>>15)); ¡ ¡ ¡ ¡ ¡ ¡el ¡= ¡receive(channel1); ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡xin1=read_input(); ¡xin2=read_input(); ¡ ¡ ¡ ¡ ¡el ¡= ¡(el>=0)?el:(-‑el); ¡ ¡ ¡ ¡ ¡for(i=23;i>=2;i-‑-‑) ¡{ ¡// ¡22 ¡iterations ¡ ¡ ¡ ¡ ¡for ¡(q ¡= ¡0; ¡q ¡< ¡30; ¡q++) ¡{ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡tqmf[i]=tqmf[i-‑2]; ¡ ¡ ¡ ¡ ¡ ¡ ¡// ¡30 ¡iterations ¡ ¡ ¡ ¡ ¡} ¡ ¡ ¡ ¡ ¡ ¡ ¡if ¡(el ¡<= ¡decis_levl[q]) ¡ ¡ ¡ ¡ ¡ ¡tqmf[1] ¡= ¡xin1; ¡tqmf[0] ¡= ¡xin2; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡break; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡} ¡ ¡ ¡ ¡ ¡decis_levl ¡= ¡receive(channel2) ¡; ¡ ¡ ¡ ¡ ¡ ¡send(channel2,decis_levl) ¡; ¡ ¡ ¡ ¡ ¡write_output(decis_levl) ¡; ¡ ¡ ¡ ¡} ¡ ¡ ¡} ¡ } ¡ } ¡

Traditional timing analysis 9 ¡ const ¡int ¡decis_levl ¡[30]; ¡ void ¡core1() ¡{ ¡ int ¡core2() ¡{ ¡ ¡ ¡int ¡tqmf[24]; ¡long ¡xa, ¡xb, ¡el; ¡ ¡ ¡ ¡int ¡q,el; ¡ ¡ ¡int ¡xin1, ¡xin2, ¡decis_levl; ¡ ¡ ¡ ¡for(;;) ¡{ ¡//Infinite ¡loop ¡ ¡ ¡ ¡for(;;) ¡{//Infinite ¡loop ¡ ¡ ¡ ¡ ¡xa ¡= ¡0; ¡xb ¡= ¡0; ¡ ¡ ¡ ¡ ¡ ¡for ¡(i=0;i<12;i++) ¡{ ¡// ¡12 ¡iterations ¡ ¡ Task1_1 ¡ ¡ ¡ ¡ ¡ ¡xa ¡+= ¡(long) ¡tqmf[2*i]*h[2*i]; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡xb ¡+= ¡(long) ¡tqmf[2*i+1]*h[2*i+1]; ¡ ¡ ¡ ¡ ¡ ¡} ¡ ¡ ¡ ¡ ¡ ¡send(channel1,(int)((xa+xb)>>15)); ¡ ¡ ¡ ¡ ¡ ¡el ¡= ¡receive(channel1); ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡xin1=read_input(); ¡xin2=read_input(); ¡ ¡ ¡ ¡ ¡el ¡= ¡(el>=0)?el:(-‑el); ¡ ¡ ¡ ¡ ¡for(i=23;i>=2;i-‑-‑) ¡{ ¡// ¡22 ¡iterations ¡ ¡ ¡ ¡ ¡for ¡(q ¡= ¡0; ¡q ¡< ¡30; ¡q++) ¡{ ¡ ¡ Task1_2 Task2_1 ¡ ¡ ¡ ¡ ¡ ¡tqmf[i]=tqmf[i-‑2]; ¡ ¡ ¡ ¡ ¡ ¡ ¡// ¡30 ¡iterations ¡ ¡ ¡ ¡ ¡} ¡ ¡ ¡ ¡ ¡ ¡ ¡if ¡(el ¡<= ¡decis_levl[q]) ¡ ¡ ¡ ¡ ¡ ¡tqmf[1] ¡= ¡xin1; ¡tqmf[0] ¡= ¡xin2; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡break; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡} ¡ ¡ ¡ ¡ ¡decis_levl ¡= ¡receive(channel2) ¡; ¡ ¡ ¡ ¡ ¡ ¡send(channel2,decis_levl) ¡; ¡ ¡ ¡ ¡ ¡write_output(decis_levl) ¡; ¡ ¡ Task1_3 ¡ ¡} ¡ ¡ ¡} ¡ } ¡ } ¡

Traditional timing analysis 10 Task1_1 Task1_2 Task2_1 Task1_3

Traditional timing analysis 11 Task1_1 WCET1_1 Task1_2 WCET1_2 Task2_1 WCET2_1 WCET1_3 Task1_3

Traditional timing analysis 12 Application latency Task1_1 WCET1_1 Task1_2 WCET1_2 Task2_1 WCET2_1 WCET1_3 Task1_3

Traditional timing analysis 13 Application latency Task1_1 WCET1_1 Task1_2 WCET1_2 Task2_1 WCET2_1 Safety considerations when analyzing subtasks  WCET_i_j are overestimated WCET1_3 Task1_3 Glue code between tasks is not considered  Margins must be added to WCET_i_j

Unified timing analysis 14 ¡ const ¡int ¡decis_levl ¡[30]; ¡ void ¡core1() ¡{ ¡ int ¡core2() ¡{ ¡ ¡ ¡int ¡tqmf[24]; ¡long ¡xa, ¡xb, ¡el; ¡ ¡ ¡ ¡int ¡q,el; ¡ ¡ ¡int ¡xin1, ¡xin2, ¡decis_levl; ¡ ¡ ¡ ¡for(;;) ¡{ ¡//Infinite ¡loop ¡ ¡ ¡ ¡for(;;) ¡{//Infinite ¡loop ¡ ¡ ¡ ¡ ¡xa ¡= ¡0; ¡xb ¡= ¡0; ¡ ¡ ¡ ¡ ¡ ¡for ¡(i=0;i<12;i++) ¡{ ¡// ¡12 ¡iterations ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡xa ¡+= ¡(long) ¡tqmf[2*i]*h[2*i]; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡xb ¡+= ¡(long) ¡tqmf[2*i+1]*h[2*i+1]; ¡ ¡ ¡ ¡ ¡ ¡} ¡ ¡ ¡ ¡ ¡ ¡send(channel1,(int)((xa+xb)>>15)); ¡ ¡ ¡ ¡ ¡ ¡el ¡= ¡receive(channel1); ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡xin1=read_input(); ¡xin2=read_input(); ¡ ¡ ¡ ¡ ¡el ¡= ¡(el>=0)?el:(-‑el); ¡ ¡ ¡ ¡ ¡for(i=23;i>=2;i-‑-‑) ¡{ ¡// ¡22 ¡iterations ¡ ¡ ¡ ¡ ¡for ¡(q ¡= ¡0; ¡q ¡< ¡30; ¡q++) ¡{ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡tqmf[i]=tqmf[i-‑2]; ¡ ¡ ¡ ¡ ¡ ¡ ¡// ¡30 ¡iterations ¡ ¡ ¡ ¡ ¡} ¡ ¡ ¡ ¡ ¡ ¡ ¡if ¡(el ¡<= ¡decis_levl[q]) ¡ ¡ ¡ ¡ ¡ ¡tqmf[1] ¡= ¡xin1; ¡tqmf[0] ¡= ¡xin2; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡break; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡} ¡ ¡ ¡ ¡ ¡decis_levl ¡= ¡receive(channel2) ¡; ¡ ¡ ¡ ¡ ¡ ¡send(channel2,decis_levl) ¡; ¡ ¡ ¡ ¡ ¡write_output(decis_levl) ¡; ¡ ¡ ¡ ¡} ¡ ¡ ¡} ¡ } ¡ } ¡

INTEGRATED WCET ESTIMATION OF MULTICORE APPLICATIONS Dumitru - PowerPoint PPT Presentation

1 INTEGRATED WCET ESTIMATION OF MULTICORE APPLICATIONS Dumitru Potop-Butucaru, Isabelle Puaut Motivation: Scalable timing analysis 2 Real-time systems: complexity steadily increases Hardware: Multi-core, networks-on-chips Software:

State of Multicore OCaml KC Sivaramakrishnan University of OCaml Labs Cambridge Outline

The Why, Where and How of Multicore Anant Agarwal MIT and Tilera Corp. What is Multicore?

Multicore Multicore curiculum 1 Motivation Moores Law: the number of transistors double

Tuning the WCET of Embedded Why Reduce the WCET? Applications more likely to meet timing

Extending the Path Analysis Technique to Obtain a Soft WCET Paul Keim, Amanda Noyes, Drew

Control Flow Analysis for WCET Analysis Bjrn Lisper School of Innovation, Design, and

State-of-the-art of WCET (Worst- Case Execution Time) Estimation methods Isabelle PUAUT

Multicore OCaml GC KC Sivaramakrishnan, Stephen Dolan University of OCaml Labs Cambridge

Multicore Synchronization a pragmatic introduction Multicore Synchronization This is a talk on

RETHINKING OPERATING SYSTEM DESIGNS FOR A Ken Birman Based heavily MULTICORE WORLD on a slide

When Multicore Isnt Enough: Trends and the Future for Multi-Multicore Systems Matt Reilly

A Scalable Ordering Primitive for Multicore Machines Sanidhya Kashyap Changwoo Min Kangnyeon Kim

The Challenge of Multicore The Challenge of Multicore and and Specialized Accelerators for

Practical Algebraic Effect Handlers in Multicore OCaml KC Sivaramakrishnan University of

Reactive design patterns for microservices on multicore Reactive summit - 22/10/18

Multicore Based Packet Splitting Multicore Based Packet Splitting Approaches for High Speed

Generating High Coverage Tests for SystemC Designs Using Symbolic Execution Bin Lin Department

Reverse Engineering DSP Code GameCube DSP Analyzing GCN DSP code Pierre Bourdon Conclusion

TELECOMMUNICATION (DECT) ETI 2506 Monday, 21 November 2016 LOOK AT THE SYLLABUS 2 REVISITED

Quality of service CSCI 466: Networks Keith Vertanen

Motivations Instruction cache (icache) misses can FICO drastically decrease code performance a

VideoLAN VLC 3.0.0 Jean-Baptiste Kempf samedi 30 janvier 2016 Ecole Centrale Paris The Cone

RIStAL Centre de Recherche en Informatique, Signal et Automatique de Lille 1 Outline

BEST: a Binary Executable Slicing Tool and its use to improve Model Checking-based WCET Analysis