Towards a Reliability-aware Design Flow for Kahn Process Networks on NoC-based Multiprocessors Onur Derin , Leandro Fiorin ALaRI, Faculty of Informatics, University of Lugano, Switzerland { derino,fiorin } @alari.ch L¨ ubeck – Feb 25, 2014
Outline Introduction Fault tolerance techniques Online task remapping N-modular redundancy Case study Related Work Conclusion O. Derin, L. Fiorin, ALaRI Feb 25, 2014— ARCS/VERFE 2/24
Introduction As CMOS technology scales down, fault tolerance becomes more relevant. Probability of permanent faults increases with technology scaling due to process variability age-related degradation single-event effects (e.g., single-event latchup/burnout/gate rupture) rupture of wires dielectric breakdowns corrosion Failures are often hardly predictable and avoidable with current design methodologies. Introducing fault tolerance capabilities increases the lifetime of the system O. Derin, L. Fiorin, ALaRI Feb 25, 2014— ARCS/VERFE 3/24
Introduction Power wall problem : higher power consumption prohibits running at higher frequencies to increase performance. task-level parallelism is a promising solution to increase performance, but it requires suitable programming models. advances in microelectronics enable integration of billions of transistors on the same on-chip die. higher number of heterogeneous processing and storage elements in next generation embedded platforms bus-based or point-to-point communication do not scale , are power hungry and not predictable. Networks-on-Chip (NoCs) improve scalability, bandwidth and power efficiency. distributed memory solutions are more scalable than shared memory architectures: Non-uniform Memory Access (NUMA) No-remote Memory Access (NORMA) O. Derin, L. Fiorin, ALaRI Feb 25, 2014— ARCS/VERFE 4/24
Problem: Reliability-aware design flow Application Estimators Design Space Explorer Performance Architecture Power Mapping Pareto-optimal solutions O. Derin, L. Fiorin, ALaRI Feb 25, 2014— ARCS/VERFE 5/24
Problem: Reliability-aware design flow Application Estimators Design Space Explorer Performance Architecture Power Mapping Reliability Pareto-optimal solutions Reliability is introduced as a new objective into the design flow O. Derin, L. Fiorin, ALaRI Feb 25, 2014— ARCS/VERFE 5/24
Context of our work NoC-based multi-processor platforms as the underlying hardware platform NORMA as the memory model Kahn Process Networks (KPN) as the model of computation throughput -constrained systems fault-tolerance addressed at the software level physical or micro-architectural solutions may be too costly for resource-constrained platforms. fault model is restricted to permanent faults in the processing elements (assumed fault tolerant interconnect and memory). single fault assumption Two fault-tolerance schemes: Fault-aware online task remapping (OTR) N-modular redundancy at KPN level (NMR) O. Derin, L. Fiorin, ALaRI Feb 25, 2014— ARCS/VERFE 6/24
Kahn Process Networks a set of concurrent processes (tasks) connected via non-blocking write, blocking read FIFO channels when running on actual platforms, channels are bounded and have blocking write semantics. Figure : A KPN example better suited for streaming applications (e.g., image/video/audio processing) several advantages: suitable for message passing platforms as the communication is explicitly exposed no need for a global scheduler to execute in a distributed fashion O. Derin, L. Fiorin, ALaRI Feb 25, 2014— ARCS/VERFE 7/24
Fault-aware online task remapping (OTR) Fault-aware online task remapping allows the system to survive in the presence of faulty processors. As the processor becomes faulty, the tasks are executed on a reduced number of fault-free processors with degraded performances. Tile 1 Tile 2 Q DCT Tile 3 Tile 4 SRC V LE O. Derin, L. Fiorin, ALaRI Feb 25, 2014— ARCS/VERFE 8/24
Fault-aware online task remapping (OTR) Fault-aware online task remapping allows the system to survive in the presence of faulty processors. As the processor becomes faulty, the tasks are executed on a reduced number of fault-free processors with degraded performances. Tile 1 ( faulty PE ) Tile 2 Q DCT Tile 3 Tile 4 SRC V LE O. Derin, L. Fiorin, ALaRI Feb 25, 2014— ARCS/VERFE 8/24
Fault-aware online task remapping (OTR) Fault-aware online task remapping allows the system to survive in the presence of faulty processors. As the processor becomes faulty, the tasks are executed on a reduced number of fault-free processors with degraded performances. Tile 1 ( faulty PE ) Tile 2 ( faulty PE ) Tile 3 Tile 4 Q SRC DCT V LE O. Derin, L. Fiorin, ALaRI Feb 25, 2014— ARCS/VERFE 8/24
Fault tolerant tile for OTR interrupt stall Self Testing Instruction Memory Processing Element Data Memory Module Port B Port A Port A Port B tile 1 tile 2 fault detected NA NA Local Bus NoC NoC router router parameters send()/recv() tile 3 tile 4 NA NA NoC send() paramaters DMA router tag Task Migration Tag Decoder message−passing preds/succs flushed Hardware NA handler tile 5 Initiator/Target NI Network Fault tolerance Adapter (NA) support FLIT−out FLIT−in Figure : Fault tolerant tile for OTR support (Derin, 2013) Self-testing module (STM) detects the fault with a self-test routine Task migration hardware module (TMH) notifies the remapping manager (RM) RM calculates the new mapping by a remapping heuristic RM notifies the predecessor, successor and other tiles RM gets the tasks’ state (iterators and channel tokens) from the TMH RM transfers the tasks’ state to new tiles Migrated tasks are resumed on the new tiles O. Derin, L. Fiorin, ALaRI Feb 25, 2014— ARCS/VERFE 9/24
Fault-aware online task remapping (OTR) Calculating MTTF for fault-aware online task remapping The application will not fail as long as there is at least one healthy core of all the core types required by the application. Create a fault tree given the platform specification ( M NC ) and profiling information ( M TC cap ) Calculate MTTF using binary decision diagrams O. Derin, L. Fiorin, ALaRI Feb 25, 2014— ARCS/VERFE 10/24
OTR: Creating the fault tree failure t 1 t 2 t 3 t 4 n 1 n 2 C 1 C 1 C 1 C 1 C 2 C 1 C 2 C 3 n 3 n 4 C 2 C 3 n 1 n 2 n 1 n 2 n 3 n 1 n 2 n 3 n 4 t 2 t 1 C 1 t 4 t 1 t 2 C 1 , C 2 t 3 t 3 C 1 , C 2 t 4 C 3 O. Derin, L. Fiorin, ALaRI Feb 25, 2014— ARCS/VERFE 11/24
OTR: Calculating MTTF The set of all the paths leading to 1s are called satisfying paths, Sat A satisfying path assigns values to nodes as 1 ( n i , failure) and 0 (¯ n i , non-failure) probability that processing node n i will be failed at time t , Pn i ( t ) the overall probability of failure ( Q sys ( t )) � � � Q sys ( t ) = ( Pn j ( t ) (1 − Pn k ( t ))) n j ∈ s i n k ∈ s i ¯ s i ∈ Sat � ∞ MTTF sys = R sys ( t ) dt 0 where reliability of the system, R sys ( t ) = 1 − Q sys ( t ). O. Derin, L. Fiorin, ALaRI Feb 25, 2014— ARCS/VERFE 12/24
N-modular redundancy at KPN level (NMR) t i t j t k t 1 t 1 2 (a) n 1 n 2 n 3 RISC DSP RISC t 1 j fork t 2 voter t i fork t k j t 2 t 3 2 2 t 3 voter j n 4 n 5 n 6 DSP NPC DSP (b) Figure : TMR pattern applied to a t 3 KPN task n 7 n 8 n 9 RISC DSP RISC Figure : A mapping of an application with TMR pattern onto a 3x3 NoC O. Derin, L. Fiorin, ALaRI Feb 25, 2014— ARCS/VERFE 13/24
NMR - safe failure Calculating MTTF Safe failure is the failure of the system to provide checked results. The application will fail if there is only one instance left of any task type. Create a fault tree given the application specification and mapping information ( g R t , M NT ) Calculate MTTF using binary decision diagrams O. Derin, L. Fiorin, ALaRI Feb 25, 2014— ARCS/VERFE 14/24
NMR - safe failure: Creating the fault tree failure t 1 t 1 2 n 1 n 2 n 3 RISC DSP RISC t 1 fork t 2 voter t 3 fork t 2 t 3 2 2 voter n 4 n 5 n 6 DSP NPC DSP t 3 t 1 t 2 t 1 t 3 t 2 t 3 2 2 2 2 2 2 n 2 n 4 n 2 n 6 n 4 n 6 true true true true n 7 n 8 n 9 RISC DSP RISC Figure : A mapping of an application with TMR pattern onto a 3x3 NoC O. Derin, L. Fiorin, ALaRI Feb 25, 2014— ARCS/VERFE 15/24
NMR - unsafe failure Calculating MTTF Unsafe failure is the failure of the system to provide correct results. The application will fail if there is no instance left of any task type. Create a fault tree given the application specification and mapping information ( g R t , M NT ) Calculate MTTF using binary decision diagrams O. Derin, L. Fiorin, ALaRI Feb 25, 2014— ARCS/VERFE 16/24
NMR - unsafe failure: Creating the fault tree failure t 1 t 1 2 n 1 n 2 n 3 RISC DSP RISC t 2 fork t 2 t 3 2 2 voter n 4 n 5 n 6 DSP NPC DSP t 1 t 2 t 3 t 1 fork voter t 3 2 2 2 t 3 n 1 n 5 n 2 n 4 n 6 n 5 n 8 n 7 n 8 n 9 RISC DSP RISC Figure : A mapping of an application with TMR pattern onto a 3x3 NoC O. Derin, L. Fiorin, ALaRI Feb 25, 2014— ARCS/VERFE 17/24
Reliability-aware mapping flow Application Specification Architecture Specification M TE , d M NP , M PL , M NC , C, c Profiling Fault rates T TC λ GAMapper (reliability-aware mapping tool) Analytical model Apply self-checking patterns MTTF, exe. time, comm. cost Pareto solutions (X NT , M TE ) Figure : Reliability-aware mapping tool (GAMapper) based on genetic algorithms (constrained NSGAIIC) O. Derin, L. Fiorin, ALaRI Feb 25, 2014— ARCS/VERFE 18/24
Recommend
More recommend