FAUL T TOLERANCE FOR M UL TI-CORE AND M ANY-CORE PROCESSORS Vanessa VARGAS PhD candidate in Nano Electronics and Nano T echnologies Université de Grenoble Alpes - France Professor at Universidad de las Fuerzas Armadas ESPE Department of Electrical and Electronics- Ecuador
OUTLINE Introduction M otivation Background Work Done Conclusions 2
INTRODUCTION 3
INTRODUCTION 4
INTRODUCTION Start Task 1 Task 2 Task n 5 End 5
OUTLINE Introduction M OTIVATION Background Work Done Conclusions 6
M OTIVATION SUPERCOMPUTERS Top500 (June 2016) 1er de Top500 : Sunway TaihuLight - Sunway M PP , NRCPC, 93.01 Petaflops Sunway SW26010 260C 1.45GHz, Sunway NRCPC 10,649,600 cores 15.31 MW National Supercomputing Center in Wuxi China Many-core 2nd de Top500 : Thiane-2, NUDT, 33.86 Petaflops ivybridge 12c/ proc, 2.2GHz + Intel XeonPhi, 3 120 000 cores 17.81 MW TH Express-2, National University of defense technology, China 7
M OTIVATION In HPC systems, the use of many-core processors is crucial to satisfy the growing demand of performance and reliability without a critical increase of power consumption. 8
M OTIVATION This exponential growth face many challenges: Power • Limited power budget Space • Fit in available floor space Cost • Fixed financial budget Memory technology • Feed compute power & cost efficiently Network technology • Connect nodes power & cost efficiently Software • S cale to utilize the growing compute capacity RELIABILITY • Failure rates should not grow with machine size And others … 9
M OTIVATION CONCERNING THE RELIABILITY Evaluate fault tolerance technique under radiation and fault injection campaigns. Evaluate the impact of the use of fault tolerance techniques on performance and energy consumption. F IGURE 1. R ADIATION E XPERIM ENT 10
OUTLINE Introduction M otivation BACKGROUND • M ultiprocessing modes • Fault T olerance Work Done Conclusions 11
M UL TI-PROCESSING M ODES F IGURE 2. S CHEM ESOF AMP AND SMP PROCESSING M ODES • Single OS is responsible for achieving parallelism in the application. SM P • It dynamically distributes the tasks among the cores, manages the organization of task completion, and controls the shared resources. • The cores run independently of each other, with or without OS. AM P • They have their own private memory space, although there is a common infrastructure for inter-core communications. 12
FAUL T TOLERANCE A system is considered as fault tolerant when facing a fault, it continues working correctly. Fault tolerance can be obtained by redundancy. 1 • Spatial Redundancy 2 • T emporal Redundany 3 • Both of them 13
Spatial vs temporal redundancy SPATIAL TEM PORAL It uses the same physical components It uses different physical components It can separate identical data signals in It can separate identical data signals in time space ADVANT AGE ADVANT AGE • Fewer components. • It lacks an inherent maximum operating frequency. DISADVANT AGES DISADVANT AGES • Latency penalty. • It requires more area and components. • It has a maximum operating frequency and therefore not used in commercial processes faster • Penalty in performance 14 Source: Radiation Effects and Soft Errors in Integrated Circuits and Electronic Devices
FAUL T TOLERANCE IN M UL TICORE Taking advantage of the multiplicity of cores, various redundancy techniques can be considered. 1 • T emporal redundancy 2 • Data value redundancy • Information redundancy for error 3 detection in multicore designs 4 • Redundancy in execution For evaluating any technique it is possible to do it by fault injection or by radiation test campaigns. 15
Redundancy in execution The replication of state machine is used Replication copies of a process is performed. Copies follow the same sequence of execution and produce the same result if inputs are the same. It should ensure that redundant processes not diverge in the absence of failures. Divergent causes are: Nondeterministic In multi-core Asynchronous functions signals • Access to shared memory (gettimeofday) The record / replay method ensures that access to shared memory is done in the same order. 16
Redundancy in execution Unreliable State Machine Error Checking Reliable system Replication and Recovery system Double Modular Triple Modular Deterministic Redundancy with Record/ Replay Redundancy with Multithreading checkpoint/ Fault Masking rollback 17
Redundancy in execution • by using locks, barriers and creating Deterministic threads. multithreading • Problem: Slow down application. Double Modular • It allows error detection. Redundancy DMR Triple Modular • It allows error detection and Redundancy TMR correction by a voter. 18
Redundancy in execution Mixed • Deteministic Multithreading Modelling • DMR Source: Hamid M ushtaq, Zaid Al-Ars, Koen Bertels “Fault T olerance on M ulticore Processors using Deterministic M ultithreading” F IGURE 3. E XAM PLEOF R EDUNDANCY IN E XECUTION 19
OUTLINE Introduction Motivation Background WORK DONE • Freescale P2041RDB • TM R in AM P mode • Fault Injection in SM P • Radiation T ests in AM P y SM P mode • KALRAY M PPA-256 (M ulti Purpose Processing Array) • Fault Injection in AM P mode • Radiation T ests in AM P mode • Fault Injection in mixed mode • Evaluating Fault T olerance T echnique Conclusions 20
FREESCALE P2041 F IGURE 4. Q OR IQ P2041 M EM ORY ARCHITECTURE Built on • Power Architectures technology M anufactured • 45nm SOI technology Based on • four e500mc cores( 32-bit superscalar processor ) Operation Frequency • up to 1.5 GHz 21
TM R in AM P mode 22
TM R in AM P mode FIGURE 5. F AULT I NJECTION STRATEGY IN P ROCESSOR R EGISTER
TM R in AM P mode EXPERIM ENT • It was run 50000 times. • Injection of one or two SEUs per execution. FIGURE 6. F T -I NJECTION C ONS AUL EQUENCES RESULTS • 20% of injected faults have no detectable consequences (silent faults). • If one SEU is injected per execution, the error rate reaches 78% and the TM R corrects 99.99% of them. • On the other hand, if two SEUs are injected, the error rate reaches 93% while the error correction factor decreases to 85%. 24
TM R in AM P mode FIGURE 7. F T -I NJECTION C ONS EQUENCESIN P ROCES OR R EGIS AUL S TERS 25
FAUL T INJECTION IN SM P TABLE I. A PPLICATIONS S UM M ARY 26
FAUL T INJECTION IN SM P Two test campaigns were performed on each selected application: a) Fault injection in processor registers. b) Fault injection in memory region TABLE II. F AULT - I NJECTION C AM PAIGNS 27
FAUL T INJECTION IN SM P FIGURE 8. P ROPOSED S OFTWARE F AULT -I NJECTION IN M EM ORY R EGION
FAUL T INJECTION IN APPLICATION RUNNING IN SM P Register MM Register TSP 84,38% 65,39% 34,19% 13,52% 1,47% 0,63% 0,16% 0,27% Silent faults Result Exceptions Timeouts errors FIGURE 9. F T -I NJECTION C ONS EQUENCESIN P ROCES OR R EGIS AUL S TERS Memory MM Memory TSP 96,59% These campaigns target only the private code memory: 59,82% The initial process stack memory, The thread’s stacks memory, and The process’ heap memory. 23,32% 14,25% 2,60% 1,49% 1,92% 0,02% Silent Result Exceptions Timeouts faults errors FIGURE 10. F AULT -I NJECTION C ONSEQUENCESIN M EM ORY R EGION 29
RADIATION TES TS F IGURE 11. C ONSEQUENCESOF RADIATION TEST CAM PAIGNS • From the results, one can see that the reliability of an application depends on the software environment characteristics: • Operating system. • Multiprocessing mode used. • 30 Characteristics of application.
RADIATION TES TS IN SM P M ODE FIGURE 12. E RROR CLASSIFICATION ACCORDING TO OS FAULT The obtained results revealed that errors may occur in SMP mode, even if the OS is in idle mode. 31
RADIATION TES TS F IGURE 13. SEE CONSEQUENCESACCORDING TO THE SCENARIO IM PLEM ENTED . T HE CONFIDENCE INTERVALSARE SHOWN BY M EANSOF THE RED LINES . 32
KALRAY M PP A-256 M anufactured • TS MC CMOS 28HP technology. • multi-banked local static memory Compute (SM EM ) of 2M B shared by the 16(PE) + Cluster 1(RM ). • 256 Processing Engine (PE) and 32 Integrates Resource Management (RM) cores. • 2 groups of quadcore. Each 128 KB I/ O cluster shared. Based on • Core VLIW 32-bit/ 64-bit architecture. Operation • 100 MHz to 600 MHz. frequency Power • 15 W to 25 W. Consumption Peaks performance • 634 GFLOPS and 316 GFLOPS for single at 600 M Hz and double-precision respectively. Clustered • 16 compute clusters (CCs) and 2 I/ O clusters per device. architecture F IGURE 14: MPPA- 256 M EM ORY ARCHITECTURE 33
Fault Tolerance Approach on M PP A Implemented at application level, it uses the 2 I/ O to improve the reliability of the application. • Core 0 Initializes intercluster communications 1) • Core 0 generates a pthread per core: 2) Core 1, 2 • M aster of group of computing cluster Core 4,5,6 • Voters of the results (TM R –arbiter) • Arbiter of the final results. It logs the Core 3 results Core 7 (only of • Fault Injector. I/ O 0) 34
Recommend
More recommend