FlipBack: Automatic Target Protection Against Soft Errors Xiang Ni Parallel Programming Lab
Soft Errors • Common source of soft errors Electrical noise • External radiation • Manufacturing fault • • Data corruption: we may or may not know Shrinking chip size • More energy efficient • Higher soft error rate 2
������������ ������������������������� Soft Errors • Common source of soft errors Electrical noise • External radiation • Manufacturing fault • • Data corruption: we may or may not know Shrinking chip size • More energy efficient • Higher soft error rate 2
Motivation Example msgsRecvd++ No msgsRecvd== ghostMsg expectedMsg Yes 3
Motivation Example msgsRecvd++ No msgsRecvd== ghostMsg expectedMsg Yes expectedMsg 00000111 3
Motivation Example msgsRecvd++ No msgsRecvd== ghostMsg expectedMsg Yes expectedMsg 00001111 3
Motivation Example msgsRecvd++ No msgsRecvd== ghostMsg expectedMsg Yes expectedMsg 7 —> 15 00001111 3
Motivation Example HANG msgsRecvd++ No msgsRecvd== ghostMsg expectedMsg Yes expectedMsg 7 —> 15 00001111 3
Motivation Example HANG msgsRecvd++ No msgsRecvd== ghostMsg expectedMsg Yes expectedMsg 7 —> 15 00000011 3
Motivation Example HANG msgsRecvd++ No msgsRecvd== ghostMsg expectedMsg Yes expectedMsg 7 —> 3 00000011 3
Motivation Example HANG msgsRecvd++ No msgsRecvd== ghostMsg expectedMsg Stop accepting messages much earlier: incorrect result Yes expectedMsg 7 —> 3 00000011 3
Runtime Guided Replication 4
Runtime Guided Replication • Control Variables • msgsRecvd, expectedMsg • Affecting program flow 4
Runtime Guided Replication • Control Variables • msgsRecvd, expectedMsg • Affecting program flow • How do we ensure the program control flow is correct? • Fully duplication is expensive: less than 50% resource utilization or at least twice the running time 4
Runtime Guided Replication • Control Variables • msgsRecvd, expectedMsg • Affecting program flow • How do we ensure the program control flow is correct? • Fully duplication is expensive: less than 50% resource utilization or at least twice the running time • What about only duplicating the computation that affects program flow? • Leverage a compiler slicing pass • Reduce computation time • Avoid doubling the memory 4
Compiler Slicing Pass 5
Compiler Slicing Pass void Stencil::beginNextIter() { iterCount++; if (iterCount >= totalIter){ mainProxy.done(); //program exits } else { for ( int i = 0; i < totalDirections; i++) { ghostMsg * m = createGhostMsg(dirs[i]); copy(m->data, boundary[i]); int sendTo = myIdx+dirs[i]; stencilProxy(sendTo).receiveMessage(m); } } } 5
Compiler Slicing Pass void Stencil::beginNextIter() { iterCount++; if (iterCount >= totalIter){ mainProxy.done(); //program exits } else { for ( int i = 0; i < totalDirections; i++) { ghostMsg * m = createGhostMsg(dirs[i]); copy(m->data, boundary[i]); int sendTo = myIdx+dirs[i]; stencilProxy(sendTo).receiveMessage(m); } } } 6
Compiler Slicing Pass void Stencil::beginNextIter() { iterCount++; if (iterCount >= totalIter){ mainProxy.done(); //program exits } else { for(int i = 0; i < totalDirections; i++) { ghostMsg * m = createGhostMsg(dirs[i]); copy(m->data, boundary[i]); int sendTo = myIdx+dirs[i]; stencilProxy(sendTo).receiveMessage(m); } } } 7
Compiler Slicing Pass void Stencil::beginNextIter() { iterCount++; if(iterCount >= totalIter){ mainProxy.done(); //program exits } else { for(int i = 0; i < totalDirections; i++) { ghostMsg * m = createGhostMsg(dirs[i]); copy(m->data, boundary[i]); int sendTo = myIdx+dirs[i]; stencilProxy(sendTo).receiveMessage(m); } } } 8
The Role of Runtime System 9
The Role of Runtime System • Creation of shadow chares 9
The Role of Runtime System • Creation of shadow chares • Initialize with the same control variables from the original chare 9
The Role of Runtime System • Creation of shadow chares • Initialize with the same control variables from the original chare • Share the same pointers of the non-control variables 9
The Role of Runtime System • Creation of shadow chares • Initialize with the same control variables from the original chare • Share the same pointers of the non-control variables • Compare the values of control variables and outgoing messages at the end of entry method 9
Runtime Guided Replication 10
Another Example void Stencil:invokeCompution() { //computation routine for ( int i = 0; i < size; ++i){ temperature[i] = ... } } • The previous method fails to protect loop index i • Lifetime ends before the end of the entry method • However, if bit flip occurs to i : incorrect data to be used or program crashes 11
Selective Instruction Duplication 12
Protection for Field Data • The rule holds in nature also be held in scientific programs 0.7 300 60 35 0.6 250 30 50 0.5 25 200 40 0.4 20 150 30 0.3 15 100 20 0.2 10 0.1 50 10 5 0 0 0 0 0 5 10 15 20 25 30 35 0 10 20 30 40 50 60 Stencil2d OpenAtom 13
Protection for Field Data 14
Protection for Field Data • Spatial similarity 14
Protection for Field Data • Spatial similarity d(i-1,j-1) d(i-1,j) d(i-1,j+1) d(i,j+1) d(i,j-1) d(i,j) d(i+1,j+1) d(i+1,j-1) d(i+1,j) 14
Protection for Field Data • Spatial similarity d(i-1,j-1) d(i-1,j) d(i-1,j+1) d(i,j+1) d(i,j-1) d(i,j) d(i+1,j+1) d(i+1,j-1) d(i+1,j) • Temporal similarity 14
Protection for Field Data • Spatial similarity d(i-1,j-1) d(i-1,j) d(i-1,j+1) d(i,j+1) d(i,j-1) d(i,j) d(i+1,j+1) d(i+1,j-1) d(i+1,j) • Temporal similarity • data at time step t-2k, t-k, t 14
Protection for Field Data • Spatial similarity d(i-1,j-1) d(i-1,j) d(i-1,j+1) d(i,j+1) d(i,j-1) d(i,j) d(i+1,j+1) d(i+1,j-1) d(i+1,j) • Temporal similarity • data at time step t-2k, t-k, t • Spatial temporal similarity 14
Protection for Field Data • Spatial similarity d(i-1,j-1) d(i-1,j) d(i-1,j+1) d(i,j+1) d(i,j-1) d(i,j) d(i+1,j+1) d(i+1,j-1) d(i+1,j) • Temporal similarity • data at time step t-2k, t-k, t • Spatial temporal similarity • spatial similarity of temporal updates 14
Protection for Field Data • Spatial similarity d(i-1,j-1) d(i-1,j) d(i-1,j+1) d(i,j+1) d(i,j-1) d(i,j) d(i+1,j+1) d(i+1,j-1) d(i+1,j) • Temporal similarity • data at time step t-2k, t-k, t • Spatial temporal similarity • spatial similarity of temporal updates • temporal similarity of spatial differences 14
Evaluation • Miniaero • Mantevo mini-applications suite • compressible Navier-Stokes equations using explicit RK4 method • Particle-in-cell • Intel PRK benchmark suite • Charm++ implementation • Particles are distributed within a fixed grid of charges. At each time step, PIC calculates the impact of the Coulomb potential of particles with related grid points. • Stencil3d • 7-point stencil-based computation on a 3D-structured mesh • Fault Injection with LLFI • random time • random processor 15
Evaluation Miniaero 100 100 100 100 100 100 80 80 80 80 80 80 Failure Type (%) Failure Type (%) Failure Type (%) Failure Type (%) Failure Type (%) Failure Type (%) 60 60 60 60 60 60 40 40 40 40 40 40 20 20 20 20 20 20 0 0 0 0 0 0 0 5 10 15 20 25 30 0 0 5 5 10 10 15 15 20 20 25 25 30 30 0 0 5 5 10 10 15 15 20 20 25 25 30 30 0 5 10 15 20 25 30 Corrupted Bit Corrupted Bit Corrupted Bit Corrupted Bit Corrupted Bit Corrupted Bit (a) Original: control (b) Protected: control (c) Original: communication (d) Protected: communication 100 100 100 100 90 90 80 80 Failure Type (%) Failure Type (%) Failure Type (%) Failure Type (%) 80 80 60 60 70 70 40 40 60 60 20 20 50 50 0 0 40 40 0 5 10 15 20 25 30 0 5 10 15 20 25 30 20 25 30 35 40 45 50 55 60 20 25 30 35 40 45 50 55 60 Corrupted Bit Corrupted Bit Corrupted Bit Corrupted Bit (e) Original: computation (integer) (f) Protected: computation (integer) (g) Original: computation (floating point) (h) Protected: computation (floating point) Hang Crash Masked SOC Detected Detected & Masked Detected & Corrected 16
Recommend
More recommend