EPVF: AN ENHANCED PROGRAM VULNERABILITY FACTOR METHODOLOGY FOR CROSS-LAYER RESILIENCE ANALYSIS Bo Fang ☨ , Qining Lu ☨ , Karthik Pattabiraman ☨ , Matei Ripeanu ☨ and Sudhanva Gurumurthi * ☨ The University of British Columbia, Canada *Cloud Innovation Lab, IBM, USA 1
Wh What ar are we we fa facing? § SoC soft error trends: overall FIT rate per SoC is increasing [DATE 2014, Chandra AMD] SoC SER FIT rate per node 1000 100 10 1 200 150 100 50 0 Memory SER Logic SER 2
Wh Why So Software-ba based Fa Fault To Tolerance § Hardware-based techniques Application Level Operating System Level Architectural Level Hardware Device/Circuit Level Faults Impactful Errors Software-based techniques: more cost-effective 3
Mi Mitigating Si Silent Da Data Co Corruptio ion (SDC) C): Ke Key to to Er Error Re Resilience Incorrect SDC output Crash Fault Error Hang Normal execution Benign 4
Er Error Resilience Es Estimation: Ac Accuracy vs Co Cost Accuracy FI Goal High resource consumption, low `predictive power Conservative AVF/ estimation of Error PVF Resilience [HPCA2010,MICRO2003] Cost 5
Id Identifying SDC-ca causing Bits § AVF/PVF: Identify Architecturally Correct Execution (ACE) Bits [MICRO03, HPCA10] ACE bits SDC- Crash- Total bits for causing causing bits execution bits e(nhanced)PVF: a methodology that distinguishes crash-causing bits from ACE bits 6
PV PVF An Analysis [Sr Sridharan, , HP HPCA10’] R2 R1 = LD R2 ADDR1 R4 = ADD R1, R3 LD R8 LD R5 = ADD R6*4, R7 ST R4, R5 R1 R8 = LD R2 R3 ADD ADD * R4 § ACE Bits = ∑ 𝐶𝑗𝑢𝑡 𝑗𝑜 𝑆𝑗 +,- . R6 § Total Bits = ∑ ST 𝐶𝑗𝑢𝑡 𝑗𝑜 𝑆𝑗 +,- ADD ADDR2 R7 /01 2+34 R5 § PVF = 56378 2+34 = 88.9% ADD 7
Ou Our Approach: eP ePVF § Source of crashes R2 § Segmentation faults (99% of crashes are due to segfaults) ADDR1 LD R8 LD § Direct crash-causing bits R1 § Crash model R3 ADD § Indirect crash-causing bits ADD Source of crashes R4 § Propagation model R6 ST ADD ADDR2 R7 R5 ADD 8 Segfaults Others
Identify bits that cause Ov Overall methodology a program to make an invalid memory access and crash Obtaining PVF- Crash Propagation Program Identify Model Model Trace ACE bits Identify bits on the backward slice of bits that directly cause crashes 9
Obtaining PVF- Crash model Cr Crash Propagation Program Identify Model Model Trace ACE bits § Determining the bits that cause an out-of-bound memory access § Applied on every memory instruction R1 = LD R2 R1 = LD R2 R4 = ADD R1, R3 R2 ∈ [addr_min, addr_max] R5 = ADD R6*4, R7 R2 vma_start vma_end ST R4, R5 01110001010010… R8 = LD R2 OS Info ESP 10
Pr Propagation model Obtaining PVF- Crash Propagation Program Identify Model Model Trace ACE bits § Identifying all possible bits that can affect the bits identified by the crash model R1 = LD R2 R4 = ADD R1, R3 R5 = ADD R6*4 + R7 R5 = ADD R6*4, R7 Crash min(R5),max(R5) ST R4, R5 ST R4, R5 model R8 = LD R2 max(R6) = (max(R5) – R7)/4 max(R7) = max(R5) – R6*4 min(R6) = (min(R5) – R7)/4 min(R7) = min(R5) – R6*4 11
Ov Overall eP ePVF me methodology Obtaining PVF- Crash Propagation Program Identify Model Model Trace ACE bits ePVF Bits that potentially lead to SDCs 12
Ex Experimental setup § Scientific benchmarks § 8 from Rodinia [IISWC 09] § Matrix Multiplication § LULESH: DOE proxy app [IPDPS 2013] § Fault Model § LLFI [DSN 14] § 3,000 runs per benchmark 13
Ev Evaluation § RQ1: Accuracy of the models § RQ2: Effectiveness of the ePVF methodology § RQ3: Performance ACE bits SDC- Crash- Total bits for causing causing execution bits bits 14
RQ RQ1: Accuracy of the models FI experiments 100% Recall of the Model 90% § Recall FI experiments 80% Crash trials 70% Randomly pick 60% Pick the flipped 50% bit for a crash a bit from the trail 100% models Check that bit Recall of the Model for the model 90% 80% 100% Crash trials Precision of the Model § Precision 70% 90% Flip the exact 60% 80% bit during the Our models achieve average execution 50% 70% Pick the flipped 89% recall and 92% bit for a crash 60% precision trail 50% Check if a crash occurs Check that bit for the model 15
RQ RQ1. Accuracy of the Models ACE bits SDC- Crash- Total bits for causing causing execution bits bits On average, 90% of the time the ePVF methodology is accurate to identify crash-causing bits 16
RQ2: Effectiveness of the eP RQ ePVF § SDC estimate using PVF analysis, ePVF analysis and Fault Injection PVF value ePVF value SDC rate from FI ePVF significantly tightens the 100% 80% upper bound of estimated SDCs 60% by 61% on average 40% 20% 0% 17
eP ePVF-in informed Duplic licatio ion § Rank instructions based on their ePVF value /01 :+34 ;0<74=;>?74+@A :+34 § ePVF value per instruction = /01 :+34 § Higher the ePVF value, Higher chance to lead to SDCs § Duplication highly-ranked ePVF instructions § 30% more SDC coverage than hot-path duplication for the same performance overhead 18
RQ3: Performance RQ § Modeling time ranges from 30s (lavaMD) to ~ 4 hours (pathfinder). § Depending on the size of the DDG, hence the number of dynamic instructions § Optimization (Sampling and Extrapolation) § Intuition – scientific applications usually have repetitive behaviors. predicted ePVF computed ePVF 45% Extrapolated ePVF values 30% based on 10% of the graph, 15% and showing less than 1% 0% difference on average 19
Co Conclu lusio ion § ePVF removes the crash-causing bits from PVF to get a more accurate estimate of SDC rate. § A crash model that predicts direct crash-causing bits § A propagation model that identifies bit that lead to direct crash-causing bits § Implementation with LLVM compiler § Drive selective protection of SDC-causing instructions Email: bof@ece.ubc.ca Code: https://github.com/flyree/enhancedPVF 20
Recommend
More recommend