Parallel Streaming Computation on Error-Prone Processors Yavuz Yetim, Margaret Martonosi, Sharad Malik
Hardware Errors on the Rise Soft Errors Due to Cosmic Rays Random Process Variation [Sierawski et al., 2011] [Khun et al., 2011] 25 100000 Average Number of Dopant Atoms 20 10000 Upsets/B muons/Mb 15 1000 10 100 5 10 0 1 65 55 45 40 10000 1000 100 10 1 Technology Node (nm) Technology Node (nm) X
Traditional Solutions Higher Latencies or Redundancy Voltage Margins Processor 1 Processor 2 PDF of Delay 0.01 Norm Number of Dies 0.008 0.006 Input Output 0.004 Replication Check 0.002 0 up to 100% 550 590 630 670 710 750 790 830 Delay (ps) Memory subsystem with ECC Reliable SECDED: 1-cycle latency, ~10k gates 4EC5ED: 14-cycle latency, ~100k gates High Power, Performance and Area Overhead X
Architectures for Error-Prone Computing Reliable core & memory Unreliable core & memory . . ERSA [Leem et al., 2010] main thread: worker thread: . . - algorithmic control - do-all unit . . - worker thread error handling - restarted on error . . Reliable memory Unreliable memory Flikker [Liu et al., 2011] critical int x; int y; Processor / Memory Unreliable execution unit / tolerant register / memory EnerJ [Sampson et al., 2011] Instruction / Data Reliable execution unit / register / memory X * Reliable Unreliable
To Minimal Reliable Hardware Error-tolerant application Error-prone processor Output: • Crashes due to memory errors • Hangs due to control-flow errors X
To Minimal Reliable Hardware Error-tolerant application Error-prone processor Output: • Crashes due to StreamIt programming model memory errors • + memory segmentation Hangs due to control-flow errors Control-flow with scopes: Filter 2 • Known run-times of modular Filter 1 Filter 4 control-flow regions determine Filter 3 timeout limits • Coarse-grain sequencing of computation Regions with Memory: • R/W/X Only allowed accesses are permissions allowed, other dropped X
To Minimal Reliable Hardware Error-tolerant application Error-prone processor Output: • Crashes due to memory errors • Hangs due to control-flow errors Error-tolerant application Error-prone processor Output: Graceful quality + coarse-grain control-flow, degradation with errors memory, I/O management *Extracting Useful Computation From Error-Prone Processors [Yetim et al, 2013] X
Communication Errors For Parallel Streaming Applications Error-tolerant application Multiple processing Output: nodes with single- Unacceptable quality threaded protection This work • Communication errors – Unrecoverable corruption of the communication mechanism – Data misalignment among producer/consumer threads • CommGuard – Application-level communication information – Low overhead recovery from communication errors X
Outline • Motivation • Communication Errors in Parallel Streaming Applications • CommGuard System Overview • Experimental Methodology and Results • Conclusions X
Communication Errors Transmission Failure Concurrent Software Queue • List of free pointers • List of data pointers Producer Consumer • Locks push pop • State shared by both ends • State retained throughout computation Corruption in lists, pointers and locks are permanent X
Communication Errors Transmission Failure Producer Consumer Error-free Hardware Queue push pop • Data items are flowing • Image is not coherent X
Communication Errors Misalignment I Producer(): Consumer(): Error-free Hardware Queue push R; pop R; push G; pop G; R B R B G R push B; pop B; Misalignment due to a control-flow error is permanent X
Communication Errors Misalignment II Producer R R[64:127] R[128:191] R[0:63] G[64:127] G[192:255] G[0:63] Producer G Join P[64:127] P[0:63] B[64:127] B[128:191] B[0:63] Producer B Misalignment at join nodes are also permanent X
Outline • Motivation • Communication Errors in Parallel Streaming Applications • CommGuard System Overview • Experimental Methodology and Results • Conclusions X
CommGuard Overview iteration iteration Producer Consumer iteration • Expecting item, markers received marker: PAD Iteration • Expecting marker, received item: DISCARD X
CommGuard Overview join split Local Local iteration iteration counter counter For all incoming edges • If items missing: PAD • If items extra: DISCARD X
CommGuard System Overview Unreliable Producer Unreliable Consumer New New Push Pop Stall iteration iteration Header Header Hardware Frame Inserter Frame Checker Item Item Queue Pad, Discard, Pad & Discard X
Outline • Motivation • Communication Errors in Parallel Streaming Applications • CommGuard System Overview • Experimental Methodology and Results • Conclusions X
Experimental Methodology • Built on prior simulation Infrastructure by [Yetim et al, DATE 2013] – Virtutech Simics modeling 32-bit Intel x86 – Error injection capabilities – Protection modules for sequential streaming applications – Architecturally visible errors following distribution with given mean time between errors ( MTBE ) • Pick error injection cycle • Picks random register, pick random bit • Flip bit, repeat • Extensions for multi-core simulation – Monitor scheduling of selected threads – Pin threads to processor cores – Per-core error injection – Protection modules implemented for every core • Modeled frame checker and frame inserter • JPEG Decoder as a streaming application X
Output at Different Error Rates • Output quality restored after misalignment through CommGuard • Graceful output degradation with increasing errors X
Run-time Overhead Due to Stalls • Run-time increases due to stalls caused by misalignments • Only 2% even at high error-rates X
Amount of Padding • Padding to resolve misalignments is observed even at low error rates X
Outline • Motivation • Communication Errors in Parallel Streaming Applications • CommGuard System Overview • Experimental Methodology and Results • Conclusions X
Conclusions • Communication in parallel applications add fragility – Error-prone communication subsystem – Data misalignments due to asynchronous threads • Explicit communication & control-flow can be used – Encapsulate coarse-grain data units – Use small checker circuitry to recover from communication errors • Low overhead solutions to sustain quality – Only ~150B of reliable state per core and less than 2% run- time overhead even at high error rate – 16dB can be sustained for errors as frequent as every 1ms X
Parallel Streaming Computation on Error-Prone Processors Yavuz Yetim, Margaret Martonosi, Sharad Malik
Backup Slides X
Suitably Error Tolerant X
Frame Checker FSM X
Avoid Running Indefinitely Divide program to Regular execution Indefinite run due to errors regions with time limits Program Program Program Too long Too long, Loop 1 Loop 1 Scope 1 break Loop 2 Loop 2 Scope 2 X
Disallowed Memory Accesses Regular execution Crash due to errors Suppress crashes Memory Memory Memory Crash X X W X W R/X R/X R/X Don’t crash, Bump PC W R/W R/W R/W X
Overall Design MIS : Coarse-grained control flow constraints and recovery MFU : Coarse-grained constrains on memory accesses Streamed I/O : Manages bounded data streams X
Communication Errors Single-threaded Toy producer-consumer Producer Consumer streaming application push 16 pop 64 Core 0 ... P P P P C Statically allocated 64-item buffer • Static location is preserved in reliable I-Cache throughout the computation • Every new [P] or [C] iteration recovers the pointer values • Communication never halts indefinitely X
Shared State • The inserter and the checker need to keep state to operate • State below is shared by every inserter and checker belonging to a node Value Details (S)tatic or (D)ynamic How many times a node needs to fire Firing per frame S before the computation starts for the next frame Number of total frames the application Frame limit S needs to process How many frames have been processed Active frame D so far How many times the node has fired for the Active firing D active frame X
Additional Frame Checker State State Details (E)rroneous (N)ormal Node is receiving items for the active Receiving items N frame Node has started new frame Expecting a header computationally hence the next item in the N queue should be a header The computation in the node is ahead of Discarding E the communication of the edge The communication of the edge is ahead of Padding E the computation in the node X
CommGuard Placement . . . . . . . . Previous Filter Next FC FI X
Output Quality For Varying MTBEs • Compare lossy compression to error-prone decompression • For raw image file I, encoded file E and decoded files F or P: Baseline: Error-free SNR Decompressed Image Error-free Raw Image Compressed Image Error-prone Decompressed Compression Image Ours: Error-prone SNR • This study was performed for MP3 and JPEG decoder benchmarks – Widely used – Full-runs – Each experimental setting: 10 times X
Recommend
More recommend