September 26 th 2017 – Grenoble, France PAOLO RECH Radiation Reliability Issues in Current and Future Supercomputers
Sponsors
HPC reliability importance Paolo Rech – Grenoble, France 2
Available Accelerators Modern parallel accelerators offer: - Low cost - Flexible platform - High efficiency (low per-thread consumption) - High computational power and frequency - Huge amount of resources Kepler K40 Xeon-Phi Paolo Rech – Grenoble, France 3
Available Accelerators Modern parallel accelerators offer: - Low cost - Flexible platform - High efficiency (low per-thread consumption) - High computational power and frequency - Huge amount of resources - Reliability? Kepler K40 Xeon-Phi Paolo Rech – Grenoble, France 3
Available Accelerators Modern parallel accelerators offer: Error Rate - Low cost - Flexible platform - High efficiency (low per-thread consumption) - High computational power and frequency - Huge amount of resources - Reliability? Kepler K40 Xeon-Phi Paolo Rech – Grenoble, France 3
Titan Titan (Oak Ridge National Lab): 18,688 GPUs High probability of having a GPU corrupted Titan Detected Uncorrectable Errors MTBF is ~44h* *(field and experimental data from HPCA’15) Paolo Rech – Grenoble, France 4
HPC bad stories Virginia Tech’s Advanced Computing facility built a supercomputer called Big Mac in 2003 ● 1,100 Apple Power Mac G5 ● Couldn't boot because of the failure rate ● Power Mac G5 did not have error-correcting code (ECC) memory ● Big Mac was broken apart and sold on-line Jaguar – (2009 #1 Top500 list) ● 360 terabytes of main memory ● 350 ECC errors per minute ASCI Q – (2002 #2 in Top500 list) ● Built with AlphaServers ● 7 Teraflops ● Couldn't run more than 1h without crash ● After putting metal side it could last 6h before crash ● Address bus on the microprocessors were unprotected (causing the crashes) Paolo Rech – Grenoble, France 4
Outline The origins of the issue: § Radiation Effects Essentials § Error Criticality in HPC Understand the issue: § Experimental Procedure § K40 vs Xeon Phi Toward the solution of the issue: § ECC – ABFT – Duplication § Selective Hardening What’s the Plan? Paolo Rech – Grenoble, France 5
Outline The origins of the issue: § Radiation Effects Essentials § Error Criticality in HPC Understand the issue: § Experimental Procedure § K40 vs Xeon Phi Toward the solution of the issue: § ECC – ABFT – Duplication § Selective Hardening What’s the Plan? Paolo Rech – Grenoble, France
Terrestrial Radiation Environment Cosmic rays could be so energetic to pass the Van Allen belts Paolo Rech – Grenoble, France 6
Terrestrial Radiation Environment Cosmic rays could be so energetic to pass the Van Allen belts Galactic cosmic rays interact with atmosphere shower of energetic particles: Muons, Pions, Protons, Gamma rays, Neutrons 13 n/(cm 2 h) @sea level* *JEDEC JESD89A Standard Paolo Rech – Grenoble, France 6
Altitude and Radiation Maximum ionization @ ~13KM above sea level Paolo Rech – Grenoble, France 7
Altitude and Radiation LANL Maximum ionization @ ~13KM above sea level Paolo Rech – Grenoble, France 7
Radiation Effects - Soft Errors Soft Errors: the device is not permanently damaged, but the particle may generate: IONIZING PARTICLE • One or more bit-flips 0 1 Single Event Upset (SEU) Multiple Bit Upset (MBU) 1 0 IONIZING PARTICLE • Transient voltage pulse FF Logic Single Event Transient (SET) Paolo Rech – Grenoble, France 8
Silent Data Corruption vs Crash Soft Errors in: -data cache -register files Silent Data Corruption -logic gates (ALU) -scheduler Soft Errors in: -instruction cache DUE (Crash) -scheduler / dispatcher -PCI-e bus controller Paolo Rech – Grenoble, France 9
Radiation Effects on Parallel Accelerators Streaming Multiprocessor CUDA GPU Instruction Cache X Blocks Scheduler and Dispatcher Warp Scheduler Warp Scheduler SM SM SM SM SM SM SM SM X Dispatch Unit Dispatch Unit SM SM SM SM SM SM SM SM X Register File X core core core core core core SM SM SM SM core core SM SM SM SM … X L2 Cache core core core core core core core core X DRAM Shared Memory / L1 Cache Paolo Rech – Grenoble, France 10
Output Correctness in HPC A single fault can propagate to several parallel threads: … multiple corrupted elements . Paolo Rech – Grenoble, France 11
Output Correctness in HPC A single fault can propagate to several parallel threads: … multiple corrupted elements . Not all SDCs are critical for HPC applications error can be in the float intrinsic variance Values in a given range are accepted as correct in physical simulations Imprecise computation is being applied to HPC Paolo Rech – Grenoble, France 11
Output Correctness in HPC A single fault can propagate to several parallel threads: … multiple corrupted elements . Not all SDCs are critical for HPC applications error can be in the Goal: quantify and qualify SDC in float intrinsic variance NVIDIA and Intel architectures. Values in a given range are accepted as correct in physical simulations Imprecise computation is being applied to HPC Paolo Rech – Grenoble, France 11
Outline The origins of the issue: § Radiation Effects Essentials § Error Criticality in HPC Understand the issue: § Experimental Procedure § K40 vs Xeon Phi Toward the solution of the issue: § ECC – ABFT – Duplication § Selective Hardening What’s the Plan? Paolo Rech – Grenoble, France
Radiation Test Facilities Irradiation of Chips Electronics Paolo Rech – Grenoble, France 12
Experimental Setup Paolo Rech – Grenoble, France 13
Radiation Test are NOT for dummies What can (and actually went) wrong: - Ethernet cables failures - Bios checksum error - HDD failures - Linux GRUB failure - power plug failure (wow, this was risky) - board boot failure - GPU fell off the BUS (this was funny) - mic is lost - etc… etc… etc… - Heather/Sean, can you add something to the list? Paolo Rech – Grenoble, France 14
GPU Radiation Test Setup SoC microcontrollers SoC FPGA GPU FPGA Flash APU Paolo Rech – Grenoble, France 15
GPU Radiation Test Setup Intel NVIDIA AMD Xeon-Phi K40 APU GPU power control circuitry is out of beam desktop PCs 23/48 Paolo Rech – Grenoble, France
Neutrons Spectrum @LANSCE 1.8x10 6 n/(cm 2 h) @NYC 13 n/(cm 2 h) We test each architecture for 800h, simulating 9.2x10 8 h of natural radiation (~ 91,000 years ) Paolo Rech – Grenoble, France 17
Neutrons Spectrum @LANSCE 1.8x10 6 n/(cm 2 h) @NYC 13 n/(cm 2 h) We test each architecture for 800h, simulating 9.2x10 8 h of natural radiation (~ 91,000 years ) All the collected SDCs are publicly available: https://github.com/UFRGS-CAROL/HPCA2017-log-data Paolo Rech – Grenoble, France 18
Selected Algorithms We select a set of benchmarks that: - stimulate different resources - are representative of HPC applications - minimize error masking (high AVF) - DGEMM: matrix multiplication - lavaMD: particles interactions - Hotspot: heat simulation - Needleman–Wunsch: Biology - CLAMR: DOE’s workload - Quick- Merge- Radix-Sort - Matrix Transpose: Memory - Gaussian Paolo Rech – Grenoble, France 19
Xeon Phi vs K40 SDC rate Xeon Phi error rate seems lower than Kepler, but: -Xeon Phi is built in 3D Trigate, Kepler in planar CMOS -Xeon Phi and K40 have different throughput 1000 SDC Relative FIT [a.u.] 100 Xeon Phi K40 10 N/A 1 15 19 23 2 10 2 11 2 12 Hotspot CLAMR lavaMD DGEMM Paolo Rech – Grenoble, France 20
Parallelism Management Reliability ~95% processor resources used with smallest input Increasing the input size we increase the #threads: -Xeon-Phi error rate remains constant (<20% variation) -K40 SDC error rate increases with input size K40 Xeon Phi lavaMD DGEMM 800 300 Relative FIT [a.u.] 250 Relative FIT [a.u.] 600 200 400 150 100 200 50 0 0 15 19 23 2 11 2 10 2 12 Paolo Rech – Grenoble, France 21
Parallelism Management Reliability K40 Xeon-Phi FIT increases with input constant FIT rate: size: HW scheduler is embedded OS is OK! prone to be corrupted! data of 2048 active only 4 threads/core are threads is maintained in maintained. Other the register file threads data in the main memory (not exposed) Paolo Rech – Grenoble, France 22
Parallelism Management Reliability K40 throughput increases with input size. Reliability vs Performances trade-off should be considered K40 Gflops 1 .20E+03 rapidly increase 1 .00E+03 DGEMM GFlops 8 .00E+02 Xeon Phi 6 .00E+02 K40 Xeon-Phi GFlops 4 .00E+02 almost constant 2 .00E+02 0 .00E+00 2 9 x2 9 2 12 x2 12 2 13 x2 13 2 10 x2 10 2 11 x2 11 Paolo Rech – Grenoble, France 23
Mean Workload Between Failures Error rate Throughput Parallel threads Paolo Rech – Grenoble, France 24
Mean Workload Between Failures Error rate Throughput Parallel threads Error rate Throughput Paolo Rech – Grenoble, France 25
Mean Workload Between Failures Error rate Throughput Parallel threads Error rate Throughput Which architecture produces a higher amount of data before experiencing a failure? Is there a sweet spot? Mean Workload Between Failures Paolo Rech – Grenoble, France 26
Recommend
More recommend