radiation reliability issues in current and future
play

Radiation Reliability Issues in Current and Future Supercomputers - PowerPoint PPT Presentation

September 26 th 2017 Grenoble, France PAOLO RECH Radiation Reliability Issues in Current and Future Supercomputers Sponsors HPC reliability importance Paolo Rech Grenoble, France 2 Available Accelerators Modern parallel


  1. September 26 th 2017 – Grenoble, France PAOLO RECH Radiation Reliability Issues in Current and Future Supercomputers

  2. Sponsors

  3. HPC reliability importance Paolo Rech – Grenoble, France 2

  4. Available Accelerators Modern parallel accelerators offer: - Low cost - Flexible platform - High efficiency (low per-thread consumption) - High computational power and frequency - Huge amount of resources Kepler K40 Xeon-Phi Paolo Rech – Grenoble, France 3

  5. Available Accelerators Modern parallel accelerators offer: - Low cost - Flexible platform - High efficiency (low per-thread consumption) - High computational power and frequency - Huge amount of resources - Reliability? Kepler K40 Xeon-Phi Paolo Rech – Grenoble, France 3

  6. Available Accelerators Modern parallel accelerators offer: Error Rate - Low cost - Flexible platform - High efficiency (low per-thread consumption) - High computational power and frequency - Huge amount of resources - Reliability? Kepler K40 Xeon-Phi Paolo Rech – Grenoble, France 3

  7. Titan Titan (Oak Ridge National Lab): 18,688 GPUs High probability of having a GPU corrupted Titan Detected Uncorrectable Errors MTBF is ~44h* *(field and experimental data from HPCA’15) Paolo Rech – Grenoble, France 4

  8. HPC bad stories Virginia Tech’s Advanced Computing facility built a supercomputer called Big Mac in 2003 ● 1,100 Apple Power Mac G5 ● Couldn't boot because of the failure rate ● Power Mac G5 did not have error-correcting code (ECC) memory ● Big Mac was broken apart and sold on-line Jaguar – (2009 #1 Top500 list) ● 360 terabytes of main memory ● 350 ECC errors per minute ASCI Q – (2002 #2 in Top500 list) ● Built with AlphaServers ● 7 Teraflops ● Couldn't run more than 1h without crash ● After putting metal side it could last 6h before crash ● Address bus on the microprocessors were unprotected (causing the crashes) Paolo Rech – Grenoble, France 4

  9. Outline The origins of the issue: § Radiation Effects Essentials § Error Criticality in HPC Understand the issue: § Experimental Procedure § K40 vs Xeon Phi Toward the solution of the issue: § ECC – ABFT – Duplication § Selective Hardening What’s the Plan? Paolo Rech – Grenoble, France 5

  10. Outline The origins of the issue: § Radiation Effects Essentials § Error Criticality in HPC Understand the issue: § Experimental Procedure § K40 vs Xeon Phi Toward the solution of the issue: § ECC – ABFT – Duplication § Selective Hardening What’s the Plan? Paolo Rech – Grenoble, France

  11. Terrestrial Radiation Environment Cosmic rays could be so energetic to pass the Van Allen belts Paolo Rech – Grenoble, France 6

  12. Terrestrial Radiation Environment Cosmic rays could be so energetic to pass the Van Allen belts Galactic cosmic rays interact with atmosphere shower of energetic particles: Muons, Pions, Protons, Gamma rays, Neutrons 13 n/(cm 2 ž h) @sea level* *JEDEC JESD89A Standard Paolo Rech – Grenoble, France 6

  13. Altitude and Radiation Maximum ionization @ ~13KM above sea level Paolo Rech – Grenoble, France 7

  14. Altitude and Radiation LANL Maximum ionization @ ~13KM above sea level Paolo Rech – Grenoble, France 7

  15. Radiation Effects - Soft Errors Soft Errors: the device is not permanently damaged, but the particle may generate: IONIZING PARTICLE • One or more bit-flips 0 1 Single Event Upset (SEU) Multiple Bit Upset (MBU) 1 0 IONIZING PARTICLE • Transient voltage pulse FF Logic Single Event Transient (SET) Paolo Rech – Grenoble, France 8

  16. Silent Data Corruption vs Crash Soft Errors in: -data cache -register files Silent Data Corruption -logic gates (ALU) -scheduler Soft Errors in: -instruction cache DUE (Crash) -scheduler / dispatcher -PCI-e bus controller Paolo Rech – Grenoble, France 9

  17. Radiation Effects on Parallel Accelerators Streaming Multiprocessor CUDA GPU Instruction Cache X Blocks Scheduler and Dispatcher Warp Scheduler Warp Scheduler SM SM SM SM SM SM SM SM X Dispatch Unit Dispatch Unit SM SM SM SM SM SM SM SM X Register File X core core core core core core SM SM SM SM core core SM SM SM SM … X L2 Cache core core core core core core core core X DRAM Shared Memory / L1 Cache Paolo Rech – Grenoble, France 10

  18. Output Correctness in HPC A single fault can propagate to several parallel threads: … multiple corrupted elements . Paolo Rech – Grenoble, France 11

  19. Output Correctness in HPC A single fault can propagate to several parallel threads: … multiple corrupted elements . Not all SDCs are critical for HPC applications error can be in the float intrinsic variance Values in a given range are accepted as correct in physical simulations Imprecise computation is being applied to HPC Paolo Rech – Grenoble, France 11

  20. Output Correctness in HPC A single fault can propagate to several parallel threads: … multiple corrupted elements . Not all SDCs are critical for HPC applications error can be in the Goal: quantify and qualify SDC in float intrinsic variance NVIDIA and Intel architectures. Values in a given range are accepted as correct in physical simulations Imprecise computation is being applied to HPC Paolo Rech – Grenoble, France 11

  21. Outline The origins of the issue: § Radiation Effects Essentials § Error Criticality in HPC Understand the issue: § Experimental Procedure § K40 vs Xeon Phi Toward the solution of the issue: § ECC – ABFT – Duplication § Selective Hardening What’s the Plan? Paolo Rech – Grenoble, France

  22. Radiation Test Facilities Irradiation of Chips Electronics Paolo Rech – Grenoble, France 12

  23. Experimental Setup Paolo Rech – Grenoble, France 13

  24. Radiation Test are NOT for dummies What can (and actually went) wrong: - Ethernet cables failures - Bios checksum error - HDD failures - Linux GRUB failure - power plug failure (wow, this was risky) - board boot failure - GPU fell off the BUS (this was funny) - mic is lost - etc… etc… etc… - Heather/Sean, can you add something to the list? Paolo Rech – Grenoble, France 14

  25. GPU Radiation Test Setup SoC microcontrollers SoC FPGA GPU FPGA Flash APU Paolo Rech – Grenoble, France 15

  26. GPU Radiation Test Setup Intel NVIDIA AMD Xeon-Phi K40 APU GPU power control circuitry is out of beam desktop PCs 23/48 Paolo Rech – Grenoble, France

  27. Neutrons Spectrum @LANSCE 1.8x10 6 n/(cm 2 h) @NYC 13 n/(cm 2 h) We test each architecture for 800h, simulating 9.2x10 8 h of natural radiation (~ 91,000 years ) Paolo Rech – Grenoble, France 17

  28. Neutrons Spectrum @LANSCE 1.8x10 6 n/(cm 2 h) @NYC 13 n/(cm 2 h) We test each architecture for 800h, simulating 9.2x10 8 h of natural radiation (~ 91,000 years ) All the collected SDCs are publicly available: https://github.com/UFRGS-CAROL/HPCA2017-log-data Paolo Rech – Grenoble, France 18

  29. Selected Algorithms We select a set of benchmarks that: - stimulate different resources - are representative of HPC applications - minimize error masking (high AVF) - DGEMM: matrix multiplication - lavaMD: particles interactions - Hotspot: heat simulation - Needleman–Wunsch: Biology - CLAMR: DOE’s workload - Quick- Merge- Radix-Sort - Matrix Transpose: Memory - Gaussian Paolo Rech – Grenoble, France 19

  30. Xeon Phi vs K40 SDC rate Xeon Phi error rate seems lower than Kepler, but: -Xeon Phi is built in 3D Trigate, Kepler in planar CMOS -Xeon Phi and K40 have different throughput 1000 SDC Relative FIT [a.u.] 100 Xeon Phi K40 10 N/A 1 15 19 23 2 10 2 11 2 12 Hotspot CLAMR lavaMD DGEMM Paolo Rech – Grenoble, France 20

  31. Parallelism Management Reliability ~95% processor resources used with smallest input Increasing the input size we increase the #threads: -Xeon-Phi error rate remains constant (<20% variation) -K40 SDC error rate increases with input size K40 Xeon Phi lavaMD DGEMM 800 300 Relative FIT [a.u.] 250 Relative FIT [a.u.] 600 200 400 150 100 200 50 0 0 15 19 23 2 11 2 10 2 12 Paolo Rech – Grenoble, France 21

  32. Parallelism Management Reliability K40 Xeon-Phi FIT increases with input constant FIT rate: size: HW scheduler is embedded OS is OK! prone to be corrupted! data of 2048 active only 4 threads/core are threads is maintained in maintained. Other the register file threads data in the main memory (not exposed) Paolo Rech – Grenoble, France 22

  33. Parallelism Management Reliability K40 throughput increases with input size. Reliability vs Performances trade-off should be considered K40 Gflops 1 .20E+03 rapidly increase 1 .00E+03 DGEMM GFlops 8 .00E+02 Xeon Phi 6 .00E+02 K40 Xeon-Phi GFlops 4 .00E+02 almost constant 2 .00E+02 0 .00E+00 2 9 x2 9 2 12 x2 12 2 13 x2 13 2 10 x2 10 2 11 x2 11 Paolo Rech – Grenoble, France 23

  34. Mean Workload Between Failures Error rate Throughput Parallel threads Paolo Rech – Grenoble, France 24

  35. Mean Workload Between Failures Error rate Throughput Parallel threads Error rate Throughput Paolo Rech – Grenoble, France 25

  36. Mean Workload Between Failures Error rate Throughput Parallel threads Error rate Throughput Which architecture produces a higher amount of data before experiencing a failure? Is there a sweet spot? Mean Workload Between Failures Paolo Rech – Grenoble, France 26

Recommend


More recommend