Radiation Reliability Issues in Current and Future Supercomputers - PowerPoint PPT Presentation

September 26 th 2017 – Grenoble, France PAOLO RECH Radiation Reliability Issues in Current and Future Supercomputers

Sponsors

HPC reliability importance Paolo Rech – Grenoble, France 2

Available Accelerators Modern parallel accelerators offer: - Low cost - Flexible platform - High efficiency (low per-thread consumption) - High computational power and frequency - Huge amount of resources Kepler K40 Xeon-Phi Paolo Rech – Grenoble, France 3

Available Accelerators Modern parallel accelerators offer: - Low cost - Flexible platform - High efficiency (low per-thread consumption) - High computational power and frequency - Huge amount of resources - Reliability? Kepler K40 Xeon-Phi Paolo Rech – Grenoble, France 3

Available Accelerators Modern parallel accelerators offer: Error Rate - Low cost - Flexible platform - High efficiency (low per-thread consumption) - High computational power and frequency - Huge amount of resources - Reliability? Kepler K40 Xeon-Phi Paolo Rech – Grenoble, France 3

Titan Titan (Oak Ridge National Lab): 18,688 GPUs High probability of having a GPU corrupted Titan Detected Uncorrectable Errors MTBF is ~44h* *(field and experimental data from HPCA’15) Paolo Rech – Grenoble, France 4

HPC bad stories Virginia Tech’s Advanced Computing facility built a supercomputer called Big Mac in 2003 ● 1,100 Apple Power Mac G5 ● Couldn't boot because of the failure rate ● Power Mac G5 did not have error-correcting code (ECC) memory ● Big Mac was broken apart and sold on-line Jaguar – (2009 #1 Top500 list) ● 360 terabytes of main memory ● 350 ECC errors per minute ASCI Q – (2002 #2 in Top500 list) ● Built with AlphaServers ● 7 Teraflops ● Couldn't run more than 1h without crash ● After putting metal side it could last 6h before crash ● Address bus on the microprocessors were unprotected (causing the crashes) Paolo Rech – Grenoble, France 4

Outline The origins of the issue: § Radiation Effects Essentials § Error Criticality in HPC Understand the issue: § Experimental Procedure § K40 vs Xeon Phi Toward the solution of the issue: § ECC – ABFT – Duplication § Selective Hardening What’s the Plan? Paolo Rech – Grenoble, France 5

Outline The origins of the issue: § Radiation Effects Essentials § Error Criticality in HPC Understand the issue: § Experimental Procedure § K40 vs Xeon Phi Toward the solution of the issue: § ECC – ABFT – Duplication § Selective Hardening What’s the Plan? Paolo Rech – Grenoble, France

Terrestrial Radiation Environment Cosmic rays could be so energetic to pass the Van Allen belts Paolo Rech – Grenoble, France 6

Terrestrial Radiation Environment Cosmic rays could be so energetic to pass the Van Allen belts Galactic cosmic rays interact with atmosphere shower of energetic particles: Muons, Pions, Protons, Gamma rays, Neutrons 13 n/(cm 2 h) @sea level* *JEDEC JESD89A Standard Paolo Rech – Grenoble, France 6

Altitude and Radiation Maximum ionization @ ~13KM above sea level Paolo Rech – Grenoble, France 7

Altitude and Radiation LANL Maximum ionization @ ~13KM above sea level Paolo Rech – Grenoble, France 7

Radiation Effects - Soft Errors Soft Errors: the device is not permanently damaged, but the particle may generate: IONIZING PARTICLE • One or more bit-flips 0 1 Single Event Upset (SEU) Multiple Bit Upset (MBU) 1 0 IONIZING PARTICLE • Transient voltage pulse FF Logic Single Event Transient (SET) Paolo Rech – Grenoble, France 8

Silent Data Corruption vs Crash Soft Errors in: -data cache -register files Silent Data Corruption -logic gates (ALU) -scheduler Soft Errors in: -instruction cache DUE (Crash) -scheduler / dispatcher -PCI-e bus controller Paolo Rech – Grenoble, France 9

Radiation Effects on Parallel Accelerators Streaming Multiprocessor CUDA GPU Instruction Cache X Blocks Scheduler and Dispatcher Warp Scheduler Warp Scheduler SM SM SM SM SM SM SM SM X Dispatch Unit Dispatch Unit SM SM SM SM SM SM SM SM X Register File X core core core core core core SM SM SM SM core core SM SM SM SM … X L2 Cache core core core core core core core core X DRAM Shared Memory / L1 Cache Paolo Rech – Grenoble, France 10

Output Correctness in HPC A single fault can propagate to several parallel threads: … multiple corrupted elements . Paolo Rech – Grenoble, France 11

Output Correctness in HPC A single fault can propagate to several parallel threads: … multiple corrupted elements . Not all SDCs are critical for HPC applications error can be in the float intrinsic variance Values in a given range are accepted as correct in physical simulations Imprecise computation is being applied to HPC Paolo Rech – Grenoble, France 11

Output Correctness in HPC A single fault can propagate to several parallel threads: … multiple corrupted elements . Not all SDCs are critical for HPC applications error can be in the Goal: quantify and qualify SDC in float intrinsic variance NVIDIA and Intel architectures. Values in a given range are accepted as correct in physical simulations Imprecise computation is being applied to HPC Paolo Rech – Grenoble, France 11

Outline The origins of the issue: § Radiation Effects Essentials § Error Criticality in HPC Understand the issue: § Experimental Procedure § K40 vs Xeon Phi Toward the solution of the issue: § ECC – ABFT – Duplication § Selective Hardening What’s the Plan? Paolo Rech – Grenoble, France

Radiation Test Facilities Irradiation of Chips Electronics Paolo Rech – Grenoble, France 12

Experimental Setup Paolo Rech – Grenoble, France 13

Radiation Test are NOT for dummies What can (and actually went) wrong: - Ethernet cables failures - Bios checksum error - HDD failures - Linux GRUB failure - power plug failure (wow, this was risky) - board boot failure - GPU fell off the BUS (this was funny) - mic is lost - etc… etc… etc… - Heather/Sean, can you add something to the list? Paolo Rech – Grenoble, France 14

GPU Radiation Test Setup SoC microcontrollers SoC FPGA GPU FPGA Flash APU Paolo Rech – Grenoble, France 15

GPU Radiation Test Setup Intel NVIDIA AMD Xeon-Phi K40 APU GPU power control circuitry is out of beam desktop PCs 23/48 Paolo Rech – Grenoble, France

Neutrons Spectrum @LANSCE 1.8x10 6 n/(cm 2 h) @NYC 13 n/(cm 2 h) We test each architecture for 800h, simulating 9.2x10 8 h of natural radiation (~ 91,000 years ) Paolo Rech – Grenoble, France 17

Neutrons Spectrum @LANSCE 1.8x10 6 n/(cm 2 h) @NYC 13 n/(cm 2 h) We test each architecture for 800h, simulating 9.2x10 8 h of natural radiation (~ 91,000 years ) All the collected SDCs are publicly available: https://github.com/UFRGS-CAROL/HPCA2017-log-data Paolo Rech – Grenoble, France 18

Selected Algorithms We select a set of benchmarks that: - stimulate different resources - are representative of HPC applications - minimize error masking (high AVF) - DGEMM: matrix multiplication - lavaMD: particles interactions - Hotspot: heat simulation - Needleman–Wunsch: Biology - CLAMR: DOE’s workload - Quick- Merge- Radix-Sort - Matrix Transpose: Memory - Gaussian Paolo Rech – Grenoble, France 19

Xeon Phi vs K40 SDC rate Xeon Phi error rate seems lower than Kepler, but: -Xeon Phi is built in 3D Trigate, Kepler in planar CMOS -Xeon Phi and K40 have different throughput 1000 SDC Relative FIT [a.u.] 100 Xeon Phi K40 10 N/A 1 15 19 23 2 10 2 11 2 12 Hotspot CLAMR lavaMD DGEMM Paolo Rech – Grenoble, France 20

Parallelism Management Reliability ~95% processor resources used with smallest input Increasing the input size we increase the #threads: -Xeon-Phi error rate remains constant (<20% variation) -K40 SDC error rate increases with input size K40 Xeon Phi lavaMD DGEMM 800 300 Relative FIT [a.u.] 250 Relative FIT [a.u.] 600 200 400 150 100 200 50 0 0 15 19 23 2 11 2 10 2 12 Paolo Rech – Grenoble, France 21

Parallelism Management Reliability K40 Xeon-Phi FIT increases with input constant FIT rate: size: HW scheduler is embedded OS is OK! prone to be corrupted! data of 2048 active only 4 threads/core are threads is maintained in maintained. Other the register file threads data in the main memory (not exposed) Paolo Rech – Grenoble, France 22

Parallelism Management Reliability K40 throughput increases with input size. Reliability vs Performances trade-off should be considered K40 Gflops 1 .20E+03 rapidly increase 1 .00E+03 DGEMM GFlops 8 .00E+02 Xeon Phi 6 .00E+02 K40 Xeon-Phi GFlops 4 .00E+02 almost constant 2 .00E+02 0 .00E+00 2 9 x2 9 2 12 x2 12 2 13 x2 13 2 10 x2 10 2 11 x2 11 Paolo Rech – Grenoble, France 23

Mean Workload Between Failures Error rate Throughput Parallel threads Paolo Rech – Grenoble, France 24

Mean Workload Between Failures Error rate Throughput Parallel threads Error rate Throughput Paolo Rech – Grenoble, France 25

Mean Workload Between Failures Error rate Throughput Parallel threads Error rate Throughput Which architecture produces a higher amount of data before experiencing a failure? Is there a sweet spot? Mean Workload Between Failures Paolo Rech – Grenoble, France 26

Radiation Reliability Issues in Current and Future Supercomputers - PowerPoint PPT Presentation

September 26 th 2017 Grenoble, France PAOLO RECH Radiation Reliability Issues in Current and Future Supercomputers Sponsors HPC reliability importance Paolo Rech Grenoble, France 2 Available Accelerators Modern parallel

Understanding Radiation Understanding Radiation Therapy Therapy For Patients and the Public

Blackbody Radiation Blackbody Radiation A blackbody is a surface that completely absorbs all

Software Reliability and System Reliability Introduction 1 Software Reliability and System

Reliability Engineering - Discussions and Clarifications Reliability Engineering VS.

Reliability of Cloud-Scale Systems (CS 598) Fall 2018 Tianyin Xu 1 Reliability of Cloud-Scale

The Future of Reliability: Stanton Energy Reliability Center DCBO Bidders Conference

Basic Radiation Concepts and Radiation Protection Radiation and Radioactive Material Are Part

Radiation Safety Radiation Safety General Information about General Information about Radiation

NEW RADIATION LEGISLATION M M TREVOR RADIATION Aims to: Produce a working definition for

NuMI Beamline Radiation Safety Issues Nancy Grossman FNAL NBI03 November 2003 NUMI NBI2003

Reliability Perspectives on Clean Power Plan Implications NERC Reliability Assessments John Moura

Radiation- -Dominated Dominated Radiation Relativistic Current Sheets Relativistic Current

Safety barriers Ola Holmberg Radiation Protection of Patients Unit Division of Radiation,

RADIO HALOS AND SYNCHROTRON RADIATION CONTENTS CONTENTS Synchrotron radiation Spectral

Exposure Dose from Natural and Artificial Radiation around Us Radiation From outer Natural

Radiation Protection Program PETITION FOR REVISION OF Radiation Protection Rules Pa Parts ts

Welcome to the Welcome to the First MiniWorkshop MiniWorkshop on on First Acoustic

Jack Fried Cold Electronics Review October 13, 2016 10/13/2016 Cold Electronics Review 1 APA

A Generalized Fejrs Theorem for Locally Compact Groups Huichi Huang April 22, 2016 . .

B A Y C P L E I T S C U PART II univalence n loop e w ? p a t h s loop l o

R3B Active Target - status and R&D Oleg Kiselev GSI Darmstadt I. Si detectors for EXL recoil

ROBOTICS ROBOTICS 01PEEQW 01PEEQW 01PEEQW 01PEEQW Basilio Bona Basilio Bona DAUIN DAUIN

Introduction Summary of LCWS13/accelerator Philip Bambade Laboratoire de lAcclrateur

The bright side of Coulomb blockade: Radiation from a Josephson junction in the single Cooper

Sambuz

Useful Links

Newsletter

Mail Us

Radiation Reliability Issues in Current and Future Supercomputers - PowerPoint PPT Presentation

September 26 th 2017 Grenoble, France PAOLO RECH Radiation Reliability Issues in Current and Future Supercomputers Sponsors HPC reliability importance Paolo Rech Grenoble, France 2 Available Accelerators Modern parallel

Understanding Radiation Understanding Radiation Therapy Therapy For Patients and the Public

Blackbody Radiation Blackbody Radiation A blackbody is a surface that completely absorbs all

Software Reliability and System Reliability Introduction 1 Software Reliability and System

Reliability Engineering - Discussions and Clarifications Reliability Engineering VS.

Reliability of Cloud-Scale Systems (CS 598) Fall 2018 Tianyin Xu 1 Reliability of Cloud-Scale

The Future of Reliability: Stanton Energy Reliability Center DCBO Bidders Conference

Basic Radiation Concepts and Radiation Protection Radiation and Radioactive Material Are Part

Radiation Safety Radiation Safety General Information about General Information about Radiation

NEW RADIATION LEGISLATION M M TREVOR RADIATION Aims to: Produce a working definition for

NuMI Beamline Radiation Safety Issues Nancy Grossman FNAL NBI03 November 2003 NUMI NBI2003

Reliability Perspectives on Clean Power Plan Implications NERC Reliability Assessments John Moura

Radiation- -Dominated Dominated Radiation Relativistic Current Sheets Relativistic Current

Safety barriers Ola Holmberg Radiation Protection of Patients Unit Division of Radiation,

RADIO HALOS AND SYNCHROTRON RADIATION CONTENTS CONTENTS Synchrotron radiation Spectral

Exposure Dose from Natural and Artificial Radiation around Us Radiation From outer Natural

Radiation Protection Program PETITION FOR REVISION OF Radiation Protection Rules Pa Parts ts

Welcome to the Welcome to the First MiniWorkshop MiniWorkshop on on First Acoustic

Jack Fried Cold Electronics Review October 13, 2016 10/13/2016 Cold Electronics Review 1 APA

A Generalized Fejrs Theorem for Locally Compact Groups Huichi Huang April 22, 2016 . .

B A Y C P L E I T S C U PART II univalence n loop e w ? p a t h s loop l o

R3B Active Target - status and R&amp;D Oleg Kiselev GSI Darmstadt I. Si detectors for EXL recoil

ROBOTICS ROBOTICS 01PEEQW 01PEEQW 01PEEQW 01PEEQW Basilio Bona Basilio Bona DAUIN DAUIN

Introduction Summary of LCWS13/accelerator Philip Bambade Laboratoire de lAcclrateur

The bright side of Coulomb blockade: Radiation from a Josephson junction in the single Cooper

Sambuz

Useful Links

Newsletter

Mail Us

R3B Active Target - status and R&D Oleg Kiselev GSI Darmstadt I. Si detectors for EXL recoil