April 6 th 2015 – San José, CA How to Deal with Radiation: Evaluation and Mitigation of GPUs Soft-Errors Paolo Rech
Motivation: Automotive Applications Pedestrian Detection System: embedded GPUs increase cars security Paolo Rech – GTC2016, San José, CA 2
Motivation: Automotive Applications Pedestrian Detection System: embedded GPUs increase cars security Observed error Paolo Rech – GTC2016, San José, CA 2
Motivation: Automotive Applications Pedestrian Detection System: embedded GPUs increase cars security The insurance does not cover those accidents caused by: Observed error […] exposure to ionizing radiation* *Paolo ’ s car insurance Paolo Rech – GTC2016, San José, CA 2
Motivation: HPC Industry Titan (Oak Ridge National Lab): 18,688 GPUs High probability of having a GPU corrupted Titan MTBF is ~44h* *(field data from Tiwari et al. HPCA ’ 15) Paolo Rech – GTC2016, San José, CA 3
Motivation: HPC Industry Titan (Oak Ridge National Lab): 18,688 GPUs High probability of having a GPU corrupted Titan MTBF is ~44h* *(field data from Tiwari et al. HPCA ’ 15) Only Crashes/Hangs considered (correct output is unknown) We perform radiation experiments to measure Silent Data Corruption (SDC) rates Paolo Rech – GTC2016, San José, CA 3
Outline Radiation Effects Essentials Evaluation of GPU Radiation Sensitivity - Experimental Setup - Parallel Algorithms Error Rates Hardening Solution Efficiency Codes Optimizations Effects on HPC Reliability What ’ s the Plan? Paolo Rech – GTC2016, San José, CA 4
Outline Radiation Effects Essentials Evaluation of GPU Radiation Sensitivity - Experimental Setup - Parallel Algorithms Error Rates Hardening Solution Efficiency Codes Optimizations Effects on HPC Reliability What ’ s the Plan? Paolo Rech – GTC2016, San José, CA
Terrestrial Radiation Environment Galactic cosmic rays interact with atmosphere shower of energetic particles: Muons, Pions, Protons, Gamma rays, Neutrons 13 n/(cm 2 h) @sea level Paolo Rech – GTC2016, San José, CA 5
Terrestrial Radiation Environment Galactic cosmic rays interact with atmosphere shower of energetic particles: Muons, Pions, Protons, Gamma rays, Neutrons 13 n/(cm 2 h) @sea level neutron flux increases exponentially with altitude Paolo Rech – GTC2016, San José, CA 5
Radiation Effects - Soft Errors Soft Errors: the device is not permanently damaged, but the particle may generate: • One or more bit-flips 0 1 Single Event Upset (SEU) Multiple Bit Upset (MBU) Paolo Rech – GTC2016, San José, CA 6
Radiation Effects - Soft Errors Soft Errors: the device is not permanently damaged, but the particle may generate: IONIZING PARTICLE • One or more bit-flips 0 1 Single Event Upset (SEU) Multiple Bit Upset (MBU) 1 0 Paolo Rech – GTC2016, San José, CA 6
Radiation Effects - Soft Errors Soft Errors: the device is not permanently damaged, but the particle may generate: IONIZING PARTICLE • One or more bit-flips 0 1 Single Event Upset (SEU) Multiple Bit Upset (MBU) 1 0 IONIZING PARTICLE • Transient voltage pulse FF Logic Single Event Transient (SET) Paolo Rech – GTC2016, San José, CA 6
Radiation Effects on GPUs Streaming Multiprocessor CUDA GPU Instruction Cache Blocks Scheduler and Dispatcher Warp Scheduler Warp Scheduler SM SM SM SM Dispatch Unit Dispatch Unit SM SM SM SM Register File core core core core SM SM SM SM … L2 Cache core core core core Shared Memory / L1 Cache DRAM Paolo Rech – GTC2016, San José, CA 7
Radiation Effects on GPUs Streaming Multiprocessor CUDA GPU Instruction Cache Blocks Scheduler and Dispatcher Warp Scheduler Warp Scheduler SM SM SM SM Dispatch Unit Dispatch Unit X SM SM SM SM Register File X core core core core SM SM SM SM … L2 Cache core core core core Shared Memory / L1 Cache DRAM Paolo Rech – GTC2016, San José, CA 7
Radiation Effects on GPUs Streaming Multiprocessor CUDA GPU Instruction Cache Blocks Scheduler and Dispatcher Warp Scheduler Warp Scheduler SM SM SM SM Dispatch Unit Dispatch Unit X SM SM SM SM Register File X core core core core core SM SM SM SM … L2 Cache core core core core Shared Memory / L1 Cache DRAM Paolo Rech – GTC2016, San José, CA 7
Radiation Effects on GPUs Streaming Multiprocessor CUDA GPU Instruction Cache Blocks Scheduler and Dispatcher Warp Scheduler Warp Scheduler SM SM SM SM X Dispatch Unit Dispatch Unit X SM SM SM SM Register File X core core core core core SM SM SM SM … L2 Cache core core core core X Shared Memory / L1 Cache DRAM Paolo Rech – GTC2016, San José, CA 7
Radiation Effects on GPUs Streaming Multiprocessor CUDA GPU Instruction Cache Blocks Scheduler and Dispatcher Warp Scheduler Warp Scheduler SM SM SM SM X Dispatch Unit Dispatch Unit X SM SM SM SM Register File X core core core core core core core SM SM SM SM core … L2 Cache core core core core core core core core X Shared Memory / L1 Cache DRAM Paolo Rech – GTC2016, San José, CA 7
Radiation Effects on GPUs Streaming Multiprocessor CUDA GPU Instruction Cache X Blocks Scheduler and Dispatcher Warp Scheduler Warp Scheduler SM SM SM SM X Dispatch Unit Dispatch Unit X SM SM SM SM Register File X core core core core core core core SM SM SM SM core … X L2 Cache core core core core core core core core X Shared Memory / L1 Cache DRAM Paolo Rech – GTC2016, San José, CA 7
Radiation Effects on GPUs Streaming Multiprocessor CUDA GPU Instruction Cache X Blocks Scheduler and Dispatcher Warp Scheduler Warp Scheduler SM SM SM SM SM SM SM SM X Dispatch Unit Dispatch Unit X SM SM SM SM SM SM SM SM Register File X core core core core core core core SM SM SM SM SM SM SM SM core … X L2 Cache core core core core core core core core X Shared Memory / L1 Cache DRAM Paolo Rech – GTC2016, San José, CA 7
Silent Data Corruption vs Crash&Hang Errors in: - data cache - register files Silent Data Corruption - logic gates (ALU) - scheduler Paolo Rech – GTC2016, San José, CA 8
Silent Data Corruption vs Crash&Hang Errors in: - data cache - register files Silent Data Corruption - logic gates (ALU) - scheduler Errors in: - instruction cache Crash & Hang - scheduler / dispatcher - PCI-e bus controller Paolo Rech – GTC2016, San José, CA 8
Outline Radiation Effects Essentials Evaluation of GPU Radiation Sensitivity - Experimental Setup - Parallel Algorithms Error Rates Hardening Solution Efficiency Codes Optimizations Effects on HPC Reliability What ’ s the Plan? Paolo Rech – GTC2016, San José, CA
Radiation Test Facilities Weapon Nuclear Research Paolo Rech – GTC2016, San José, CA 9
Neutrons Spectrum @LANSCE 1.8x10 9 n/(cm 2 h) @NYC 13 n/(cm 2 h) errors/s cross section [cm 2 ] = flux (n/cm 2 /s) cross section x flux (13 n/(cm 2 h)) = Error Rate Paolo Rech – GTC2016, San José, CA 10
Neutrons Spectrum @LANSCE 1.8x10 9 n/(cm 2 h) @NYC 13 n/(cm 2 h) probability for 1 neutron to generate an output error errors/s cross section [cm 2 ] = flux (n/cm 2 /s) cross section x flux (13 n/(cm 2 h)) = Error Rate Paolo Rech – GTC2016, San José, CA 10
GPU Radiation Test Setup SoC microcontrollers SoC FPGA GPU FPGA Flash APU Paolo Rech – GTC2016, San José, CA 11
GPU Radiation Test Setup Intel AMD NVIDIA Xeon-Phi APU K20 GPU power control circuitry is out of beam desktop PCs Paolo Rech – GTC2016, San José, CA 23/48
Outline Radiation Effects Essentials Evaluation of GPU Radiation Sensitivity - Experimental Setup Parallel Algorithms Error Rates - Hardening Solution Efficiency Codes Optimizations Effects on HPC Reliability What ’ s the Plan? Paolo Rech – GTC2016, San José, CA
Tested Parallel Codes -Matrix Multiplication (linear algebra) -Matrix Transpose (memory) -FFT (signal processing) -Needleman – Wunsch (biology) -lavaMD (physical simulations) -Hotspot (physical simulations) -HOG (pedestrian detection) The selected algorithms are heterogeneous and representative Paolo Rech – GTC2016, San José, CA 13
Experimental Results (ECC OFF) SDC rate varies ~3 orders of magnitude (details on Oliveira et al. Trans. Comp. 2015) execution dominated by memory latencies 10000 Failure In Time @NYC Crashes 1000 SDC 100 10 1 NW lavaMD Hotspot MxM MTrans FFT Paolo Rech – GTC2016, San José, CA 14
Experimental Results (ECC OFF) SDC rate varies ~3 orders of magnitude (details on Oliveira et al. Trans. Comp. 2015) execution dominated by codes that heavily memory latencies employ registers 10000 Failure In Time @NYC Crashes 1000 SDC 100 10 1 NW lavaMD Hotspot MxM MTrans FFT Paolo Rech – GTC2016, San José, CA 14
Experimental Results (ECC OFF) SDC rate varies ~3 orders of magnitude (details on Oliveira et al. Trans. Comp. 2015) higher codes that heavily Matrix Multiplication: 6.46 10 2 FIT #instructions employ registers 1 error every 15 years 10000 Failure In Time @NYC Titan: 18,688 errors every 15 years Crashes (1 error every 7.3h) 1000 SDC 100 10 1 NW lavaMD Hotspot MxM MTrans FFT Paolo Rech – GTC2016, San José, CA 14
Recommend
More recommend