The trade-off betw een energy consumption and dependability consumption and dependability Johan Karlsson Department of Computer Science and Engineering Chalmers University of Technology Göteborg, Sweden Trade-offs in Computer System Design Cost Dependability and Security Johan Karlsson Energy-aware computing 2 1
Layered fault tolerance Catastrophic Benign Safe System failure modes System failure modes failure f il f il failure Sh td Shutdow n Error 3 rd line of defense System mechanisms corrected Timing Bounded Fail Value Fail Processor failure modes failure failure failure silent signal ost balancing Error 2 nd line of defense Software mechanisms corrected Detected Undetected Error Error Error Error C Error 1 st line of defense Hardware mechanisms Corrected SW Design HW Design Physical Faults Faults Faults Johan Karlsson Energy-aware computing 3 Outline • Trends in integrated circuit reliability g y • HP NonStop Advanced Architecture – Traditional approach to fault tolerance in high-end servers • IBM Power7 processor – Energy control – Chip-level fault tolerance • Software implemented hardware fault tolerance • Final reamrks Johan Karlsson Energy-aware computing 4 2
Transistor variability and degradation Shekhar Borkar, Intel Corp: “ As technology scales, variability in transistor performance will continue to increase, making transistors less and less reliable. …. Finding solutions to these challenges will require a concerted effort on the part of all the players in a system design ” the players in a system design . Borkar, S.; "Designing reliable systems from unreliable components: the challenges of transistor variability and degradation," IEEE Micro, December 2005. Johan Karlsson Energy-aware computing 5 Trends in the bathtube curve Infant mortality Constant failure rate Wear out Failure rate 1 – 20 weeks 3 – 10 years Time • Infant mortality: Increasing manufacturing defects • Constant failure rate: Increasing rate of transient, intermittent and permanent faults • Wearout: Acceleration of aging phenomena Source: Vikas Chandra, ARM R&D, Dependable Design in Nanoscale CMOS Technologies: Challenges and Solutions Keynote address, WDSN, Estoril, Portugal, June 29, 2009 Johan Karlsson Energy-aware computing 6 3
Sources of transistor failures • Process variations (intermittent and permanent faults) ( p ) – Random variations related to lithography, etching, dopant count – Voltage and temperature variations • Wear out effects (intermittent and permanent faults) – NBTI - negative bias temperature instability – HCI - hot carrier injection – Gate oxide breakdown – Electromigration – … • Ionizing particle radiation (mostly transient faults) Electromigration – Cosmic neutrons, alpha particles, muons, … – Soft errors (single event upsets) – no permanent damage – Hard errors (permanent faults) – permanent damage Energy-aware computing Johan Karlsson 7 Gate oxide breakdow ns • Gate oxide breakdowns increase leakage currents and change g g electrical characteristics of transistors Gate oxide in 90 nm technology Gate oxide scaling Thickness: 5 atom layers Source: Intel 2005 Johan Karlsson Energy-aware computing 8 4
Development of Gate-Oxide Breakdow n Johan Karlsson Energy-aware computing 9 Single Event Effects (SEE) Disturbance caused by a single ionizing particle Disturbance caused by a single ionizing particle Types of SEE:s • Upset (SEU) – change in logic state (bit-flips) by direct hit in memory element, e.g., flip-flop or SRAM cell • Transient (SET) – voltage pulse in combinational network, may lead to single bit or multiple bit upset • Latchup (SEL) – triggering of parasitic pnpn structure • Burnout (SEB) of high voltage device, e.g., power transistor Johan Karlsson Energy-aware computing 10 5
Soft Errors • Soft errors (or single event upsets) are particle induced upsets (bit-flips) are particle induced upsets (bit flips) • Caused by highly energetic particles such as neutron, protons and muons Bit-flips SRAM cell Particle trajectory Source Drain Poly Si gate SiO 2 gate SiO 2 gate n+ n+ Si substrate Depletion p region Particle strike in n-channel MOSFET transistor Johan Karlsson Energy-aware computing 11 Flux of cosmic ray-induced high-energy neutrons – The neutron flux is influenced by latitude, longitude, altitude, y g atmospheric pressure, and solar activity – Reference point: New York City, sea-level, medium solar activity • Total flux at NYC is 12.9 cm -2 h -1 for neutron energies > 10 MeV • Roughly 10 times higher at an altitude of 3000 meters – The neutron flux at a specific location can be calculated at http://www seutest com http://www.seutest.com – More information can be found in the JEDEC Standard: JESD89A - Measurement and Reporting of Alpha Particle and Terrestrial Cosmic Ray-Induced Soft Errors in Semiconductor Devices (October, 2006) Johan Karlsson Energy-aware computing 12 6
Soft error rate trend for SRAM & Flip-Flops (Radiation test data from Sun Microsystems) 1 FIT = 10 -9 faults per hour Source: A. Dixit, R. Heald, and A. Wood, “Trends from Ten Years of Soft Error Experimentation, SELSE´09, Stanford, CA, USA . Johan Karlsson Energy-aware computing 14 Raw soft error rate trend for microprocessors (Data from Sun Microsystems) Technology Year Relative SEU Mbits/processor Relative node (nm) introduced rate in uncorrected FITs/kbit FITs/kbit SEU rate / SEU rate / microproces sor 250 1998 3.2 1.52 5.0 180 1999 3.0 1.52 4.3 130 2000 2.4 3.28 7.9 90 2002 1.0 33.6 33.6 65 2006 0.7 44.3 30.5 40 2008 0.94 71 67 1 FIT = 10 -9 faults per hour Source: A. Dixit, R. Heald, and A. Wood, “The Impact of New Technology on Soft Error Rates, SELSE-6, Stanford, CA, USA, 2010 Johan Karlsson Energy-aware computing 15 7
Circuit w ear out Keane, J.; Kim, C.H.; , "An odometer for CPUs," IEEE Spectrum, May 2011 Johan Karlsson Energy-aware computing 16 HP’s NonStop Computer Systems • Highly available computers for on-line transaction Highly available computers for on line transaction processing (OLTP) systems • Typical applications: – Automatic teller machines, Stock trading, Funds transfer, 911 emergency centers, Medical records, Travel and hotel reservations, etc • Availability: 0 99999 • Availability: 0,99999 – “five nines” or 5 min five nines , or 5 min downtime per year • Data integrity: 1 FIT = 10 -9 undetected errors per hour (one undetected data error per billion hours) Johan Karlsson Energy-aware computing 17 8
associated hardw are announcements Energy-aware computing Johan Karlsson 18 Marketing information from HP (from 2005) • Telecommunications – 135 public telephone companies currently rely on NonStop technology. – More than half of all 911 calls in the United States and the majority of wireless calls worldwide depend on NonStop servers. • Finance – Eighty percent of all ATM transactions worldwide and 66 percent of all point-of-sale transactions worldwide are percent of all point-of-sale transactions worldwide are handled by NonStop servers. – NonStop technology powers 75 percent of the world’s 100 largest electronic funds transfer networks and 106 of the world’s 120 stock and commodity exchanges. Johan Karlsson Energy-aware computing 19 9
NonStop System w ith self-checked processors Self-checked processors Self-checked processors • Stop promptly if an error occurs • Prevent error propagation Process pairs • Critical software is implemented as a process pair, with one primary and one backup process executing on different processors • • The primary process execute the Th i t th program and sends state changes regularly to the backup process • Backup process takes over if the primary process fails by itself or as a result of a processor failure Johan Karlsson Energy-aware computing 20 Logical Processors Johan Karlsson Energy-aware computing 21 10
IBM Pow er7 processor • Released in 2010, successor to the dual core Power 6 processor (released in 2007) • Implements Power ISA v. 2.06 revision B (July 2010) • Fabricated in 45 nm SOI, 567 mm 2 , 1.2 billion transistors • 8 cores • Each core has 12 executions units: two fixed-point units, two load-store units, four double-precision floating-point units, one vector unit , one branch execution unit, one condition register unit, and decimal floating-point unit. • Each core can fetch up to 8 instructions, decode and dispatch up to 6 p , p p instructions, and issue and execute up to 8 instructions in one clock cycle. • Two on-chip memory controllers. Each memory controller supports four DDR3 memory channels, yielding a total memory bandwidth of 100 Gbytes/s • Scales to 32 socket systems with 1024 threads Energy-aware computing Johan Karlsson 22 Pow er 7 High Volume Card Johan Karlsson Energy-aware computing 23 11
Recommend
More recommend