IBM zSeries Fault Tolerant Design Lisa Spainhower Technology m ^ - PowerPoint PPT Presentation

Dec 03, 2023 •41 likes •131 views

IBM zSeries Fault Tolerant Design Lisa Spainhower Technology m ^ September 20, 2001 Power/Cooling Fault Tolerance N+1 350 volt DC to DC Load Converter AC to DC DC to DC Converter Converter AC input Battery DC to DC Load Converter

IBM zSeries Fault Tolerant Design Lisa Spainhower Technology m ^ September 20, 2001
Power/Cooling Fault Tolerance N+1 350 volt DC to DC Load Converter AC to DC DC to DC Converter Converter AC input Battery DC to DC Load Converter N+1 AC to DC Fan/ Fan/ Converter Compressor Compressor Control AC input Battery Fan/ Fan/ Compressor Compressor Control
I/ O ED and Recovery I/ O ED and Recovery Memory Bus MAIN GX, S/390 L2 I/O CACHES MEMORY SUBSYSTEM PROCESSORS HUB S/390 STI RIO RIO S/390 Unix NO ED standards for PCI RS/AIX custom design to BRIDGE CHANNEL circumvent IBT, ESCON IBT is channel-based PCI, PCIx FICON, FC Like S/390 I/O I/O Defined errors ADAPTER ADAPTER Defined robust checking & isolation SCSI, SCSI, FCAL FCAL NETWORK LEVEL THE PLAYING FIELD STORAGE STORAGE ETHERNET
Memory Hierarchy Fault Tolerance Memory (72, 64) SEC/DED ECC Memory Memory One bit per chip Background scrubbing Dynamic chip sparing Level 2 Cache I/O I/O (72, 64) SEC/DED ECC Line/directory deletes Line sparing L2 L2 L2 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 uP uP uP uP uP uP uP uP uP uP uP uP uP uP uP uP uP uP Level 1 Cache Parity Protected Store-through to L2 ECC'd Store Buffer on uP Line delete/sparing
CP Error Detection & Recovery Shared: Duplicated : Cache controls Complex controls Cache data/address flow Arithmetic dataflow Check all state updates R-Unit Preserve known good state If erro r I-Unit I-Unit Cache 1. Stop state updates (unchecked) (mirror) (parity) 2. Refresh from saved state 3. Restart CPU If error persists 1. Extract saved state (SE) 2. Load into spare CPU E-Unit E-Unit 3. Start spare CPU (unchecked) (mirror) R-Unit (ECC on Address saved state) CFW 3/30/00 Cache data Instructions Results / state updates Saved state data
2Q01 zSeries Full Field Data MTTHardware Repair = 8 months 81-83% of repairs are concurrent TYPICAL REPAIR SCENARIO Hard Single Channel Detect 100% Element Error RESOURCES Offline UP Soft CPU 100% 100% RESOURCES RESOURCES UP UP System up Restart Op HW Checkpoint Retry Op Repair/restore (~1 second) (~1 minute) (hours+) HW Failure 13-15% of repairs are deferable 2-6% of repairs are app loss: MTTAL = 24 years
zSeries Error Reporting ~2 week interval "call home" recovery data Suppose CP hard logic (not array) fails caused app loss: MTTAL from 24 yrs to 11 yrs Suppose array (L1, L2, BHT) fails also caused app loss: MTTAL from 11 yrs to 5 yrs
S/390 Evolution S/ 390 uses same technology building blocks for soft and hard error recovery Enhanced over past 35 years IT'S NOT THE ONLY OPTION Beginning afresh, might land Soft error Hard error recovery recovery elsewhere Need to be driven by current Instruction CPU Sparing conditions retry PAF Technology Workload Circuit-level uArch IT'S EFFICIENT & EFFECTIVE detection checkpoint FOR S/390
Challenges for the 00s Increased importance of firmware Circuit failure mechanisms State encapsulation On-the-fly change Dynamic resource allocation Configuration validation

Recommend

Lecture 10: Fault Tolerance Fault Tolerant Concurrent Computing The main principles of fault

Lecture 10: Fault Tolerance Fault Tolerant Concurrent Computing The main principles of fault tolerant programming are: Fault Detection - Knowing that a fault exists Fault Recovery - having atomic instructions that can be rolled back in

365 views • 18 slides

Adaptive Fault Tolerant Systems: Adaptive Fault Tolerant Systems: Reflective Design and

1 Adaptive Fault Tolerant Systems: Adaptive Fault Tolerant Systems: Reflective Design and Validation Reflective Design and Validation Marc-Olivier Killijian Dependable Computing and Fault Tolerance Research Group Toulouse - France 2

867 views • 50 slides

IBM POWER6 Processor and Systems IBM POWER6 Fault-Tolerant Design Presenter: Natalya Kostenko

IBM POWER6 Processor and Systems IBM POWER6 Fault-Tolerant Design Presenter: Natalya Kostenko WHATS IBM POWER 6 MICROPOCESSOR POWER is a RISC instruction set architecture designed by IBM. (POWER is P erformance O ptimization W ith E

537 views • 21 slides

Idealised Fault Tolerant Idealised Fault Tolerant Architectural Element Architectural Element

Idealised Fault Tolerant Idealised Fault Tolerant Architectural Element Architectural Element Rog rio rio de Lemos de Lemos Rog University of Kent, UK University of Kent, UK Motivation architectural fault tolerance; iFTE

350 views • 16 slides

Distributed Systems 5. Fault Tolerant Systems Fault-Tolerance - 1 Lszl Bszrmnyi

Distributed Systems 5. Fault Tolerant Systems Fault-Tolerance - 1 Lszl Bszrmnyi Distributed Systems Fault tolerance A system or a component fails due to a fault Fault tolerance means that the system continues to provide its

429 views • 40 slides

FAULT-TOLERANT CONTROL Is it possible? JAN MACIEJOWSKI Fault- tolerant control. DPS09,

FAULT-TOLERANT CONTROL Is it possible? JAN MACIEJOWSKI Fault- tolerant control. DPS09, Gdask Canonical Control Engineering Problem Disturbance Controlled output Set-point Filter Controller Plant Sensor Noise This problem is

644 views • 29 slides

Building a Fault- Building a Fault- Tolerant Distributed Tolerant Distributed System with

Building a Fault- Building a Fault- Tolerant Distributed Tolerant Distributed System with System with zookeepertcl zookeepertcl Tcl Conference 2018 Tcl Conference 2018 Garrett McGrath Garrett McGrath /whois /whois /whois /whois

1.43k views • 125 slides

Fault-Tolerant Data Collection in Fault-Tolerant Data Collection in Heterogeneous Intelligent

Fault-Tolerant Data Collection in Fault-Tolerant Data Collection in Heterogeneous Intelligent Monitoring Networks Networks Heterogeneous Intelligent Monitoring Jing Deng Department of Computer Science University of North Carolina at

337 views • 18 slides

Fault-tolerant techniques Fault-tolerant techniques What causes component faults? What are the

EDA421/DIT171 - Parallel and Distributed Real-Time Systems, Chalmers/GU, 2011/2012 Lecture #14 Updated May 2, 2012 Fault-tolerant techniques Fault-tolerant techniques What causes component faults? What are the effects if the

374 views • 9 slides

4CSLL5 IBM Translation Models Martin Emms October 22, 2020 4CSLL5 IBM Translation Models IBM

4CSLL5 IBM Translation Models 4CSLL5 IBM Translation Models Martin Emms October 22, 2020 4CSLL5 IBM Translation Models IBM models Probabilities and Translation Alignments IBM Model 1 definitions 4CSLL5 IBM Translation Models IBM models

1.23k views • 103 slides

Overview Introduction and basic concept ECE 753: FAULT-TOLERANT Fault model and fault

4/1/2014 Overview Introduction and basic concept ECE 753: FAULT-TOLERANT Fault model and fault coverage COMPUTING Checkpointing and backward error recovery (rollback) Kewal K.Saluja General principles General principles

261 views • 3 slides

4CSLL5 IBM Translation Models IBM models Probabilities and Translation Alignments Martin Emms

4CSLL5 IBM Translation Models 4CSLL5 IBM Translation Models 4CSLL5 IBM Translation Models IBM models Probabilities and Translation Alignments Martin Emms IBM Model 1 definitions October 22, 2020 4CSLL5 IBM Translation Models 4CSLL5 IBM

571 views • 7 slides

Overview ECE 753: FAULT-TOLERANT Fault Modeling COMPUTING References Introduction

1/21/2014 Overview ECE 753: FAULT-TOLERANT Fault Modeling COMPUTING References Introduction Kewal K.Saluja Kewal K Saluja Fault models at different levels (HW) Department of Electrical and Computer Engineering Error

311 views • 5 slides

4/22/2009 Designing highly available systems Incorporate elements of fault-tolerant design

4/22/2009 Designing highly available systems Incorporate elements of fault-tolerant design Replication, TMR Distributed Systems Fully fault tolerant system will offer non-stop availability Clusters You cant achieve this! Problem:

459 views • 11 slides

CS6100: Topics in Design and Analysis of Algorithms Fault Tolerant Consensus CS6100 (Even 2012):

CS6100: Topics in Design and Analysis of Algorithms Fault Tolerant Consensus CS6100 (Even 2012): Fault Tolerant Consensus Models Failure Types 1. Clean Crash Failure completely fail 2. (Unclean) Crash Failure fail after some messages

656 views • 16 slides

Fault Tolerance and Robustness in Concurrent Systems Faults, errors, failures, and fault

Fault Tolerance and Robustness in Concurrent Systems Faults, errors, failures, and fault tolerance have many different definitions. What working definition should we use for fault? What does it mean to be fault-tolerant? 2 Faults, errors,

843 views • 16 slides

Comparison Based Dictionaries: Fault Tolerance versus I/O Efficiency Gerth Stlting Brodal

Comparison Based Dictionaries: Fault Tolerance versus I/O Efficiency Gerth Stlting Brodal Allan Grnlund Jrgensen Thomas Mlhave University of Aarhus ADS 2007, 3rd Bertinoro Workshop on Algorithms and Data Structures University

467 views • 23 slides

1 Heuristic (Bound- -Guided) Search Guided) Search Bucket Tree Heuristic (Bound Bucket Tree

Finding Leading Solutions Leading Solutions Finding Many AI problems = Constraint optimization problems On- -demand Bound Computation demand Bound Computation On Diagnosis (state estimation) for Finding Leading Solutions for Finding

270 views • 8 slides

A.I.S. Class 18: Outline College Computing Stage 5 Learning Objectives for Chapter 12

A.I.S. Class 18: Outline College Computing Stage 5 Learning Objectives for Chapter 12 Chapter 12 Summary Group Work for Chapter 12 (1) Group Work for Chapter 12 (2) College Computing CLASSROOM PRESENTATION 4 Dr. Peter R

209 views • 20 slides

Four Example Application Domains Autonomous delivery robot roams around an office environment and

Four Example Application Domains Autonomous delivery robot roams around an office environment and delivers coffee, parcels,. . . Diagnostic assistant helps a human troubleshoot problems and suggests repairs or treatments. E.g., electrical

413 views • 11 slides

MSc in Computer Engineering, Cybersecurity and Artificial Intelligence Course FDE , a.a.

MSc in Computer Engineering, Cybersecurity and Artificial Intelligence Course FDE , a.a. 2019/2020, Lecture 21 Residual generation via parameter estimation methods Prof. Mauro Franceschelli Dept. of Electrical and Electronic Engineering

1.11k views • 62 slides

Testability L Lecture 7: Fault Simulation t 7 Shaahin Hessabi Shaahin Hessabi Department of

Testability L Lecture 7: Fault Simulation t 7 Shaahin Hessabi Shaahin Hessabi Department of Computer Engineering Sharif University of Technology Adapted from the presentation prepared by book authors Slide 1 of 31 Sharif University of

447 views • 31 slides

First Meeting of Creditors 22 June 2018 Unlockd Agenda Opening Meeting formalities and

DRAFT Unlockd Limited (Administrators Appointed) (Unlockd Limited) Unlockd IP Pty Ltd (Administrators Appointed) (Unlockd IP) Unlockd AU Pty Ltd (Administrators Appointed) (Unlocked AU) Unlockd Operations Pty Ltd (Administrators Appointed)

472 views • 25 slides

Changes and Guidance Lead Administrative Patent Judges Michael Tierney, Thomas Giannetti, and

Proposed AIA Trial Rule Changes and Guidance Lead Administrative Patent Judges Michael Tierney, Thomas Giannetti, and Susan Mitchell Patent Trial and Appeal Board Webinar Series (5 of 5) October 1, 2015 AIA Trial Rulemaking In response to

380 views • 10 slides