Availability models Dr. János Tapolcai tapolcai@tmit.bme.hu http://opti.tmit.bme.hu/~tapolcai/ / 1
Failure sources – HW failures • Network element failures – Type failures • Manufacturing or design failures • Turns out at the testing phase – Wear out • Processor, memory, main board, interface cards • Components with moving parts: – Cooling fans, hard disk, power supply – Natural phenomena is mostly influence and damage these devices (e.g. high humidity, high temperature, earthquake) • Circuit breakers, transistors, etc. 2
Failure sources – SW failures • Design errors • High complexity and compound failures • Faulty implementations • Typos in variable names – Compiler detects most of these failures • Failed memory reading/writing operation 3
Failure sources – Operator errors (1) • Unplanned maintenance – Misconfiguration • Routing and addressing – misconfigured addresses or prefixes, interface identifiers, link metrics, and timers and queues (Diffserv) • Traffic Conditioners – Policers, classifiers, markers, shapers • Wrong security settings – Block legacy traffic – Other operation faults: • Accidental errors (unplug, reset) • Access denial (forgotten password) • Planned maintenance • Upgrade is longer than planned 4
Failure sources – Operator errors (2) • Topology/Dimensioning/Implementation design errors – Weak processor in routers – High BER in long cables – Topology is not meshed enough (not enough redundancy in protection path selection) • Compatibility errors – Between different vendors and versions – Between service providers or AS (Autonomous system) • Different routein settings and Admission Control between two ASs 5
Failure sources – Operator errors (3) • Operation and maintenance errors Updates and patches Misconfiguration Device upgrade Maintenance Data mirroring or recovery Monitoring and testing Teach users Other 6
Failure sources – User errors • Failures from malicious users – Physical devices • Robbery, damage the device – Against nodes • Viruses – DoS (denial-of-service) attack (i.e. used in the Interneten) • Routers are overload • At once from many addresses • IP address spoofing • Example: Ping of Death – the maximal size of ping packet is 65535 byte. In 1996 computers could be froze by recieving larger packets. • Unexpected user behavior – Short term • Extreme events (mass calling) • Mobility of users (e.g. after a football match the given cell is congested) – Long term • New popular sites and killer applications 7
Failure sources – Environmental causes • Cable cuts – Road construction (‘Universal Cable Locator’) – Rodent bites • Fading of radio waves – New skyscraper (e.g. CN Tower) – Clouds, fog, smog, etc. – Birds, planes • Electro-magnetic interference – Electro-magnetic noise – solar flares • Power outage • Humidity and temperature – Air-conditioner fault • Natural disasters – Fires, floods, terrorist attacks, lightnings, earthquakes, etc.
Michnet ISP Backbone 11/97 – 11/98 • Which failures are the most probable ones? Hardware Problem Maintenance Software Problem Power Outage Fiber Cut/Cicuit/Carrier Problem Interface Down Malicious Attack Congestion/Sluggish Routing Problems 9
Michnet ISP Backbone 11/97 – 11/98 Cause Type # [%] Maintenance Operator 272 16.2 User 5% Power Outage Environmental 273 16.0 Fiber Cut/Cicuit/Carrier Operator Environmental 261 15.3 Problem Environmental 35% 31% Unreachable Operator 215 12.6 Hardware Problem Hardware 154 9.0 Interface Down Hardware 105 6.2 Malice Routing Problems Operator 104 6.1 2% Hardw are Softw are Unknow n Miscellaneous Unknown 86 5.9 15% 1% 11% Unknown/Undetermine Unknown 32 5.6 d/No problem Congestion/Sluggish User 65 4.6 Malicious Attack Malice 26 1.5 Software Problem Software 23 1.3 10
Case study - 2002 • D. Patterson et. al.: “Recovery Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies”, UC Berkeley Computer Science Technical Report UCB//CSD-02-1175, March 15, 2002, 11
Failure sources - Summary • Operator errors (misconfiguration) – Simple solutions needed – Sometimes reach 90% of all failures • Planned maintenance – Running at night – Sometimes reach 20% of all failures • DoS attack – It will be worse in the future • Software failures – 10 million line source codes • Link failures – Anything from which a point-to-point connection fails (not only cable cuts) 12
Motivation behind survivable network design 13
Reliability • Failure – is the termination of the ability of a network element to perform a required function. Hence, a network failure happens at one particular moment t f • Reliability, R(t) – continuous operation of a system or service – refers to the probability of the system being adequately operational (i.e. failure free operation) for the period of time [0 – t ] intended in the presence of network failures 14
Reliability (2) • Reliability, R(t) – Defined as 1- F(t) (cummulative distribution function, cdf) – Simple model: exponentially distributed variables • Properies: – non-increasing – R ( 0 ) 1 – lim R ( t ) 0 t R ( t ) 1 R ( a ) t t R ( t ) 1 F ( t ) 1 ( 1 e ) e 0 a t 15
Network with reparable subsystems • Measures to charecterize a reparable system are: – Availability, A(t) • refers to the probability of a reparable system to be found in the operational state at some time t in the future • A(t) = P(time = t, system = UP) – Unavailability, U(t) • refers to the probability of a reparable system to be found in the faulty state at some time t in the future • U(t) = P(time = t, system = DOWN) • A(t) + U(t) = 1 at time t Failure Failure UP Device is Device is Device is operational operational operational DOWN t The network element is failed, repair action is in progress. 16
Element Availability Assignment • The mainly used measures are – MTTR - Mean Time To Repair – MTTF - Mean Time to Failure • MTTR << MTTF – MTBF - Mean Time Between Failures • MTBF=MTTF+MTTR • if the repair is fast, MTBF is approximately the same as MTTF • Sometimes given in FITs (Failures in Time), MTBF[h]=10 9 /FIT • Another notation – MUT - Mean Up Time • Like MTTF – MDT - Mean Down Time • Like MTTR – MCT - Mean Cycle Time • MCT=MUT+MDT 17
Availability in hours Outage time/ Outage time/ Outage time/ Availability Nines year month week 90% 1 nine 36.52 day 73.04 hour 16.80 hour 95% - 18.26 day 36.52 hour 8.40 hour 98% - 7.30 day 14.60 hour 3.36 hour 2 nines 99% 3.65 day 7.30 hour 1.68 hour (maintained) 99.5% - 1.83 day 3.65 hour 50.40 min 99.8% - 17.53 hour 87.66 min 20.16 min 3 nines (well 99.9% 8.77 hour 43.83 min 10.08 min maintained) 99.95% - 4.38 hour 21.91 min 5.04 min 99.99% 4 nines 52.59 min 4.38 min 1.01 min 5 nines (failure 99.999% 5.26 min 25.9 sec 6.05 sec protected) 6 nines (high 99.9999% 31.56 sec 2.62 sec 0.61 sec reliability) 99.99999% 7 nines 3.16 sec 0.26 sec 0.61 sec 18
Availability evaluation – assumptions (1) • Deployment – availability increases (unavailability decreases) – Performance is optimized • Steady state – the availability remains the same for a long period (time independent) • Wear out (component aging) – availability decrease (unavailability increase) – e.g. impairments in the fiber U(t) Bathtub curve 1 Steady state t 0 19
Availability evaluation – assumptions (2) • Failure arrival times – independent and identically distributed (iid) variables following exponential distribution t – sometimes Weibull distribution is used (hard) 1 F ( t ) e – > 0 failure rate (time independent!) • Repair times – iid exponential variables – sometimes Weibull distribution is used (hard) – m > 0 repair rate (time independent!) • If both failure arrival times and repair times are exponentially distributed we have a simple model – Continuous Time Markov Chain 20
Two-state Markov model – Steady state analysis (1) Mean of exp. dist. variables: 1 1 MTTF UP DN 1 0 1 1m MTTR m m • Transition probability distribution in a matrix form – Transition matrix P (stochastic matrix) • Time homogeneous Markov-chain – The transition matrix after k steps: P k – Stationary distribution is a row vector π, for which P – π exists, (and in this case it is unambiguous) 21
Two-state Markov model – Steady state analysis (2) 1 1 Transition matrix: P UP DN m m 1 1 0 Stationary distribution: m 1m UP , DOWN ( A U ) 1 ( A U ) ( A U ) m m 1 m A A ( 1 ) U m A U / we have seen U 1 A m A m 22
Recommend
More recommend