Survivable Network Design Dr. János Tapolcai tapolcai@tmit.bme.hu 1
The final goal 2 • We prefer not to see:
Telecommunicaiton Networks Video PSTN Internet Business Metro Backbone High Speed Backbone Service providers Mobile access 3
Telecommunicaiton Networks 4 http://www.icn.co
Traditional network architecture in backbone networks Adressing, IP (Internet routing Protocol) Traffic engineering ATM (Asynchronous Transfer Mode) Transport and SDH/SONET protection (Synchronous Digital Hierarchy) High bandwidth WDM (Wavelength Division Multiplexing) 5
Evolution of network layers BGP-4: 15 – 30 minutes OSPF: 10 seconds to minutes SONET: 50 milliseconds Layer 3 Layer IP GMPLS 2 ATM 1 IP 2/3 MPLS SONET 0 Packet Packet Thin SONET Inter- IP/Ethernet working Smart 0/1 Optics Optics Optical Optical 1999 2003 201x 6
IP - Internet Protocol • Packet switched – Hop-by-hop routing – Packets are forwarded based on forwarding tables • Distributed control – Shortest path routing • via link-state protocols: OSPF (Open Shortest Path First), IS- IS (Intermediate System To Intermediate System) • Routing on a logical topology • Widespread, its role is straightforward – From a technical point of view not very popular 7
8
Optical backbone • Circuit switched – Centralized control – Exact knowledge of the physical topology • Logical links are lightpaths – Source and destination node pairs, bandwidth IP router B C E Wavelength crossconnect Lightpaths A D 9
Optical Backbone Networks 10
Motivation Behind Survivable Network Design 11
FAILURE SOURCES 12
Failure Sources – HW Failures • Network element failures – Type failures • Manufacturing or design failures • Turns out at the testing phase – Wear out • Processor, memory, main board, interface cards • Components with moving parts: – Cooling fans, hard disk, power supply – Natural phenomena is mostly influence and damage these devices (e.g. high humidity, high temperature, earthquake) • Circuit breakers, transistors, etc. 13
Failure Sources – SW Failures • Design errors • High complexity and compound failures • Faulty implementations • Typos in variable names – Compiler detects most of these failures • Failed memory reading/writing operation 14
Failure Sources – Operator Errors (1) • Unplanned maintenance – Misconfiguration • Routing and addressing – misconfigured addresses or prefixes, interface identifiers, link metrics, and timers and queues (Diffserv) • Traffic Conditioners – Policers, classifiers, markers, shapers • Wrong security settings – Block legacy traffic – Other operation faults: • Accidental errors (unplug, reset) • Access denial (forgotten password) • Planned maintenance • Upgrade is longer than planned 15
Failure Sources – Operator Errors (2) • Topology/Dimensioning/Implementation design errors – Weak processor in routers – High BER in long cables – Topology is not meshed enough (not enough redundancy in protection path selection) • Compatibility errors – Between different vendors and versions – Between service providers or AS (Autonomous system) • Different routein settings and Admission Control between two ASs 16
Failure Sources – Operator errors (3) • Operation and maintenance errors Updates and patches Misconfiguration Device upgrade Maintenance Data mirroring or recovery Monitoring and testing Teach users Other 17
Failure Sources – User Errors • Failures from malicious users – Physical devices • Robbery, damage the device – Against nodes • Viruses – DoS (denial-of-service) attack (i.e. used in the Interneten) • Routers are overload • At once from many addresses • IP address spoofing • Example: Ping of Death – the maximal size of ping packet is 65535 byte. In 1996 computers could be froze by recieving larger packets. • Unexpected user behavior – Short term • Extreme events (mass calling) • Mobility of users (e.g. after a football match the given cell is congested) – Long term • New popular sites and killer applications 18
Failure Sources – Environmental Causes • Cable cuts – Road construction (‘Universal Cable Locator’) – Rodent bites • Fading of radio waves – New skyscraper (e.g. CN Tower) – Clouds, fog, smog, etc. – Birds, planes • Electro-magnetic interference – Electro-magnetic noise – solar flares • Power outage • Humidity and temperature – Air-conditioner fault • Natural disasters – Fires, floods, terrorist attacks, lightnings, earthquakes, etc.
Operating Routers During Sandy Hurricane 20
Michnet ISP Backbone (1998) • Which failures were the most probable ones? Hardware Problem Maintenance Software Problem Power Outage Fiber Cut/Cicuit/Carrier Problem Interface Down Malicious Attack Congestion/Sluggish Routing Problems 21
Michnet ISP Backbone (1998) Cause Type # [%] Maintenance Operator 272 16.2 User Power Outage Environmental 273 16.0 5% Environmental 261 15.3 Fiber Cut/Cicuit/Carrier Operator Problem Environmental 35% Unreachable Operator 215 12.6 31% Hardware Problem Hardware 154 9.0 Interface Down Hardware 105 6.2 Routing Problems Operator 104 6.1 Malice 2% Miscellaneous Unknown 86 5.9 Hardw are Softw are Unknow n 15% 1% 11% Unknown 32 5.6 Unknown/Undetermined/ No problem Congestion/Sluggish User 65 4.6 Malicious Attack Malice 26 1.5 Software Problem Software 23 1.3 22
Case study - 2002 • D. Patterson et. al.: “Recovery Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies”, UC Berkeley Computer Science Technical Report UCB//CSD-02-1175, March 15, 2002, 23
Failure Sources - Summary • Operator errors (misconfiguration) – Simple solutions needed – Sometimes reach 90% of all failures • Planned maintenance – Running at night – Sometimes reach 20% of all failures • DoS attack – It will be worse in the future • Software failures – 10 million line source codes • Link failures – Anything from which a point-to-point connection fails (not only cable cuts) 24
Reliability • Failure – is the termination of the ability of a network element to perform a required function. Hence, a network failure happens at one particular moment t f • Reliability, R(t) – continuous operation of a system or service – refers to the probability of the system being adequately operational (i.e. failure free operation) for the period of time [0 – t ] intended in the presence of network failures 25
Reliability (2) • Reliability, R(t) – Defined as 1- F(t) (cummulative distribution function, cdf) – Simple model: exponentially distributed variables • Properies: – non-increasing – R ( 0 ) 1 = – lim R ( t ) 0 = t → ∞ R ( t ) 1 R ( a ) t t R ( t ) 1 F ( t ) 1 ( 1 e − λ ) e − λ = − = − − = 0 a t 26
Network with Reparable Subsystems • Measures to charecterize a reparable system are: – Availability, A(t) • refers to the probability of a reparable system to be found in the operational state at some time t in the future • A(t) = P(time = t, system = UP) – Unavailability, U(t) • refers to the probability of a reparable system to be found in the faulty state at some time t in the future • U(t) = P(time = t, system = DOWN) • A(t) + U(t) = 1 at time t Failure Failure UP Device is Device is Device is operational operational operational DOWN t The network element is failed, repair action is in progress. 27
Element Availability Assignment • The mainly used measures are – MTTR - Mean Time To Repair – MTTF - Mean Time to Failure • MTTR << MTTF – MTBF - Mean Time Between Failures • MTBF=MTTF+MTTR • if the repair is fast, MTBF is approximately the same as MTTF • Sometimes given in FITs (Failures in Time), MTBF[h]=10 9 /FIT • Another notation – MUT - Mean Up Time • Like MTTF – MDT - Mean Down Time • Like MTTR – MCT - Mean Cycle Time • MCT=MUT+MDT 28
Availability in Hours Outage time/ Outage time/ Outage time/ Availability Nines year week month 90% 1 nine 36.52 day 73.04 hour 16.80 hour 95% - 18.26 day 36.52 hour 8.40 hour 98% - 7.30 day 14.60 hour 3.36 hour 2 nines 99% 3.65 day 7.30 hour 1.68 hour (maintained) 99.5% - 1.83 day 3.65 hour 50.40 min 99.8% - 17.53 hour 87.66 min 20.16 min 3 nines (well 99.9% 8.77 hour 43.83 min 10.08 min maintained) 99.95% - 4.38 hour 21.91 min 5.04 min 99.99% 4 nines 52.59 min 4.38 min 1.01 min 5 nines (failure 99.999% 5.26 min 25.9 sec 6.05 sec protected) 6 nines (high 99.9999% 31.56 sec 2.62 sec 0.61 sec reliability) 99.99999% 7 nines 3.16 sec 0.26 sec 0.61 sec 29
Availability Evaluation – Assumptions • Failure arrival times – independent and identically distributed (iid) variables following exponential distribution – sometimes Weibull distribution is used (hard) α λ t F ( t ) = 1 e − − – λ > 0 failure rate (time independent!) • Repair times – iid exponential variables – sometimes Weibull distribution is used (hard) – µ > 0 repair rate (time independent!) • If both failure arrival times and repair times are exponentially distributed we have a simple model – Continuous Time Markov Chain 30
Recommend
More recommend