WADS Workshop at ICSE 2002 Woodside 1 Evaluation of Dependable Layered Systems with a Fault Management Architecture Olivia Das, C. Murray Woodside Dept. of Systems and Computer Engineering, Carleton University, Ottawa, Canada email: odas@sce.carleton.ca, cmw@sce.carleton.ca
WADS Workshop at ICSE 2002 Woodside 2 Layered System Model Tasks, Interactions and Dependencies, and Processors N UserA = 50 N UserB = 100 UserA userA UserB userB procB procA eB AppB eA AppA serviceB serviceA #1 #2 proc1 proc2 #2 #1 Server2 Server1 eB-1 eA-1 eA-2 eB-2 proc4 proc3 ......Configuration depends on Failure State
WADS Workshop at ICSE 2002 Woodside 3 Example Configuration (1) ... failure compensated by standby servers Processor 3 fails and puts Server1 out... Server2 used instead N UserA = 50 N UserB = 100 UserA userA UserB userB procB procA eB AppB eA AppA serviceB serviceA #1 #2 proc1 proc2 #2 #1 Server2 Server1 eB-1 eA-1 eA-2 eB-2 proc4 proc3
WADS Workshop at ICSE 2002 Woodside 4 Example Configuration (2) ... failure cannot be compensated by standby servers Processor 2 fails and puts Application1 out... Group Users1 is off the air.... performability measure is reduced N UserA = 50 N UserB = 100 UserA userA UserB userB procB procA eB AppB eA AppA serviceB serviceA #1 #2 proc1 proc2 #2 #1 Server2 Server1 eB-1 eA-1 eA-2 eB-2 proc4 proc3
WADS Workshop at ICSE 2002 Woodside 5 Fault Propagation Graph.... used to find the con f ig uration states, add up their probabilities r userB userA procB eB UserA eA procA UserB proc1 proc2 AppA serviceA AppB serviceB #1 #2 #1 #2 eA-1 eA-2 eB-1 eB-2 Server2 proc4 proc3 Server1
WADS Workshop at ICSE 2002 Woodside 6 Management Subsystem Manager Application Agent Agent Agent Server1 Server2 Subagent - Reaction delays - Management subsystem failures and repairs
WADS Workshop at ICSE 2002 Woodside 7 Specifying a Management Architecture proc2:Proc proc1:Proc c2:AW c1:AW Elements AppB:AT ag2:AGT AppA:AT ag1:AGT Components c6:Ntfy c5:Ntfy - Application pro- cesses c16:Ntfy c12:SW c13:Ntfy - Management c15:SW Agents proc5:Proc - Managers c14:AW m1:MT c11:AW c9:AW Connectors c7:AW - Alive-watch c8:SW c10:SW - Status-watch proc3:Proc proc4:Proc - Notifier c3:AW c4:AW ag3:AGT Server1:AT Server2:AT ag4:AGT
WADS Workshop at ICSE 2002 Woodside 8 Functionality Application process status is monitored by its local agent (Alive-watch connection) Processor status is monitored by a Manager on another node, ... e.g. by pinging System wide status is gathered by Managers (Status connections) .... and distributed back to Agents (Notify connections) Application process reconfiguration is triggered by the agent on its node (Notification connection) .... e.g. to switch to a standby server, or to restart a process Capability to reconfigure is conditioned by “Knowledge” of the status of the system .... that is, by the Management Architecture and its failures
WADS Workshop at ICSE 2002 Woodside 9 Analysis.... currently.... * Markov model for component failures and repairs .... (e.g., independent failure of processors and processes) * Derive configurations and their probabilities ....Additional configurations that include Management Subsystem failure * Reconfiguration capability is limited by “Knowledge” of the status, and thus by the Management Subsystem state .... thus, additional delays to repair * Analyse the performance of each configuration .... assemble measures based on configuration probabilities .... related to work by Haverkort with queueing models and server failures .... here, extended with layered dependencies for failure, and layered queuing models for performance * Consider bounds and approximations
WADS Workshop at ICSE 2002 Woodside 10 Conclusions Scalable technique ... separation of performance-level analysis from failure repair ... analysis of effective configurations gives a MUCH smaller set of configurations, than of failure states. Even so, explosion of configurations is a limitation.... Publications..... www.sce.carleton.ca/faculty/woodside
Recommend
More recommend