An Approach to Manage Reconfiguration in Fault- Tolerant Distributed System s Stefano Porcarelli 1 , Marco Castaldi 2 , Felicita Di Giandomenico 1 , Andrea Bondavalli 3 , Paola Inverardi 2 1 Italian National Research Council, ISTI Dept, Italy stefano.porcarelli@ guest.cnuce.cnr.it, digiandomenico@ iei.pi.cnr.it 2 University of L'Aquila, Dip. Informatica, Italy { castaldi, inverard} @ di.univaq.it 3 University of Florence, Dip. Sistemi e Informatica, Italy a.bondavalli@ dsi.uni.it May 3rd, WADS 2003 1
Motivations • Large distributed systems live for several years • Environmental events and component’s faults may affect workload and functionalities of the system • High availability and reliability of critical systems System reconfiguration to react to faults, to manage system’s life and to provide dependability properties May 3rd, WADS 2003 2
System Reconfigurations • Dynamic: the reconfiguration must be performed while the system is running, without service interruption • Automatic: the reconfiguration may be triggered as a reaction for a specified event, issued by a human administrator or an automatic Decision Maker • Distributed : the reconfiguration is performed on distributed systems In particular, we address: • Component Reconfiguration: any change of the component parameters ( component re-parametrization ) • Application Reconfiguration: any architecture’s modification in terms of topology, component’s number and location May 3rd, WADS 2003 3
Our Approach to (Fault) Reconfiguration •We propose to use Lira , an infrastructure created to perform dynamic reconfiguration, enriched with a model-based Decision Maker Decision Maker Managed System Lira monitors the system, detects faults and notifies the For each fault pattern, Decision Maker a set of reconfigurations is specified DM performs Lira reconfigures the evaluation the system DM orders the reconfiguration May 3rd, WADS 2003 4
Our Approach to (Fault) Reconfiguration • The decision making capability is decomposed in a hierarchical fashion: – Favoring fault-tolerance by distribution of control – Avoiding heavy computation and coordination activity whenever faults can be managed at local level – Facilitating the construction and on-line solution of analytical models – Favoring scalability May 3rd, WADS 2003 5
Lira Architecture • Lira Management Infrastructure – Light-weight Infrastructure for Reconfiguring Applications – Lira is based on: • Agents • MIB (Management Information Base) • Management Protocol Human Administrator Component Manager Agent Management Comp Protocol MI B May 3rd, WADS 2003 6
Enriched Lira Architecture • Lira uses a different agent for each hierarchical level: – Component, Host, Application, Manager agent • Each agent is enriched with a decision maker – Decision making capabilities depend on the hierarchical level of the agent Decision Component Maker Agent Application Comp Agent Decision MIB Maker MIB Host Host Manager Agent Management Protocol MIB Decision MIB Maker May 3rd, WADS 2003 7
Decision Maker • Model-Based Decision Maker – The dynamic topology of the Up Degraded system and the number of managed faults demand for Down statistical decisions capabilities – Combinatorial and Petri net like models (for complex relationships among The component’s state is modeled components) help to take the by using three states : most appropriate decision • Up – The possible reconfiguration options are pre-planned: • Degraded models allow deciding each • Down time which is the most appropriate one May 3rd, WADS 2003 8
A Case Study • Distributed computing where H 2 Net 1 H 1 peer-to-peer clients on the network are communicating • Path redundancy is used to H 6 prevent service’s interruption H 5 Net 2 H 3 H 4 Net 1 N 3 Path Route c f N 1 d 1 a-N 1 -c-N 3 -f a g N 4 client 2 a-N 1 -c-N 3 -d-N 2 -e-N 4 -g Client H 2 N 2 e b H H 6 5 3 H 1 b-N 2 -e-N 4 -g 4 b-N 2 -d-N 3 -f May 3rd, WADS 2003 9
A Case Study (cont) • Component agent – HEALTH_STATE AA 1 – CONNECTED_NODE – Function to connect different nodes HA 2 – Functions to control the node A 3 N 3 HA 1 • Host agent Manager – HEALTH_STATE N 1 – CONNECTED_HOST A 1 A 4 N 4 client – Functions to install and activate nodes Client H 2 N 2 • Application Agent A 2 H H 6 5 H 1 Net 1 – AVAILABLE_PATHS – ACTIVE_NODES – ACTIVE_HOSTS – Functions provided by the Host agents Net 2 AA 2 • Manager Agent – ACTIVE_HOSTS – Functions provided by the Application agents May 3rd, WADS 2003 10
An Exam ple • Let suppose that node N 3 starts to work in degraded manner AA 1 • The associated agent A 3 notifies at HA 2 the upper level AA 1 A 3 N 3 HA 1 • The agent AA 1 checks the path Manager availability on the controlled N 1 A 1 network A 4 N 4 client Client H 2 N 2 • Three different reconfiguration A 2 H H 6 5 H 1 options are possible: Net 1 – Continuing to work in degraded manner – Temporarily bypassing node Net 2 AA 2 N 3 and waiting for its restart – Activate a new node for substituting N 3 May 3rd, WADS 2003 11
An Exam ple Link or Failure •Three different component Probability reconfiguration options are status possible: 10 -3 Up state – Continuing to work in 10 -2 Degraded state degraded manner 5 * 10 -3 Restarted and new – Temporarily bypassing node N 3 and waiting for its restart Policy Options P F – Activate a new node for substituting N 3 Working in 1.73848 * 10 -8 degraded manner • The best reconfiguration 5.19695 * 10 -9 Restart node N 3 consists in restarting N 3 4.77510 * 10 -8 Set-up a new path May 3rd, WADS 2003 12
Conclusions • An architecture for dependability provision has been proposed. It is based on: – Lira – Model-based Decision Maker • We concentrate on system reconfiguration as consequence of faults (both sw and hw) • Hierarchical approach May 3rd, WADS 2003 13
Future Work • Lira infrastructure has to be fault-tolerant itself • Development of Petri net based decision maker (combinatorial models are not able to handle complex scenarios) – Dependencies among components – Account for Time – Repairing of components • Development of a prototype – Experimental measurements May 3rd, WADS 2003 14
Recommend
More recommend