University of Paderborn Software Engineering Group Prof. Dr. Wilhelm Schäfer Computing Optimal Self- Repair Actions: Damage Minimization versus Repair Time Matthias Tichy, Holger Giese, Daniela Schilling, Wladimir Pauls Daniela Schilling – May 2005
University of Paderborn Software Engineering Group Motivation Prof. Dr. Wilhelm Schäfer www.railcab.de Daniela Schilling - May 2005- 2
University of Paderborn Software Engineering Group Motivation Prof. Dr. Wilhelm Schäfer � Redundant implementations of important software components vot:Voter pc1:Position Calculation Taliesin Avalon cc:Convoy Uther Gareth pc3:Position pc2:Position Calculation Calculation mul:Multiplier Gorlois Arthur gps:GPS- Controller � Required: reconfiguration � Given: automatism to detect failed components � Self-Repair Actions: automatic calculation of redeployment for failed components Daniela Schilling - May 2005- 3
University of Paderborn Initial Deployment Software Engineering Group Prof. Dr. Wilhelm Schäfer pc1:Position Node1: pc1.mem=2.0Mb Calculation pc2:Position Node2: Calculation � Map deployment constraints given as extended UML Deployment Diagrams to inequalities over boolean and integer variables � Use constraint solver to calculate initial deployment WOSS/FSE 2004: Matthias Tichy, Daniela Schilling, Holger Giese: Design of Self-Managing Dependable Systems with UML and Fault Tolerance Patterns Daniela Schilling - May 2005- 4
University of Paderborn Online Redeployment Software Engineering Group Prof. Dr. Wilhelm Schäfer � Node crash failure ⇒ all components running on this node fail too � Compute Self-Repair Action � -> Find suitable nodes to redeploy failed components � How to find suitable nodes? � What to do if there is no suitable node? � Redeploy further (still running) components � Damage: negative effects of unavailable components � Costs damage � Components to Goal: minimize costs be migrated � Keep damage as low as possible � Reduce solving time Failed Costs components time calculate redeployment perform redeployment Daniela Schilling - May 2005- 5
University of Paderborn Online Redeployment Software Engineering Group Prof. Dr. Wilhelm Schäfer - 1.Solution - � Remove crashed nodes from constraint system � Solve complete constraint system again damage time Daniela Schilling - May 2005- 6
University of Paderborn Online Redeployment Software Engineering Group Prof. Dr. Wilhelm Schäfer - 2.Solution - � Remove crashed nodes from constraint system � Add objective function (minimize damage caused by migration of running componets) to the constraint system � Solve complete system again damage time Daniela Schilling - May 2005- 7
University of Paderborn Online Redeployment Software Engineering Group Prof. Dr. Wilhelm Schäfer - Our Approach - � Remove crashed nodes from constraint system � Add objective function (minimize damage) to the constraint system � Try to solve constraint systems for failed components only � Until a solution is found: extend set of components that have to be redeployed/migrated � Use Constraint solver � Heuristic approach Daniela Schilling - May 2005- 8
University of Paderborn Online Redeployment Software Engineering Group Prof. Dr. Wilhelm Schäfer - Our Approach - damage time Daniela Schilling - May 2005- 9
University of Paderborn Choosing Components for Software Engineering Group Prof. Dr. Wilhelm Schäfer Redeployment � Example: 3 redundant copies of important components � Algorithm: � Try to redeploy failed component � Until redeployment is possible: 1. Choose components which are no redundant copies of failed components 2. Choose components where only one of three redundant copies already failed 3. Choose arbitrary components Daniela Schilling - May 2005- 10
University of Paderborn Choosing Components for Software Engineering Group Prof. Dr. Wilhelm Schäfer Redeployment � Example: 3 redundant copies of important components � Algorithm: � Try to redeploy failed component � Until redeployment is possible: 1. Choose components which are no redundant copies of failed components 2. Choose components where only one of three redundant copies already failed 3. Choose arbitrary components Daniela Schilling - May 2005- 11
University of Paderborn Experiment Software Engineering Group Prof. Dr. Wilhelm Schäfer � Scenario: � 36 nodes with 114 links � 72 components with 99 connectors � 5 node-specific (CPU, OS, Memory, Utilization, HDD) and 2 link-specific (Bandwidth, Loss) deployment restrictions � set of deployment constraints on components and connectors � Experiment: � Randomly selected a node and let it fail Daniela Schilling - May 2005- 12
University of Paderborn Experimental Results Software Engineering Group Prof. Dr. Wilhelm Schäfer Test 1. Solution 2. Solution Our Algorithm Nr. Time (ms) Damage Time (ms) Damage Time (ms) Damage 1 13630 773 > 1h N/A 50 7 2 14890 97 56060 29 30 30 3 13790 4 14920 1 10 5 4 13660 34 16430 31 50 34 damage time Daniela Schilling - May 2005- 13
University of Paderborn Conclusion & Future Work Software Engineering Group Prof. Dr. Wilhelm Schäfer � Algorithm to calculate optimal self-repair actions � Deployment constraints solved by standard constraint solver � Experiment showed that algorithm is nearly optimal in damage minimization and time consumption � Not presented: pre-solving step � Communication and monitoring framework � Describe repair rules by graph transformation systems Daniela Schilling - May 2005- 14
University of Paderborn Software Engineering Group Prof. Dr. Wilhelm Schäfer Appendix Daniela Schilling - May 2005- 15
University of Paderborn Simple Software Engineering Group Prof. Dr. Wilhelm Schäfer Redeployment vot:Voter pc1:Position Calculation Taliesin Avalon cc:Convoy Uther Gareth pc3:Position pc2:Position Calculation Calculation mul:Multiplier Gorlois Arthur gps:GPS- Controller Daniela Schilling - May 2005- 16
University of Paderborn Software Engineering Group Example Prof. Dr. Wilhelm Schäfer vot:Voter pc1:Position Mem:0.5Mb Calculation Taliesin Avalon Mem=2Mb cc:Convoy Mem=1.5Mb Mem=2.5Mb Mem=0.7Mb pc2:Position pc1:Position Uther Gareth pc3:Position Calculation Calculation Calculation Mem=1Mb Mem=2Mb Mem=1.5Mb Mem=2Mb pc2:Position mul:Multiplier Gorlois Arthur Calculation Mem=0.25Mb Mem=2Mb Mem=1.5Mb Mem=1.5Mb gps:GPS- Controller Mem=0.5Mb Daniela Schilling - May 2005- 17
University of Paderborn Damage Calculation Software Engineering Group Prof. Dr. Wilhelm Schäfer n2 C2 n3 n5 n1 C1 C3 C5 damage=13 damage=13 n4 C4 damage: all=13 2of3=4 1of3=1 Daniela Schilling - May 2005- 18
University of Paderborn Submodel Expansion Software Engineering Group Prof. Dr. Wilhelm Schäfer Failed components Running components Initial situation a b c d e f g Submodel: Consider later: Consider: 1) a b c d e f g Submodel not solvable 2) a b c d e f g Redundant copies 3) a b c e f g d Not related e f g d a b c 4) Submodel not solvable Daniela Schilling - May 2005- 19
University of Paderborn Submodel Expansion(2) Software Engineering Group Prof. Dr. Wilhelm Schäfer Failed components Running components a b c e f g d 4) Submodel not solvable e a b c f g d 5) Redundant copies e d f g a b c 6) a b c e d f g 7) Submodel solvable Daniela Schilling - May 2005- 20
University of Paderborn Pre-Solving Software Engineering Group Prof. Dr. Wilhelm Schäfer Daniela Schilling - May 2005- 21
University of Paderborn Foundations (TMR) Software Engineering Group Prof. Dr. Wilhelm Schäfer � Use fault tolerance techniques to ensure dependability � Triple Modular Redundancy (TMR) :Component1 :Provider :Multiplier :Component2 :Voter :User :Component3 Daniela Schilling - May 2005- 22
University of Paderborn Foundations (TMR) Software Engineering Group Prof. Dr. Wilhelm Schäfer � Deployment constraints for TMR Avoid single-point- of-failure of voter / Node1: Node2: multiplier -> Deploy voter and user to same node (if the user fails, the :Provider :Multiplier :Voter :User failure of the voter is no problem) Avoid crash failures -> Deploy redundant :Component1 :Component2 :Component3 components to distinct nodes Heterogeneous Node3: Node4: Node5: hardware platform -> require different CPU { Node3.CPU � Node4.CPU � Node4.CPU � Node5.CPU � Node3.CPU � Node 5.CPU } Daniela Schilling - May 2005- 23
University of Paderborn Software Engineering Group Prof. Dr. Wilhelm Schäfer Questions? .de www. Daniela Schilling - May 2005- 24
University of Paderborn Online Redeployment Software Engineering Group Prof. Dr. Wilhelm Schäfer - Our Solution - � Compute Self-Repair Action � -> Find suitable nodes to redeploy failed components � How to find suitable nodes? � What to do if there is no suitable node? � 2) Redeploy further (still running) components � Goal: reduce costs � Redeployment should not decrease dependability (reduce damage) � Reduce solving time Daniela Schilling - May 2005- 25
Recommend
More recommend