barricade defending systems against operator mistakes
play

Barricade: Defending Systems Against Operator Mistakes Ricardo - PowerPoint PPT Presentation

Compute ter Science ce Barricade: Defending Systems Against Operator Mistakes Ricardo Bianchini In collaboration with Fbio Oliveira, Andrew Tjang, Rich Martin, Thu Nguyen Motivation Computer systems pervade our lives Work and


  1. Compute ter Science ce Barricade: Defending Systems Against Operator Mistakes Ricardo Bianchini In collaboration with Fábio Oliveira, Andrew Tjang, Rich Martin, Thu Nguyen

  2. Motivation  Computer systems pervade our lives  Work and leisure  Enterprise systems: email, storage, database, etc.  Internet services: e-commerce, social networks, etc.  Cloud computing  Systems need to be highly available, behave correctly  Misbehavior and downtime can be costly  Operator mistakes are common source of such problems

  3. “Google Glitch Briefly Disrupts World’s Search” By LIZ ROBBINS – The New York Times News Blog Jan 31, 2009  Google blacklisted all sites on the Web for 1 hour Results warned ―This site may harm your computer‖  Sites would not load   Cause: Operator mistake Operator added ―/‖ to blacklist file   Estimated cost: $2 — 3 million

  4. Operator Mistakes Are Very Common 0% 34% Mistakes are responsible Operator (at least partly) for 79% of Hardware 51% DB administration problems Software Overload [Oliveira’06] 15% Avg of 3 Internet sites [Patterson’02]  Examples of mistakes: misconfiguration, improper testing of changes, improper deployment, dissemination  Fixing mistakes can be time-consuming

  5. Our Idea  What do we need?  Mistake-aware management of systems  Focus: multi-node systems with replicated components  Goals  Proactively defend the system against potential mistakes  Confine mistakes to isolated, off-line subset of nodes  Require less-experienced operators, lowering labor costs

  6. Contributions  A framework for mistake-aware management and a prototype system (Barricade)  Two case studies  Prototype 3-tier online auction service  System that mimics our dept’s computing infrastructure (email, DNS, authentication)  64 live experiments with 20 volunteers, showing that mistake-aware management is effective  Barricade contained 75 out of 82 observed mistakes

  7. Outline  Motivation  Mistake-Aware Systems Management  Framework overview  Example  Prototype Implementation: Barricade  Evaluation  Conclusions & Future Work

  8. Mistake-Aware Management  Our approach to mistake-aware management is:  Monitor the operator’s actions  Predict the expected cost of a mistake  If cost is high, take nodes off-line and block actions  Enforce testing of actions  Lift blocks when tests confirm correctness  Blocking mechanisms: command and file blocking  Operators = scripts  Key questions: What should be blocked and when? Can we make it all be un-intrusive?

  9. Framework for MAM Task Prediction Cost Blocking Module Module Module Diagnosis Mistake Prediction Testing Module Module Module Monitors Actuators For each task i, the expected cost of a mistake (ECM) is: ( ) ( ) ( | ) ( ) ECM task P task P mistake task CM task i i i i Blocking actions for task i ( i ) ECM task threshold Lifting actions for task i ( i ) ECM task threshold

  10. Overview of a MAM System Monitors System engineer instantiates and configures the MAM system Actuator Managed Server Task Prediction Model Monitors Mistake Prediction Model Cost Model Actuator Managed Diagnosis Model Server Management Blocking Actions Server Test Procedures Monitors Actuator Managed Server

  11. Overview of a MAM System Operators interact with modified Monitors shell at managed/target servers (site of operation) Actuator Managed Server Task Prediction Model Monitors Mistake Prediction Model Shell commands, changes Cost Model to persistent state, info Actuator for testing Managed Diagnosis Model Server Management Blocking Actions Server Test Procedures Monitors Actuator Managed Server

  12. Overview of a MAM System Monitors Super-operator can Blocking/Lifting bypass the blocks Actions Actuator Managed Server Task Prediction Model Monitors Mistake Prediction Model Shell commands, changes Cost Model to persistent state, info Actuator for testing Managed Diagnosis Model Server Management Blocking Actions Server Test Procedures Monitors Blocking/Lifting Actions Actuator Managed Server

  13. Example: 3-tiered Service  Operator task: add an application server Web Web Web … Server m Server 1 Server 2 … Application Application Application Application Server 1 Server 2 Server n Server n+1 Database Server Actions: install, config, start appl server; reconfig Web servers Site of operation: appl server and 1 Web server at a time

  14. Example: 3-tiered Service  Operator task: add an application server Web Web Web … Server m Server 1 Server 2 … Application Application Application Application Server 1 Server 2 Server n Server n+1 Database Server P(add_app_server) increases

  15. Example: 3-tiered Service  Operator task: add an application server Site of operation Web Web Web … Server m Server 1 Server 2 … Application Application Application Application Server 1 Server 2 Server n Server n+1 Database Server P(add_app_server) increases

  16. Example: 3-tiered Service  Operator task: add an application server Site of operation Web Web Web … Server m Server 1 Server 2 … Application Application Application Application Server 1 Server 2 Server n Server n+1 Database Server Expected cost of mistake for ―add app server‖ > threshold

  17. Example: 3-tiered Service  Operator task: add an application server Site of operation Web Web Web … Server m Server 1 Server 2 … Application Application Application Application Server 1 Server 2 Server n Server n+1 Barricade Database Server Expected cost of mistake for ―add app server‖ > threshold Erect barricade for ―add app server‖: containment phase Block commands: startup and shutdown of server software Block files: configuration files of server software

  18. Example: 3-tiered Service  Operator task: add an application server Site of operation Web Web Web … Server m Server 1 Server 2 … Application Application Application Application Server 1 Server 2 Server n Server n+1 Barricade Database Server Expected cost of mistake for ―add app server‖ > threshold In background, run test procedures on site of operation: Check running processes; Verify consistency of config files; etc.

  19. Example: 3-tiered Service  Operator task: add an application server Site of operation Web Web Web … Server m Server 1 Server 2 … Application Application Application Application Server 1 Server 2 Server n Server n+1 Barricade Database Server Operator starts working on ―Web server m‖

  20. Example: 3-tiered Service  Operator task: add an application server Site of operation Web Web Web … Server m Server 1 Server 2 … Application Application Application Application Server 1 Server 2 Server n Server n+1 Barricade Database Server Operator starts working on ―Web server m‖ Extend the barricade and the site of operation; take WS off-line Keep running tests until behavior on site of operation is ―correct‖

  21. Example: 3-tiered Service  Operator task: add an application server Site of operation Web Web Web … Server m Server 1 Server 2 … Application Application Application Application Server 1 Server 2 Server n Server n+1 Barricade Database Server Operator is done working on site of operation AND all tests succeed Containment phase is over; dissemination phase begins Establish dissemination order and adjust barricade

  22. Example: 3-tiered Service  Operator task: add an application server Dissemination target Web Web Web … Server m Server 1 Server 2 … Application Application Application Application Server 1 Server 2 Server n Server n+1 Barricade Database Server Next dissemination target is Web server 2 In background, run test procedures on dissemination target

  23. Example: 3-tiered Service  Operator task: add an application server Dissemination target Web Web Web … Server m Server 1 Server 2 … Application Application Application Application Server 1 Server 2 Server n Server n+1 Barricade Database Server Operator is done working on dissemination target AND tests succeed Allow operator to proceed to next dissemination target Adjust barricade

  24. Example: 3-tiered Service  Operator task: add an application server Dissemination target Web Web Web … Server m Server 1 Server 2 … Application Application Application Application Server 1 Server 2 Server n Server n+1 Barricade Database Server Web server 1 is the next dissemination target Allow operator to proceed to next dissemination target Adjust barricade

  25. Example: 3-tiered Service  Operator task: add an application server Dissemination target Web Web Web … Server m Server 1 Server 2 … Application Application Application Application Server 1 Server 2 Server n Server n+1 Barricade Database Server Operator completes last dissemination step AND tests succeed Task is over Destroy barricade

  26. Example: 3-tiered Service  Operator task: add an application server Web Web Web … Server m Server 1 Server 2 … Application Application Application Application Server 1 Server 2 Server n Server n+1 Database Server

  27. Outline  Motivation  Mistake-Aware Systems Management  Prototype Implementation: Barricade  Evaluation  Conclusions & Future Work

Recommend


More recommend