Compute ter Science ce Barricade: Defending Systems Against Operator Mistakes Ricardo Bianchini In collaboration with Fábio Oliveira, Andrew Tjang, Rich Martin, Thu Nguyen
Motivation Computer systems pervade our lives Work and leisure Enterprise systems: email, storage, database, etc. Internet services: e-commerce, social networks, etc. Cloud computing Systems need to be highly available, behave correctly Misbehavior and downtime can be costly Operator mistakes are common source of such problems
“Google Glitch Briefly Disrupts World’s Search” By LIZ ROBBINS – The New York Times News Blog Jan 31, 2009 Google blacklisted all sites on the Web for 1 hour Results warned ―This site may harm your computer‖ Sites would not load Cause: Operator mistake Operator added ―/‖ to blacklist file Estimated cost: $2 — 3 million
Operator Mistakes Are Very Common 0% 34% Mistakes are responsible Operator (at least partly) for 79% of Hardware 51% DB administration problems Software Overload [Oliveira’06] 15% Avg of 3 Internet sites [Patterson’02] Examples of mistakes: misconfiguration, improper testing of changes, improper deployment, dissemination Fixing mistakes can be time-consuming
Our Idea What do we need? Mistake-aware management of systems Focus: multi-node systems with replicated components Goals Proactively defend the system against potential mistakes Confine mistakes to isolated, off-line subset of nodes Require less-experienced operators, lowering labor costs
Contributions A framework for mistake-aware management and a prototype system (Barricade) Two case studies Prototype 3-tier online auction service System that mimics our dept’s computing infrastructure (email, DNS, authentication) 64 live experiments with 20 volunteers, showing that mistake-aware management is effective Barricade contained 75 out of 82 observed mistakes
Outline Motivation Mistake-Aware Systems Management Framework overview Example Prototype Implementation: Barricade Evaluation Conclusions & Future Work
Mistake-Aware Management Our approach to mistake-aware management is: Monitor the operator’s actions Predict the expected cost of a mistake If cost is high, take nodes off-line and block actions Enforce testing of actions Lift blocks when tests confirm correctness Blocking mechanisms: command and file blocking Operators = scripts Key questions: What should be blocked and when? Can we make it all be un-intrusive?
Framework for MAM Task Prediction Cost Blocking Module Module Module Diagnosis Mistake Prediction Testing Module Module Module Monitors Actuators For each task i, the expected cost of a mistake (ECM) is: ( ) ( ) ( | ) ( ) ECM task P task P mistake task CM task i i i i Blocking actions for task i ( i ) ECM task threshold Lifting actions for task i ( i ) ECM task threshold
Overview of a MAM System Monitors System engineer instantiates and configures the MAM system Actuator Managed Server Task Prediction Model Monitors Mistake Prediction Model Cost Model Actuator Managed Diagnosis Model Server Management Blocking Actions Server Test Procedures Monitors Actuator Managed Server
Overview of a MAM System Operators interact with modified Monitors shell at managed/target servers (site of operation) Actuator Managed Server Task Prediction Model Monitors Mistake Prediction Model Shell commands, changes Cost Model to persistent state, info Actuator for testing Managed Diagnosis Model Server Management Blocking Actions Server Test Procedures Monitors Actuator Managed Server
Overview of a MAM System Monitors Super-operator can Blocking/Lifting bypass the blocks Actions Actuator Managed Server Task Prediction Model Monitors Mistake Prediction Model Shell commands, changes Cost Model to persistent state, info Actuator for testing Managed Diagnosis Model Server Management Blocking Actions Server Test Procedures Monitors Blocking/Lifting Actions Actuator Managed Server
Example: 3-tiered Service Operator task: add an application server Web Web Web … Server m Server 1 Server 2 … Application Application Application Application Server 1 Server 2 Server n Server n+1 Database Server Actions: install, config, start appl server; reconfig Web servers Site of operation: appl server and 1 Web server at a time
Example: 3-tiered Service Operator task: add an application server Web Web Web … Server m Server 1 Server 2 … Application Application Application Application Server 1 Server 2 Server n Server n+1 Database Server P(add_app_server) increases
Example: 3-tiered Service Operator task: add an application server Site of operation Web Web Web … Server m Server 1 Server 2 … Application Application Application Application Server 1 Server 2 Server n Server n+1 Database Server P(add_app_server) increases
Example: 3-tiered Service Operator task: add an application server Site of operation Web Web Web … Server m Server 1 Server 2 … Application Application Application Application Server 1 Server 2 Server n Server n+1 Database Server Expected cost of mistake for ―add app server‖ > threshold
Example: 3-tiered Service Operator task: add an application server Site of operation Web Web Web … Server m Server 1 Server 2 … Application Application Application Application Server 1 Server 2 Server n Server n+1 Barricade Database Server Expected cost of mistake for ―add app server‖ > threshold Erect barricade for ―add app server‖: containment phase Block commands: startup and shutdown of server software Block files: configuration files of server software
Example: 3-tiered Service Operator task: add an application server Site of operation Web Web Web … Server m Server 1 Server 2 … Application Application Application Application Server 1 Server 2 Server n Server n+1 Barricade Database Server Expected cost of mistake for ―add app server‖ > threshold In background, run test procedures on site of operation: Check running processes; Verify consistency of config files; etc.
Example: 3-tiered Service Operator task: add an application server Site of operation Web Web Web … Server m Server 1 Server 2 … Application Application Application Application Server 1 Server 2 Server n Server n+1 Barricade Database Server Operator starts working on ―Web server m‖
Example: 3-tiered Service Operator task: add an application server Site of operation Web Web Web … Server m Server 1 Server 2 … Application Application Application Application Server 1 Server 2 Server n Server n+1 Barricade Database Server Operator starts working on ―Web server m‖ Extend the barricade and the site of operation; take WS off-line Keep running tests until behavior on site of operation is ―correct‖
Example: 3-tiered Service Operator task: add an application server Site of operation Web Web Web … Server m Server 1 Server 2 … Application Application Application Application Server 1 Server 2 Server n Server n+1 Barricade Database Server Operator is done working on site of operation AND all tests succeed Containment phase is over; dissemination phase begins Establish dissemination order and adjust barricade
Example: 3-tiered Service Operator task: add an application server Dissemination target Web Web Web … Server m Server 1 Server 2 … Application Application Application Application Server 1 Server 2 Server n Server n+1 Barricade Database Server Next dissemination target is Web server 2 In background, run test procedures on dissemination target
Example: 3-tiered Service Operator task: add an application server Dissemination target Web Web Web … Server m Server 1 Server 2 … Application Application Application Application Server 1 Server 2 Server n Server n+1 Barricade Database Server Operator is done working on dissemination target AND tests succeed Allow operator to proceed to next dissemination target Adjust barricade
Example: 3-tiered Service Operator task: add an application server Dissemination target Web Web Web … Server m Server 1 Server 2 … Application Application Application Application Server 1 Server 2 Server n Server n+1 Barricade Database Server Web server 1 is the next dissemination target Allow operator to proceed to next dissemination target Adjust barricade
Example: 3-tiered Service Operator task: add an application server Dissemination target Web Web Web … Server m Server 1 Server 2 … Application Application Application Application Server 1 Server 2 Server n Server n+1 Barricade Database Server Operator completes last dissemination step AND tests succeed Task is over Destroy barricade
Example: 3-tiered Service Operator task: add an application server Web Web Web … Server m Server 1 Server 2 … Application Application Application Application Server 1 Server 2 Server n Server n+1 Database Server
Outline Motivation Mistake-Aware Systems Management Prototype Implementation: Barricade Evaluation Conclusions & Future Work
Recommend
More recommend