Introduction to Autonomic Computing Johan Tordsson Department of Computing Science www.cloudresearch.org
About me • MSc (Civ.Ing) Computer Science (2004) • PhD Umeå, Grid computing (2009) • Postdoc in Madrid Spain (2009), OpenNebula • Architect etc. in misc. EC projects (2009-2013) • Associate professor (2014 - now) • Research – Autonomic cloud and data center management – How to make clouds run themselves faster/better/cheaper? • Spare time job: – CTO & co-founder for Elastisys (UMU cloud research spinoff) – Evangelizing that computers (will) beat humans at IT operations
Outline • Why – do we need autonomic computing? • What – are autonomic systems? • How – to build these autonomic systems? • When – will they happen? • Who – will build them?
Motivation: software complexity
Motivation: scale • Enorma byggnader med servrar, lagringsutrustning, nätverk, kylning • En fabrik för IT-tjänster 5
Motivation: faults Question: what is the probability of a hard drive failure? In my laptop? Will happen every few years, hopefully not right now… In a large supercomputer or data center? More than 100k nodes Will happen during this talk!
Motivation: costs • Question: How many servers can be handled by a system administrator? • Very old question… • Some numbers: – 10 - very complex systems – ~300 - standard large-scale organization – Several 1000s – virtualized data center – 26k (Facebook 2013) • Highel-level management and better abstractions are needed – Alternative: exponential increase in need for systems management
Autonomic option • Autonomic computing – Named after autonomic nervous system – Systems manage themselves according to admin goals – Self-governing operation of entire system, not just parts of it – New components integrate effortlessly - as a new cell establishes itself in the body
Autonomic Computing • IBM initiative in early 2000’s • Landmark paper published 2003 in IEEE Computer by Kephart and Chess @ IBM • Active research field since, during 2003-2013: – 200 conferences/workshops – 8000+ papers • Lots of funding – EC FP6, FP7, H2020 – WASP… • Industry uptake – Many big IT vendors & startups • Key point – Self-management of IT systems
Self-management (1/3) • Self-management – Changing components – External conditions – Hardware/software failures • Ex. component upgrade – Continually check for component upgrades – Download and install – Reconfigure itself – Run a regression test – When it detects errors, revert to the older version
Self-management (2/3) • Four aspects of self-management – Self-configuration • Configure themselves automatically • High-level policies ( what is desired, not how ) – Self-optimization • Continually seek ways to improve their operation • Hundreds of tunable parameters – Self-healing • Handle faults and errors • Analyze information from logs and monitors – Self-protection • Malicious attacks • Cascading failures • Admin mistakes
Self-management (3/3) Hal 9000, 2001 • Autonomic computing achievable without self-awareness? – Without hard artificial intelligence Terminator • (Hollywood) Misconception: machines will take over g! all human tasks – AI could be a “real danger” (S. Hawking) – Unemployment? – • Actual idea: Machines will free people to manage systems at higher level
Autonomic elements Autonomic manager Analyze Plan Knowledge Monitor Execute Managed element • Fundamental atom of • Responsible for: the architecture – Providing its service – Managed element(s) – Managing behavior • Server, database, according to goals storage system, etc. Interacting with other – Autonomic manager autonomic elements
Autonomic element details Sensors Effectors Autonomic Manager Analyze Plan Monitor Execute Knowledge Managed Element Sensors Effectors • Sensors: monitor environment • Effectors: tune managed element • MAPE loop: – Process for self-management of autonomic element
The MAPE loop 1. Monitor: – Collect information about state of system – Lot of metrics around – Which ones to gather? – How often to monitor? 4. Execute – Turn the “knobs” of the managed element – Interactions between knobs? • Unknown, even to human operators • At Google, 238 knobs in each managed entity
The MAPE loop (cont.) 2. Analyze – Estimate current state based on monitoring data – Commonly use model of the world for this • “All models are wrong, but some are useful” • What part of system to model? How? • Correlations? 3. Plan – Select action(s), i.e., which knobs to turn? – Can be formulated as optimization problem – Reactive vs. Predictive/Proactive methods • Knowledge management – Update model dynamically (monitoring) – Evaluate effects of actions (execution)
Engineering challenges (1/3) • Life cycle of an autonomic element – Design, test, and verification • Testing autonomic elements a challenge – Installation and configuration • Element registers itself in a directory service – Monitoring and problem determination • Elements will continually monitor themselves • Adaptation, optimization, reconfiguration – Upgrading – Uninstallation or replacement
Engineering challenges (2/3) • Relationships among autonomic elements – Specification • Set of output/input services of autonomic elements • Expressed in a standard format • Description syntax and semantics – Location • Find input services that autonomic element needs – Negotiation – Provision – Operation • Autonomic manager oversees the operation – Termination
Engineering challenges (3/3) • System-wide issues – Authentication, encryption, signing – Autonomic elements can identify themselves – Autonomic system must be robust against insidious forms of attack • Goal specification – Humans provide the goals and constraints – Ensure that goals are specified correctly in the first place – Autonomic systems need to protect themselves from bad input goals: • Inconsistent, implausible, dangerous, or unrealizable
Specifying goals (1/3) • Rules – Often simple condition-action pairs • If something happens, do this • If something else happens, do that • … – Can use more complex languages to express states, context, etc. – Explicit enumeration tedious – Very limited ability to express complex actions
Specifying goals (2/3) • Utility functions – Mathematical expressions – Maps system state to scalar value – Represents high-level objectives – What parts of system state to include? – What should function look like?
Specifying goals (3/3) • Policies – (higher-level) descriptions of goals and constraints for operation – How to map to lower-level behavior? – Composition of multiple policies – What high-level language to use? • Turing-complete? • No widely used languages available today • Human operators used to explicit steering – Not used to indirect goal specification
Autonomic management techniques - requirements • Robustness – Avoid oscillations or behavioral changes • Scalability – Internet-scale: millions of servers and networks, even more autonomic agents (50 billion devices?) • Adaptive to changing workloads – Some methods reliable for certain load patterns, but unstable once the load or system dynamics change • Performance – Need to make decisions fast enough to react timely – Optimal solutions vs. approximations • Simplicity – Key to adoption – Complex models vs. model-free? – Learning phase required before deployment?
Autonomic management - sample techniques • Heuristic frameworks – Fast and simple, rules of thumb • Control theory – Used to steer, e.g., industrial plants, embedded systems, etc. – Discretization for data packet flows (queuing theory) • Machine learning – Evolve behavior based on empirical (monitor) data – Examples: Neural networks, genetic algorithms, reinforcement learning
Heuristics • Rules of thumb – Often lack theoretic background • Often used to handle very complex (NP-hard) problems – Scalable, find fast solutions • Greedy: • Local decisions that make sense right here/now • May not result in optimal solution – Hill climbing • Steer search (manage system in this case) towards steepest slope – Often no upper bound • Not possible to know distance from optimal solution – ”The O-word…”
Control theory • Mathematical models to monitor and steer dynamic systems – Real-time allocation of CPU, memory, etc. • Some simple examples: – Proportional control • Adjust signal proportionally to compensate error – PID (Proportional Integral Derivative) control: • Integral: adjustment w.r.t. error over time • Derivative: adjustment w.r.t. error trend
Neural networks & Deep learning • Mimics the brain’s neuron systems • Input/hidden/output layers of neurons: – Neurons in hidden layer: activation functions maps input signal to output signal – Action functions tuned upon error in output layer (errors are propagated back for tuning) • Often used to capture multi-dimensional problems that are hard to model with other techniques • Hard to train (need representative training data) • Hard to understand cause/effect (hidden layers)
Recommend
More recommend