doomsday
play

Doomsday Anwesha Das, Frank Mueller, Paul Hargrove, Eric Roman, - PowerPoint PPT Presentation

Doomsday Anwesha Das, Frank Mueller, Paul Hargrove, Eric Roman, Scott Baden Lawrence Introduction HPC systems are expensive computing environments composed of hundreds or thousands of nodes with non-uniform memory access. Like everything


  1. Doomsday Anwesha Das, Frank Mueller, Paul Hargrove, Eric Roman, Scott Baden Lawrence

  2. Introduction HPC systems are expensive computing environments composed of hundreds or thousands of nodes with non-uniform memory access. Like everything distributed, individual nodes can fail. Because we want high performance, failure is very expensive. We can reduce the overhead of failure recovery if we can predict the failures proactively in these large scale computing systems.

  3. Motivation Existing work does not place sufficient emphasis on lead time requirements. Prior studies use the same training data for future predictions over a long time frame. Dynamic prediction and scalable online prediction techniques have not yet been explored. Most studies have focussed on rich BlueGene logs of decommissioned systems. Contemporary systems(e.g. Cray) with lower-level Linux style raw logs need further exploration

  4. Proposal The paper proposes a novel prediction scheme,TBP(time based phrase) to extract relevant log phrases indicative of node failure from noisy data. These events help forecast future failures with lead times ranging from 20 secs to 2 minutes.

  5. Cray System Architecture Scale : These systems have been widely deployed and typically run more than 1,400,000 jobs/year.

  6. Technical Challenges Failure needs to be discovered by integrating a distributed set of events over space and time. Normalizing, Mapping, Asymmetric Binarization of data cannot reveal the information required. Non-critical messages could be better predictors. Errors propagate in the system making it harder to find a correlation between distant error logs.

  7. What is Node Failure? Broadly speaking, node failures can be classified as Internal Failures, External Failures, Normal Shutdowns. Normal Shutdowns are administrative events like maintenance. Internal Failures are specific to the node at hand and are not influenced by the state of the system. External Failures are triggered by errors or failures in other parts of the system.

  8. Example

  9. TBP Framework The framework follows the standard division of steps for any machine learning model. TBP Learning: TBP uses TOT to learn the failure chains from the training data(Logs). Node Failure Prediction: TBP compares the incoming phrases with those in the failure chains. If chains with at least 50% similarity in log messages are formed, the corresponding node is likely to fail in the future.

  10. The work flow The main idea is that every phrase is assigned a topic. We have finite number of topics for an integrated document. During the training phase, TOT learns top N topics referring to phrases. TBP forms sequences of phrases that correspond to failures in the past referring to the data. We use them to forecast future failures when those phrases reappear in the test data.

  11. Topics Over Time Topics over time captures the relationship between topic frequencies with respect to time. It views time as a continuous entity and does not discretize time. The intuition behind using TOT is that in a continuous and long running system like HPC systems, the topics evolve over time and reflect the state of the system at the current time period in consideration.

  12. Capturing information from Logs The requirement is to capture information in the form of correlations between highly probable topics at any given time. Example:

  13. Preprocessing Steps Job Logs and Data Integration: Logs corresponding to one event can show up across various places in the system. They are correlated using a timestamp difference of 15ms. After successful correlation, a text document with timestamps, node ids and filtered log messages is formed.

  14. Training Phase Phrase Likelihood Estimation: The training phase includes topic assignment and identification of the top N topics over a period of time. This follows from a continuous time statistical technique called Topics over Time.

  15. TBP Framework

  16. Performance The data shows that node failures are actually somewhat rare, which calls into question the utility of TBP. However, the number of compute node failures increases with service node failures; predicting service node failures will prevent cascading failures. Also, rescheduling jobs after node failures is expensive; the job scheduler could avoid running long jobs on nodes with short term failure predictions.

  17. Observation - Phrase distribution There is significant phrase variation over a short time interval, which means that disparate, large events occur in the system with high frequency. As a result, discrete time models can’t be used here, because they cannot capture variation beyond their time granularity.

  18. Prediction quality and lead time In their experiments, TBP is trained on 4 weeks worth of logs and tested on a week’s worth of data. In this scenario, it predicts 86% of all node failures correctly. However, it needs to be retrained with 4 weeks worth of data every week to maintain its level of performance. TBP offers at least a minute worth of lead time. This can be improved by pruning the failure event chains, at the expense of more false positives.

  19. Thoughts TBP does provide a novel method by taking into consideration the lead times, low level logs, continuous time environment. The details about the application of TOT algorithm are not obvious. Training phase requires manual intervention to establish correlation of logs. Does this work for online learning?

Recommend


More recommend