berkeley stanford recovery oriented october 25 2001
play

Berkeley/Stanford Recovery-oriented October 25, 2001 Computing - PDF document

Berkeley/Stanford Recovery-oriented October 25, 2001 Computing Course Lecture Problem definition When bad things happen to good M Detection: determining that a problem has (will) systems: detecting and diagnosing occur(red) problems M


  1. Berkeley/Stanford Recovery-oriented October 25, 2001 Computing Course Lecture Problem definition When bad things happen to good M Detection: determining that a problem has (will) systems: detecting and diagnosing occur(red) problems M Diagnosis: determining the root cause of the problem M “Problem” can be broadly defined 3 2 – Performance-related, availability-related, security-related 1 0 M Fields to draw from: -1 -2 – System administration, operating systems, network management, 0 10 20 30 intrusion detection Kimberly Keeton M Techniques borrowed from: HPL Storage and Content Distribution – Statistics, database data mining, AI machine learning Berkeley/Stanford Recovery-oriented Computing Course Lecture October 18, 2001 Hewlett-Packard Laboratories Hewlett-Packard 2001-10-ROC-Lecture, 1 2001-10-ROC-Lecture Laboratories Storage & Content Distribution Outline Challenges in detecting problems M Problem definition M Many types of faults – Persistent increase, gradual change, abrupt change, single spike M Detection techniques M Time-varying property of observed system behavior – Challenges – Change point detection – Trends and seasonality (i.e., cyclic behavior) – Time series analysis M Distinguishing between the “good,” the “bad” and the – Predictive detection “ugly” – Data mining/machine learning algorithms M Detecting problems fast enough to minimize service M Diagnosis techniques disruption M Additional related work M Catching false positives vs. neglecting true positives M Summary Hewlett-Packard Hewlett-Packard 2001-10-ROC-Lecture, 2 2001-10-ROC-Lecture, 3 Laboratories Laboratories Storage & Content Distribution Storage & Content Distribution Change point detection algorithms [Hellerstein98] Maximum likelihood ratio M Basic idea: M Let Y 1 , Y 2 , … Y T be i.i.d. random variables – Determine when process parameters have changed M Let f(Y i , θ θ ) be the probability distribution function (pdf) of – Declare change point if I/O response time is “more likely” to have the random variables, where θ θ is the only parameter in the come from a distribution with a different mean pdf 6 M Let f( θ θ o ) and f( θ θ 1 ) be different distributions 5 4 M Likelihood ratio: T 3 ∏ θ f ( Y ) 2 i , 1 = 1 i 1 0 T ∏ θ f ( Y ) -1 i , 0 -2 i = 1 -3 M Large ratio => more likely Y 1 , Y 2 , … Y T from f( θ θ 1 ) M Ex: maximum likelihood ratio detection rules, such as cumulative sum (CUMSUM) Hewlett-Packard Hewlett-Packard 2001-10-ROC-Lecture, 4 2001-10-ROC-Lecture, 5 Laboratories Laboratories Storage & Content Distribution Storage & Content Distribution 1

  2. Berkeley/Stanford Recovery-oriented October 25, 2001 Computing Course Lecture Maximum likelihood ratio detection rule CUMSUM example M Declare a change has occurred at N if the likelihood ratio • Raw data: difficult to after the change exceeds a pre-determined threshold level c detect change     n   θ ∏ f ( Y ) i , 1   • CUMSUM: easier to = i k = ≥ ≥ N inf  n 1 : sup c  detect change n   1 ≤ k ≤ n ∏ f ( Y θ ) , 0  i  i = k     • CUMSUM confidence M Ex: CUMSUM rule for normal random variables level    n  M Confidence level compared with bootstrapping (random permutation ∑ N = inf n ≥ 1 : max ( Y − Y ) ≥ c   i of data) 1 ≤ k ≤ n   i = k   – Bootstrap: flat cumulative residuals – CUMSUM: angle forms at change point Hewlett-Packard Hewlett-Packard 2001-10-ROC-Lecture, 6 2001-10-ROC-Lecture, 7 Laboratories Laboratories Storage & Content Distribution Storage & Content Distribution Change point pros/cons Outline M Advantages: M Problem definition – Well-established statistical technique M Detection techniques – Several variants of on-line and off-line algorithms – Challenges – Change point detection – Time series analysis M Disadvantages: – Predictive detection – Focuses on single type of fault – abrupt changes – Data mining/machine learning algorithms – Mostly limited to stationary (non-varying over time) processes M Diagnosis techniques • Must separately deal with long-term trends and seasonality – Some dependence on knowledge of and assumptions of data M Additional related work distributions M Summary Hewlett-Packard Hewlett-Packard 2001-10-ROC-Lecture, 8 2001-10-ROC-Lecture, 9 Laboratories Laboratories Storage & Content Distribution Storage & Content Distribution Time series forecasting algorithms Holt-Winters measure of deviation M Basic idea: M Confidence bands to measure deviation in seasonal cycle: – Build model of what you expect next observation to be, and raise alarm if observed and predicted values differ too much – predicted deviation: d t = γ γ |y t – y’ t | + (1 – γ γ )(d t-m ) M Ex: Holt-Winters forecasting [Hoogenboom93, Brutlag00] – confidence band: (y’ t – δ δ · d t-m , y’ t + δ δ · d t-m ) – 3-part model built on exponential smoothing: M Trigger alarm when number of violations exceeds – prediction = baseline + linear trend + seasonal effect threshold • y’ t+1 = a t + b t + c t+1-m – To reduce false alarm rate, measure across moving, fixed- • baseline: a t = α α (y t – c t-m ) + (1 – α α )(a t-1 + b t-1 ) sized window • linear trend: b t = β β (a t – a t-1 ) + (1 – β β )(b t-1 ) • seasonal trend: c t = γ γ (y t – a t ) + (1 – γ γ )(c t-m ) • where m is period of seasonal cycle Hewlett-Packard Hewlett-Packard 2001-10-ROC-Lecture, 10 2001-10-ROC-Lecture, 11 Laboratories Laboratories Storage & Content Distribution Storage & Content Distribution 2

  3. Berkeley/Stanford Recovery-oriented October 25, 2001 Computing Course Lecture Time series forecasting pros/cons Holt-Winters example 1 LU read experiment - faultlu only M Advantages: 0.035 Response time (seconds) – Well-established statistical technique 0.03 – Considers time-varying properties of data 0.025 • Trends and seasonality (at many levels) 0.02 observations 0.015 lowerBound upperBound 0.01 M Disadvantages: 0.005 0 – Large number of parameters to tune for algorithm to work 0 20 40 60 80 -0.005 correctly Time (minutes) – Detection of problem after it occurs may imply service disruption M Simplified Holt-Winters: exponential smoothing M Generally detects 10-minute changes – Violations occur when observation falls outside of lower and upper bounds Hewlett-Packard Hewlett-Packard 2001-10-ROC-Lecture, 12 2001-10-ROC-Lecture, 13 Laboratories Laboratories Storage & Content Distribution Storage & Content Distribution Outline Predictive detection [Hellerstein00] M Problem definition M Basic idea: – Predict probability of violations of threshold tests in advance, M Detection techniques including how long until violation – Challenges – Change point detection – Allows pre-emptive corrective action in advance of service – Time series analysis disruption – Predictive detection – Data mining/machine learning algorithms – Also allows service providers to give customers advanced notice of M Diagnosis techniques potential service degradations M Additional related work M Summary Hewlett-Packard Hewlett-Packard 2001-10-ROC-Lecture, 14 2001-10-ROC-Lecture, 15 Laboratories Laboratories Storage & Content Distribution Storage & Content Distribution Predictive detection highlights Predictive detection example M Model both stationary and nonstationary effects 10 10 9 – Stationary: multi-part model using ANOVA techniques 9 Metric of interest Transformed metric of 8 8 7 – Non-stationary: use auto-correlation and auto-regression to 7 interest 6 6 Data capture short-range dependencies 5 5 Threshold 4 4 3 M Use observed data and models to predict future 3 2 2 1 transformed values for a prediction horizon 1 0 0 t-2 t-1 t t+1 t+2 t+3 M Calculate the probability that threshold is violated at each t-2 t-1 t t+1 t+2 t+3 Time Time point in the prediction horizon M Transform data and thresholds M May consider both upper and lower thresholds – Measured (time-varying) values are transformed into (stationary) values – Constant raw threshold also transformed into (time-varying) thresholds M Predict future values and probability of threshold violation Hewlett-Packard Hewlett-Packard 2001-10-ROC-Lecture, 16 2001-10-ROC-Lecture, 17 Laboratories Laboratories Storage & Content Distribution Storage & Content Distribution 3

Recommend


More recommend