Changepoint detection for time series prediction Allen B. Downey - PowerPoint PPT Presentation

Changepoint detection for time series prediction Allen B. Downey Olin College of Engineering 1

My background: � Predoc at San Diego Supercomputer Center. � Dissertation on workload modeling, queue time prediction and malleable job allocation for parallel machines. � Recent: Network measurement and modeling. � Current: History-based prediction. 2

Connection? � Resource allocation based on prediction. � Prediction based on history. � Historical data characterized by changepoints (nonstationarity). 3

Three ways to characterize variability: � Noise around a stationary level. � Noise around an underlying trend. � Abrupt changes in level: changepoints. Important difference: � Data prior to a changepoint is irrelevant to performance after. 4

Example: wide area networks � Some trends (accumulating queue). � Many abrupt changepoints. • Beginning and end of transfers. • Routing changes. • Hardware failure, replacement. 5

Example: parallel batch queues � Some trends (daily cycles). � Some abrupt changepoints. • Start/completion of wide jobs. • Queue policy changes. • Hardware failure, replacement. 6

My claim: � Many systems are characterized by changepoints where data before a changepoint is irrelevant to performance after. � In these systems, good predictions depend on changepoint detection, because old data is wrong. Discussion? 7

Two kinds of prediction: � Single value prediction. � Predictive distribution. • Summary stats. • Intervals. • P ( error > thresh ) • E [ cost ( error )] 8

If you assume stationarity, life is good: � Accumulate data indefinitely. � Predictive distribution = observed distribution. But this is often not a good assumption. 9

If the system is nonstationary: � Fixed window? Exponential decay? � Too far: obsolete data. � Not far enough: loss of useful info. 10

If you know where the changepoints are: � Use data back to the latest changepoint. � Less information immediately after. 11

If you don’t know, you have to guess. P ( i ) = prob of a changepoint at time i Example: � 150 data points. � P (50) = 0 . 7 � P (100) = 0 . 5 How do you generate a predictive distribution? 12

Two steps: � Derive P ( i +) : prob that i is the latest changepoint. � Compute weighted mix going back to each i . Example: P (50) = 0 . 7 P (100) = 0 . 5 P ( ⊘ ) = 0 . 15 P (50+) = 0 . 35 P (100+) = 0 . 5 13

Predictive distribution = 0 . 50 · ed f (100 , 150) ⊕ 0 . 35 · ed f (50 , 150) ⊕ 0 . 15 · ed f (0 , 150) 14

So how do you generate the probabilities P ( i +) ? Three steps: � Bayes’ theorem. � Simple case: you know there is 1 changepoint. � General case: unknown # of changepoints. 15

Bayes’ theorem (diachronic interpretation) P ( H | E ) = P ( E | H ) P ( E ) P ( H ) � H is a hypothesis, E is a body of evidence. � P ( H | E ) : posterior � P ( H ) : prior � P ( E | H ) is usually easy to compute. � P ( E ) is often not. 16

Unless you have a suite of exclusive hypotheses. P ( H i | E ) = P ( E | H i ) P ( H i ) P ( E ) � P ( E ) = P ( E | H j ) P ( H j ) H j ∈ S In that case life is good. 17

� If you know there there is exactly one changepoint in an interval... � ...then the P ( i ) are exclusive hypotheses, � and all you need is P ( E | i ) . Which is pretty much a solved problem. 18

What if the # of changepoints is unknown? � P ( i ) are no longer exclusive. � But the P ( i +) are. � And you can write a system of equations for P ( i +) . 19

� P ( i + ) = P ( i + |⊘ ) P ( ⊘ ) + P ( i + | j ++ ) P ( j ++ ) j<i � P ( j ++ ) is the prob that the second-to last changepoint is at i . � P ( i + | j ++ ) reduces to the simple problem. � P ( ⊘ ) is the prob that we have not seen two changepoints. � P ( i + |⊘ ) reduces to the simple problem (plus). Great, so what’s P ( j ++ ) ? 20

� P ( j ++ ) = P ( j ++ | k + ) P ( k + ) k>j � P ( j ++ | k + ) is just P ( j + ) computed at time k . � So you can solve for P ( + ) in terms of P ( ++ ) . � And P ( ++ ) in terms of P ( + ) . � And at every iteration you have a pretty good estimate. � Paging Dr. Jacobi! 21

Implementation: � Need to keep n 2 / 2 previous values. � And n 2 / 2 summary statistics. � And it takes n 2 work to do an update. � But, you only have to go back two changepoints, � ...so you can keep n small. 22

4 � Synthetic series data with two 2 changepoints. x[i] 0 � µ = − 0 . 5 , 0 . 5 , 0 . 0 -2 � σ = 1 . 0 -4 � P ( ⊘ ) = 0 . 04 1.0 P(i+) cumulative probability P(i++) 0.5 0.0 0 50 100 150 time 23

150 � The ubiquitous data annual flow (10^9 m^3) Nile dataset. 100 � Change in 1898. 50 � Estimated probs can be 0 1880 1900 1920 1940 1960 mercurial. 1.0 P33(i+) cumulative probability P66(i+) P99(i+) 0.5 0.0 1880 1900 1920 1940 1960 time 24

4 � Can also detect data change in 2 variance. 0 � µ = 1 , 0 , 0 -2 � σ = 1 , 1 , 0 . 5 -4 � Estimated P ( i + ) 1.0 cumulative probability P(i+) is good. P(i++) � Estimated 0.5 P ( i ++ ) less certain. 0.0 0 50 100 index 25

� Qualitative behavior seems good. � Quantitative tests: • Compare to GLR for online alarm problem. • Test predictive distribution with synthetic data. • Test predictive distribution with real data. 26

Changepoint problems: � Detection: online alarm problem. � Location: offline partitioning. � Tracking: online prediction. Proposed method does all three. Starting simple... 27

Online alarm problem: � Observe process in real time. � µ 0 and σ known. � τ and µ 1 unknown. � Raise alarm ASAP after changepoint. � Minimize delay. � Minimize false alarm rate. 28

GLR = generalized likelihood ratio. � Compute decision function g k . � E [ g k ] = 0 before the changepoint, � ... increases after. � Alarm when g k > h . � GLR is optimal when µ 1 is known. 29

CPP = change point probability n � P ( i + ) P ( changepoint ) = i =0 � Alarm when P ( changepoint ) > thresh . 30

� µ = 0 , 1 15 � σ = 1 GLR � τ ∼ Exp (0 . 01) CPP mean delay 10 � Goodness = lower mean delay for same false alarm rate. 5 0 0.0 0.1 0.2 false alarm probability 31

� Fix false alarm 25 rate = 5% GLR (5% false alarm rate) � Vary σ . 20 CPP (5% false alarm rate) mean delay � CPP does well 15 with small S/N . 10 5 0 0.0 0.5 1.0 1.5 sigma 32

So it works on a simple problem. Future work: � Other changepoint problems (location, tracking). � Other data distributions (lognormal). � Testing robustness (real data, trends). 33

Related problem: � How much categorical data to use? � Example: predict queue time based on size, queue, etc. � Possible answer: narrowest category that yields two changepoints. 34

Good news: � Very general framework. � Seems to work. � Many possible applications. 35

Bad news: � Need to apply and test in real application. � n 2 space and time may limit scope. 36

� More at allendowney.com/research/changepoint � Or email downey@allendowney.com 37

Changepoint detection for time series prediction Allen B. Downey - PowerPoint PPT Presentation

Changepoint detection for time series prediction Allen B. Downey Olin College of Engineering 1 My background: Predoc at San Diego Supercomputer Center. Dissertation on workload modeling, queue time prediction and malleable job

High-dimensional, multiscale online changepoint detection Richard J. Samworth University of

Multiple Changepoint Detection in Climate Time Series Robert Lund Clemson Math Sciences

Changepoint detection in network measurements Allen B. Downey 1 Fundamental problem: Predict

Lead Screw Motors LSM08 Series LSM11 Series LSM14 Series LSM17 Series

Bayesian Minimal Description Lengths for Multiple Changepoint Detection Yingbo Li Dept of

Time Series Analysis and Mining with R Time Series Decomposi- tion Time Series Forecasting

Branch Prediction Branch Prediction vs vs Execution Time Execution Time Prediction

Detection of neutral particles detection of neutrons detection of neutrinons detection of low

Outline Time series and forecasting Time series objects 1 in R Basic time series functionality

Using the Mixture Kalman Filter to Track a Hidden State in Changepoint Models Sarah Oscroft

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

standard series Overview DP series DX series H series M series bitte hier

E- -Series: Series: Water Mist Extinguishers Water Mist Extinguishers E E- -Series: Series:

Fourier Series Fourier Sine Series Fourier Cosine Series Fourier Series Convergence

Low Level Low Level Low Level Low Level Detection of Detection of Detection of Detection of

Cycle time: 40 sec Cycle time: 12 sec Cycle time: 0.75 sec Cycle time: 1.25 sec Cycle time: 5

Some references Kocourek , R. 1996. The prefix post- in contemporary English terminology.

How is a Collection Related to its Members? Antony Galton University of Exeter, UK Fundamental

From the National Corpus of Polish to the Polish Corpus Infrastructure Maciej Ogrodniczuk

Entities as a Gemma Boleda 2 Window into Gabriella Lapesa 1 V Thejas 3 (Distributional) Matthijs

Creating a dual-purpose treebank Eirkur Rgnvaldsson, Anton Karl Ingason Einar Freyr

D DiaCollo Bryan Jurish jurish@bbaw.de University of Birmingham 28 th June, 2016 Overview The

DiaCollo: On the trail of diachronic collocations Bryan Jurish jurish@bbaw.de AG

Computational Linguistics: Formal Semantics Raffaella Bernardi University of Trento Contents