statistics 479 503 time series analysis part ii doug
play

STATISTICS 479/503 TIME SERIES ANALYSIS PART II Doug Wiens April - PDF document

STATISTICS 479/503 TIME SERIES ANALYSIS PART II Doug Wiens April 12, 2005 Contents II Time Domain Analysis 38 5 Lecture 5 . . . . . . . . . . . . . . . . . . . 39 6 Lecture 6 . . . . . . . . . . . . . . . . . . . 47 7 Lecture 7 . . .


  1. 51 Thus series is both stationary and invertible. It is ARMA(1,1), not ARMA(2,2) as it initially ap- peared. Students should verify that the above can be continued as ⎡ ⎤ ∞ X ( . 9) j B j · (1 + . 5 B ) ⎣ ⎦ w t X t = j =0 h i 1 + ( . 9 + . 5) B + ... + ( . 9) j − 1 ( . 9 + . 5) B j + ... = w t = ψ ( B ) w t where ψ ( z ) = P ψ k z k and ψ 0 = 1, ψ k = 1 . 4( . 9) k − 1 . • Box-Jenkins methodology: 1. Determine the theoretical ACF (and PACF, to be de fi ned) for these and other classes of time series models. Use the sample ACF/PACF to match the data to a possible model (MA(q), AR(p), etc.).

  2. 52 2. Estimate parameters using a method appropri- ate to the chosen model, assess the fi t, study the residuals. The notion of residual will re- quire a special treatment, for now think of them as X t − ˆ X t where, e.g., in an AR(p) model ( X t = φ 1 X t − 1 + φ 2 X t − 2 + ... + φ p X t − p + w t ) we have ˆ X t = ˆ φ 1 X t − 1 + ˆ φ 2 X t − 2 + ... + ˆ φ p X t − p . The residuals should then “look like” white noise (why?). If the fi t is inad- equate revise steps 1. and 2. 3. Finally use model to forecast. • We treat these three steps in detail. Recall that for an MA(q), the autocovariance function is ( P q − m σ 2 k =0 θ k θ k + m , 0 ≤ m ≤ q, w γ ( m ) = 0 m > q. The salient feature is that γ ( m ) = 0 for m > q ; we look for this in the sample ACF. See Figures 2.1, 2.2.

  3. 53 Figure 2.1. Sample ACF of simulated MA(3) series. Figure 2.2. Sample ACF of simulated MA(3) series.

  4. 54 • ACF of an AR(p) process: Let j ≥ 0; assume process is stationary. Then p X w t = X t − φ i X t − i i =1 ⎡ ⎤ p X ⎣ X t − ⎦ ⇒ COV φ i X t − i , X t − j i =1 h i = COV w t , X t − j p h i X ⇒ γ ( j ) − φ i γ ( j − i ) = COV w t , X t − j . i =1 Under the stationarity condition, X t − j is a linear combination w t − j + ψ 1 w t − j − 1 + ψ 2 w t − j − 2 + .. with " # h i w t , w t − j + ψ 1 w t − j − 1 COV w t , X t − j = COV + ψ 2 w t − j − 2 + .. = σ 2 w I ( j = 0) , thus ( p X σ 2 w , j = 0 , γ ( j ) − φ i γ ( j − i ) = 0 j > 0 . i =1 These are the “Yule-Walker” equations to be solved to obtain γ ( j ) for j ≥ 0, then γ ( − j ) = γ ( j ).

  5. 55 • Example: AR(1). Yule-Walker equations are ( σ 2 w , j = 0 , γ ( j ) − φγ ( j − 1) = 0 j > 0 . We get γ (0) = φγ (1) + σ 2 w , γ ( j ) = φγ ( j − 1) for j > 0 . In particular γ (0) = φγ (1) + σ 2 w = φ ( φγ (0)) + σ 2 w , so σ 2 w γ (0) = 1 − φ 2 . Note that 0 < γ (0) = V AR [ X t ] < ∞ by the stationarity condition | φ | < 1. Iterating γ ( j ) = φγ ( j − 1) gives γ ( j ) = φ j γ (0) , j = 1 , 2 , 3 , ... . Thus ρ ( j ) = γ ( j ) γ (0) = φ | j | .

  6. 56 7. Lecture 7 • Di ffi cult to identify an AR(p) from its ACF. • Suppose that a series is AR(1), and consider fore- casting X t from two previous values X t − 1 , X t − 2 : X t = φX t − 1 + w t , ˆ = α 1 X t − 1 + α 2 X t − 2 . X t One suspects that the “best” α ’s will be α 1 = φ, α 2 = 0. This is in fact true, and is a property of the “Partial Autocorrelation Function (PACF)”. • Assume µ X = 0; consider the problem of mini- mizing the function f m ( α 1 ,m , ..., α m,m ) h { X t − α 1 ,m X t − 1 − ... − α m,m X t − m } 2 i = E , which is the MSE when X t is forecast by ˆ X t = α 1 ,m X t − 1 + ... + α m,m X t − m .

  7. 57 Let the minimizers be α ∗ 1 ,m , ..., α ∗ m,m . The lag- m PACF value , written φ mm , is de fi ned to be α ∗ m,m . • It can also be shown that h i X t − ˆ X t , X t − m − ˆ φ mm = CORR X t − m , where each ˆ X denotes the best (i.e. minimum MSE) predictor which is a linear function of X t − 1 , ..., X t − m +1 . • To compute: Solve the m equations in m un- knowns 0 = − 1 ∂f m = 2 ∂α j,m h i E X t − j · { X t − α 1 ,m X t − 1 − ... − α m,m X t − m } h i = γ ( j ) − α 1 ,m γ ( j − 1) − ... − α m,m γ ( j − m ) , for j = 1 , ..., m ; i.e. m X α i,m γ ( j − i ) = γ ( j ) . i =1

  8. 58 Then m = 1 : φ 11 = ρ (1) , m = 2 : φ 22 = ρ (2) − ρ 2 (1) 1 − ρ 2 (1) , etc. • Note that for an AR(1), ρ ( j ) = φ j and so φ 11 = φ, φ 22 = 0. See Figure 2.3. Figure 2.3. Sample PACF from simulated AR(1) series.

  9. 59 • In general, if { X t } is AR(p) and stationary, then φ pp = φ p and φ mm = 0 for m > p . Write X t = P p Proof : j =1 φ j X t − j + w t , so for m ≥ p f m ( α 1 ,m , ..., α m,m ) ⎡ 2 ⎤ ⎧ ⎫ ³ ´ w t + P p ⎨ ⎬ φ j − α j,m X t − j ⎢ ⎥ j =1 = E − P m ⎣ ⎦ ⎩ ⎭ j = p +1 α j,m X t − j h { w t + Z } 2 i = E , say, (where Z is uncorrelated with w t - why?), = σ 2 w + E [ Z 2 ] . This is minimized if Z = 0 with probability 1, i.e. if α j,m = φ j for j ≤ p and = 0 for j > p .

  10. 60 • Forecasting. Given r.v.s X t , X t − 1 , ... (into the in fi nite past, in principle) we wish to forecast a Let the forecast be X t future value X t + l . t + l . We will later show that the “best” forecast is £ X t + l | X t , X t − 1 , ... ¤ , X t t + l = E the conditional expected value of X t + l given X t = { X s } t s = −∞ . Our general model identi fi cation ap- proach, following Box/Jenkins, is then: 1. Tentatively identify a model, generally by look- ing at its sample ACF/PACF. 2. Estimate the parameters (generally by the method of Maximum Likelihood, to be covered later). This allows us to estimate the forecasts X t t + l , which depend on unknown parameters, by sub- stituting estimates to obtain ˆ X t t + l . We de fi ne the residuals by X t − 1 w t = X t − ˆ ˆ . t

  11. 61 The (adjusted) MLE of σ 2 w is typically (in ARMA ( p, q ) models) T X 1 σ 2 w 2 ˆ w = ˆ t . T − 2 p − q t = p +1 The T − 2 p − q in the denominator is the num- ber of residuals used ( T − p ) minus the number of ARMA parameters estimated ( p + q ). 3. The residuals should “look like” white noise. We study them, and apply various tests of whiteness. To the extent that they are not white, we look for possible alternate models. 4. Iterate; fi nally use the model to forecast.

  12. 62 8. Lecture 8 • Conditional expectation. Example: Randomly choose a stock from a listing. Y = price in one week, X = price in the previous week. To predict Y, if we have no information about X then the best (minimum mse) constant predictor of Y is E [ Y ]. (Why? What mathematical problem is being formulated and solved here?) However, suppose we also know that X = x . Then we can improve our forecast, by using the forecast ˆ Y = E [ Y | X = x ], the mean price of all stocks whose price in the previous week was x . • In general, if ( X, Y ) are any r.v.s, then E [ Y | X = x ] is the expected value of Y , when the popula- tion is restricted to those pairs with X = x . We use the following facts about conditional expec- tation. ( X, Y ) independent ⇒ E [ Y | X = x ] = E [ Y ] , E [ X | X = x ] = x, E [ g ( X ) | X = x ] = g ( x ) , and more generally E [ g ( X, Y ) | X = x ] = E [ g ( x, Y ) | X = x ] .

  13. 63 In particular, E [ f ( X ) g ( Y ) | X = x ] = f ( x ) E [ g ( Y ) | X = x ]. • Assume from now on that white noise is Normally distributed . The important consequence is that these terms are now independent, not merely un- correlated, and so ( w s , s ≤ t, E [ w s | w t , w t − 1 , ... ] = 0 , s > t. • Put h ( x ) = E [ Y | X = x ]. This is a function of x ; when it is evaluated at the r.v. X we call it h ( X ) = E [ Y | X ]. We have the Double Expectation Theorem: E { E [ Y | X ] } = E [ Y ] . The inner expectation is with respect to the conditional distribution and the outer is with re- spect to the distribution of X ; the theorem can be stated as E [ h ( X )] = E [ Y ], where h ( x ) is as above.

  14. 64 — Example: Y = house values, X = location (neighbourhood) of a house. E [ Y ] can be obtained by averaging within neighbourhoods, then averaging over neighbourhoods. • Similarly, n o E X E Y | X [ g ( X, Y ) | X ] = E [ g ( X, Y )] . • Minimum MSE forecasting. Consider forecasting a r.v. Y (unobservable) using another r.v. X (or set of r.v.s); e.g. Y = X t + l , X = X t . We seek the function g ( X ) which minimizes the MSE h { Y − g ( X ) } 2 i MSE ( g ) = E . The required function is g ( X ) = E [ Y | X ] (= h ( X )).

  15. 65 • Proof : We have to show that for any function g , MSE ( g ) ≥ MSE ( h ). Write MSE ( g ) h { ( Y − h ( X )) + ( h ( X ) − g ( X )) } 2 i = E h { Y − h ( X ) } 2 i h { h ( X ) − g ( X ) } 2 i = E + E +2 E [( h ( X ) − g ( X )) · ( Y − h ( X ))] . We will show that the last term = 0; then we have h { h ( X ) − g ( X ) } 2 i MSE ( g ) = MSE ( h )+ E which exceeds MSE ( h ); equality i ff g ( X ) = h ( X ) with probability 1. To establish the claim we evaluate the expected value in stages: E [( h ( X ) − g ( X )) · ( Y − h ( X ))] n o = E X E Y | X [( h ( X ) − g ( X )) · ( Y − h ( X )) | X ] . The inner expectation is (why?) ( h ( X ) − g ( X ) · E Y | X [( Y − h ( X )) | X ] n o = ( h ( X ) − g ( X ) · E Y | X [ Y | X ] − h ( X ) = 0 .

  16. 66 • The minimum MSE is h { Y − h ( X ) } 2 i = E MSE min n h io { Y − h ( X ) } 2 | X = E X E Y | X = E X { V AR [ Y | X ] } . We will show that V AR [ Y ] = E X { V AR [ Y | X ] } + V AR [ E { Y | X } ] , i.e. V AR [ Y ] = MSE min + V AR [ h ( X )]; thus MSE min ≤ V AR [ Y ]. V AR [ Y ] is the MSE when Y is forecast by its mean and X is ignored; our result is then that using the information in X never increases MSE, and results in a strict decrease as long as V AR [ h ( X )] > 0, i.e. h ( x ) is non-constant. Analogous to within and between breakdown in ANOVA (e.g. variation in house prices within and between neighbourhoods).

  17. 67 • Proof of claim : h { Y − E [ Y ] } 2 i V AR [ Y ] = E h { Y − h ( X ) + ( h ( X ) − E [ Y ]) } 2 i = E h { Y − h ( X ) } 2 i h { h ( X ) − E [ Y ] } 2 i = E + E +2 E [ { Y − h ( X ) } { h ( X ) − E [ Y ] } ] = MSE min + V AR [ h ( X )] n o +2 E X E Y | X [ { Y − h ( X ) } { h ( X ) − E [ Y ] } | X ] . The inner expectation is { h ( X ) − E [ Y ] } · E Y | X [ Y − h ( X ) | X ] = 0 . • Students should verify that Y − E [ Y | X ] is uncor- related with X .

  18. 68 • Assume { X t } is stationary and invertible. We h X t + l | X t i by X t forecast X t + l t + l = E , where X t = { X s } t s = −∞ . Note that this forecast is ‘unbiased’ in that E [ X t t + l ] = E [ X t + l ]. By the linearity we have that X t + l can be represented as ∞ X X t + l = ψ k w t + l − k , ( ψ 0 = 1) k =0 so that ∞ X X t ψ k E [ w t + l − k | X t ] . t + l = k =0 We have X t = ψ ( B ) w t and (by invertibility) w t = φ ( B ) X t where φ ( B ) ψ ( B ) = 1 determines φ ( B ). Thus conditioning on X t is equivalent to condi- tioning on w t = { w s } t s = −∞ : ∞ X X t ψ k E [ w t + l − k | w t ] where = t + l k =0 ( if l ≤ k, w t + l − k , E [ w t + l − k | w t ] = 0 , otherwise. h X t + l − k | X t i (Note that E = X t + l − k if l ≤ k .)

  19. 69 Thus the forecast is ∞ X X t t + l = ψ k w t + l − k , k = l with forecast error and variance l − 1 X X t + l − X t = ψ k w t + l − k , t + l k =0 l − 1 X t + l ] = σ 2 ψ 2 V AR [ X t + l − X t k . w k =0 Since { w t } is normal, we have ⎛ ⎞ l − 1 X X t + l − X t ⎝ 0 , σ 2 ψ 2 ⎠ t + l ∼ N w k k =0 and so a 100(1 − α )% prediction (forecast) inter- val on X t + l is v u u l − 1 X u X t ψ 2 t t + l ± z α/ 2 σ w k . k =0 Interpretation: the probability that X t + l will lie in this interval is 1 − α .

  20. 70 • In practice we must solve for the ψ k in terms of the AR and MA parameters of { X t } , then sub- stitute estimates of these parameters to obtain estimates ˆ ψ k . Substituting these estimates into the expressions above results in the forecast ˆ X t t + l ; σ 2 we also must use an estimate ˆ w . The residuals , or innovations are X t − 1 w t = X t − ˆ ˆ t w = P ˆ σ 2 w 2 and typically ˆ t / (# of residuals - # of parameters estimated).

  21. 71 9. Lecture 9 • Example 1: AR(1) (and stationary). = φX t − 1 + w t . X t h X t + l | X t i h φX t + l − 1 + w t + l | X t i X t = E = E t + l h X t + l − 1 | X t i h w t + l | w t i = φE + E ( φX t , l = 1 , = φX t t + l − 1 , l > 1 . Iterating: X t t + l = φ l X t for l ≥ 1 . The calculation of the forecast was easy (it al- ways is for an AR model); determining the fore- cast variance requires us to determine the ψ k ’s P l − 1 (since V AR [ X t + l − X t t + l ] = σ 2 k =0 ψ 2 k ). Usu- w ally this is done numerically; in the present case it can be done explicitly: (1 − φB ) X t = w t , so = (1 − φB ) − 1 w t X t ∞ X = ψ k w t − k , k =0

  22. 72 for ψ k = φ k . Then l − 1 k = 1 − φ 2 l X ψ 2 1 − φ 2 , k =0 leading to the forecast interval v u t 1 − φ 2 l u φ l X t ± z α/ 2 σ w 1 − φ 2 . Numerically we replace φ by its estimate ˆ φ ; then X t ˆ t + l = ˆ φ l X t and the residuals are X t − 1 w t = X t − ˆ = X t − ˆ ˆ φX t − 1 ( t > 1) . t Note the similarity with w t = X t − φX t − 1 . This illustrates the fact that the residual can also be obtained by writing w t in terms of the data and parameters, and then replacing the parameters with estimates. The estimate of the variance of the noise is T X σ 2 w 2 ˆ w = ˆ t / ( T − 2) . t =2

  23. 73 • Example 2. AR(p). Similar to Example 1, X t = P p i =1 φ i X t − i + w t results in p X X t φ i X t ˆ ˆ t + l = t + l − i , i =1 where X t t + l − i = X t + l − i if l ≤ i . Now solve (numerically) X t = (1 /φ ( B )) w t = ψ ( B ) w t to get the ψ k , then the ˆ ψ k and the standard errors of the forecasts. The innovations are obtained from p p X X X t − 1 φ i X t − 1 ˆ ˆ ˆ = t − i = φ i X t − i t i =1 i =1 to get p X ˆ w t = X t − ˆ φ i X t − i , i =1 with T X σ 2 w 2 ˆ w = ˆ t / ( T − 2 p ) . t = p +1

  24. 74 • Example 3. MA(1) (and invertible). X t = w t − θw t − 1 = (1 − θB ) w t ∞ X θ k X t − k . = ⇒ w t k =0 We make the approximation X 0 = w 0 = 0, and then t − 1 X θ k X t − k . w t = k =0 Now ( h w t + l − θw t + l − 1 | w t i − θw t , l = 1 , X t t + l = E = 0 , l > 1 , with w t obtained from the preceding equation. This gives residuals t − 1 X θ k X t − k , ˆ w t = ˆ k =0 and T X σ 2 w 2 ˆ w = ˆ t / ( T − 1) t =1

  25. 75 Trivially, since ψ 0 = 1 , ψ 1 = − θ and ψ k = 0 for k > 1, we have ( l − 1 X 1 , l = 1 , ψ 2 k = 1 + θ 2 , l > 1 . k =0 The prediction intervals are ⎧ − ˆ ⎨ θ ˆ w t ± z α/ 2 ˆ σ w , l = 1 , q ⎩ 1 + ˆ θ 2 , l > 1 . 0 ± z α/ 2 ˆ σ w • Students should write out the procedure for an invertible MA(q) model. • Example 4. ARMA(1,1), stationary and invert- ible. In general, when there is an MA compo- nent we make the approximation X 0 = w 0 = 0. The model is (1 − φB ) X t = (1 − θB ) w t , i.e. X t = φX t − 1 + w t − θw t − 1 , leading to X t = φX t t + l − 1 + w t t + l − θw t t + l t + l − 1 ( φX t − θw t , l = 1 , = φX t l > 1 . t + l − 1 ,

  26. 76 To obtain a value for w t we write = (1 − φB )(1 − θB ) − 1 X t w t ∞ X θ k B k · X t = (1 − φB ) k =0 ⎛ ⎞ ∞ X θ k − 1 ( θ − φ ) B k ⎝ 1 + ⎠ X t = k =1 with approximation t − 1 X θ k − 1 ( θ − φ ) X t − k . w t = X t + k =1 For the forecast variance we reverse φ and θ in the above: ∞ X X t = (1 − θB )(1 − φB ) − 1 w t = ψ k w t − k k =0 with ψ k = φ k − 1 ( φ − θ ) (and ψ 0 = 1); thus h i X t + l − X t V AR t + l l − 1 X = σ 2 ψ 2 w k k =0 " # 1 + ( φ − θ ) 2 1 − φ 2( l − 1) = σ 2 . w 1 − φ 2

  27. 77 Students should verify that obtaining the residuals X t − 1 w t = X t − ˆ from ˆ results in the same expres- t sion as substituting estimates into the expression for w t given above, i.e. t − 1 θ k − 1 ³ ˆ ´ X ˆ θ − ˆ w t = X t + ˆ φ X t − k . k =1 w = P T σ 2 w 2 Then ˆ t =2 ˆ t / ( T − 3). • In general, in an ARMA( p, q ) model, the assump- tion X 0 = 0 is necessary if we are to be able to calculate any of the residuals. The resulting ex- w t = P t − 1 pression - of the form ˆ k =0 ˆ α k X t − k - could also be used to calculate the fi rst p residuals. So there is a trade-o ff : we could get numerical but approximate values for ˆ w 1 , ..., ˆ w p , or we could just not bother and calculate the remaining residuals more accurately. Here we take the latter ap- proach. This gives T X σ 2 w 2 ˆ w = ˆ t / ( T − 2 p − q ) . t = p +1

  28. 78 10. Lecture 10 • Estimation. One method is the method of mo- ments , in which we take expressions relating pa- rameters to expected values, replace the expected values by series averages, then solve for the un- known parameters. — e.g. E [ X t ] = µ becomes T − 1 P T t =1 x t = ˆ µ. — e.g. For an AR(1) model we could replace γ ( k ) by the sample autocovariance ˆ γ ( k ) in the Yule-Walker equations, then solve them as be- fore to get σ 2 ˆ w ˆ γ (0) = φ 2 , 1 − ˆ ˆ ˆ γ (1) = φ ˆ γ (0) , yielding ˆ φ = ρ (1) , ˆ ³ φ 2 ´ σ 2 1 − ˆ ˆ = γ (0) ˆ . w

  29. 79 Recall we previously used the (adjusted Max- imum Likelihood) estimate T X σ 2 w 2 ˆ w,MLE = ˆ t / ( T − 2) . t =2 Students should show that the di ff erence be- tween these two estimates is of the order (i.e. a multiple of) 1 /T ; in this sense the two es- timates are asymptotically (i.e. as T → ∞ ) equivalent . — The same technique applied to the MA(1) model ³ 1 + θ 2 ´ starts with ρ (1) = − θ/ ( | ρ (1) | < 1 / 2 by invertibility: | θ | < 1), then we solve ³ θ 2 ´ ρ (1) = − ˆ 1 + ˆ ˆ θ/ . ρ (1) | < 1 / 2 there is a real root ˆ If | ˆ θ with | ˆ θ | < 1 and we use it. (Otherwise | ˆ θ | = 1 and the estimated model is not invertible.) But even when | ˆ θ | < 1 the estimate can be quite ine ffi cient (highly varied) relative to the MLE, which we consider next.

  30. 80 • Maximum Likelihood Estimation . We observe x = ( x 1 , ..., x T ) 0 ; suppose the joint probability density function (pdf) is f ( x | α ) for a vector α = ( α 1 , ..., α p ) 0 of unknown parameters. E.g. if the X t are independent N ( µ, σ 2 ) the joint pdf is ⎧ ⎫ 2 πσ 2 ´ − 1 / 2 e − ( xt − µ )2 T ⎨ ⎬ ³ Y 2 σ 2 ⎩ ⎭ t =1 P T t =1( xt − µ )2 ³ 2 πσ 2 ´ − T/ 2 e − 2 σ 2 = . When evaluated at the numerical data this is a function of α alone, denoted L ( α | x ) and known as the Likelihood function. The value ˆ α which maximizes L ( α | x ) is known as the Maximum Likelihood Estimator (MLE). Intuitively, the MLE makes the observed data “most likely to have oc- curred”. — We put l ( α ) = ln L ( α | x ), the log-likelihood, and typically maximize it (equivalent to max-

  31. 81 imizing L ) by solving the likelihood equations ´ ∂ ³ ˙ l ( α ) = ∂ α l ( α ) = 0 , i.e. ∂ l ( α ) = 0, j = 1 , ..., p. ∂α j The vector ˙ l ( α ) is called the gradient . — The MLE has attractive large sample proper- With α 0 denoting the true value, we ties. √ typically have that T ( ˆ α − α 0 ) has a lim- iting (as T → ∞ ) normal distribution, with mean 0 and covariance matrix C = I − 1 ( α 0 ) where I ( α 0 ) is the information matrix de fi ned below. — The role of the covariance matrix is that if a random vector Y = ( Y 1 , ..., Y p ) 0 has covari- ance matrix C , then h i COV Y j , Y k = C jk ; in particular V AR [ Y j ] = C jj .

  32. 82 — The information matrix is given by ½ 1 l ( α 0 ) 0 i¾ h ˙ l ( α 0 ) ˙ I ( α 0 ) = lim T E ; T →∞ a more convenient and equivalent form (for the kinds of models we will be working with) is ½ 1 i¾ h − ¨ I ( α 0 ) = lim l ( α 0 ) T E , T →∞ where ¨ l ( α ) is the Hessian matrix with ( j, k ) th element ∂ 2 l ( α ) /∂α j ∂α k . — To apply these results we estimate I ( α 0 ) by ˆ I = I ( ˆ α ) . Denote the ( j, k ) th element of ˆ I − 1 by ˆ I jk . ³ ´ √ Then the normal approximation is that T α j − α j ˆ is asymptotically normally distributed with mean zero and variance estimated by ˆ I jj , so that s ˆ I jj α j − α j ˆ ≈ N (0 , 1), where s j = T . s j

  33. 83 Then, e.g., the p-value for the hypothesis H 0 : α j = 0 against a two-sided alternative is ¯ ¯ Ã ! ¯ ¯ α j ˆ ¯ ¯ p = 2 P Z > , ¯ ¯ ¯ ¯ s j supplied on the ASTSA printout. • Example 1. AR(1) with a constant: X t = φ 0 + φ 1 X t − 1 + w t . As is commonly done for AR mod- els we will carry out an analysis conditional on X 1 ; i.e. we act as if X 1 is not random, but is the con- stant x 1 . We can then carry out the following steps: 1. Write out the pdf of X 2 , ..., X T . 2. From 1. the log-likelihood is l ( α ) = ln L ( α | x ), ³ ´ 0 . φ 0 , φ 1 , σ 2 where α = w ³ ˆ ´ φ 0 , ˆ σ 2 3. Maximize l ( α ) to obtain the MLEs φ 1 , ˆ . w 4. Obtain the information matrix and its esti- mated inverse.

  34. 84 • Step 1. (Assuming normally distributed white noise.) Transformation of variables . If the pdf of w 2 , ..., w T is g ( w | α ) and we write the w ’s in terms of the X ’s: w t = X t − φ 0 − φ 1 X t − 1 then the pdf of X 2 , ..., X T is ¯ ¶¯ µ ∂ w ¯ ¯ ¯ ¯ f ( x | α ) = g ( w | α ) ¯ ¯ + ∂ x ³ ∂ w ´ where is the matrix of partial derivatives, ∂ x with µ ∂ w ¶ = ∂w j , ∂ x ∂x k jk and |·| + is the absolute value of the determinant. On the right hand side g ( w | α ) is evaluated by replacing the w ’s with their expressions in terms of the x ’s. In this AR(1) example, P T t =2 w 2 ³ ´ − ( T − 1) / 2 e t − 2 σ 2 2 πσ 2 g ( w | α ) = w w

  35. 85 and ⎛ ⎞ 1 0 0 · · · 0 ⎜ ⎟ . ... . ⎜ ⎟ − φ 1 0 . ⎜ ⎟ ∂ w ... ... ⎜ ⎟ ∂ x = 0 − φ 0 ⎜ ⎟ ⎜ . ⎟ ... ... . . 1 0 ⎝ ⎠ 0 · · · 0 − φ 1 with determinant = 1 (why?). Thus P T t =2 ( xt − φ 0 − φ 1 xt − 1 ) 2 ³ ´ − ( T − 1) / 2 e − 2 σ 2 2 πσ 2 f ( x | α ) = . w w This determinant will always = 1 if we can write w t as X t +a function of X t − 1 , ..., X 1 . But (i) this can always be done in AR models, and (ii) in models with MA parts to them we assume in- vertibility + “ X 0 = 0” , so that this can be done there as well. So for all models considered in this course, | ∂ w /∂ x | + = 1.

  36. 86 • Step 2. l ( α ) ⎧ ⎫ P T ⎪ t =2 ( xt − φ 0 − φ 1 xt − 1 ) 2 ⎪ ⎪ ⎪ ⎨ ⎬ ³ ´ − ( T − 1) / 2 e − 2 σ 2 2 πσ 2 = ln w w ⎪ ⎪ ⎪ ⎪ ⎩ ⎭ P T t =2 ( x t − φ 0 − φ 1 x t − 1 ) 2 = − T − 1 ln σ 2 w − 2 σ 2 2 w + const. = − T − 1 w − S ( φ 0 , φ 1 ) ln σ 2 + const. , say. 2 σ 2 2 w

  37. 87 11. Lecture 11 • Step 3. To maximize l over φ 0 , φ 1 we minimize S ( φ 0 , φ 1 ). Consider a regression model x t = φ 0 + φ 1 x t − 1 + error , t = 2 , ..., T. (*) In a regression model y i = φ 0 + φ 1 x i + error , i = 1 , ..., n the least squares estimates of the slope and inter- cept minimize P ( y i − φ 0 − φ 1 x i ) 2 and are ˆ y − ˆ = ¯ φ 1 ¯ φ 0 x, Ã ! P ( x i − ¯ x ) ( y i − ¯ y ) = ˆ γ Y X (0) ˆ φ 1 = . P ( x i − ¯ x ) 2 ˆ γ X (0) Thus the minimizers of S are the LSEs in the re- gression model (*); they are obtained numerically by doing the regression and formulas for them can (and should, by the students) be written out. Students should verify that ³ ´ ˆ ρ (1) , ˆ 1 − ˆ φ 1 ≈ ˆ φ 0 ≈ ¯ x φ 1 .

  38. 88 The likelihood equation for σ 2 w is + S (ˆ φ 0 , ˆ ∂l = − T − 1 φ 1 ) 0 = , ∂σ 2 2 σ 2 2 σ 4 w w w satis fi ed by S (ˆ φ 0 , ˆ φ 1 ) σ 2 ˆ = w T − 1 ³ ´ 2 P T x t − ˆ φ 0 − ˆ φ 1 x t − 1 t =2 = T − 1 P T w 2 t =2 ˆ t = T − 1 . A usual adjustment for bias is to replace T − 1 by T − 3 = ( T − 1) − 2 = # of residuals - number of parameters estimated.

  39. 89 • Step 4. Information matrix: ⎛ ⎞ − 1 ∂S 2 σ 2 ∂φ 0 ⎜ ⎟ w ⎜ ⎟ − 1 ∂S ˙ ⎜ ⎟ l = , ⎜ ⎟ 2 σ 2 ∂φ 1 w ⎝ ⎠ − T − 1 S w + 2 σ 2 2 σ 4 w ⎛ ⎞ ∂ 2 S ∂ 2 S 1 1 − 1 ∂S 2 σ 2 2 σ 2 2 σ 4 ⎜ ∂φ 0 ∂φ 0 ∂φ 1 ∂φ 0 ∂φ 0 ⎟ w w w ⎜ ⎟ ∂ 2 S − ¨ ⎜ 1 − 1 ∂S ⎟ l = . ∗ ⎜ ⎟ 2 σ 2 2 σ 4 ∂φ 1 ∂φ 1 ∂φ 1 ⎝ ⎠ w w − T − 1 w + S ∗ ∗ 2 σ 4 σ 6 w Calculate T X ∂S = − 2 ( x t − φ 0 − φ 1 x t − 1 ) ∂φ 0 t =2 T X = − 2 w t , with expectation 0 , t =2 T X ∂S = − 2 x t − 1 ( x t − φ 0 − φ 1 x t − 1 ) ∂φ 1 t =2 T X = − 2 x t − 1 w t , t =2

  40. 90 with expectation (why?) T X − 2 COV [ X t − 1 , w t ] = 0 , t =2 ∂ 2 S = 2 ( T − 1) , with expectation ∂φ 0 ∂φ 0 2 ( T − 1) , T ∂ 2 S X = 2 x t − 1 , with expectation ∂φ 1 ∂φ 0 t =2 2 ( T − 1) µ, T ∂ 2 S X x 2 = 2 t − 1 , with expectation ∂φ 1 ∂φ 1 t =2 ³ γ (0) + µ 2 ´ 2 ( T − 1) . Then using T X w 2 t ] = ( T − 1) σ 2 E [ S ] = E [ w , t =2

  41. 91 we get 1 T E [ − ¨ l ] ⎛ ⎞ µ 1 0 σ 2 σ 2 ⎜ ⎟ w w ⎜ ⎟ ( γ (0)+ µ 2 ) T − 1 ⎜ ⎟ 0 ∗ = ⎜ ⎟ σ 2 ⎜ ⎟ T w ⎝ ⎠ w + σ 2 − 1 1 w w = ∗ ∗ 2 σ 4 σ 6 2 σ 4 w ⎛ ⎞ 1 0 µ 1 ⎜ ⎟ µ γ (0) + µ 2 ⎜ 0 ⎟ → ⎝ ⎠ σ 2 1 T →∞ 0 0 w 2 σ 2 w à ! 1 A 0 = = I ( α 0 ) 0 T 1 / (2 σ 2 σ 2 w ) w The inverse is à ! A − 1 0 I − 1 ( α 0 ) = σ 2 , where w 0 T 2 σ 2 w à ! γ (0) + µ 2 1 − µ A − 1 = . − µ 1 γ (0) Thus, e.g. the normal approximation for ˆ φ 1 is that à ! Tγ (0) = 1 − φ 2 σ 2 w ˆ 1 φ 1 − φ 1 ≈ N 0 , , T

  42. 92 with standard error φ 1 ) = 1 − ˆ φ 2 s 2 (ˆ 1 T ˆ φ 1 − φ 1 and φ 1 ) ≈ N (0 , 1). A 100(1 − α )% con fi dence s (ˆ interval is ˆ φ 1 ± z α/ 2 s (ˆ φ 1 ) . • In general, for an AR(p): p X X t = φ 0 + φ i X t − i + w t i =1 we minimize ⎛ ⎞ 2 p T X X ⎝ x t − φ 0 − ⎠ S ( φ ) = φ i x t − i t = p +1 i =1 by fi tting a regression model p X x t = φ 0 + φ i x t − i + error i =1 The resulting LSEs are ˆ φ for t = p + 1 , ..., T . and the associated mean square of the residuals

  43. 93 is S ( ˆ φ ) σ 2 ˆ w = T − 2 p − 1 . The large-sample standard errors are obtained by ASTSA and appear on the printout. ( Time do- main > Autoregression or Time domain > ARIMA ; then answer the question that appear. More on this later.) • Example 2. ARMA(p,q). Model is p q X X X t − φ j X t − j = w t − θ k w t − k . j =1 k =1 Assume that X t = w t = 0 and solve successively for the w t ’s in terms of the X t ’s: p q X X = X t − φ j X t − j + θ k w t − k ; w t j =1 k =1 = X 1 , w 1 = X 2 − φ 1 X 1 + θ 1 w 1 , w 2 etc.

  44. 94 ³ ´ In this way we write in terms of w p +1 , ..., w T ³ ´ . The Jacobian is again = 1 and x p +1 , ..., x T so − S ( φ , θ ) ³ ´ − ( T − p ) / 2 e 2 σ 2 2 πσ 2 w , f ( x | α ) = w where S ( φ , θ ) = P T t = p +1 w 2 t ( φ , θ ). (It is usual to omit the fi rst p w t ’s, since they are based on so little data and on the assumption men- tioned above. Equivalently we are conditioning on them.) Now S ( φ , θ ) is minimized numeri- cally to obtain the MLEs ˆ φ , ˆ The MLE of σ 2 θ . w is obtained from the likelihood equation and is S ( ˆ φ , ˆ θ ) / ( T − p ); it is often adjusted for bias to give S ( ˆ φ , ˆ θ ) σ 2 ˆ w = T − 2 p − q.

  45. 95 • The matrix ½ 1 i¾ h − ¨ I ( α 0 ) = lim l ( α 0 ) T E T →∞ is sometimes estimated by the “observed infor- ³ ´ mation matrix” 1 − ¨ l ( ˆ α ) evaluated at the data T { x t } . This is numerically simpler. • Students should write out the procedure for an MA(1) model. • The numerical calculations also rely on a modi fi - cation of least squares regression. Gauss-Newton algorithm : We are to minimize X w 2 S ( ψ ) = t ( ψ ) , t where ψ is a p-dimensional vector of parameters. The idea is to set up a series of least squares re- gressions converging to the solution. First choose an initial value ψ 0 . (ASTSA will put all compo- nents = . 1 if “.1 guess” is chosen. Otherwise, it will compute method of moments estimates, by equating the theoretical and sample ACF and PACF values. This takes longer.)

  46. 96 • Now expand w t ( ψ ) around ψ 0 by the Mean Value Theorem: w 0 w t ( ψ ) ≈ w t ( ψ 0 ) + ˙ t ( ψ 0 ) ( ψ − ψ 0 ) (1) = “ y t − z 0 t β ” , where y t = w t ( ψ 0 ), z t = − ˙ w t ( ψ 0 ), β = ψ − ψ 0 . Now ³ ´ 2 X y t − z 0 S ( ψ ) ≈ t β t is minimized by regressing { y t } on { z t } to get the LSE ˆ β 1 . We now set ψ 1 = ˆ β 1 + ψ 0 , expand around ψ 1 (i.e. replace ψ 0 by ψ 1 in (1)), obtain a revised estimate ˆ β 2 . Iterate to conver- gence.

  47. 97 12. Lecture 12 • A class of nonstationary models is obtained by taking di ff erences, and assuming that the di ff er- enced series is ARMA(p,q): ∇ X t = X t − X t − 1 = (1 − B ) X t , ∇ 2 X t = ∇ ( ∇ X t ) = (1 − B ) 2 X t , etc. We say { X t } is ARIMA(p,d,q) (“Integrated ARMA”) if ∇ d X t is ARMA(p,q). If so, φ ( B )(1 − B ) d X t = θ ( B ) w t for an AR(p) polynomial φ ( B ) and an MA(q) polynomial θ ( B ). Since φ ( B )(1 − B ) d has roots on the unit circle, { X t } cannot be stationary. n o ∇ d X t The di ff erenced series is the one we an- alyze.

  48. 98 • It may happen that the dependence of a series on its past is strongest at multiples of the sam- pling unit, e.g. monthly economic data may ex- hibit strong quarterly or annual trends. To model this, de fi ne seasonal AR(P) and MA(Q) charac- teristic polynomials Φ ( B s ) = 1 − Φ 1 B s − Φ 2 B 2 s − ... − Φ P B Ps , Θ ( B s ) = 1 − Θ 1 B s − Θ 2 B 2 s − ... − Θ Q B Qs . A seasonal ARMA(P,Q) model, with season s , is de fi ned by Φ ( B s ) X t = Θ ( B s ) w t . This can be combined with the hierarchy of or- dinary ARMA models, and with di ff erencing, to give the full ARIMA(p,d,q) × (P,D,Q) s model de- fi ned by Φ ( B s ) φ ( B )(1 − B s ) D (1 − B ) d X t = Θ ( B s ) θ ( B ) w t .

  49. 99 — Example: the ARIMA(0,1,1) × (0,1,1) 12 model has Φ ( B s ) = 1, φ ( B ) = 1, d = D = 1, Θ ( B s ) = 1 − Θ B 12 , θ ( B ) = 1 − θB . Thus (1 − B 12 )(1 − B ) X t ³ 1 − Θ B 12 ´ = (1 − θB ) w t . Expanding: X t = X t − 1 + X t − 12 − X t − 13 + w t − θw t − 1 − Θ w t − 12 + Θ θw t − 13 . This model often arises with monthly economic data. — The analysis of the ACF and PACF proceeds along the same lines as for the previous models (see below).

  50. 100 • Choosing an appropriate model. Some guiding properties of the ACF/PACF: — Nonstationarity: ACF drops o ff very slowly (a root of the AR characteristic equation with | B | near 1 will do this too); PACF large (in absolute value) at 1 (but only at 1 could in- Try taking di ff erences ∇ d X t , dicate AR(1)). d = 1 , 2. Rarely is d > 2. Don’t be too hasty to take di ff erences; try ARMA models fi rst. — Seasonal nonstationarity: ACF zero except at lags s, 2 s, ... ; decays slowly,or PACF very large Try ∇ d s X t = (1 − B s ) d X t . at s . — AR(p) behaviour: PACF zero for m > p . — Seasonal AR(P): PACF zero except at m = s, 2 s, ..., Ps . — MA(q): ACF zero for m > q . — Seasonal MA(Q): ACF zero except at m = s, 2 , ..., Qs .

  51. 101 • Note: Sometimes one fi ts an MA(q) model, or an ARMA(p,q) model with q > 0, and fi nds that most residuals are of the same sign. This is gen- erally a sign that µ X 6 = 0 (recall an MA(q) has mean zero). A remedy is to fi t a constant as well (possible in ASTSA only in the AR option) or to subtract the average ¯ x from the series before the analysis is carried out. • Principle of Parsimony : We seek the simplest model that is adequate. We can always “improve” the fi t by throwing in extra terms, but then the model might only fi t these data well.

  52. 102 • See Figures 2.4 - 2.11. Example: U.S. Federal Reserve Board Production Index - an index of eco- nomic productivity. Plots of data and ACF/PACF clearly indicate nonstationarity. ACF of ∇ X t in- dicates seasonal ( s = 12) nonstationarity. ACF and PACF of ∇ 12 ∇ X t shows possible models ARIMA ( p = 1 − 2 , d = 1 , q = 1 − 4) × ( P = 2 , D = 1 , Q = 1) s =12 . The ARIMA Search facility, using BIC, picks out ARIMA (1 , 1 , 0) × (0 , 1 , 1) 12 . Note q = P = 0; the other AR and MA features seem to have ac- counted for the seeming MA and seasonal AR(2) behaviour.

  53. 103 Figure 2.4. U.S. Federal Reserve Board Production Index (frb.asd in ASTSA). Non-stationarity of mean is obvious.

  54. 104 Figure 2.5. ACF decays slowly, PACF spikes at m = 1. Both indicate nonstationarity. Figure 2.6. ∇ X t ; nonstationary variance exhibited.

  55. 105 Figure 2.7. ∇ X t ; ACF shows seasonal nonstationarity. Figure 2.8. ∇ 12 ∇ X t . Peak in ACF at m = 12 indicates seasonal ( s = 12) MA(1). PACF indicates AR(1) or AR(2) and (possibly) seasonal AR(2).

  56. 106 Figure 2.9. Time domain > ARIMA search applied to X t . 3 × 5 × 3 × 3 = 135 possible models to be fi tted.

  57. 107 Figure 2.10. ARIMA search with BIC criterion picks out ARIMA(1,1,0) × (0,1,1) 12 . The (default) AICc criterion picks out ARIMA(0,1,4) × (2,1,1) 12 . One could argue for either one.

  58. 108 Figure 2.11. Output for “best” model. Save the residuals for further analysis.

  59. 109 • There are several “information criteria”. All seek to minimize the residual variation while imposing penalties for nonparsimonious models. Let K be the number of AR and MA parameters fi tted, σ 2 and let ˆ w ( K ) be the estimated variance of the residual noise. Then w ( K ) + 2 K σ 2 AIC ( K ) = ln ˆ T , T + K σ 2 AIC c ( K ) = ln ˆ w ( K ) + T − K − 2 , (small sample modi fi cation), w ( K ) + K ln T σ 2 BIC ( K ) = ln ˆ , etc. T • See Figures 2.12 - 2.15. Residual analysis. Save the residuals. Plot their ACF/PACF; none of the values should be signi fi cant (in principle!). These plots were examined but are not included here; all satisfactory. Also plot residuals against ∇ 12 ∇ X t − 1 ; they should be approximately uncor- related and not exhibit any patterns. (A conse- quence of the fact that Y − E [ Y | X ] is uncorre- lated with X is that X t − X t − 1 is uncorrelated t

  60. 110 with X s for s < t , i.e. except for the fact that the residuals use estimated parameters, they are uncorrelated with the predictors. This assumes stationarity, so here we use the di ff erenced series.) Click on Graph > Residual tests. Figure 2.12. Residuals vs. time.

  61. 111 Figure 2.13. Store the residuals w t = { ˆ ˆ w 15 , ..., ˆ w 372 } and the di ff erenced predictors ∇ 12 ∇ X t = { ∇ 12 ∇ X 15 , ..., ∇ 12 ∇ X 372 } (see Transform > Take a subset ). Then use Graph > 2D Scatterplot and “detail view” with the ˆ w t as series 1 and ∇ 12 ∇ X t as series 2, and “lag = 1” to get this plot. Use “Full view”- or merely double click on this plot - to get plots at a variety of lags, as described in the ASTSA manual.

  62. 112 Figure 2.14. Graphical output from Graph > Residual tests .

  63. 113 Figure 2.15. Printed output from Graph > Residual tests .

  64. 114 • Cumulative spectrum test . We will see later that if the residuals { ˆ w t } are white noise, then their “spectrum” f ( ν ) is constantly equal to the variance, i.e. is ‘ fl at’. Then the integrated, or cumulative spectrum, is linear with slope = vari- σ 2 ance; here it is divided by ˆ w / 2 and so should be linear with a slope of 2. Accounting for ran- dom variation, it should at least remain in a band around a straight line through the origin, with a slope of 2. The graphical output con fi rms this and the printed output gives a non-signi fi cant p- value to the hypothesis of a fl at spectrum.

  65. 115 13. Lecture 13 • Box-Pierce test . Under the hypothesis of white- ness we expect ˆ ρ w ( m ) to be small in absolute value for all m ; a test can be based on M ρ 2 X ˆ w ( m ) Q = T ( T + 2) T − m, m =1 which is approximately ∼ χ 2 M − K under the null hypothesis. The p-value is calculated and re- ported for M − K = 1 , 20. • Fluctuations test (= Runs test) . Too few or too many fl uctuations, i.e. changes from an up- ward trend to a downward trend or vice versa) constitute evidence against randomness. E.g. an AR(1) ( ρ (1) = φ ) with φ near +1 will have few fl uctuations, φ near − 1 will result in many.

  66. 116 • Normal scores (Q-Q) test . The quantiles of a distribution F = F w are the values F − 1 ( q ), i.e. the values below which w lies with probability q . The sample versions are the order statistics w (1) < w (2) < ... < w ( T ) : the probability of a value w falling at or below w ( t ) is estimated by t/T , so w ( t ) can be viewed as the ( t/T ) th sam- ple quantile. If the residuals are normal then a plot of the sample quantiles against the standard normal quantiles should be linear, with intercept equal to the mean and slope equal to the standard deviation (follows from F − 1 ( q ) = µ + σ Φ − 1 ( q ) if F is the N ( µ, σ 2 ) df; students should verify this identity). The strength of the linearity is mea- sured by the correlation between the two sets of quantiles; values too far below 1 lead to rejection of the hypothesis of normality of the white noise.

  67. 117 • The residuals from the more complex model cho- sen by AICc looked no better. • A log or square root transformation of the original data sometimes improves normality; logging the data made no di ff erence in this case. • In contrast, if we fi t an ARIMA (0 , 1 , 0) × (0 , 1 , 1) 12 , then the ACF/PACF of the residuals clearly shows the need for the AR (1) component - see Figures 2.16, 2.17. Note also that the largest residual is now at t = 324 - adding the AR term has ex- plained the drop at this point to such an extent that it is now less of a problem than the smaller drop at t = 128.

  68. 118 Figure 2.16. ACF and PACF of residuals from ARIMA (0 , 1 , 0) × (0 , 1 , 1) 12 fi t.

  69. 119 Figure 2.17. Residual tests for ARIMA (0 , 1 , 0) × (0 , 1 , 1) 12 fi t.

  70. 120 Forecasting in the FRB example. Fitted model is ³ 1 − Θ B 12 ´ (1 − φB ) ∇∇ 12 X t = w t ; (1) this is expanded as X t = (1 + φ ) X t − 1 − φX t − 2 + X t − 12 − (1 + φ ) X t − 13 + φX t − 14 + w t − Θ w t − 12 . (2) Conditioning on X t gives X t t + l = (1 + φ ) X t t + l − 1 − φX t t + l − 2 + X t t + l − 12 − (1 + φ ) X t t + l − 13 + φX t t + l − 14 + w t t + l − Θ w t t + l − 12 . We consider the case 1 ≤ l ≤ 12 only; l = 13 left as an exercise. Note that the model is not stationary but seems to be invertible (since | ˆ Θ | = . 6962 < 1). Thus w t can be expressed in terms of X t and, if we assume w 0 = X 0 = 0, we can express X t in terms of w t by Then conditioning on X t is equivalent iterating (2). to conditioning on w t . Thus X t t + l − k = X t + l − k for k = 12 , 13 , 14 and w t t + l = 0: X t = (1 + φ ) X t t + l − 1 − φX t t + l − 2 + X t + l − 12 t + l − (1 + φ ) X t + l − 13 + φX t + l − 14 − Θ w t t + l − 12 .

  71. 121 These become l = 1 : X t t +1 = (1 + φ ) X t − φX t − 1 + X t − 11 − (1 + φ ) X t − 12 + φX t − 13 − Θ w t − 11 , (3) l = 2 : X t t +2 = (1 + φ ) X t t +1 − φX t + X t − 10 − (1 + φ ) X t − 11 + φX t − 12 − Θ w t − 10 , 3 ≤ l ≤ 12 : from previous forecasts and w t − 9 , ..., w t . To get w t , .., w t − 11 we write (1) as w t = Θ w t − 12 + ( " #) (1 + φ ) X t − 1 − φX t − 2 + X t − X t − 12 − (1 + φ ) X t − 13 + φX t − 14 = Θ w t − 12 + f t , say (4) and calculate successively, using the assumption w 0 = X 0 = 0, w 1 = f 1 , w 2 = f 2 , , ..., w 12 = f 12 , w 13 = Θ w 1 + f 13 , w 14 = Θ w 2 + f 14 , etc.

  72. 122 X t − 1 w t = X t − ˆ To get the residuals ˆ put t = t − 1 in t (3): X t − 1 = (1 + φ ) X t − 1 − φX t − 2 + X t − 12 − t (1 + φ ) X t − 13 + φX t − 14 − Θ w t − 12 , X t − 1 w t = X t − ˆ so ˆ = (4) with all parameters esti- t mated and T X σ 2 w 2 ˆ w = ˆ t / ( T − 16) . t =15 (The fi rst 14 residuals employ the assumption w 0 = X 0 = 0 and are not used.) The forecast variances and prediction intervals require us to write X t = ψ ( B ) w t , and then s X X t ψ 2 PI = ˆ ˆ t + l ± z α/ 2 ˆ σ w k . 1 ≤ k<l The model is not stationary and so the coe ffi cients of ψ ( B ) will not be absolutely summable, however under the assumption that w 0 = 0, only fi nitely many of the ψ k are needed. Then in (2), ³ 1 − B 12 ´ ( ) (1 − φB ) (1 − B ) · w t (1 + ψ 1 B + ψ 2 B 2 + ... + ψ k B k + ... ) ³ 1 − Θ B 12 ´ = w t ,

  73. 123 so (1 − (1 + φ ) B + φB 2 − B 12 + (1 + φ ) B 13 − φB 14 ) · (1 + ψ 1 B + ψ 2 B 2 + ... + ψ k B k + ... ) ³ 1 − Θ B 12 ´ = . For 1 ≤ k < 12 the coe ffi cient of B k is = 0 on the RHS; on the LHS it is ψ 1 − (1 + φ ) ( k = 1); ψ k − (1 + φ ) ψ k − 1 + φψ k − 2 ( k > 1). Thus ψ 0 = 1 , ψ 1 = 1 + φ, ψ k = (1 + φ ) ψ k − 1 − φψ k − 2 , k = 2 , ..., 11 . • See Figures 2.18 - 2.20.

  74. 124 Figure 2.18. Time domain > ARIMA , to fi t just one model. (See also Time domain > Autoregression .) Figure 2.19. Forecasts up to 12 steps ahead.

  75. 125 Figure 2.20. Graphical output, including forecasts.

  76. 126 14. Lecture 14 (Review) Recall SOI and Recruits series. We earlier attempted to predict Recruits from SOI by regressing Recruits on lagged SOI ( m = 3 , ..., 12) - 11 parameters, including the intercept. See Figure 2.21. Figure 2.21. Recruits - data and predictions from a regression on SOI lagged by m = 3 , ..., 12 months. R 2 = 67%.

  77. 127 Let’s look for a better, and perhaps more parsimo- nious, model. Put X t = ∇ 12 SOI t , Y t = ∇ 12 Recruits t (both re-indexed: t = 1 , ..., 441) - this is to remove an annual trend, and to make this example more in- teresting. Both look stationary. (What do we look for in order to make this statement?). Consider the CCF, with Y t as input and X t as output (Figure 2.22). Figure 2.22. COV [ ∇ 12 SOI, ∇ 12 RECRUITS ] is largest at lags m = − 8 , − 9 , − 10. Largest (absolute) values of COV [ X t + m , Y t ] = COV [ X t , Y t − m ]

  78. 128 are at m = − 8 , − 9 , − 10, indicating a linear relation- ship between X t and Y t +8 , Y t +9 , Y t +10 . Thus we might regress Y t +10 on Y t +9 , Y t +8 and X t ; equiv- alently Y t on Y t − 1 , Y t − 2 , X t − 10 : Y t = β 1 Y t − 1 + β 2 Y t − 2 + β 3 X t − 10 + Z t , ( t = 11 , ..., 441) for a series { Z t } following a time series model, and being white noise if we’re lucky. See Figure 2.23. Figure 2.23. Data and regression fi t, Y t on Y t − 1 , R 2 = 90%. Y t − 2 , X t − 10 .

  79. 129 Multiple regression on Y(t) AICc = 6.07724 Variance = 158.450 df = 428 R2 = 0.8995 predictor coef st. error t-ratio p-value beta(1) 1.3086 .0445 29.4389 .000 beta(2) -.4199 .0441 -9.5206 .000 beta(3) -4.1032 1.7181 -2.3883 .017 Figure 2.24. ACF and PACF for residual series { Z t } .

  80. 130 The residual series { Z t } still exhibits some trends - see Figure 2.24 - seasonal ARMA (4 , 1) 12 ? ARIMA Search, with AICc, selects ARMA (1 , 1) 12 . With BIC, ARMA (1 , 2) 12 is selected. Each of these was fi tted. Both failed the cumulative spectrum test, and the Box-Pierce test with M − K = 1, although the p- values at M − K = 20 were non-signi fi cant. (What are these, and what do they test for? What does “p-value” mean?) This suggested adding an AR(1) term to the simpler of the two previous models. In terms of the backshift operator (what is it? how are these polynomials manipulated?): ³ 1 − Φ B 12 ´ ³ 1 − Θ B 12 ´ (1 − φB ) Z t = w t .

  81. 131 ARIMA(1,0,0)x(1,0,1)x12 from Z(t) AICc = 5.45580 variance = 85.0835 d.f. = 415 Start values derived from ACF and PACF predictor coef st. error t-ratio p-value AR(1) -.0852 .04917 -1.7327 .083 SAR(1) -.1283 .05030 -2.5516 .011 SMA(1) .8735 .02635 33.1524 .000 (1 +.09B1) (1 +.13B12) Z(t) = (1 -.87B12) w(t) See Figure 2.25. All residuals tests were passed, with the exception of the Normal Scores test (what does this mean? how does this test work?).

  82. 132 Figure 2.25. ACF and PACF of residuals from an ARIMA (1 , 0 , 0) × (1 , 0 , 1) 12 fi t to Z t . Note that: • { Z t } seems to be linear , hence stationary . (What does linearity mean, and how can we make this claim? How can we then represent Z t as a linear series Z t = α ( B ) w t = P ∞ k =0 α k w t − k ?) • { Z t } seems to be invertible . (What does invert- ibility mean, and how can we make this claim? Representation?)

  83. 133 The polynomial β ( B ) = 1 − β 1 B − β 2 B 2 has zeros B = 1 . 34, 1 . 77. This yields the additional represen- tation β 3 B 10 β ( B ) X t + α ( B ) Y t = β ( B ) w t = β 3 γ ( B ) B 10 X t + δ ( B ) w t , with 1 /β ( B ) expanded as a series γ ( B ) and δ ( B ) = α ( B ) /β ( B ). Assume { w t } independent of { X t , Y t } .

  84. 134 Predictions: You should be su ffi ciently familiar with the prediction theory covered in class that you can follow the following development, although I wouldn’t expect you to come up with something like this on your own. (The point of knowing the theory is so that you can still do something sensible when the theory you know doesn’t quite apply!) h Y t + l | Y t i The best forecast of Y t + l is ˆ Y t t + l = E (es- timated), in the case of a single series { Y t } . (Why? - you should be able to apply the Double Expectation Theorem, so as to derive this result. And what does “best” mean, in this context?)

Recommend


More recommend