5. Some History • Origins in problems connected with esti- mation of ore reserves in mineral explora- tion/mining (Krige, 1951). • Subsequent development largely indepen- dent of “mainstream” spatial statistics, initi- ally by Matheron and colleagues at ´ Ecole des Mines, Fontainebleau. • Parallel developments by Mat´ ern (1946, 1960), Whittle (1954, 1962, 1963) • Ripley (1981) re-casts kriging in terminology of stochastic process prediction • Significant cross-fertilization during 1980’s and 1990’s (eg variogram is now a standard statistical tool for analysing correlated data in space and/or time). • But still vigorous debate on practical issues: – prediction vs inference – role of explicit probability models 11
6. Core Geostatistical Problems • Design – how many locations? – how many measurements? – spatial layout of the locations? – what to measure at each location? • Modelling – probability model for the signal, [ S ] – conditional probability model for the mea- surements, [ Y | S ] • Estimation – assign values to unknown model parame- ters – make inferences about (functions of) model parameters • Prediction – evaluate [ T | Y ] , the conditional distribution of the target given the data 12
7. Model-Based Geostatistics • declares explicit stochastic model • apply general statistical principles for infe- rence Notation ( Y i , x i ) : i = 1 , ..., n • { x i : i = 1 , ..., n } is the sampling design • { Y ( x ) : x ∈ A } is the measurement process • { S ( x ) : x ∈ A } is the signal process • T = F ( S ) is the target for prediction • [ S, Y ] = [ S ][ Y | S ] is the geostatistical model 13
Traditional geostatistics: • avoids explicit references to the parametric specification of the model • inference via variograms (Matheron’s “estimating and choosing”) • complex variogram structures are often used • concentrates on linear estimators • specific methods/paradigms for: – point prediction (SK, OK, KTE, UK) – prediction of non-linear functionals (IK, DK) – predictive estimation (IK, DK) – simulations from the predictive distribu- tion (SGSIM, SISIM, ...) • the kriging menu 14
PART II: SPATIAL PREDICTION AND GAUSSIAN MODELS 1. The Gaussian Model 2. Specification of the Correlation Function 3. Stochastic Process Prediction 4. Linear Geostatistics 5. Prediction under the Gaussian Model 6. What does Kriging Actually do to the Data 7. Prediction of Functionals 8. Directional Effects 9. Non-stationary Gaussian Models
1. The Gaussian Spatial Model (a) S ( · ) is a stationary Gaussian process with i. E[ S ( x )] = µ , ii. Var { S ( x ) } = σ 2 iii. ρ ( u ) = Corr { S ( x ) , S ( x − u ) } ; (b) the conditional distribution of Y i given S ( · ) is Gaussian with mean S ( x i ) and variance τ 2 ; (c) Y i : i = 1 , ..., n are mutually independent, con- ditional on S ( · ) . 7.0 6.5 6.0 data 5.5 5.0 4.5 4.0 0.0 0.2 0.4 0.6 0.8 1.0 locations simulated data in 1-D illustrating the elements of the mo- del: data Y ( x i ) (dots), signal S ( x ) (curve line) and mean µ . (horizontal line). 16
An Equivalent Formulation: Y i = S ( x i ) + � i : i = 1 , ..., n. where � i : i = 1 , ..., n are mutually independent, identically distributed with � i ∼ N(0 , τ 2 ) . Then, the joint distribution of Y is multivariate Normal, Y ∼ MVN( µ 1 , σ 2 R + τ 2 I ) where: 1 denotes an n -element vector of ones, I is the n × n identity matrix R is the n × n matrix with ( i, j ) th element ρ ( u ij ) where u ij = || x i − x j || , the Euclidean distance between x i and x j . 17
2. Specification of the Correlation Function Differentiability of Gaussian processes • A formal mathematical description of the smoothness of a spatial surface S ( x ) is its de- gree of differentiability. • A process S ( x ) is mean-square continuous if, for all x , E[ { S ( x + h ) − S ( x ) } 2 ] → 0 as h → 0 . • S ( x ) is mean square differentiable if there exists a process S ′ ( x ) such that, for all x , �� S ( x + h ) − S ( x ) � 2 � − S ′ ( x ) E → 0 as h → 0 h • the mean-square differentiability of S ( x ) is directly linked to the differentiability of its covariance function Theorem 3 Let S ( x ) be a stationary Gaussian process with correlation function ρ ( u ) : u ∈ I R. Then: • S ( x ) is mean-square continuous iff ρ ( u ) is con- tinuous at u = 0 ; • S ( x ) is k times mean-square differentiable iff ρ ( u ) is (at least) 2 k times differentiable at u = 0 . 18
(a) The Mat´ ern family The correlation function is given by: ρ ( u ) = { 2 κ − 1 Γ( κ ) } − 1 ( u/φ ) κ K κ ( u/φ ) • κ and φ are parameters • K κ ( · ) denotes modified Bessel function of order κ • valid for φ > 0 and κ > 0 . • κ = 0 . 5 : exponential correlation function • κ → ∞ : Gaussian correlation function • S ( x ) is mean-square ⌈ κ − 1 times differen- tiable 1.0 0.8 correlation 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 distance Three examples of the Mat´ ern correlation function with φ = 0 . 2 and κ = 1 (solid line), κ = 1 . 5 (dashed line) and κ = 2 (dotted line). 19
(b) The powered exponential family ρ ( u ) = exp {− ( u/φ ) κ } • defined for φ > 0 and 0 < κ ≤ 2 • φ and κ are parameters • mean-square continuous (but non- differentiable) if κ < 2 • mean-square infinitely differentiable if κ = 2 • ρ ( u ) very ill-conditioned when κ = 2 • κ = 1 : exponential correlation function • κ = 2 : Gaussian correlation function Conclusion: not as flexible as it looks 1.0 0.8 correlation 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 distance Three examples of the powered exponential correla- tion function with φ = 0 . 2 and κ = 1 (solid line), κ = 1 . 5 (dashed line) and κ = 2 (dotted line). 20
(c) The spherical family � 1 − 3 2 ( u/φ ) 3 : 0 ≤ u ≤ φ 2 ( u/φ ) + 1 ρ ( u ; φ ) = 0 : u > φ • φ > 0 is parameter • finite range • non-differentiable at the origin 1.0 0.8 correlation 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 distance The spherical correlation function with φ = 0 . 6 . 21
Comparable Simulations (same seed) 1.5 1.0 0.5 0.0 y −0.5 −1.0 −1.5 0.0 0.2 0.4 0.6 0.8 1.0 x simulations with Mat´ ern correlation functions with φ = 0 . 2 and κ = 0 . 5 (solid line), κ = 1 (dashed line) and κ = 2 (dotted line). 1.5 1.0 0.5 0.0 y −0.5 −1.0 −1.5 0.0 0.2 0.4 0.6 0.8 1.0 x simulations with powered exponential correlation function with φ = 0 . 2 and κ = 1 (solid line), κ = 1 . 5 (dashed line) and κ = 2 (dotted line). 1.5 1.0 0.5 0.0 y −0.5 −1.0 −1.5 0.0 0.2 0.4 0.6 0.8 1.0 x simulations with spherical correlation function ( φ = 0 . 6 ). 22
3. Stochastic Process Prediction General results for prediction goal: predict the realised value of a (scalar) r.v. T , using data y a realisation of a (vector) r.v. Y . predictor: of T is any function of Y , ˆ T = t ( Y ) best choice: needs a criterion MMSPE: the best predictor minimises MSPE ( ˆ T ) = E[( T − ˆ T ) 2 ] Theorem 1 . The minimum mean square error predictor of T is ˆ T = E( T | Y ) . Theorem 2 . (a) The prediction mean square error of ˆ T is E[( T − ˆ T ) 2 ] = E Y [Var( T | Y )] , (the prediction variance is an estimate of the MSPE). (b) E[( T − ˆ T ) 2 ] ≤ Var( T ) , with equality if T and Y are independent random variables. 23
Comments • We call ˆ T the least squares predictor for T , and Var( T | Y ) its prediction variance • Var( T ) − Var( T | Y ) measures the contribution of the data (exploiting dependence between T and Y ) • point prediction, prediction variance are summaries • complete answer is the distribution [ T | Y ] • not transformation invariant: ˆ T the best predictor for T does NOT necessa- rily imply that g ( ˆ T ) is the best predictor for g ( T ) . 24
4. Linear Gaussian Geostatistics Suppose the target for prediction is T = S ( x ) A predictor for T is a function ˆ T = ˆ T ( Y ) The mean square prediction error (MSPE) is MSPE ( ˆ T ) = E[( ˆ T − T ) 2 ] The the predictor which minimises MSPE is ˆ T = E[ S ( x ) | Y ] Two approaches: • Model-based geostatistics: – specify a probability model for [ Y, T ] – choose ˆ T to minimise MSPE ( ˆ T ) amongst all functions ˆ T ( Y ) • Traditional (linear) geostatistics: – Assume that ˆ T is linear in Y , so that n � ˆ T = b 0 ( x ) + b i ( x ) Y i i =1 – Choose b i to minimise MSPE ( ˆ T ) within the class of linear predictors Coincident results under Gaussian assumpti- ons 25
5. Prediction Under The Gaussian Model • assume that the target for prediction is T = S ( x ) • [ T, Y ] are jointly multivariate Gaussian. • ˆ T = E( T | Y ) , Var( T | Y ) and [ T | Y ] can be easily derived from a standard result: Theorem 4 . Let X = ( X 1 , X 2 ) be jointly multi- variate Gaussian, with mean vector µ = ( µ 1 , µ 2 ) and covariance matrix � Σ 11 Σ 12 � Σ = , Σ 21 Σ 22 ie X ∼ MVN( µ, Σ) . Then, the conditional distri- bution of X 1 given X 2 is also multivariate Gaus- sian, X 1 | X 2 ∼ MVN( µ 1 | 2 , Σ 1 | 2 ) , where µ 1 | 2 = µ 1 + Σ 12 Σ − 1 22 ( X 2 − µ 2 ) and Σ 1 | 2 = Σ 11 − Σ 12 Σ − 1 22 Σ 21 . 26
For the geostatistical model: [ T, Y ] is multivariate Gaussian with mean vec- tor µ 1 and variance matrix � σ 2 � σ 2 r ′ σ 2 r τ 2 I + σ 2 R where r is a vector with elements r i = ρ ( || x − x i || ) : i = 1 , ..., n . Hence, using Theorem 4 with X 1 = T and X 2 = Y , we find that the minimum mean square error predictor for T = S ( x ) is ˆ T = µ + σ 2 r ′ ( τ 2 I + σ 2 R ) − 1 ( Y − µ 1 ) (1) with prediction variance Var( T | Y ) = σ 2 − σ 2 r ′ ( τ 2 I + σ 2 R ) − 1 σ 2 r . (2) 27
Notes 1. Because the conditional variance does not de- pend on Y , the prediction mean square error is equal to the prediction variance. 2. Equality of prediction mean square error and prediction variance is a special property of the multivariate Gaussian distribution, not a gene- ral result. 3. In conventional geostatistical terminology, construction of the surface ˆ S ( x ) , where ˆ T = ˆ S ( x ) is given by (1), is called simple kriging . This name is a reference to D.G. Krige, who pionee- red the use of statistical methods in the South African mining industry (Krige, 1951). 28
6. What Does Kriging Actually Do to the Data? The minimum mean square error predictor for S ( x ) is given by n � T = ˆ ˆ S ( x ) = µ + w i ( x )( Y i − µ ) i =1 n n � � = { 1 − w i ( x ) } µ + w i ( x ) Y i i =1 i =1 • the predictor ˆ S ( x ) compromises between its unconditional mean µ and the observed data Y • the nature of the compromise depends on the target location x , the data-locations x i and the values of the model parameters. • call the w i ( x ) the prediction weights . 29
6.1 Effects on predictions (a) Varying the correlation function 2 1 predicted signal 0 −1 −2 0.0 0.2 0.4 0.6 0.8 1.0 locations Predictions from 10 equally spaced data-points using expo- nential (solid line) or Mat´ ern of order 2 (dashed line) correla- tion functions. 0.5 0.0 predicted signal −0.5 −1.0 −1.5 −2.0 −2.5 0.0 0.2 0.4 0.6 0.8 1.0 locations Predictions from 10 randomly spaced data-points using expo- nential (solid line) or Mat´ ern of order 2 (dashed line) correla- tion functions. 30
(b) Varying the correlation parameter 2.0 1.5 predicted signal 1.0 0.5 0.0 −0.5 0.0 0.2 0.4 0.6 0.8 1.0 locations Predictions from 10 randomly spaced data-points using the Mat´ ern ( κ = 2 ) correlation function and different values of φ : 0 . 05 (solid line), 0 . 1 (dashed line) and 0 . 5 (thick dashed line). 31
(c) Varying the noise-to-signal ratio 2.0 1.5 predicted signal 1.0 0.5 0.0 −0.5 0.0 0.2 0.4 0.6 0.8 1.0 locations Predictions from 10 randomly spaced data-points using the Mat´ ern correlation function and different values of τ 2 : 0 (solid line), 0 . 25 (dashed line) and 0 . 5 (thick dashed line). 0.4 prediction variance 0.3 0.2 0.1 0.0 0.0 0.2 0.4 0.6 0.8 1.0 locations Prediction variances from 10 randomly spaced data- points using the Mat´ ern correlation function and dif- ferent values of τ 2 : 0 (solid line), 0 . 25 (dashed line) and 0 . 5 (thick dashed line). 32
6.2 Effects on kriging weights (a) The prediction weights: varying φ φ = 0 φ = 0.05 0.6 0.6 prediction weights prediction weights 0.4 0.4 0.2 0.2 0.0 0.0 −0.2 −0.2 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 data locations data locations φ = 0.15 φ = 0.3 0.6 0.6 prediction weights prediction weights 0.4 0.4 0.2 0.2 0.0 0.0 −0.2 −0.2 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 data locations data locations Prediction weights for 10 equally spaced data-points with tar- get location x = 0 . 50 . i. varying parameter φ = 0 , 0 . 05 , 0 . 15 , 0 . 30 ii. locations: equally spaced x i = − 0 . 05 + 0 . 1 i : i = 1 , ..., 10 iii. prediction location: x = 0 . 50 iv. correlation function: Mat´ ern with κ = 2 v. nugget: τ 2 = 0 33
(b) The prediction weights: varying κ κ = 0.5 κ = 1 0.6 0.6 prediction weights prediction weights 0.4 0.4 0.2 0.2 0.0 0.0 −0.2 −0.2 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 data locations data locations κ = 5 κ = 2 0.6 0.6 prediction weights prediction weights 0.4 0.4 0.2 0.2 0.0 0.0 −0.2 −0.2 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 data locations data locations Prediction weights for 10 equally spaced data-points with tar- get location x = 0 . 50 . i. varying parameter κ = 0 . 5 , 1 , 2 , 5 ii. locations: equally spaced x i = − 0 . 05 + 0 . 1 i : i = 1 , ..., 10 iii. prediction location: x = 0 . 50 iv. correlation function: Mat´ ern with φ = 0 . 1 v. Nugget: τ 2 = 0 34
(c) The prediction weights: varying τ 2 τ 2 = 0 τ 2 = 0.1 0.8 0.8 prediction weights prediction weights 0.4 0.4 0.0 0.0 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 data locations data locations τ 2 = 0.25 τ 2 = 0.5 0.8 0.8 prediction weights prediction weights 0.4 0.4 0.0 0.0 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 data locations data locations Prediction weights for 10 equally spaced data-points with tar- get location x = 0 . 45 . i. varying parameter τ 2 = 0 , 0 . 1 , 0 . 25 , 0 . 5 ii. locations: equally spaced x i = − 0 . 05 + 0 . 1 i : i = 1 , ..., 10 iii. prediction location: x = 0 . 45 iv. correlation function: Mat´ ern with κ = 2 and φ = 0 . 1 35
7. Prediction of Functionals Let T be any linear functional of S , � T = w ( x ) S ( x ) dx A for some prescribed weighting function w ( x ) . Under the Gaussian model: • [ T, Y ] is multivariate Gaussian; • [ T | Y ] is univariate Gaussian; • the conditional mean and variance are: � E[ T | Y ] = w ( x )E[ S ( x ) | Y ] dx A � � w ( x ) w ( x ′ )Cov { S ( x ) , S ( x ′ ) } dxdx ′ Var[ T | Y ] = A A Note in particular that � ˆ w ( x ) ˆ T = S ( x ) dx A 36
In words: • given a predicted surface ˆ S ( x ) , it is legitimate simply to calculate any linear property of this surface and to use the result as the predictor for the corresponding linear property of the true surface S ( x ) • it is NOT legitimate to do this for prediction of non-linear properties • for example, the maximum of ˆ S ( x ) is a very bad predictor for the maximum of S ( x ) (this problem will be addressed later) 37
8. Directional Effects • Environmental conditions can induce directi- onal effects (wind, soil formation, etc). • As a consequence the spatial correlation may vary with the direction. 1.0 1.0 1.0 0.1 0 . 1 0.3 0.5 0.5 0.5 0.2 0 . 3 0.5 0 . 4 0.5 0.6 0.7 0.8 0.0 0.0 0.0 7 . 0 0.7 0.6 0.6 0.5 0.4 0.4 0.3 −0.5 −0.5 −0.5 0.2 0.2 0 . 1 −1.0 −1.0 −1.0 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 correlation contours for a isotropic model (left) and two anisotropic models (center and right). • a possible approach: geometric anisotropy . • two more parameters: the anisotropy angle ψ A and the anisotropy ratio ψ R . • rotation and stretching of the original coordi- nates: � cos( ψ A ) − sin( ψ A ) � � 1 � 0 ( x 1 ′ , x 2 ′ ) = ( x 1 , x 2 ) 1 0 sin( ψ A ) cos( ψ A ) ψ R 38
Geometric Anisotropy Correction Original Space, ψ 1 = 0 ° , ψ 2 = 2 Isotropic Space 1.0 31 32 33 34 35 36 0.6 0.8 25 26 27 28 29 30 31 32 33 34 35 36 0.4 25 26 27 28 29 30 0.6 19 20 21 22 23 24 19 20 21 22 23 24 x x 0.2 13 14 15 16 17 18 0.4 13 14 15 16 17 18 7 8 9 10 11 12 0.2 7 8 9 10 11 12 0.0 1 2 3 4 5 6 0.0 1 2 3 4 5 6 −0.2 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Original Space, ψ 1 = 120 ° , ψ 2 = 2 Isotropic Space 0.8 1.0 31 32 33 34 35 36 0.6 0.8 25 26 27 28 29 30 6 0.4 12 5 18 11 24 4 17 30 0.6 19 20 23 21 22 24 0.2 10 23 36 3 16 29 9 22 35 x x 2 15 28 8 21 0.0 34 1 14 0.4 27 13 14 15 16 17 18 7 20 33 13 26 19 32 −0.2 25 31 0.2 7 8 9 10 11 12 −0.4 0.0 1 2 3 4 5 6 −0.6 0.0 0.2 0.4 0.6 0.8 1.0 −1.4 −1.2 −1.0 −0.8 −0.6 −0.4 −0.2 0.0 Original Space, ψ 1 = 45 ° , ψ 2 = 2 Isotropic Space 1.0 31 32 33 34 35 36 1.0 0.8 0.8 25 26 28 29 30 27 36 35 30 0.6 34 29 24 0.6 19 20 21 22 23 24 33 28 23 18 32 27 22 17 12 0.4 x 31 26 21 x 16 11 6 25 20 15 10 5 0.4 13 14 15 16 17 18 0.2 19 14 9 4 13 8 3 7 2 0.0 1 0.2 7 8 9 10 11 12 −0.2 0.0 1 2 3 4 5 6 −0.4 0.0 0.2 0.4 0.6 0.8 1.0 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 39
9. Non-Stationary Gaussian Models Stationarity is a convenient working assump- tion, which can be relaxed in various ways. • Functional relationship between mean and variance? Can sometimes be resolved by a transforma- tion of the data. • Non-constant mean? Replace constant µ by k � µ ( x ) = Fβ = β j f j ( x ) j =1 for measured covariates f j ( x ) (or non-linear versions). Note : sometimes called universal kriging or kriging with external trend . 40
• Non-stationary random variation? Intrinsic variation a weaker hypothesis than stationarity (process has stationary in- crements, cf random walk model in time series), widely used as default model for discrete spatial variation (Besag, York and Moli´ e, 1991). Spatial deformation methods (Sampson and Guttorp, 1992) seek to achieve statio- narity by transformation of the geographical space, x . • as always, need to balance increased fle- xibility of general modelling assumptions against over-modelling of sparse data, lea- ding to poor identifiability of model parame- ters. 41
PART III: PARAMETRIC ESTIMATION 1. Second-Moment Properties 2. Variogram Analysis 3. Likelihood Inference 4. Plug-in Prediction 5. Gaussian Transformed Models 6. A Case Study 7. Anisotropic Models 8. Model Validation
1. Second-Moment Properties • the variogram of a process Y ( x ) is the func- tion V ( x, x ′ ) = 1 2Var { Y ( x ) − Y ( x ′ ) } • for the spatial Gaussian model, with u = || x − x ′ || , V ( u ) = τ 2 + σ 2 { 1 − ρ 0 ( u ) } • basic structural parameters of the spatial Gaussian model are: – the nugget variance : τ 2 – the sill : σ 2 = Var { S ( x ) } – the total sill : τ 2 + σ 2 = Var { Y ( x ) } – the range : φ , such ρ 0 ( u ) = ρ ( u/φ ) • practical implications: – any reasonable version of the (linear) spa- tial Gaussian model has at least three pa- rameters – but you need a lot of data (or contextual knowledge) to justify estimating more than three parameters • the Mat´ ern family uses a fourth parameter to determine the differentiability of S ( x ) 43
Paradigmas for parameter estimation • Ad hoc (variogram based) methods – compute the empirical variogram – fit a theoretical covariance model • Likelihood-based methods – typically under Gaussian assumptions – Optimal under stated assumptions, but computationally expensive and may lack robustness? • Bayesian implementation , combining estimation with prediction, beco- ming more widely accepted (amongst statis- ticians!) 44
2. Variogram Analysis • The variogram is defined by V ( x, x ′ ) = 1 2Var { Y ( x ) − Y ( x ′ ) } • if Y ( x ) is stationary, V ( x, x ′ ) = V ( u ) = 1 2E[ { Y ( x ) − Y ( x ′ ) } 2 ] where u = || x − x ′ || • suggests an empirical estimate of V ( u ) : ˆ V ( u ) = average { [ y ( x i ) − y ( x j )] 2 } where each average is taken over all pairs [ y ( x i ) , y ( x j )] such that || x i − x j || ≈ u • for a process with non-constant mean (cova- riates), trend-removal can be used as follows: – define r i = Y i − ˆ µ ( x i ) – define ˆ V ( u ) = average { ( r i − r j ) 2 } , where each average is taken over all pairs ( r i , r j ) 45
(a) The variogram cloud • define the quantities r i = Y i − ˆ µ ( x i ) u ij = || x i − x j || v ij = ( r i − r j ) 2 2 • the variogram cloud is a scatterplot of the points ( u ij , v ij ) Example: Swiss rainfall data 150000 100000 semivariance 50000 0 0 50 100 150 200 distance • under the spatial Gaussian model: – v ij ∼ V ( u ij ) χ 2 1 – the v ij are correlated • the variogram cloud is therefore unstable, both pointwise and in its overall shape 46
(b) The empirical variogram • derived from the variogram cloud by ave- raging within bins : u − h/ 2 ≤ u ij < u + h/ 2 • forms k bins, each averaging over n k pairs • removes the first objection to the vario- gram cloud, but not the second • is sensitive to mis-specification of µ ( x ) Example: Swiss rainfall data. 15000 semivariance 10000 5000 0 0 50 100 150 200 distance Empirical binned variogram 47
(c) The fitted variogram Estimate ˜ θ to minimise a particular crite- rium eg, the weighted least squares (Cressie, 1993) � n k { [ ¯ V k − V ( u k ; θ )] /V ( u ij ; θ ) } 2 S ( θ ) = k where ¯ V k is average of n k variogram ordinates v ij . Other criteria: OLS, WLS with weights given n k only, ... • Corresponds to biased estimating equation for θ , although still widely used in practice. • Potentially misleading because of inherent correlations amongst successive ¯ V k Example: Swiss rainfall data 100 80 semivariance 60 40 WLS 1 WLS 2 20 OLS 0 0 50 100 150 200 distance Empirical variogram with WLS with n k only (thick line), Cressie’s WLS (full line) and OLS (dashed line) 48
Further comments on empirical vario- grams (a) Variograms of raw data and residuals can be very different Example: Paran´ a rainfall data. • • • 6000 • • • 1000 • • • • 5000 • • • • • • • 800 • • • 4000 • • • • • • • • • 600 3000 • • • 400 2000 • • • • • 200 1000 • • • • • 0 0 0 100 200 300 400 0 100 200 300 400 empirical variograms of raw data (left-hand panel) and of residuals after linear regression on latitude, longitude and altitude (right-hand panel). • variogram of raw data includes variation due to large-scale geographical trend • variogram of residuals eliminates this source of variation 49
(b) How unstable are empirical variograms? • • • • 1.0 • • • • • • 0.8 • • • • • • • Semi-variance • • 0.6 • • • • • • • • • • 0.4 • • • • 0.2 • • • 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Distance • thick solid line shows true underlying va- riogram • fine lines show empirical variograms from three independent simulations of the same mo- del ˆ • high autocorrelations amongst V ( u ) for successive values of u imparts misleading smoothness 50
3. Likelihood Inference (a) Maximum likelihood estimation (ML) The Gaussian model is given by: Y i | S ∼ N( S ( x i ) , τ 2 ) • S ( x i ) = µ ( x i ) + S c ( x i ) • S c ( · ) is a zero mean stationary Gaus- sian process with covariance parameters ( σ 2 , φ, κ ) , • µ ( x i ) = Fβ = � k j =1 f k ( x i ) β k , where f k ( x i ) is a vector of covariates at location x i Which allows us to write: Y i = µ ( x i ) + S c ( x i ) + � i : i = 1 , ..., n Then Y ∼ MVN( Fβ, σ 2 R + τ 2 I ) and the likelihood function is − 0 . 5 { log | ( σ 2 R + τ 2 I ) | + L ( β, τ, σ, φ, κ ) ∝ ( y − Fβ ) ′ ( σ 2 R + τ 2 I ) − 1 ( y − Fβ ) } . which maximisation yields the maximum li- kelihood estimates of the model parameters. 51
Maximum likelihood estimation (cont.) Computational details: • reparametrise ν 2 = τ 2 σ 2 and denote: σ 2 V = σ 2 ( R + ν 2 I ) • the log-likelihood function is maximised for ˆ β ( V ) = ( F ′ V − 1 F ) − 1 F ′ V − 1 y σ 2 = n − 1 ( y − F ˆ β ) ′ V − 1 ( y − F ˆ ˆ β ) β, ˆ • then substituting ( β, σ 2 ) by ( ˆ σ 2 ) in 3 the maximisation reduces to L ( τ r , φ, κ ) ∝ − 0 . 5 { n log | ˆ σ 2 | + log | ( R + ν 2 I ) |} • For the Mat´ ern correlation function we suggest to take κ in a discrete set { 0 . 5 , 1 , 2 , 3 , ..., N } 52
(b) Restricted Maximum Likelihood Esti- mation (REML) The REML principle is as follows: • under the assumed model for E[ Y ] = Fβ , transform the data linearly to Y ∗ = AY such that the distribution of Y ∗ does not depend on β ; • estimate θ = ( ν 2 , σ 2 , φ, κ ) by maximum like- lihood applied to the transformed data Y ∗ We can always find a suitable matrix A without knowing the true values of β or θ , for example A = I − F ( F ′ F ) − 1 F ′ The REML estimators for θ can be computed by maximising � L ∗ ( θ ) ∝ − 1 log | σ 2 V | + log | F ′ { σ 2 V } − 1 F | 2 � β ) ′ { σ 2 V } − 1 ( y − F ˜ +( y − F ˜ β )) , where ˜ β = ˆ β ( θ ) . 53
Comments on REML • introduced in the context of variance com- ponents estimation in designed experi- ments (Patterson and Thompson, 1971) • leads to less biased estimators in small samples • L ∗ ( θ ) depends on F , and therefore on a cor- rect specification of the model for µ ( x ) , • L ∗ ( θ ) does not depend on the choice of A . • Given the model for µ ( · ) , the method enjoys the same objectivity as does maximum li- kelihood estimation. • widely recommended for geostatistical mo- dels. • REML is more sensitive than ML to mis- specification of the model for µ ( x ) . • for designed experiments the mean µ ( x ) is usually well specified • however in the spatial setting there is no sharp distinction between µ ( x ) and S c ( x ) . 54
Profile likelihood Variability of the parameter estimators can be inspected by looking at the log-likelihood sur- face defined by (3). Surface reflects the information contained in the data about the model parameters. Dimension of the log-likelihood surface does not allow direct inspection. Computing profile likelihoods • Suppose a model with parameters ( α, ψ ) • denote its likelihood by L ( α, ψ ) • To inspect the likelihood for α replace the nuisance parameters ψ by their ML estima- tors ˆ ψ ( α ) , for each value of α • This gives the profile likelihood: L p ( α ) = L ( α, ˆ ψ ( α )) = max ψ ( L ( α, ψ )) . 55
4. Plug-In Prediction – Kriging Quite often the interest is to predict • the realised value of the process S ( · ) at a point, • or the average of S ( · ) over a region, � T = | B | − 1 S ( x ) dx B where | B | denotes the area of the region B . For the Gaussian model we’ve seen that the mi- nimum MSPE predictor for T = S ( x ) is ˆ T = µ + σ 2 r ′ ( τ 2 I + σ 2 R ) − 1 ( Y − µ 1 ) with prediction variance Var( T | Y ) = σ 2 − σ 2 r ′ ( τ 2 I + σ 2 R ) − 1 σ 2 r where the only unknowns are the model para- meters. The plug-in prediction consists of replacing the true parameters by their estimates. 56
Comments • ML estimates and simple kriging • REML estimates and ordinary kriging • The ad-hoc prediction (a) Estimate β by OLS, ˜ β = ( F ′ F ) − 1 F ′ Y , and construct residuals Z = Y − F ˜ β . (b) Calculate the empirical variogram of Z and use it for model formulation and pa- rameter estimation. (c) Re-estimate β by GLS and use the fitted model for prediction. • the role of empirical variograms: – diagnostics (model-based approach) – inferential tool (ad-hoc approach) • The conventional approach to kriging (M-B or ad-hoc) is to plug-in estimated parameter values and proceed as if the estimates were the truth . This approach: – usually gives good point predictions when predicting T = S ( x ) – but often under-estimates prediction vari- ance – and can produce poor results when predic- ting other targets T 57
5. Gaussian Transformed Models The Gaussian model might be clearly inappro- priate for variables with asymmetric distributi- ons. Some flexibility with an extra parameter λ defi- ning a Box-Cox transformation. Terminology: Gaussian-transformed model . The model is defined as follows: • assume a variable Y ∗ ∼ MV N ( Fβ, σ 2 V ) • the data, denoted y = ( y 1 , ..., y n ) , are genera- ted by a transformation of the linear Gaus- sian model Y = h − 1 λ ( Y ∗ ) such that: � ( y i ) λ − 1 if λ � = 0 Y ∗ i = h λ ( Y ) = λ log( y i ) if λ = 0 The log-likelihood is: ℓ ( β, θ, λ ) = − 1 2 { log | σ 2 V | +( h λ ( y ) − Fβ ) ′ { σ 2 V } − 1 ( h λ ( y ) − Fβ ) } n � � ( y i ) λ − 1 � + log i =1 58
Notes: • λ = 0 corresponds to a particular case: log- Gaussian model • Inference strategies: (a) λ as a random parameter (De Oliveira, Ke- dem and Short, 1997). Prediction averages over a range of models. (b) An alternative approach is to estimate λ and then fix it equal to the estimate when performing prediction (Christensen, Dig- gle and Ribeiro, 2000). i. find the best transformation maximising the profile likelihood for λ ii. fix the transformation, transform data iii. inferences on transformed scale iv. back-transform results Difficulties with negative values and back- transformation. 59
6. A Case Study: Swiss rainfall data 200 150 N-S (km) 100 50 0 0 100 200 300 E-W (km) Locations of the data points with points size proportional to the value of the observed data. Distances are in kilo- metres. • 467 locations in Switzerland • daily rainfall measurements on 8th of May 1986 • The data values are integers where the unit of measurement is 1 / 10 mm • 5 locations where the value is equal to zero. 60
Swiss rainfall data (cont.) Estimating the transformation parameter and using the Mat´ ern model for the correlation func- tion. ˆ log ˆ κ λ L 0 . 5 0.514 -2464.246 1 0.508 -2462.413 2 0.508 -2464.160 Maximum likelihood estimate ˆ λ and the corresponding value of the log-likelihood function log ˆ L for different values of the Mat´ ern parameter κ . • • • • • • -2463 -2463 -2463 • • • -2464 -2464 -2464 • • • • •• • • • • • • • • • -2465 -2465 -2465 • • • • • • • -2466 -2466 -2466 • • • 0.40 0.50 0.60 0.40 0.50 0.60 0.40 0.50 0.60 Profile likelihoods for λ . Left: κ = 0 . 5 , middle: κ = 1 , right: κ = 2 . The two lines correspond to 90% and 95% percent quantiles for a 1 2 χ 2 (1) -distribution. Neither untransformed nor log-transformed are indicated. 61
Swiss rainfall data (cont.) Estimates for the model with λ = 0 . 5 ˆ ˆ log ˆ σ 2 τ 2 κ β ˆ φ ˆ L 0 . 5 18.36 118.82 87.97 2.48 -2464.315 1 20.13 105.06 35.79 6.92 -2462.438 2 21.36 88.58 17.73 8.72 -2464.185 Maximum likelihood estimates ˆ β , ˆ φ , ˆ σ , ˆ τ and the correspon- ding value of the likelihood function log ˆ L for different values of the Mat´ ern parameter κ , for λ = 0 . 5 • • • •• • ••• • -2462.5 -2462.5 -2462.5 • • • • • • • • • • • • • • • • • -2463.0 -2463.0 -2463.0 • • • • • • • • • • • -2463.5 -2463.5 -2463.5 • • • • • • • • • • • -2464.0 • -2464.0 -2464.0 • • • • • • • • • 50 100 150 200 250 300 20 30 40 50 60 70 5 6 7 8 9 Profile likelihood for covariance parameters with κ = 1 and λ = 0 . 5 . Left: σ 2 , middle: φ , right: τ 2 . 62
Swiss rainfall data (cont.) 120 • 100 • • • • • • Semi-variance 80 • • • • 60 • 40 • • 20 • 0 0 50 100 150 200 Distance Estimated semivariances for square-root transformed data (dot-dashed line), compared with the theoretical semivario- gram model, with parameters equal to the maximum like- lihood estimates. κ = 0 . 5 (dashed line), κ = 1 (thick solid line), κ = 2 (thin solid line). 63
Swiss rainfall data (cont.) 200 200 100 100 0 0 -100 -100 0 100 200 300 400 500 0 5000 10000 15000 0 100 200 300 0 100 200 300 Maps with predictions (left) and prediction variances (right). Prediction of the percentage of the area where Y ( x ) ≥ 200 : ˜ A 200 is 0 . 4157 600 500 400 300 200 100 0 0.410 0.415 0.420 Samples from the predictive distribution of ˜ A 200 . 64
7. Estimating Anisotropic Models Likelihood based methods • Two more parameters for the correlation function • Increases the dimension of the numerical mi- nimisation problem • In practice a lot of data might be needed Variogram based methods • Compute variograms for different directions • Tolerance angles, in particular for irregularly distributed data • Fit variogram model using directional vario- grams omnid. omnid. 25000 0 ° 30 ° 25000 45 ° 75 ° 90 ° 120 ° 135 ° 165 ° 20000 20000 15000 15000 yy yy 10000 10000 5000 5000 0 0 0 50 100 150 200 0 50 100 150 200 Directional variograms for the Swiss rainfall data. 65
8. Model Validation and Comparison: Using validation data Data divided in two groups: data for model fit- ting and data for validation Frequently in practice data are scarce and to ex- pensive to be left out “Leaving-one-out” • One by one, for each datum: (a) remove the datum from the data-set (b) (re-estimate model parameters) (c) predict at the datum location • Compare original data with predicted values. 66
100 fitting data + 367 validation data x 1.0 500 x + x + + 0.8 x 400 + x x x xx x x + + x + x x x x + + + observed prob + + x x x + + + x x + x x x + + + predicted x x 0.6 300 + + x x x + + + + x x + ++ + x + + + + x x + + + + x x x + + + x + + + x + + x + + + + + + x x + + + + x x + + + x x + + + + + x x x x + + x + ++ + x x x + + x xx x + + + + x x x x + + 0.4 200 x x x + + + x x x + + x x x + + + x x x x x + + + + + + x x x x x + + + + + + + + x x x + + + + x x x x + + + + + x x x + + + + + + x x x x + + + + x x x x x + x x x x x x x + + + + + + + x x x x x x x x x x ++ x + + + + + x x + x x + + + + x + + + + xx x x x x + + + + x x x x x x x x x x x ++ + + + + 100 x x x x + + + + 0.2 x x x x x + + + + x x x + + x x x x + + + + xx x x x x x + + + + + x x x + + x xx x x + + + + + x x x x + + x x x + + + x + x + + + 0.0 0 0 100 200 300 400 500 0.0 0.2 0.4 0.6 0.8 1.0 data theoretical prob 250 250 x x + + x x x x x x x + x + x + x + x ++ x x x + x x x + + + + + x x x + + + + + + + x x + + + + x + + + + x + x + + + x + + + + + + x + + x x x + + + + + x + + x x x + + + x x + + + + x + + x x x + + + x x x + + + x x x + x x x + x + + + x x x + x + + + x x + + x x x x + + + x x + + x x x x x + + x x + x + + + x x + x x x + x + + xx + x x x + x + x + + x + + + x x + + + x + + + + x + x x + + 150 + x x + + + + + x + x x x x + 150 + x x + + + + + x x x x + x + x + x + + x x + x + x x + x x x + x x x x x x x x x x x x x x x x x x x x x x x x x x + x + x x x + + x + x x + + x x x + + x Coord Y + + x x x + + x Coord Y x x x + x + x x x + + + + + x x + x + + + + + + + + x x + + x x + + + + x + x x + x + + + + x + x + + x + + x x x x + + x x x + + + + x x + x x x + + + + x + x + x xx + + + + + x + + + x x x x + x + x + + + + x x x + x x + x x + + x x x + x x + + + + x + + + + + x + + x + + + + + + + + x + x + + + + + + + + x + x + + + + x + + + + + x + + + + + + x + + x+ x + + + + + + + + x + x + x x + + + x x + x + x + x x + + x x + + + x + x x + x x xx x x + + + x x + x x + + x x x + x x x + x x x + x x x x + + x x + x + + + + + x x + + + + x + + + + xx + x + x x x x x + + + + x x + + x x + + + x x + + + x + + + + x + + + x x + + + x x x x x + x + x x x x + x + + + x x x + + x + x + x x + x + + + + x + + 50 + + + x x x x 50 + x x x x x + + + x + + + + x + x x x + x x x x x + + + x x x + + + + x x x + + x x + + x x x + x x x x + + x x x x x x x x x x + + + x x x x x x x x 0 0 −50 −50 0 50 150 250 350 0 50 150 250 350 Coord X Coord X 0.008 0.4 0.006 0.3 Density Density 0.004 0.2 0.002 0.1 0.000 0.0 −200 −100 0 100 200 −4 −2 0 2 4 data − predicted std error + + + + + + + + + 200 4 + + + + + + + + + + + + + + + + + + + + + + + + + + data − predicted + 100 + + + + + + + + + + + + + + + + + + + + + + + 2 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + std error + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + +x + + + + + + + + + + + + x + + + + + + + + + + + + + + x + + + + + + + + + + + + + x + + x + + x + + x x x x x + + x x x + x + + + x + + x x x x x + x x + + + x x 0 x x x x x x x x x x x x x x x x x x x 0 x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x −2 x x x x x x x x x x x x x −200 x x −4 x 0 100 200 300 400 500 0 100 200 300 400 500 predicted predicted + + + + + + + + + 200 4 + + + + + + + + + + + + + + + + + + + + + + data − predicted + + + + + + 100 + + + + + + + + + + + + + + + + + + + + + + + + 2 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + std error + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + x + x x + x x + + x x x x x x x x x x + x x + x x x + + x x x x x x x x x x + +x x x x x 0 x x x x x x x x x x x x x x x x x x x x x 0 x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x −2 x x x x x x x x x x x x −200 x x −4 x 0 100 200 300 400 500 0 100 200 300 400 500 data data 67
all data – “leaving-one-out” 1.0 500 0.8 x x x x + x x x + x x + + x x + observed prob x + + + + x x + + +++ + x x + + + x x x x + + predicted x x x x + + + 0.6 x x x + + + + x x x + + + x x + + x x x + + x x x + + + ++ x + + + + 300 x x x ++ + + x x x x + + x + x x x x x x x + + + + x x x x x + + + x x x + + xx x + x x x + + + x x x x x x x + + + 0.4 x x x + + + + + x x x x x x x x x + + + + + + + + x x x x x x x x x + + + + x x + + x x x x x + + + + + + + x x x x x + x+ + + + x x x x x x x + + + + + + x x x x + + + + + + + x x x xx x + + + x x + + + + + x + + x x x+ ++ x x + x + + + + + x + x x + x x x x + + + x x x+ x x x x x x + + + + + + x x x x + x + xx x x + x x ++ + + + + + + x x x x x x x x x x ++ + + + + + + + + x x x x x x + + + + 0.2 x + xx x x x + + + + + 100 x x x x x x x + + + + + + + x x x x x + + + + + + + + x x x x+ x x x + x x + + + x x x x + + + + + x x x x x x x + + x x x x x x x + + + + + x x + + + + x x x x + + + + x + + x + x x x x x + + + + + + x x x x x x x + x x 0.0 0 0 100 300 500 0.0 0.2 0.4 0.6 0.8 1.0 data theoretical prob 250 250 + + x x x x + + + x + x x x x + x x + x + x + + x x + + x x + x x + x x + x x + x x x + x x + x x + x + + x x + x x x + + + x + x + + x + x + x + x + + + + x + + + + x x + + + + x + + + x x + x + + + + x x x x + + + + + + + x + x + + x + + x x x + + x x x x + + x x + + + x x x + + + x x x x + x x + x x x + x x + x + x + + x + x + x x x + x x + x + + + x x + x + x + + + + x x + x x + x x + + + + x x x + x + x x + x x + + + + x x + x x + x x + + + + + x x + x x x x x + x + x + x x x x + x + x + + + + x x + + x x + x + + + + x x + + x x + + x + + 150 x x x x x + x+ x x + x + 150 x x x x x + x x x + + x + x + x x x x + + x + x + x x x + x x + + + x x x x + x + x + + x x + x x x x x + x x x + x x x + + x x x x x x + + x Coord Y + + + x + x x x Coord Y x + + x + + x + x x x + x x + x + x + x + + x + + x + x + x x + x x x + x + x x + x x x + x + x x x x + x + + + + + x x + + x + + + x + + + x x + + + + x + + + + + x + + + + + x x + + x + + x + x + + x x x + + x + + + x x + x x x x + x x x + x x x x x + x x + x + + x + x x + + x + x + + x + x + x + + + + + x + x x + + + + + x + x x x + + x + x x x + x + x + x x x + x + x + x x x + x x + x + x x + + x + + + + x x + x + + + + x + + x + x x x x x x + x xx + + + x + x x x x x + x x + x + x x x x + + x x x x + + x x x x + x x x + + x x x + x x x + x x x x x + + + x x + x + x + x x x x + + x x + x x + + + + + x + + + x + x x x + + + + + x+ + x x + + + x x x x + + + + x x x + x x + + + x + x + + x + + x + x + + + + x x + + + + + x x + + + + + + x + x x x + + + + + + + x + + x x + x + x + + + + x x x + + + x x + + x x x + + + x x x x + + x + x x x x + x + + x x + + x + + + x x x + x x + x x x x + x x x x x + x x + x x x x + + x 50 + + + + + 50 x + + + + + + + x x + x + + + x + x + x x + + x x x x + + x x + x x + + + x x + + + x + + x x x x + + + x x x + x + x x x + x x x x x x x x x x x x + x + x + x + + x + + + x x x x x x x x 0 0 −50 −50 0 50 150 250 350 0 50 150 250 350 Coord X Coord X 0.4 0.008 0.3 Density Density 0.2 0.004 0.1 0.000 0.0 −200 −100 0 100 200 −4 −2 0 2 4 data − predicted std error + 4 200 + + + + + + + + + + + + + + + 100 + + + + + + + data − predicted + + + + + + 2 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + std error + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + ++ + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + +x + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + x x x + + + + x + x x x + x + + + x x + x xx + + + + x x + x x x x + x x + x + + x x xx + + + x + x x x x + x 0 x x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x 0 x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x −2 x x x x x x x x x x x x x x x −200 x x −4 x 0 100 200 300 400 0 100 200 300 400 predicted predicted + 4 200 + + + + + + + + + + + + + + + + + + + data − predicted 100 + + + + + + + 2 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + std error + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + x + + x+ + + + + + + + + + + + x+ + + + + + + + + + + + + + x+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + x x x x x x x + x x x x x xx x +x x + x + x x x x x x x x x + x x x x x x + x x + x +x x x x x x x x x x + x x x x 0 x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x 0 x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x −2 x x x x x x x x x x x x x x −200 x x −4 x 0 100 200 300 400 500 600 0 100 200 300 400 500 600 data data 68
PART IV: BAYESIAN INFERENCE FOR THE GAUSSIAN MODEL 1. Basic Concepts 2. Bayesian Analysis of the Gaussian Model 3. A Case Study: Swiss Rainfall data 69
1. Bayesian Analysis - Basic Concepts Bayesian inference deals with parameter uncer- tainty by treating parameters as random vari- ables, and expressing inferences about parame- ters in terms of their conditional distributions, given all observed data. For inference about model parameters, the full model specification now should include the mo- del parameters: [ Y, θ ] = [ θ ][ Y | θ ] Bayes’ Theorem allows us to calculate: [ Y, θ ] = [ Y | θ ][ θ ] = [ Y ][ θ | Y ] Thus, [ θ | Y ] = [ Y | θ ][ θ ] / [ Y ] is the posterior distribution where � [ Y ] = [ Y | θ ][ θ ] dθ. 70
The Bayesian paradigm: (a) Model • the full model specification consists of [ Y, θ ] = [ Y | θ ][ θ ] . • formulate a model for the observable vari- able Y . • this model defines [ Y | θ ] (and hence an ex- pression for the log-likelihood ℓ ( θ ; Y ) ) (b) Prior • before we observe Y , the marginal [ θ ] ex- presses our uncertainty about θ • call [ θ ] prior distribution for θ (c) Posterior • having observed Y , it is no longer an unk- nown (randomly varying) quantity • therefore revise uncertainty about θ by conditioning on the observed value of Y • call [ θ | Y ] posterior distribution for θ , and use it to make inferential statements NOTE: the likelihood function occupies a cen- tral role in both classical and Bayesian infe- rence 71
Example: Swiss rainfall data 100 80 semivariance 60 40 WLS ML posterior mode 20 0 0 50 100 150 200 distance Fitted variograms (100 data), using three different methods of estimation: (a) curve-fitting (thin line); (b) ML (dashed line); (c) posterior mode (thick line). 72
Prediction Because Bayesian inference treats θ as a ran- dom variable, it makes no formal distinction between parameter estimation problems and prediction problems, and thereby provides a na- tural means of allowing for parameter uncer- tainty in predictive inference. The general idea for prediction is to formulate a model for [ Y, T, θ ] = [ Y, T | θ ][ θ ] and make inferences based on the conditional distribution � [ T | Y ] = [ T, θ | Y ] dθ � = [ θ | Y ][ T | Y, θ ] dθ 73
Comparing plug-in and Bayesian • the plug-in prediction corresponds to inferen- ces about [ T | Y, ˆ θ ] • Bayesian prediction is a weighted average of plug-in predictions, with different plug-in va- lues of θ weighted according to their conditi- onal probabilities given the observed data. Bayesian prediction is usually more cautious than plug-in prediction, or in other words: • allowance for parameter uncertainty usually results in wider prediction intervals Notes: (a) Until recently, the need to evaluate the in- tegral which defines [ Y ] represented a major obstacle to practical application. (b) Development of Markov Chain Monte Carlo (MCMC) methods has transformed the situa- tion. (c) BUT, for geostatistical problems, reliable implementation of MCMC is not straight- forward. Geostatistical models don’t have a natural Markovian structure for the algo- rithms work well. (d) in particular for the Gaussian model other al- gorithms can be implemented. 74
2. Results for the Gaussian Model Uncertainty only in the mean parameter Assume for now that only the mean parameter β is regarded as random with (conjugate) prior: � � m β ; σ 2 V β β ∼ N The posterior is given by [ β | Y ] ∼ N(( V − 1 + F ′ R − 1 F ) − 1 ( V − 1 β m β + F ′ R − 1 y ) ; β σ 2 ( V − 1 + F ′ R − 1 F ) − 1 ) β � � β ; σ 2 V ˆ ˆ ∼ N β The predictive distribution is � p ( S ∗ | Y, σ 2 , φ ) = p ( S ∗ | Y, β, σ 2 , φ ) p ( β | Y, σ 2 , φ ) dβ. with mean and variance given by E [ S ∗ | Y ] = ( F 0 − r ′ V − 1 F )( V − 1 + F ′ V − 1 F ) − 1 V − 1 β m β + β � + F ′ V − 1 F ) − 1 F ′ V − 1 � r ′ V − 1 + ( F 0 − r ′ V − 1 F )( V − 1 Y β Var[ S ∗ | Y ] = σ 2 � V 0 − r ′ V − 1 r + + F ′ V − 1 F ) − 1 ( F 0 − r ′ V − 1 F ) ′ � ( F 0 − r ′ V − 1 F )( V − 1 . β The predictive variance has three interpreta- ble components: a priori variance, the reduc- tion due to the data and the uncertainty in the mean. V β → ∞ corresponds to universal (or ordinary) kriging. 75
Uncertainty for all model parameters Assume (w.l.g.) a model without measurement error and the prior p ( β, σ 2 , φ ) ∝ 1 σ 2 p ( φ ) . The posterior distribution: p ( β, σ 2 , φ | y ) = p ( β, σ 2 | y, φ ) p ( φ | y ) 2 ( S 2 ) − n − p 2 | R y | − 1 1 2 . pr ( φ | y ) ∝ pr ( φ ) | V ˆ β | Algorithm 1: (a) Discretise the distribution [ φ | y ] , i.e. choose a range of values for φ which is sensible for the particular application, and assign a discrete uniform prior for φ on a set of values span- ning the chosen range. (b) Compute the posterior probabilities on this discrete support set, defining a discrete posterior distribution with probability mass function ˜ pr ( φ | y ) , say. (c) Sample a value of φ from the discrete distri- bution ˜ pr ( φ | y ) . (d) Attach the sampled value of φ to the distribu- tion [ β, σ 2 | y, φ ] and sample from this distribu- tion. (e) Repeat steps (3) and (4) as many times as required; the resulting sample of triplets ( β, σ 2 , φ ) is a sample from the joint posterior distribution. 76
The predictive distribution is given by: � � � p ( S ∗ , β, σ 2 , φ | Y ) dβ dσ 2 dφ p ( S ∗ | Y ) = � � � � � dβ dσ 2 pr ( φ | y ) dφ s ∗ , β, σ 2 | y, φ = p � p ( S ∗ | Y, φ ) p ( φ | y ) dφ. = To sample from this distribution: Algorithm 2: (a) Discretise [ φ | Y ] , as in Algorithm 1. (b) Compute the posterior probabilities on the discrete support set. Denote the resulting distribution ˜ pr ( φ | y ) . (c) Sample a value of φ from ˜ pr ( φ | y ) . (d) Attach the sampled value of φ to [ s ∗ | y, φ ] and sample from it obtaining realisations of the predictive distribution. (e) Repeat steps (3) and (4) as many times as re- quired to generate a sample from the requi- red predictive distribution. 77
Note: (a) The algorithms are of the same kind to treat τ and/or κ as unknown parameters. (b) We specify a discrete prior distribution on a multi-dimensional grid of values. (c) This implies extra computational load (but no new principles) 78
3. A Case Study: Swiss rainfall, 100 data profile log−likelihood profile log−likelihood −562.5 −562.5 −563.5 −563.5 60 80 120 15 20 25 σ 2 φ Profile likelihoods for covariance parameters: σ 2 ; φ 0.15 0.015 0.10 0.010 Density Density 0.05 0.005 0.000 0.00 0 200 400 600 800 0 20 40 60 80 100 σ 2 φ Posterior distributions for covariance parameters: σ 2 ; φ 35 80 30 70 60 25 50 φ φ 20 40 15 30 20 10 10 50 100 150 200 200 400 600 800 σ 2 σ 2 2D profile log-likelihood (left) and samples from posterior distributions (right) for parameters σ 2 and φ 79
Swiss rainfall: prediction results 250 250 200 200 150 150 Coords Y Coords Y 100 100 50 50 0 0 7 150 294 437 581 8 5297 10587 15876 21165 −50 −50 0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350 Coords X Coords X Predicted signal surfaces and associated measures of pre- cision for the rainfall data: (a) posterior mean; (b) poste- rior variance 250 200 150 Coords Y 100 50 0 0 0.1 0.5 0.9 1 −50 0 50 100 150 200 250 300 350 Coords X Posterior probability contours for levels 0.10, 0.50 and 0.90 for the random set T = { x : S ( x ) < 150 } 80
Swiss rainfall: prediction results (cont.) 250 200 150 1 Coords Y 3 4 2 100 50 0 −50 0 50 100 150 200 250 300 350 Coords X Recording stations and selected prediction locations (1 to 4) 0.015 Loc. 1 Loc. 2 Loc. 3 Loc. 4 0.010 density 0.005 0.000 0 100 200 300 400 500 600 rainfall Bayesian predictive distributions for average rainfall at selected locations. 81
Comments • nugget effect • other covariates (altitude, temperature, etc) • data “peculiarities” • (variogram) estimation: REML vs Bayesian • priors • Bayesian implementation algorithm • spatial-temporal modelling • vector random fields • ad hoc vs(?) model-based 82
PART V: GENERALIZED LINEAR SPATIAL MODELS 1. Generalized linear mixed models 2. Inference for the generalized linear geostatistical model 3. Application of MCMC to Generalized Linear Pre- diction 4. Case-study: Rongelap Island 5. Case-study: Gambia Malaria
Non-Gaussian data Positive data: 0.6 0.3 0.0 0 1 2 3 4 5 Count data: 0.30 0.15 0.00 0 1 2 3 4 5 6 7 Binomial data: 0.30 0.15 0.00 0 1 2 3 4 Positive data with zeros: 0.6 0.3 0.0 84
Towards a model specification The linear model Y = Fβ + ε can be written as: Y i ∼ N ( µ i , σ 2 ) µ i = � k j =1 f ij β j and generalized in two ways Y i ∼ Q ( µ, ... ) h ( µ i ) = � k j =1 f ij β j , where Q is distribution in the exponential fa- mily and h ( · ) : known link function • Generalized Linear Models (GLM) • no longer requires: normality, homocedasticity, continuous scale • linear model: particular case 85
Generalized Linear Mixed Models Classical generalized linear model has • Y i : i = 1 , ..., n mutually independent, with µ i = E[ Y i ] • h ( µ i ) = � k j =1 f ij β j , for known link function h ( · ) Generalized linear mixed model has • Y i : i = 1 , ..., n mutually independent, with µ i = E[ Y i ] , conditio- nal on realised values of a set of latent random variables U i • h ( µ i ) = U i + � k j =1 f ij β j , for known link function h ( · ) Generalized linear geostatistical model has • Y i : i = 1 , ..., n mutually independent, with µ i = E[ Y i ] , conditio- nal on realised values of a set of latent random variables U i • h ( µ i ) = U i + � p j =1 f ij β j , for known link function h ( · ) • U i = S ( x i ) R 2 } is a spatial stochastic where { S ( x ) : x ∈ I process 86
Examples x 1 , . . . , x n locations with observations Poisson-log • [ Y ( x i ) | S ( x i )] is Poisson with density f ( z ; µ ) = exp( − µ ) µ z /z ! z = 0 , 1 , 2 , . . . • link: E [ Y ( x i ) | S ( x i )] = µ i = exp( S ( x i )) Binomial-logit • [ Y ( x i ) | S ( x i )] is binomial with density � r � ( µ/r ) z (1 − µ/r ) r − µ z = 0 , 1 , . . . , r f ( z ; µ ) = z • link: µ i = E[ Y ( x i ) | S ( x i )] , S ( x i ) = log( µ i / ( r − µ i )) Likelihood function � n � f ( y i ; h − 1 ( s i )) f ( s | θ ) ds 1 , . . . , ds n L ( θ ) = n I R i High-dimensional integral !!! 87
Inference For The Generalized Linear Geostatistical Model • likelihood evaluation involves high-dimensional numerical integration • approximate methods (eg Breslow and Clayton, 1993) are of uncertain accuracy • MCMC is feasible, although not routine. 88
Application of MCMC to Generalized Linear Prediction • Ingredients · Prior distributions for regression parameters β and covariance parameters θ · Data: Y = ( Y 1 , ..., Y n ) · S = ( S ( x 1 ) , ..., S ( x n )) · S ∗ = all other S ( x ) • Conditional independence structure θ S * Y S • Use output from chain to construct posterior probability statements about [ T | Y ] , where T = F ( S ∗ ) 89
Case-study: Rongelap Island This case-study illustrates a model-based geosta- tistical analysis combining: • a Poisson log-linear model for the sampling dis- tribution of the observations, conditional on a latent Gaussian process which represents spatial vari- ation in the level of contamination • Bayesian prediction of non-linear functionals of the latent process • MCMC implementation Details are in Diggle, Moyeed and Tawn (1998). 90
Radiological survey of Rongelap Island • Rongelap Island – approximately 2500 miles south-west of Hawaii – contaminated by nuclear weapons testing du- ring 1950’s – evacuated in 1985 – now safe for re-settlement? • The statistical problem – field-survey of 137 Cs measurements – estimate spatial variation in 137 Cs radioacti- vity – compare with agreed safe limits 91
Poisson Model for Rongelap Data • Basic measurements are nett counts Y i over time-intervals t i at locations x i ( i = 1 , ..., n ) • Suggests following model: · S ( x ) : x ∈ R 2 stationary Gaussian process (local radioactivity) · Y i |{ S ( · ) } ∼ Poisson ( µ i ) · µ i = t i λ ( x i ) = t i exp { S ( x i ) } . • Aims: · predict λ ( x ) over whole island · max λ ( x ) · arg ( max λ ( x )) 92
Predicted radioactivity surface using log- Gaussian kriging 1000 0 0 2 4 6 8 10 -1000 -2000 -3000 -4000 -5000 -6000 -5000 -4000 -3000 -2000 -1000 0 93
Predicted radioactivity surface using Pois- son log-linear model with latent Gaussian process 1000 0 5 10 15 0 -1000 -2000 -3000 -4000 -5000 -6000 -5000 -4000 -3000 -2000 -1000 0 • The two maps above show the difference between: – log-Gaussian kriging of observed counts per unit time – log-linear analysis of observed counts • the principal visual difference is in the extent of spatial smoothing of the data, which in turn stems from the different treatments of the nug- get variance 94
Bayesian prediction of non-linear functio- nals of the radioactivity surface The left-hand panel shows the predictive distribu- tion of maximum radioactivity, contrasting the ef- fects of allowing for (solid line) or ignoring (dotted line) parameter uncertainty; the right-hand panel shows 95% pointwise credible intervals for the pro- portion of the island over which radioactivity exce- eds a given threshold. 0.14 1.0 1.0 1.0 0.12 0.8 0.8 0.8 0.10 Survivor function 0.6 0.6 0.6 0.08 Density 0.06 0.4 0.4 0.4 0.04 0.2 0.2 0.2 0.02 0.0 0.0 0.0 0.0 10 20 30 40 50 60 0 0 0 5 5 5 10 10 10 15 15 15 20 20 20 Intensity level Intensity level 95
• The two panels of the above diagram illustrate Bayesian prediction of non-linear functionals of the latent Gaussian process in the Poisson log-linear mo- del • the left-hand panel contrasts posterior distribu- tions of the maximum radioactivity based on: (i) the fully Bayesian analysis incorporating the effects of parameter uncertainty in addition to uncertainty in the latent process (solid line) (ii) fixing the model parameters at their estima- ted values, ie allowing for uncertainty only in the latent process • the right-hand panel gives posterior estimates with 95% point-wise credible intervals for the proportion of the island over which radioactivity exceeds a given threshold (dotted line). 96
Case-study: Gambia malaria • In this example, the spatial variation is of se- condary scientific importance. • The primary scientific interest is to describe how the prevalence of malarial parasites de- pends on explanatory variables measured: – on villages – on individual children • There is a particular scientific interest in whether a vegetation index derived from satel- lite data is a useful predictor of malaria preva- lence, as this would help health workers to de- cide how to make best use of scarce resources. 97
Data-structure • 2039 children in 65 villages • test each child for presence/absence of malaria parasites Covariate information at child level: • age (days) • sex (F/M) • use of mosquito net (none, untreated, treated) Covariate information at village level: • location • vegetation index, from satellite data • presence/absence of public health centre 98
Recommend
More recommend