Nearest Neighbor Gaussian Processes for Large Spatial Data Abhi Datta 1 , Sudipto Banerjee 2 and Andrew O. Finley 3 July 31, 2017 1 Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, Maryland. 2 Department of Biostatistics, Fielding School of Public Health, University of California, Los Angeles. 3 Departments of Forestry and Geography, Michigan State University, East Lansing, Michigan.
Low rank Gaussian Predictive Process Pros • Proper Gaussian process • Allows for coherent spatial interpolation at arbitrary resolution • Can be used as prior for spatial random effects in any hierarchical setup for spatial data • Computationally tractable 1
Low rank Gaussian Predictive Process Cons True w Full GP PP 64 knots Figure: Comparing full GP vs low-rank GP with 2500 locations • Low rank models like the Predictive Process (PP) often tends to oversmooth • Increasing the number of knots can fix this but will lead to heavy computation 1
Sparse matrices • Idea: Use a sparse matrix instead of a low rank matrix to approximate the dense full GP covariance matrix • Goals: • Scalability: Both in terms of storage and computing inverse and determinants • Closely approximate full GP inference • Proper Gaussian process model like the Predictive Process 2
Cholesky factors • Write a joint density p ( w ) = p ( w 1 , w 2 , . . . , w n ) as: p ( w 1 ) p ( w 2 | w 1 ) p ( w 3 | w 1 , w 2 ) · · · p ( w n | w 1 , w 2 , . . . , w n − 1 ) • For Gaussian distribution w ∼ N (0 , C ) this ⇒ w 1 = 0 + η 1 ; w 2 = a 21 w 1 + η 2 ; · · · · · · · · · w n = a n 1 w 1 + a n 2 w 2 + · · · + a n , n − 1 w n − 1 + η n ; 3
Cholesky factors • Write a joint density p ( w ) = p ( w 1 , w 2 , . . . , w n ) as: p ( w 1 ) p ( w 2 | w 1 ) p ( w 3 | w 1 , w 2 ) · · · p ( w n | w 1 , w 2 , . . . , w n − 1 ) • For Gaussian distribution w ∼ N (0 , C ) this ⇒ w 1 0 0 0 . . . 0 0 w 1 η 1 0 0 . . . 0 0 η 2 w 2 a 21 w 2 w 3 a 31 a 32 0 . . . 0 0 w 3 η 3 = + . . . . . . . . . . . . . . . . . . . . . . . . . . . w n a n 1 a n 2 a n 3 . . . a n , n − 1 0 w n η n = ⇒ w = Aw + η ; η ∼ N (0 , D ) , where D = diag ( d 1 , d 2 , . . . , d n ) . 3
Cholesky factors • Write a joint density p ( w ) = p ( w 1 , w 2 , . . . , w n ) as: p ( w 1 ) p ( w 2 | w 1 ) p ( w 3 | w 1 , w 2 ) · · · p ( w n | w 1 , w 2 , . . . , w n − 1 ) • For Gaussian distribution w ∼ N (0 , C ) this ⇒ w 1 0 0 0 . . . 0 0 w 1 η 1 w 2 a 21 0 0 . . . 0 0 w 2 η 2 0 . . . 0 0 η 3 w 3 a 31 a 32 w 3 = + . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0 η n w n a n 1 a n 2 a n 3 a n , n − 1 w n = ⇒ w = Aw + η ; η ∼ N (0 , D ) , where D = diag ( d 1 , d 2 , . . . , d n ) . • Cholesky factorization: C − 1 = ( I − A ) ′ D − 1 ( I − A ) 3
Cholesky factors • w < i = ( w 1 , w 2 , . . . , w i − 1 ) ′ • c i = Cov ( w i , w < i ), C i = Var ( w < i ) • i th row of A and d i = Var ( η i ) are obtained from p ( w i | w < i ) as follows: • Solve for a ij ’s from � i − 1 i C − 1 j =1 a ij w j = E ( w i | w < i ) = c ′ w < i i 4
Cholesky factors • w < i = ( w 1 , w 2 , . . . , w i − 1 ) ′ • c i = Cov ( w i , w < i ), C i = Var ( w < i ) • i th row of A and d i = Var ( η i ) are obtained from p ( w i | w < i ) as follows: • Solve for a ij ’s from � i − 1 i C − 1 j =1 a ij w j = E ( w i | w < i ) = c ′ w < i i • d i = Var ( w i | w < i ) = σ 2 − c ′ i C − 1 c i i 4
Cholesky factors • w < i = ( w 1 , w 2 , . . . , w i − 1 ) ′ • c i = Cov ( w i , w < i ), C i = Var ( w < i ) • i th row of A and d i = Var ( η i ) are obtained from p ( w i | w < i ) as follows: • Solve for a ij ’s from � i − 1 i C − 1 j =1 a ij w j = E ( w i | w < i ) = c ′ w < i i • d i = Var ( w i | w < i ) = σ 2 − c ′ i C − 1 c i i 4
Cholesky factors • w < i = ( w 1 , w 2 , . . . , w i − 1 ) ′ • c i = Cov ( w i , w < i ), C i = Var ( w < i ) • i th row of A and d i = Var ( η i ) are obtained from p ( w i | w < i ) as follows: • Solve for a ij ’s from � i − 1 i C − 1 j =1 a ij w j = E ( w i | w < i ) = c ′ w < i i • d i = Var ( w i | w < i ) = σ 2 − c ′ i C − 1 c i i • For large i , inverting C i becomes slow • The Cholesky factor approach for the full GP covariance matrix C does not offer any computational benefits 4
Cholesky Factors and Directed Acyclic Graphs (DAGs) 7 6 1 5 2 4 3 • Number of non-zero entries (sparsity) of A equals number of arrows in the graph • In particular: Sparsity of the i th row of A is same as the number of arrows towards i in the DAG 5
Introducing sparsity via graphical models 7 6 1 5 2 4 3 p ( y 1 ) p ( y 2 | y 1 ) p ( y 3 | y 1 , y 2 ) p ( y 4 | y 1 , y 2 , y 3 ) × p ( y 5 | y 1 , y 2 , y 3 , y 4 ) p ( y 6 | y 1 , y 2 , . . . , y 5 ) p ( y 7 | y 1 , y 2 , . . . , y 6 ) . 6
Introducing sparsity via graphical models 7 6 1 5 2 4 3 p ( y 1 ) p ( y 2 | y 1 ) p ( y 3 | y 1 , y 2 ) p ( y 4 | y 1 , y 2 , y 3 ) p ( y 5 | ✚ y 1 , y 2 , y 3 , y 4 ) p ( y 6 | y 1 , ✚ y 2 , ✚ y 3 , y 4 , y 5 ) p ( y 7 | y 1 , y 2 , ✚ y 3 , ✚ y 4 , ✚ y 5 , y 6 ) ✚ ✚ ✚ ✚ ✚ ✚ 6
Introducing sparsity via graphical models 7 6 1 5 2 4 3 • Create a sparse DAG by keeping at most m arrows pointing to each node • Set a ij = 0 for all i , j which has no arrow between them • Fixing a ij = 0 introduces conditional independence and w j drops out from the conditional set in p ( w i | { w k : l < i } ) 7
Introducing sparsity via graphical models 7 6 1 5 2 4 3 • N ( i ) denote neighbor set of i , i.e., the set of nodes from which there are arrows to i • a ij = 0 for j / ∈ N ( i ) and nonzero a ij ’s obtained by solving: � E[ w i | w N ( i ) ] = a ij w j j ∈ N ( i ) • The above linear system is only m × m 7
Choosing neighbor sets Matern Covariance Function: 1 2 ν − 1 Γ( ν )( || s i − s j || φ ) ν K ν ( || s i − s j || φ ); φ > 0 , ν > 0 , C ( s i , s j ) = 8
Choosing neighbor sets • Spatial covariance functions decay with distance • Vecchia (1988): N ( s i ) = m − nearest neighbors of s i in s 1 , s 2 , . . . , s i − 1 • Nearest points have highest correlations • Theory: ”Screening effect” – Stein, 2002 • We use Vecchia’s choice of m -nearest neighbor • Other choices proposed in Stein et al. (2004); Gramacy and Apley (2015); Guinness (2016) can also be used 9
Nearest neighbors 10
Nearest neighbors 10
Nearest neighbors 10
Sparse precision matrices • The neighbor sets and the covariance function C ( · , · ) define a sparse Cholesky factor A C − 1 = ( I − A ) ⊤ D − 1 ( I − A ) • N ( w | 0 , C ) ≈ N ( w | 0 , ˜ C ) ; ˜ ˜ C − 1 A D • det(˜ C ) = � n i =1 D i , C − 1 is sparse with O ( nm 2 ) entries • ˜ 11
Sparse precision matrices • The neighbor sets and the covariance function C ( · , · ) define a sparse Cholesky factor A C − 1 = ( I − A ) ⊤ D − 1 ( I − A ) • N ( w | 0 , C ) ≈ N ( w | 0 , ˜ C ) ; ˜ ˜ C − 1 A D • det(˜ C ) = � n i =1 D i , C − 1 is sparse with O ( nm 2 ) entries • ˜ 11
Extension to a Process • We have defined w ∼ N (0 , ˜ C ) over the set of data locations S = { s 1 , s 2 , . . . , s n } • For s / ∈ S , define N ( s ) as set of m -nearest neighbors of s in S • Define w ( s ) = � i : s i ∈ N ( s ) a i ( s ) w ( s i ) + η ( s ) where η ( s ) ind ∼ N (0 , d ( s )) • a i ( s ) and d ( s ) are once again obtained by solving m × m system • Well-defined GP over entire domain • Nearest Neighbor GP (NNGP) – Datta et al., JASA, (2016) 12
Hierarchical spatial regression with NNGP Spatial linear model y ( s ) = x ( s ) ′ β + w ( s ) + ǫ ( s ) • w ( s ) modeled as NNGP derived from a GP (0 , C ( · , · , | σ 2 , φ )) • ǫ ( s ) iid ∼ N (0 , τ 2 ) contributes to the nugget • Priors for the parameters β , σ 2 , τ 2 and φ • Only difference from a full GP model is the NNGP prior w ( s ) 13
Recommend
More recommend