Lecture 22 Computational Methods for GPs Colin Rundel 04/12/2017 1
GPs and Computational Complexity 2
2 exp n 1 i The problem with GPs x n 2 j 2 ij d ij Update covariance parameter? n 3 log 2 2 n 1 Unless you are lucky (or clever), Gaussian process models are difficult to x 2 1 log 2 1 Evaluate the (log) likelihood? n 3 0 1 Z with Z i Chol Want to sample y ? 3 scale to large problems. For a Gaussian process y ∼ N ( µ , Σ ) :
2 exp n 1 i The problem with GPs x n 2 j 2 ij d ij Update covariance parameter? n 3 log 2 2 n 1 Unless you are lucky (or clever), Gaussian process models are difficult to x 2 1 log 2 1 Evaluate the (log) likelihood? Want to sample y ? 3 scale to large problems. For a Gaussian process y ∼ N ( µ , Σ ) : µ + Chol ( Σ ) × Z with Z i ∼ N ( 0 , 1 ) O ( n 3 )
2 exp n 1 i The problem with GPs Unless you are lucky (or clever), Gaussian process models are difficult to n 2 j 2 ij d ij Update covariance parameter? 2 n 1 1 Want to sample y ? Evaluate the (log) likelihood? 3 2 scale to large problems. For a Gaussian process y ∼ N ( µ , Σ ) : µ + Chol ( Σ ) × Z with Z i ∼ N ( 0 , 1 ) O ( n 3 ) 2 ( x − µ ) ′ Σ − 1 ( x − µ ) − ( n 3 ) − log | Σ | − log 2 π O
The problem with GPs 1 Update covariance parameter? 2 n Unless you are lucky (or clever), Gaussian process models are difficult to 1 2 3 Want to sample y ? Evaluate the (log) likelihood? scale to large problems. For a Gaussian process y ∼ N ( µ , Σ ) : µ + Chol ( Σ ) × Z with Z i ∼ N ( 0 , 1 ) O ( n 3 ) 2 ( x − µ ) ′ Σ − 1 ( x − µ ) − ( n 3 ) − log | Σ | − log 2 π O { Σ } ij = σ 2 exp ( −{ d } ij ϕ ) + σ 2 O ( n 2 ) n 1 i = j
n 2 - Quadratic complexity - Pray n 3 - Cubic complexity - Give up A simple guide to computational complexity - Go for it 4 O ( n ) - Linear complexity
n 2 - Quadratic complexity - Pray n 3 - Cubic complexity - Give up A simple guide to computational complexity 4 O ( n ) - Linear complexity - Go for it
n 3 - Cubic complexity - Give up A simple guide to computational complexity - Pray 4 O ( n ) - Linear complexity - Go for it O ( n 2 ) - Quadratic complexity
n 3 - Cubic complexity - Give up A simple guide to computational complexity 4 O ( n ) - Linear complexity - Go for it O ( n 2 ) - Quadratic complexity - Pray
A simple guide to computational complexity - Give up 4 O ( n ) - Linear complexity - Go for it O ( n 2 ) - Quadratic complexity - Pray O ( n 3 ) - Cubic complexity
A simple guide to computational complexity 4 O ( n ) - Linear complexity - Go for it O ( n 2 ) - Quadratic complexity - Pray O ( n 3 ) - Cubic complexity - Give up
How bad is the problem? 5 30 method 20 time (secs) chol inv LU inv QR inv 10 0 2500 5000 7500 10000 n
Practice - Migratory Model Prediction 0.052 0.208 2.3 3. Calc. p o chol p o Z 0.049 0.9 op 4. Calc. Allele Prob 0.129 0.127 1.0 Total 1.732 0.465 3.7 Total run time: CPU (28.9 min), CPU+GPU (7.8 min) 0.467 o After fitting the GP need to sample from the posterior predictive distribution p , Step CPU (secs) CPU+GPU (secs) Rel. Performance 1 Calc. 1. po 1.080 0.046 23.0 2. Calc. chol p po 6 at ∼ 3000 locations y p ∼ N ( µ p + Σ po Σ − 1 o ( y o − µ o ) , Σ p − Σ po Σ − 1 ) o Σ op
Practice - Migratory Model Prediction Calc. Allele Prob 2.3 3. 0.049 0.052 0.9 4. 0.129 0.467 0.127 1.0 Total 1.732 0.465 3.7 Total run time: CPU (28.9 min), CPU+GPU (7.8 min) 0.208 6 After fitting the GP need to sample from the posterior predictive distribution 1. Step CPU (secs) Rel. Performance CPU+GPU (secs) 1.080 0.046 23.0 2. at ∼ 3000 locations y p ∼ N ( µ p + Σ po Σ − 1 o ( y o − µ o ) , Σ p − Σ po Σ − 1 ) o Σ op Calc. Σ p , Σ po Calc. chol (Σ p − Σ po Σ − 1 o Σ op ) Calc. µ p | o + chol (Σ p | o ) × Z
Cholesky CPU vs GPU (P100) 7 30 method chol inv 20 LU inv time (secs) QR inv comp cpu gpu 10 0 2500 5000 7500 10000 n
8 10.0 method chol inv LU inv time (secs) QR inv comp cpu gpu 0.1 2500 5000 7500 10000 n
Relative Performance 9 10 Relative performance method chol inv LU inv QR inv 1 2500 5000 7500 10000 n
Aside - Matrix Multiplication 10 Matrix Multiplication 7.5 time (sec) comp 5.0 cpu gpu 2.5 0.0 2500 5000 7500 10000 n
An even bigger hammer • Uses both shared and distributed memory bigGP is an R package written by Chris Paciorek (UC Berkeley), et al. 11 • Specialized distributed implementation of linear algebra operation for GPs • Designed to run on large super computer clusters • Able to fit models on the order of n = 65k (32 GB Cov. matrix) Cholesky Decomposition 1000 Execution time, seconds (log scale) 100 10 1 6 cores 0.1 60 cores 816 cores 12480 cores 49920 cores 0.01 2048 8192 32768 131072 Matrix dimension, n (log scale)
More scalable solutions? • Spectral domain / basis functions • Covariance tapering • GMRF approximations • Low-rank approximations • Nearest-neighbor models 12
Low Rank Approximations 13
V t V t n diag S k diag S Low rank approximations in general m n m m n U U k k k m n m Lets look at the example of the singular value decomposition of a matrix, S the singular values. Usually the singular values and vectors are ordered M U n where U are called the left singular vectors, V the right singular vectors, and such that the signular values are in descending order. It turns out (Eckart–Young theorem) that we can approximate M as having rank r by defining S to only have the r largest singular values (others set to zero). M 14 n × m = n × n diag ( S ) V t m × m n × m
Low rank approximations in general S the singular values. Usually the singular values and vectors are ordered U U Lets look at the example of the singular value decomposition of a matrix, zero). S to only have the r largest singular values (others set to It turns out (Eckart–Young theorem) that we can approximate M as having such that the signular values are in descending order. M where U are called the left singular vectors, V the right singular vectors, and U M 14 n × m = n × n diag ( S ) V t m × m n × m rank r by defining ˜ ˜ n × n diag (˜ = ˜ n × k diag (˜ ˜ n × m = S ) S ) V t V t m × m k × m n × m k × k
Example 1 50 0 25 0 32 0 45 0 79 0 17 0 00 0 00 0 51 0 37 0 25 0 51 0 32 0 37 0 45 0 58 0 79 0 58 0 51 Rank 2 approximation: 0 333 0 140 0 166 0 203 0 249 0 166 0 200 0 251 0 203 0 51 0 251 0 330 0 501 0 249 0 333 0 501 1 000 M 15 t 1 . 000 0 . 500 0 . 333 0 . 250 M = 0 . 500 0 . 333 0 . 250 0 . 200 = U diag ( S ) V 0 . 333 0 . 250 0 . 200 0 . 167 0 . 250 0 . 200 0 . 167 0 . 143 − 0 . 79 − 0 . 18 − 0 . 03 0 . 58 − 0 . 45 − 0 . 37 U = V = 0 . 74 0 . 33 − 0 . 32 − 0 . 51 − 0 . 10 − 0 . 79 − 0 . 25 − 0 . 51 − 0 . 64 0 . 51 S = ( 0 . 00 ) 1 . 50 0 . 17 0 . 01
Example t Rank 2 approximation: 15 1 . 000 0 . 500 0 . 333 0 . 250 M = 0 . 500 0 . 333 0 . 250 0 . 200 = U diag ( S ) V 0 . 333 0 . 250 0 . 200 0 . 167 0 . 250 0 . 200 0 . 167 0 . 143 − 0 . 79 − 0 . 18 − 0 . 03 0 . 58 − 0 . 45 − 0 . 37 U = V = 0 . 74 0 . 33 − 0 . 32 − 0 . 51 − 0 . 10 − 0 . 79 − 0 . 25 − 0 . 51 − 0 . 64 0 . 51 S = ( 0 . 00 ) 1 . 50 0 . 17 0 . 01 − 0 . 79 0 . 58 ( ) ( ) − 0 . 45 − 0 . 37 − 0 . 79 − 0 . 45 − 0 . 32 − 0 . 25 ˜ M = 1 . 50 0 . 00 − 0 . 32 − 0 . 51 − 0 . 37 − 0 . 51 − 0 . 51 0 . 00 0 . 17 0 . 58 − 0 . 25 − 0 . 51 1 . 000 0 . 501 0 . 333 0 . 249 = 0 . 501 0 . 330 0 . 251 0 . 203 0 . 333 0 . 251 0 . 200 0 . 166 0 . 249 0 . 203 0 . 166 0 . 140
Approximation Error m Strong dependence (large eff. range): We can measure the error of the approximation using the Frobenius norm, n 16 1 / 2 ∥ M − ˜ ∑ ∑ ( M ij − ˜ M ∥ F = M ij ) 2 i = 1 j = 1
Approximation Error m We can measure the error of the approximation using the Frobenius norm, Strong dependence (large eff. range): n 16 1 / 2 ∥ M − ˜ ∑ ∑ ( M ij − ˜ M ∥ F = M ij ) 2 i = 1 j = 1 SVD Low Rank SVD 15 15 Error (Frob. norm) Singular Values 10 10 5 5 0 0 0 10 20 30 40 50 0 10 20 30 40 50
Weak dependence (short eff. range): 17 SVD Low Rank SVD 10 4 Error (Frob. norm) 8 Singular Values 3 6 2 4 1 2 0 0 0 10 20 30 40 50 0 10 20 30 40 50 Rank
1 is k V t A How does this help? (Sherman-Morrison-Woodbury) There is an immensely useful linear algebra identity, the k which is hopefully small. 1 U is k 1 S • diag S . k which is hopefully small, or even better S • S 1 . diag A , then it is trivial to find A • Imagine that A How does this help? 18 A a decomposed matrix, Sherman-Morrison- Woodbury formula, for the inverse (and determinant) of M ) − 1 ( − 1 = ˜ n × m + U n × k S k × k V t n × m k × m = A − 1 − A − 1 U S − 1 + V t A − 1 U ) − 1 V t A − 1 . (
Recommend
More recommend