Lecture 21 Computational Methods for GPs Colin Rundel 04/10/2017 1
GPs and Computational Complexity 2
π + Chol (π») Γ π with π π βΌ πͺ(0, 1) 2 log |Ξ£| β 1 2(π² β π) β² π» β1 (π² β π) β π 2 log 2π {Ξ£} ππ = π 2 exp (β{π} ππ π) + π 2 π 1 π=π π« (π 2 ) Update covariance parameter? π« (π 3 ) The problem with GPs Unless you are lucky (or clever), Gaussian process models are difficult to β1 Evaluate the (log) likelihood? π« (π 3 ) Want to sample π³ ? scale to large problems. For a Gaussian process π³ βΌ πͺ(π, π») : 3
2 log |Ξ£| β 1 2(π² β π) β² π» β1 (π² β π) β π 2 log 2π {Ξ£} ππ = π 2 exp (β{π} ππ π) + π 2 π 1 π=π π« (π 2 ) Update covariance parameter? π« (π 3 ) The problem with GPs Unless you are lucky (or clever), Gaussian process models are difficult to β1 Evaluate the (log) likelihood? π« (π 3 ) Want to sample π³ ? scale to large problems. For a Gaussian process π³ βΌ πͺ(π, π») : 3 π + Chol (π») Γ π with π π βΌ πͺ(0, 1)
{Ξ£} ππ = π 2 exp (β{π} ππ π) + π 2 π 1 π=π The problem with GPs Unless you are lucky (or clever), Gaussian process models are difficult to π« (π 2 ) Update covariance parameter? π« (π 3 ) 3 β1 Evaluate the (log) likelihood? π« (π 3 ) Want to sample π³ ? scale to large problems. For a Gaussian process π³ βΌ πͺ(π, π») : π + Chol (π») Γ π with π π βΌ πͺ(0, 1) 2 log |Ξ£| β 1 2(π² β π) β² π» β1 (π² β π) β π 2 log 2π
The problem with GPs β1 π« (π 2 ) Update covariance parameter? π« (π 3 ) Unless you are lucky (or clever), Gaussian process models are difficult to 3 Evaluate the (log) likelihood? π« (π 3 ) Want to sample π³ ? scale to large problems. For a Gaussian process π³ βΌ πͺ(π, π») : π + Chol (π») Γ π with π π βΌ πͺ(0, 1) 2 log |Ξ£| β 1 2(π² β π) β² π» β1 (π² β π) β π 2 log 2π {Ξ£} ππ = π 2 exp (β{π} ππ π) + π 2 π 1 π=π
A simple guide to computational complexity π« (π) - Linear complexity - Go for it π« (π 2 ) - Quadratic complexity - Pray π« (π 3 ) - Cubic complexity - Give up 4
A simple guide to computational complexity π« (π) - Linear complexity - Go for it π« (π 2 ) - Quadratic complexity - Pray π« (π 3 ) - Cubic complexity - Give up 4
A simple guide to computational complexity π« (π) - Linear complexity - Go for it π« (π 2 ) - Quadratic complexity - Pray π« (π 3 ) - Cubic complexity - Give up 4
A simple guide to computational complexity π« (π) - Linear complexity - Go for it π« (π 2 ) - Quadratic complexity - Pray π« (π 3 ) - Cubic complexity - Give up 4
A simple guide to computational complexity π« (π) - Linear complexity - Go for it π« (π 2 ) - Quadratic complexity - Pray π« (π 3 ) - Cubic complexity - Give up 4
A simple guide to computational complexity π« (π) - Linear complexity - Go for it π« (π 2 ) - Quadratic complexity - Pray π« (π 3 ) - Cubic complexity - Give up 4
How bad is the problem? 5 30 method 20 time (secs) chol inv LU inv QR inv 10 0 2500 5000 7500 10000 n
2. Calc. chol (Ξ£ π β Ξ£ ππ Ξ£ β1 π Ξ£ ππ ) 3. Calc. π π|π + chol (Ξ£ π|π ) Γ π 0.467 β’ CPU (28.9 min) Total run time for 1000 posterior predictive draws: 1.732 Total 0.129 4. Calc. Allele Prob 0.049 Practice - Migratory Model Prediction After fitting the GP need to sample from the posterior predictive distribution 1.080 1. Calc. Ξ£ π , Ξ£ ππ , Ξ£ π CPU (secs) Step at βΌ 3000 locations 6 π³ π βΌ πͺ (π π + Ξ£ ππ Ξ£ β1 π (π§ π β π π ), Ξ£ π β Ξ£ ππ Ξ£ β1 π Ξ£ ππ )
Practice - Migratory Model Prediction After fitting the GP need to sample from the posterior predictive distribution β’ CPU (28.9 min) Total run time for 1000 posterior predictive draws: 1.732 Total 0.129 4. Calc. Allele Prob 0.049 0.467 6 1.080 1. Calc. Ξ£ π , Ξ£ ππ , Ξ£ π CPU (secs) Step at βΌ 3000 locations π³ π βΌ πͺ (π π + Ξ£ ππ Ξ£ β1 π (π§ π β π π ), Ξ£ π β Ξ£ ππ Ξ£ β1 π Ξ£ ππ ) 2. Calc. chol (Ξ£ π β Ξ£ ππ Ξ£ β1 π Ξ£ ππ ) 3. Calc. π π|π + chol (Ξ£ π|π ) Γ π
A bigger hammer? 0.052 β’ CPU+GPU (7.8 min) β’ CPU (28.9 min) Total run time for 1000 posterior predictive draws: 3.7 0.465 1.732 Total 1.0 0.127 0.129 4. Calc. Allele Prob 0.9 0.049 Step 2.3 0.208 0.467 23.0 0.046 1.080 1. Calc. Ξ£ π , Ξ£ ππ , Ξ£ π Rel. Perf CPU+GPU (secs) CPU (secs) 7 2. Calc. chol (Ξ£ π β Ξ£ ππ Ξ£ β1 π Ξ£ ππ ) 3. Calc. π π|π + chol (Ξ£ π|π ) Γ π
Cholesky CPU vs GPU (P100) 8 30 method chol inv LU inv 20 time (secs) QR inv comp cpu 10 gpu 0 2500 5000 7500 10000 n
9 10.0 method chol inv LU inv time (secs) QR inv comp cpu 0.1 gpu 2500 5000 7500 10000 n
Relative Performance 10 Relative performance 10 method chol inv LU inv QR inv 1 2500 5000 7500 10000 n
Aside (1) - Matrix Multiplication 11 Matrix Multiplication 7.5 time (sec) comp 5.0 cpu gpu 2.5 0.0 2500 5000 7500 10000 n
12 Matrix Multiplication β Relative Performance 45 40 35 time (sec) 30 25 20 2500 5000 7500 10000 n
Aside (2) - Memory Limitations A general covariance is a dense π Γ π matrix, meaning it will require π 2 Γ 13 64-bits to store. 20 15 Cov Martrix Size (GB) 10 5 0 0 10000 20000 30000 40000 50000 n
Other big hammers β’ Able to fit models on the order of π = 65 k (32 GB Cov. matrix) bigGP is an R package written by Chris Paciorek (UC Berkeley), et al. 14 β’ Specialized distributed implementation of linear algebra operation for GPs β’ Uses both shared and distributed memory β’ Designed to run on large super computer clusters Cholesky Decomposition 1000 Execution time, seconds (log scale) 100 10 1 6 cores 0.1 60 cores 816 cores 12480 cores 49920 cores 0.01 2048 8192 32768 131072 Matrix dimension, n (log scale)
More scalable solutions? β’ Spectral domain / basis functions β’ Covariance tapering β’ GMRF approximations β’ Low-rank approximations β’ Nearest-neighbor models 15
Low Rank Approximations 16
π π’ πΓπ diag ( Μ πΓπ = π π π’ diag ( Μ = π) πΓπ πΓπ Low rank approximations in general Μ π π) πΓπ Μ πΓπ πΓπ π Lets look at the example of the singular value decomposition of a matrix, Μ π πΓπ πΓπ where π are called the left singular vectors, π the right singular vectors, and π the singular values. Usually the singular values and vectors are ordered such that the singular values are in descending order. The EckartβYoung theorem states that we can construct an approximatation of π with rank π by setting Μ π to contain only the π largest singular values and all other values set to zero. 17 π π’ πΓπ = π πΓπ diag (π)
Low rank approximations in general Μ πΓπ Μ πΓπ π) πΓπ π Μ = πΓπ πΓπ π) Lets look at the example of the singular value decomposition of a matrix, π and all other values set to zero. ordered such that the singular values are in descending order. π πΓπ πΓπ π to contain only the π largest singular values and π the singular values. Usually the singular values and vectors are where π are called the left singular vectors, π the right singular vectors, The EckartβYoung theorem states that we can construct an approximatation of π with rank π by setting Μ 17 π π’ πΓπ = π πΓπ diag (π) π π’ πΓπ diag ( Μ πΓπ = π π π’ diag ( Μ
Example β 0.58 β0.25 β0.32 β0.45 0.17) (β0.79 0.00 0.00 (1.50 β β β β β0.51 β0.51 β0.25 β0.51 β0.32 β0.37 β0.45 0.58 β0.79 β β β β β β0.37 β0.51) Μ 0.333 β β β β β 0.140 0.166 0.203 0.249 0.166 0.200 0.251 0.203 = 0.251 0.330 0.501 0.249 0.333 0.501 1.000 β β β β β π = Rank 2 approximation: π = 0.333 β β β β β 0.143 0.167 0.200 0.250 0.167 0.200 0.250 0.200 0.00) 0.250 0.333 0.500 0.250 0.333 0.500 1.000 β β β β β π = π = β β β0.79 0.01 0.17 π = (1.50 β β β β β 0.51 β0.64 β0.51 β0.25 β0.10 β β0.51 β0.32 0.33 0.74 β0.37 β0.45 β0.03 β0.18 0.58 β0.79 β β 18 = π diag (π) π π’
Recommend
More recommend