lecture 21
play

Lecture 21 Computational Methods for GPs Colin Rundel 04/10/2017 - PowerPoint PPT Presentation

Lecture 21 Computational Methods for GPs Colin Rundel 04/10/2017 1 GPs and Computational Complexity 2 + Chol () with (0, 1) 2 log || 1 2( ) 1 ( ) 2 log 2


  1. Lecture 21 Computational Methods for GPs Colin Rundel 04/10/2017 1

  2. GPs and Computational Complexity 2

  3. 𝝂 + Chol (𝚻) Γ— 𝐚 with π‘Ž 𝑗 ∼ π’ͺ(0, 1) 2 log |Ξ£| βˆ’ 1 2(𝐲 βˆ’ 𝝂) β€² 𝚻 βˆ’1 (𝐲 βˆ’ 𝝂) βˆ’ π‘œ 2 log 2𝜌 {Ξ£} π‘—π‘˜ = 𝜏 2 exp (βˆ’{𝑒} π‘—π‘˜ 𝜚) + 𝜏 2 π‘œ 1 𝑗=π‘˜ 𝒫 (π‘œ 2 ) Update covariance parameter? 𝒫 (π‘œ 3 ) The problem with GPs Unless you are lucky (or clever), Gaussian process models are difficult to βˆ’1 Evaluate the (log) likelihood? 𝒫 (π‘œ 3 ) Want to sample 𝐳 ? scale to large problems. For a Gaussian process 𝐳 ∼ π’ͺ(𝝂, 𝚻) : 3

  4. 2 log |Ξ£| βˆ’ 1 2(𝐲 βˆ’ 𝝂) β€² 𝚻 βˆ’1 (𝐲 βˆ’ 𝝂) βˆ’ π‘œ 2 log 2𝜌 {Ξ£} π‘—π‘˜ = 𝜏 2 exp (βˆ’{𝑒} π‘—π‘˜ 𝜚) + 𝜏 2 π‘œ 1 𝑗=π‘˜ 𝒫 (π‘œ 2 ) Update covariance parameter? 𝒫 (π‘œ 3 ) The problem with GPs Unless you are lucky (or clever), Gaussian process models are difficult to βˆ’1 Evaluate the (log) likelihood? 𝒫 (π‘œ 3 ) Want to sample 𝐳 ? scale to large problems. For a Gaussian process 𝐳 ∼ π’ͺ(𝝂, 𝚻) : 3 𝝂 + Chol (𝚻) Γ— 𝐚 with π‘Ž 𝑗 ∼ π’ͺ(0, 1)

  5. {Ξ£} π‘—π‘˜ = 𝜏 2 exp (βˆ’{𝑒} π‘—π‘˜ 𝜚) + 𝜏 2 π‘œ 1 𝑗=π‘˜ The problem with GPs Unless you are lucky (or clever), Gaussian process models are difficult to 𝒫 (π‘œ 2 ) Update covariance parameter? 𝒫 (π‘œ 3 ) 3 βˆ’1 Evaluate the (log) likelihood? 𝒫 (π‘œ 3 ) Want to sample 𝐳 ? scale to large problems. For a Gaussian process 𝐳 ∼ π’ͺ(𝝂, 𝚻) : 𝝂 + Chol (𝚻) Γ— 𝐚 with π‘Ž 𝑗 ∼ π’ͺ(0, 1) 2 log |Ξ£| βˆ’ 1 2(𝐲 βˆ’ 𝝂) β€² 𝚻 βˆ’1 (𝐲 βˆ’ 𝝂) βˆ’ π‘œ 2 log 2𝜌

  6. The problem with GPs βˆ’1 𝒫 (π‘œ 2 ) Update covariance parameter? 𝒫 (π‘œ 3 ) Unless you are lucky (or clever), Gaussian process models are difficult to 3 Evaluate the (log) likelihood? 𝒫 (π‘œ 3 ) Want to sample 𝐳 ? scale to large problems. For a Gaussian process 𝐳 ∼ π’ͺ(𝝂, 𝚻) : 𝝂 + Chol (𝚻) Γ— 𝐚 with π‘Ž 𝑗 ∼ π’ͺ(0, 1) 2 log |Ξ£| βˆ’ 1 2(𝐲 βˆ’ 𝝂) β€² 𝚻 βˆ’1 (𝐲 βˆ’ 𝝂) βˆ’ π‘œ 2 log 2𝜌 {Ξ£} π‘—π‘˜ = 𝜏 2 exp (βˆ’{𝑒} π‘—π‘˜ 𝜚) + 𝜏 2 π‘œ 1 𝑗=π‘˜

  7. A simple guide to computational complexity 𝒫 (π‘œ) - Linear complexity - Go for it 𝒫 (π‘œ 2 ) - Quadratic complexity - Pray 𝒫 (π‘œ 3 ) - Cubic complexity - Give up 4

  8. A simple guide to computational complexity 𝒫 (π‘œ) - Linear complexity - Go for it 𝒫 (π‘œ 2 ) - Quadratic complexity - Pray 𝒫 (π‘œ 3 ) - Cubic complexity - Give up 4

  9. A simple guide to computational complexity 𝒫 (π‘œ) - Linear complexity - Go for it 𝒫 (π‘œ 2 ) - Quadratic complexity - Pray 𝒫 (π‘œ 3 ) - Cubic complexity - Give up 4

  10. A simple guide to computational complexity 𝒫 (π‘œ) - Linear complexity - Go for it 𝒫 (π‘œ 2 ) - Quadratic complexity - Pray 𝒫 (π‘œ 3 ) - Cubic complexity - Give up 4

  11. A simple guide to computational complexity 𝒫 (π‘œ) - Linear complexity - Go for it 𝒫 (π‘œ 2 ) - Quadratic complexity - Pray 𝒫 (π‘œ 3 ) - Cubic complexity - Give up 4

  12. A simple guide to computational complexity 𝒫 (π‘œ) - Linear complexity - Go for it 𝒫 (π‘œ 2 ) - Quadratic complexity - Pray 𝒫 (π‘œ 3 ) - Cubic complexity - Give up 4

  13. How bad is the problem? 5 30 method 20 time (secs) chol inv LU inv QR inv 10 0 2500 5000 7500 10000 n

  14. 2. Calc. chol (Ξ£ π‘ž βˆ’ Ξ£ π‘žπ‘ Ξ£ βˆ’1 𝑝 Ξ£ π‘π‘ž ) 3. Calc. 𝜈 π‘ž|𝑝 + chol (Ξ£ π‘ž|𝑝 ) Γ— π‘Ž 0.467 β€’ CPU (28.9 min) Total run time for 1000 posterior predictive draws: 1.732 Total 0.129 4. Calc. Allele Prob 0.049 Practice - Migratory Model Prediction After fitting the GP need to sample from the posterior predictive distribution 1.080 1. Calc. Ξ£ π‘ž , Ξ£ π‘žπ‘ , Ξ£ π‘ž CPU (secs) Step at ∼ 3000 locations 6 𝐳 π‘ž ∼ π’ͺ (𝜈 π‘ž + Ξ£ π‘žπ‘ Ξ£ βˆ’1 𝑝 (𝑧 𝑝 βˆ’ 𝜈 𝑝 ), Ξ£ π‘ž βˆ’ Ξ£ π‘žπ‘ Ξ£ βˆ’1 𝑝 Ξ£ π‘π‘ž )

  15. Practice - Migratory Model Prediction After fitting the GP need to sample from the posterior predictive distribution β€’ CPU (28.9 min) Total run time for 1000 posterior predictive draws: 1.732 Total 0.129 4. Calc. Allele Prob 0.049 0.467 6 1.080 1. Calc. Ξ£ π‘ž , Ξ£ π‘žπ‘ , Ξ£ π‘ž CPU (secs) Step at ∼ 3000 locations 𝐳 π‘ž ∼ π’ͺ (𝜈 π‘ž + Ξ£ π‘žπ‘ Ξ£ βˆ’1 𝑝 (𝑧 𝑝 βˆ’ 𝜈 𝑝 ), Ξ£ π‘ž βˆ’ Ξ£ π‘žπ‘ Ξ£ βˆ’1 𝑝 Ξ£ π‘π‘ž ) 2. Calc. chol (Ξ£ π‘ž βˆ’ Ξ£ π‘žπ‘ Ξ£ βˆ’1 𝑝 Ξ£ π‘π‘ž ) 3. Calc. 𝜈 π‘ž|𝑝 + chol (Ξ£ π‘ž|𝑝 ) Γ— π‘Ž

  16. A bigger hammer? 0.052 β€’ CPU+GPU (7.8 min) β€’ CPU (28.9 min) Total run time for 1000 posterior predictive draws: 3.7 0.465 1.732 Total 1.0 0.127 0.129 4. Calc. Allele Prob 0.9 0.049 Step 2.3 0.208 0.467 23.0 0.046 1.080 1. Calc. Ξ£ π‘ž , Ξ£ π‘žπ‘ , Ξ£ π‘ž Rel. Perf CPU+GPU (secs) CPU (secs) 7 2. Calc. chol (Ξ£ π‘ž βˆ’ Ξ£ π‘žπ‘ Ξ£ βˆ’1 𝑝 Ξ£ π‘π‘ž ) 3. Calc. 𝜈 π‘ž|𝑝 + chol (Ξ£ π‘ž|𝑝 ) Γ— π‘Ž

  17. Cholesky CPU vs GPU (P100) 8 30 method chol inv LU inv 20 time (secs) QR inv comp cpu 10 gpu 0 2500 5000 7500 10000 n

  18. 9 10.0 method chol inv LU inv time (secs) QR inv comp cpu 0.1 gpu 2500 5000 7500 10000 n

  19. Relative Performance 10 Relative performance 10 method chol inv LU inv QR inv 1 2500 5000 7500 10000 n

  20. Aside (1) - Matrix Multiplication 11 Matrix Multiplication 7.5 time (sec) comp 5.0 cpu gpu 2.5 0.0 2500 5000 7500 10000 n

  21. 12 Matrix Multiplication βˆ’ Relative Performance 45 40 35 time (sec) 30 25 20 2500 5000 7500 10000 n

  22. Aside (2) - Memory Limitations A general covariance is a dense π‘œ Γ— π‘œ matrix, meaning it will require π‘œ 2 Γ— 13 64-bits to store. 20 15 Cov Martrix Size (GB) 10 5 0 0 10000 20000 30000 40000 50000 n

  23. Other big hammers β€’ Able to fit models on the order of π‘œ = 65 k (32 GB Cov. matrix) bigGP is an R package written by Chris Paciorek (UC Berkeley), et al. 14 β€’ Specialized distributed implementation of linear algebra operation for GPs β€’ Uses both shared and distributed memory β€’ Designed to run on large super computer clusters Cholesky Decomposition 1000 Execution time, seconds (log scale) 100 10 1 6 cores 0.1 60 cores 816 cores 12480 cores 49920 cores 0.01 2048 8192 32768 131072 Matrix dimension, n (log scale)

  24. More scalable solutions? β€’ Spectral domain / basis functions β€’ Covariance tapering β€’ GMRF approximations β€’ Low-rank approximations β€’ Nearest-neighbor models 15

  25. Low Rank Approximations 16

  26. π‘Š 𝑒 π‘œΓ—π‘œ diag ( Μƒ π‘œΓ—π‘› = 𝑉 π‘Š 𝑒 diag ( Μƒ = 𝑇) π‘œΓ—π‘› 𝑛×𝑛 Low rank approximations in general Μƒ 𝑉 𝑇) 𝑙×𝑙 Μƒ 𝑙×𝑛 π‘œΓ—π‘™ 𝑁 Lets look at the example of the singular value decomposition of a matrix, Μƒ 𝑁 π‘œΓ—π‘› 𝑛×𝑛 where 𝑉 are called the left singular vectors, π‘Š the right singular vectors, and 𝑇 the singular values. Usually the singular values and vectors are ordered such that the singular values are in descending order. The Eckart–Young theorem states that we can construct an approximatation of 𝑁 with rank 𝑙 by setting Μƒ 𝑇 to contain only the 𝑙 largest singular values and all other values set to zero. 17 π‘Š 𝑒 π‘œΓ—π‘› = 𝑉 π‘œΓ—π‘œ diag (𝑇)

  27. Low rank approximations in general Μƒ 𝑙×𝑛 Μƒ 𝑙×𝑙 𝑇) π‘œΓ—π‘™ 𝑉 Μƒ = 𝑛×𝑛 π‘œΓ—π‘› 𝑇) Lets look at the example of the singular value decomposition of a matrix, 𝑁 and all other values set to zero. ordered such that the singular values are in descending order. 𝑁 π‘œΓ—π‘› 𝑛×𝑛 𝑇 to contain only the 𝑙 largest singular values and 𝑇 the singular values. Usually the singular values and vectors are where 𝑉 are called the left singular vectors, π‘Š the right singular vectors, The Eckart–Young theorem states that we can construct an approximatation of 𝑁 with rank 𝑙 by setting Μƒ 17 π‘Š 𝑒 π‘œΓ—π‘› = 𝑉 π‘œΓ—π‘œ diag (𝑇) π‘Š 𝑒 π‘œΓ—π‘œ diag ( Μƒ π‘œΓ—π‘› = 𝑉 π‘Š 𝑒 diag ( Μƒ

  28. Example ⎞ 0.58 βˆ’0.25 βˆ’0.32 βˆ’0.45 0.17) (βˆ’0.79 0.00 0.00 (1.50 ⎠ ⎟ ⎟ ⎟ βˆ’0.51 βˆ’0.51 βˆ’0.25 βˆ’0.51 βˆ’0.32 βˆ’0.37 βˆ’0.45 0.58 βˆ’0.79 ⎝ ⎜ ⎜ ⎜ βŽ› βˆ’0.37 βˆ’0.51) Μƒ 0.333 ⎠ ⎟ ⎟ ⎟ ⎞ 0.140 0.166 0.203 0.249 0.166 0.200 0.251 0.203 = 0.251 0.330 0.501 0.249 0.333 0.501 1.000 ⎝ ⎜ ⎜ ⎜ βŽ› 𝑁 = Rank 2 approximation: 𝑁 = 0.333 ⎠ ⎟ ⎟ ⎟ ⎞ 0.143 0.167 0.200 0.250 0.167 0.200 0.250 0.200 0.00) 0.250 0.333 0.500 0.250 0.333 0.500 1.000 ⎝ ⎜ ⎜ ⎜ βŽ› 𝑉 = π‘Š = βŽ› ⎜ βˆ’0.79 0.01 0.17 𝑇 = (1.50 ⎠ ⎟ ⎟ ⎟ ⎞ 0.51 βˆ’0.64 βˆ’0.51 βˆ’0.25 βˆ’0.10 ⎜ βˆ’0.51 βˆ’0.32 0.33 0.74 βˆ’0.37 βˆ’0.45 βˆ’0.03 βˆ’0.18 0.58 βˆ’0.79 ⎝ ⎜ 18 = 𝑉 diag (𝑇) π‘Š 𝑒

Recommend


More recommend