scalable gaussian processes
play

Scalable Gaussian Processes Zhenwen Dai Amazon 9 September 2019 - PowerPoint PPT Presentation

Scalable Gaussian Processes Zhenwen Dai Amazon 9 September 2019 @GPSS 2019 Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 1 / 46 Gaussian process Input and Output Data: X = ( x 1 , . . . , x N ) y = ( y 1 ,


  1. Scalable Gaussian Processes Zhenwen Dai Amazon 9 September 2019 @GPSS 2019 Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 1 / 46

  2. Gaussian process Input and Output Data: X = ( x 1 , . . . , x N ) ⊤ y = ( y 1 , . . . , y N ) , y | f , σ 2 I � � p ( y | f ) = N , p ( f | X ) = N ( f | 0 , K ( X , X )) 10 Mean Data 8 Confidence 6 4 2 0 − 2 − 4 − 6 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1 . 0 Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 2 / 46

  3. Behind a Gaussian process fit Maximum likelihood estimate of the hyper-parameters. θ ∗ = arg max � y | 0 , K + σ 2 I � log p ( y | X , θ ) = arg max log N θ θ Prediction on a test point given the observed data and the optimized hyper-parameters. p ( f ∗ | X ∗ , y , X , θ ) = � f ∗ | K ∗ ( K + σ 2 I ) − 1 y , K ∗∗ − K ∗ ( K + σ 2 I ) − 1 K ⊤ � N ∗ Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 3 / 46

  4. How to implement the log-likelihood (1) Compute the covariance matrix K :   · · · k ( x 1 , x 1 ) k ( x 1 , x N ) . . ... . . K =   . .   k ( x N , x 1 ) · · · k ( x N , x N ) � − 1 2 l 2 ( x i − x j ) ⊤ ( x i − x j ) � where k ( x i , x j ) = γ exp The complexity is O ( N 2 Q ) . Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 4 / 46

  5. How to implement the log-likelihood (2) Plug in the log-pdf of multi-variate normal distribution: � y | 0 , K + σ 2 I � log p ( y | X ) = log N = − 1 2 log | 2 π ( K + σ 2 I ) | − 1 2 y ⊤ ( K + σ 2 I ) − 1 y = − 1 2( || L − 1 y || 2 + N log 2 π ) − � log L ii i Take a Cholesky decomposition: L = chol ( K + σ 2 I ) . The computational complexity is O ( N 3 + N 2 + N ) . Therefore, the overall complexity including the computation of K is O ( N 3 ) . Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 5 / 46

  6. A quick profiling ( N =1000, Q =10) Time unit is microsecond. Line # Time % Time Line Contents 2 def log_likelihood(kern, X, Y, sigma2): 3 6.0 0.0 N = X.shape[0] 4 55595.0 58.7 K = kern.K(X) 5 4369.0 4.6 Ky = K + np.eye(N)*sigma2 6 30012.0 31.7 L = np.linalg.cholesky(Ky) 7 4361.0 4.6 LinvY = dtrtrs(L, Y, lower=1)[0] 8 49.0 0.1 logL = N*np.log(2*np.pi)/-2. 9 82.0 0.1 logL += np.square(LinvY).sum()/-2. 10 208.0 0.2 logL += -np.log(np.diag(L)).sum() 11 2.0 0.0 return logL Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 6 / 46

  7. Empirical analysis of computational time I collect the run time for N = { 10 , 100 , 500 , 1000 , 1500 , 2000 } . They take 1.3ms, 8.5ms, 28ms, 0.12s, 0.29s, 0.76s. Mean 1 . 4 Data Confidence 1 . 2 1 . 0 time (second) 0 . 8 0 . 6 0 . 4 0 . 2 0 . 0 0 500 1000 1500 2000 2500 data size ( N ) Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 7 / 46

  8. What if we have 1 million data points? The mean of predicted computational time is 9 . 4 × 10 7 seconds ≈ 2 . 98 years. Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 8 / 46

  9. What about waiting for faster computers? Computational time = amount of work computer speed If the computer speed increase at the pace of 20% year over year: ◮ After 10 years, it will take about 176 days. ◮ After 50 years, it will take about 2.9 hours. If we double the size of data, it takes 11.4 years to catch up. Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 9 / 46

  10. What about parallel computing / GPU? Ongoing works about speeding up Cholesky decomposition with multi-core CPU or GPU. Main limitation: heavy communication and shared memory. Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 10 / 46

  11. Other approaches Apart from speeding up the exact computation, there have been a lot of works on approximation of GP inference. These methods often target at some specific scenario and provide good approximation for the targeted scenarios. Provide an overview about common approximations. Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 11 / 46

  12. Big data (?) lots of data � = complex function In real world problems, we often collect a lot of data for modeling relatively simple relations. 20 Mean Data Confidence 15 10 5 0 − 5 − 10 − 0 . 2 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 1 . 2 Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 12 / 46

  13. Data subsampling? Real data often do not evenly distributed. We tend to get a lot of data on common cases and very few data on rare cases. 20 Mean Data Confidence 15 10 5 0 − 5 − 10 − 0 . 25 0 . 00 0 . 25 0 . 50 0 . 75 1 . 00 1 . 25 Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 13 / 46

  14. Covariance matrix of redundant data With redundant data, the covariance matrix becomes low rank. What about low rank approximation? 1400 1200 1000 800 600 400 200 0 0 20 40 60 80 100 Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 14 / 46

  15. Low-rank approximation Let’s recall the log-likelihood of GP: y | 0 , K + σ 2 I � � log p ( y | X ) = log N , where K is the covariance matrix computed from X according to the kernel function k ( · , · ) and σ 2 is the variance of the Gaussian noise distribution. Assume K to be low rank. This leads to Nystr¨ om approximation by Williams and Seeger [Williams and Seeger, 2001]. Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 15 / 46

  16. Approximation by subset Let’s randomly pick a subset from the training data: Z ∈ R M × Q . Approximate the covariance matrix K by ˜ K . ˜ K = K z K − 1 zz K ⊤ z , where K z = K ( X , Z ) and K zz = K ( Z , Z ) . K ∈ R N × N , K z ∈ R N × M and K zz ∈ R M × M . Note that ˜ The log-likelihood is approximated by � y | 0 , K z K − 1 zz K ⊤ z + σ 2 I � log p ( y | X , θ ) ≈ log N . Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 16 / 46

  17. Efficient computation using Woodbury formula The naive formulation does not bring any computational benefits. L = − 1 K + σ 2 I ) | − 1 ˜ 2 log | 2 π ( ˜ 2 y ⊤ ( ˜ K + σ 2 I ) − 1 y Apply the Woodbury formula: z + σ 2 I ) − 1 = σ − 2 I − σ − 4 K z ( K zz + σ − 2 K ⊤ ( K z K − 1 zz K ⊤ z K z ) − 1 K ⊤ z Note that ( K zz + σ − 2 K ⊤ z K z ) ∈ R M × M . The computational complexity reduces to O ( NM 2 ) . Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 17 / 46

  18. Nystr¨ om approximation The above approach is called Nystr¨ om approximation by Williams and Seeger [2001]. The approximation is directly done on the covariance matrix without the concept of pseudo data. The approximation becomes exact if the whole data set is taken, i.e., KK − 1 K ⊤ = K . The subset selection is done randomly. Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 18 / 46

  19. Gaussian process with Pseudo Data (1) Snelson and Ghahramani [2006] proposes the idea of having pseudo data, which is later referred to as Fully independent training conditional (FITC). Augment the training data ( X , y ) with pseudo data u at location Z . �� y � � X �� �� y � � K ff + σ 2 I �� K fu p | = N | 0 , K ⊤ K uu u Z u fu where K ff = K ( X , X ) , K fu = K ( X , Z ) and K uu = K ( Z , Z ) . Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 19 / 46

  20. Gaussian process with Pseudo Data (2) Thanks to the marginalization property of Gaussian distribution, � p ( y | X ) = p ( y , u | X , Z ) . u Further re-arrange the notation: p ( y , u | X , Z ) = p ( y | u , X , Z ) p ( u | Z ) where p ( u | Z ) = N ( u | 0 , K uu ) , � y | K fu K − 1 uu u , K ff − K fu K − 1 uu K ⊤ fu + σ 2 I � p ( y | u , X , Z ) = N . Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 20 / 46

  21. FITC approximation (1) So far, p ( y | X ) has not been changed, but there is no speed-up, K ff ∈ R N × N in K ff − K fu K − 1 uu K ⊤ fu + σ 2 I . The FITC approximation assumes y | K fu K − 1 uu u , Λ + σ 2 I � � p ( y | u , X , Z ) = N ˜ , where Λ = ( K ff − K fu K − 1 uu K ⊤ fu ) ◦ I . Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 21 / 46

  22. FITC approximation (2) Marginalize u from the model definition: y | 0 , K fu K − 1 uu K ⊤ fu + Λ + σ 2 I � � p ( y | X , Z ) = N ˜ Woodbury formula can be applied in the sam way as in Nystr¨ om approximation: z + Λ + σ 2 I ) − 1 = A − AK z ( K zz + K ⊤ ( K z K − 1 zz K ⊤ z AK z ) − 1 K ⊤ z A , where A = ( Λ + σ 2 I ) − 1 . Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 22 / 46

Recommend


More recommend