scalable gaussian processes with a twist of probabilistic
play

Scalable Gaussian processes with a twist of Probabilistic Numerics - PowerPoint PPT Presentation

Scalable Gaussian processes with a twist of Probabilistic Numerics Kurt Cutajar EURECOM, Sophia Antipolis, France Data Science Meetup - October 30 th 2017 Agenda Kernel Methods Scalable Gaussian Processes (using Preconditioning)


  1. Scalable Gaussian processes with a twist of Probabilistic Numerics Kurt Cutajar EURECOM, Sophia Antipolis, France Data Science Meetup - October 30 th 2017

  2. Agenda • Kernel Methods • Scalable Gaussian Processes (using Preconditioning) • Probabilistic Numerics 1/33

  3. Kernel Methods • Operate in a high-dimensional, implicit feature space 2/33 • E.g. RBF : k • Rely on the construction of an n × n Gram matrix K ( ) ( 2 d 2 ) x i , x j = σ 2 exp − 1 ) ⊤ Λ where d 2 = ( ( ) x i − x j x i − x j

  4. Kernel Methods • Wide variety of kernel functions available Taken from David Duvenaud’s PhD Thesis 3/33

  5. Kernel Methods • Choice is not always straightforward! Taken from David Duvenaud’s PhD Thesis 4/33

  6. All About that Bass Bayes marginal likelihood 5/33 posterior = likelihood × prior p ( par | X , y ) = p ( y | X , par ) × p ( par ) p ( y | X )

  7. All About that Bass Bayes - Making Predictions • We average over all possible parameter values, weighted by their posterior probability 6/33 ∫ p ( y ∗ | x ∗ , X , y ) = p ( y ∗ | x ∗ , par ) p ( par | X , y ) d par = N ( E [ y ∗ ] , V [ y ∗ ])

  8. Gaussian Processes

  9. Gaussian Processes - Prior Distribution over Functions 7/33 3 2 1 label 0 K ∞ = −1 −2 −3 −4 −2 0 2 4 input

  10. Gaussian Processes - Conditioned on Observations 8/33 3 2 ● ● 1 ● ● ● ● label ● ● ● ● 0 ● ● ● ● K ∞ = ● ● ● ● ● ● −1 −2 −3 −4 −2 0 2 4 input

  11. Gaussian Processes - Posterior Distribution over Functions 9/33 3 2 ● ● 1 ● ● ● ● label ● ● ● ● 0 ● ● ● ● K y = ● ● ● ● ● ● −1 −2 −3 −4 −2 0 2 4 input

  12. Gaussian Processes GP regression example Inference result GP prior 10/33 3 3 3 2 2 2 ● ● ● ● 1 1 1 ● ● ● ● ● ● ● ● label label label ● ● ● ● ● ● 0 ● 0 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −1 −1 −1 −2 −2 −2 −3 −3 −3 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 input input input K y = K ∞ = K ∞ =

  13. Bayesian Learning vs Deep Learning • Deep Learning + Scalable to very large datasets + Increased model flexibility/capacity - Frequentist approaches make only point estimates - Less robust to overfitting • Bayesian Learning + Incorporates uncertainty in predictions + Works well with smaller datasets - Lack of conjugacy necessitates approximation - Expensive computational and storage requirements 11/33

  14. Bayesian Learning vs Deep Learning - Deep Gaussian Processes • Deep probabilistic models • Composition of functions 12/33 ( h ( N h − 1 ) ( θ ( N h − 1 ) ) ◦ . . . ◦ h ( 0 ) ( θ ( 0 ) )) f ( x ) = ( x ) h ( 0 ) ( x ) h ( 0 ) ( x ) h ( 1 ) ( x ) h ( 1 ) ( )

  15. Bayesian Learning vs Deep Learning - Deep Gaussian Processes • Inference requires calculating the marginal likelihood: • Very challenging! p p p 13/33 ∫ ( Y | F ( N h ) , θ ( N h ) ) p ( Y | X , θ ) = × ( F ( N h ) | F ( N h − 1 ) , θ ( N h − 1 ) ) × . . . × ( F ( 1 ) | X , θ ( 0 ) ) dF ( N h ) . . . dF ( 1 )

  16. Bayesian Learning vs Deep Learning - Deep Gaussian Processes X Y Cutajar et al., Random Feature Expansions for Deep Gaussian Processes , ICML 2017 Yarin Gal, Bayesian Deep Learning , PhD Thesis 14/33 Φ ( 0 ) F ( 1 ) Φ ( 1 ) F ( 2 ) Ω ( 0 ) W ( 0 ) Ω ( 1 ) W ( 1 ) θ ( 0 ) θ ( 1 )

  17. Scalable Gaussian Processes

  18. Gaussian Processes 2 Tr y y • Marginal likelihood 15/33 • Derivatives wrt par 2 y T K − 1 log [ p ( y | par )] = − 1 2 log | K y | − 1 y y + const . ∂ log [ p ( y | par )] ( ∂ K y ) ∂ K y K − 1 2 y T K − 1 K − 1 = − 1 + 1 ∂ par i ∂ par i ∂ par i y y

  19. Gaussian Processes - Stochastic Trace Estimation Taken from Shakir Mohamed’s Machine Learning Blog 16/33

  20. Gaussian Processes - Stochastic Gradients y Linear systems only! y y N r 2 N r • Stochastic gradient r y 17/33 y Tr • Stochastic estimate of the trace - assuming E [ rr T ] = I , then ( ) ( ) [ ] ∂ K y ∂ K y ∂ K y K − 1 K − 1 r T K − 1 = Tr E [ rr T ] = E ∂ par i ∂ par i ∂ par i ∂ K y ∂ K y r ( i ) + 1 ∑ r ( i ) T K − 1 2 y T K − 1 K − 1 − 1 ∂ par i ∂ par i y y i = 1

  21. Gaussian Processes - Stochastic Gradients y Linear systems only! y y N r 2 N r • Stochastic gradient r y 17/33 y Tr • Stochastic estimate of the trace - assuming E [ rr T ] = I , then ( ) ( ) [ ] ∂ K y ∂ K y ∂ K y K − 1 K − 1 r T K − 1 = Tr E [ rr T ] = E ∂ par i ∂ par i ∂ par i ∂ K y ∂ K y r ( i ) + 1 ∑ r ( i ) T K − 1 2 y T K − 1 K − 1 − 1 ∂ par i ∂ par i y y i = 1

  22. tn 2 for t CG iterations - in theory t Solving Linear Systems • Cholesky Decomposition • K must be stored in memory! • Conjugate Gradient • Numerical solution of linear systems • n (possibly worse!) 18/33 • Involve the solution of linear systems K z = v • O ( n 2 ) space and O ( n 3 ) time - unfeasible for large n

  23. Solving Linear Systems • Cholesky Decomposition • K must be stored in memory! • Conjugate Gradient • Numerical solution of linear systems 18/33 • Involve the solution of linear systems K z = v • O ( n 2 ) space and O ( n 3 ) time - unfeasible for large n • O ( tn 2 ) for t CG iterations - in theory t = n (possibly worse!) z0 z

  24. Solving Linear Systems • Preconditioned Conjugate Gradient (henceforth PCG ) • Transforms linear system to be better conditioned, improving convergence CG PCG 19/33 • Yields a new linear system of the form P − 1 K z = P − 1 v • O ( tn 2 ) for t PCG iterations - in practice t ≪ n z0 z0 z z

  25. • For low-rank preconditioners we employ the Woodbury inversion lemma: K y = 1 = • For other preconditioners we solve inner linear systems once again Preconditioning Approaches • Our choice of preconditioner, P , should: • Be easy to invert P = P using CG! 20/33 • Suppose we want to precondition K y = K + λ I • Approximate K y as closely as possible

  26. Preconditioning Approaches • Our choice of preconditioner, P , should: • Be easy to invert • For low-rank preconditioners we employ the Woodbury inversion lemma: P = • For other preconditioners we solve inner linear systems once again using CG! 20/33 • Suppose we want to precondition K y = K + λ I • Approximate K y as closely as possible K y = P − 1 =

  27. Preconditioning Approaches • Our choice of preconditioner, P , should: • Be easy to invert • For low-rank preconditioners we employ the Woodbury inversion lemma: P = • For other preconditioners we solve inner linear systems once again using CG! 20/33 • Suppose we want to precondition K y = K + λ I • Approximate K y as closely as possible K y = P − 1 =

  28. Preconditioning Approaches PITC Regularization SKI Block Jacobi Partial SVD r Nyström Spectral UU K UX m 21/33 FITC UU K UX P = K XU K − 1 UU K UX + λ I where U ⊂ X P = K XU K − 1 K − K XU K − 1 ( ) UU K UX + diag + λ I P = K XU K − 1 K − K XU K − 1 ( ) UU K UX + bldiag + λ I ∑ m [ 2 π s ⊤ ( )] P ij = σ 2 x i − x j + λ I ij r = 1 cos K = A Λ A ⊤ P = A [ · , 1 : m ] Λ [ 1 : m , 1 : m ] A ⊤ ⇒ [ 1 : m , · ] + λ I P = bldiag ( K ) + λ I P = WK UU W ⊤ + λ I where K UU is Kronecker P = K + λ I + δ I

  29. Comparison of Preconditioners vs CG 22/33

  30. Experimental Setup - GP Kernel Parameter Optimization • Exact gradient-based optimization using Cholesky decomposition (CHOL) • Stochastic gradient-based optimization • Linear systems solved with CG and PCG • GP Approximations • Variational learning of inducing variables ( VAR ) • Fully Independent Training Conditional ( FITC ) • Partially Independent Training Conditional ( PITC ) 23/33

  31. Results - ARD Kernel Regression Protein ( n = 45730, d =9) Classification EEG ( n = 14979, d =14) Power plant ( n = 9568, d =4) 24/33 Spam ( n = 4061, d =57) 40 0 Negative Test Log−Lik Negative Test Log−Lik 0.22 35 0.12 −10 Error Rate 30 RMSE 0.20 −20 25 0.08 20 −30 0.18 15 0.04 −40 0.5 1.0 1.5 2.0 2.5 3.0 3.5 −1 0 1 2 3 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 0 1 2 3 log 10 ( seconds ) log 10 ( seconds ) log 10 ( seconds ) log 10 ( seconds ) 0.72 60 0.25 400 Negative Test Log−Lik Negative Test Log−Lik 50 0.68 350 Error Rate RMSE 0.15 40 300 0.64 30 250 0.05 200 20 0.60 1.0 1.5 2.0 2.5 3.0 3.5 0 1 2 3 4 1.5 2.0 2.5 3.0 3.5 4.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 log 10 ( seconds ) log 10 ( seconds ) log 10 ( seconds ) log 10 ( seconds ) PCG CG CHOL FITC PITC VAR

  32. Follow-up Work • Faster Kernel Ridge Regression Using Sketching and Preconditioning Avron et al. (2017) • FALKON: An Optimal Large Scale Kernel Method Rosasco et al. (2017) • Large Linear Multi-output Gaussian Process Learning for Time Series Feinberg et al. (2017) • Scaling up the Automatic Statistician: Scalable Structure Discovery using Gaussian Processes Kim et al. (2017) 25/33

  33. Follow-up work ... but what’s left to do now? 26/33

  34. Follow-up work ... but what’s left to do now? 26/33

  35. Probabilistic Numerics

Recommend


More recommend