Probabilistic Graphical Models Lecture 21: Advanced Gaussian Processes Andrew Gordon Wilson www.cs.cmu.edu/~andrewgw Carnegie Mellon University April 1, 2015 1 / 100
Gaussian process review Definition A Gaussian process (GP) is a collection of random variables, any finite number of which have a joint Gaussian distribution. Nonparametric Regression Model ◮ Prior: f ( x ) ∼ GP ( m ( x ) , k ( x , x ′ )) , meaning ( f ( x 1 ) , . . . , f ( x N )) ∼ N ( µ , K ) , with µ i = m ( x i ) and K ij = cov ( f ( x i ) , f ( x j )) = k ( x i , x j ) . GP posterior Likelihood GP prior � �� � � �� � � �� � p ( f ( x ) |D ) ∝ p ( D| f ( x )) p ( f ( x )) Gaussian process sample prior functions Gaussian process sample posterior functions 3 3 2 2 1 1 output, f(t) output, f(t) 0 0 −1 −1 −2 −2 2 / 100 −3 −3
Gaussian Process Inference ◮ Observed noisy data y = ( y ( x 1 ) , . . . , y ( x N )) T at input locations X . ◮ Start with the standard regression assumption: N ( y ( x ); f ( x ) , σ 2 ) . ◮ Place a Gaussian process distribution over noise free functions f ( x ) ∼ GP ( 0 , k θ ) . The kernel k is parametrized by θ . ◮ Infer p ( f ∗ | y , X , X ∗ ) for the noise free function f evaluated at test points X ∗ . Joint distribution � � � � �� K θ ( X , X ) + σ 2 I K θ ( X , X ∗ ) y (1) ∼ N 0 , . K θ ( X ∗ , X ) K θ ( X ∗ , X ∗ ) f ∗ Conditional predictive distribution f ∗ | X ∗ , X , y , θ ∼ N (¯ f ∗ , cov ( f ∗ )) , (2) f ∗ = K θ ( X ∗ , X )[ K θ ( X , X ) + σ 2 I ] − 1 y , ¯ (3) cov ( f ∗ ) = K θ ( X ∗ , X ∗ ) − K θ ( X ∗ , X )[ K θ ( X , X ) + σ 2 I ] − 1 K θ ( X , X ∗ ) . (4) 3 / 100
Learning and Model Selection p ( M i | y ) = p ( y |M i ) p ( M i ) (5) p ( y ) We can write the evidence of the model as � (6) p ( y |M i ) = p ( y | f , M i ) p ( f ) d f , 4 Complex Model Data Simple Model 3 Simple Appropriate Model Complex 2 Appropriate Output, f(x) 1 p(y|M) 0 −1 −2 −3 −4 y −10 −8 −6 −4 −2 0 2 4 6 8 10 All Possible Datasets Input, x (a) (b) 4 / 100
Learning and Model Selection ◮ We can integrate away the entire Gaussian process f ( x ) to obtain the marginal likelihood, as a function of kernel hyperparameters θ alone. � (7) p ( y | θ , X ) = p ( y | f , X ) p ( f | θ , X ) d f . complexity penalty model fit � �� � � �� � − 1 1 2 log | K θ + σ 2 I | − N 2 y T ( K θ + σ 2 I ) − 1 y − log p ( y | θ , X ) = 2 log ( 2 π ) . (8) ◮ An extremely powerful mechanism for kernel learning. Samples from GP Posterior Samples from GP Prior 4 4 3 3 2 2 Output, f(x) Output, f(x) 1 1 0 0 −1 −1 −2 −2 −3 −3 −4 −4 −10 −5 0 5 10 −10 −5 0 5 10 5 / 100 Input, x Input, x
Inference and Learning 1. Learning: Optimize marginal likelihood, model fit complexity penalty � �� � � �� � − 1 1 2 log | K θ + σ 2 I | − N 2 y T ( K θ + σ 2 I ) − 1 y − log p ( y | θ , X ) = 2 log ( 2 π ) , with respect to kernel hyperparameters θ . 2. Inference: Conditioned on kernel hyperparameters θ , form the predictive distribution for test inputs X ∗ : f ∗ | X ∗ , X , y , θ ∼ N (¯ f ∗ , cov ( f ∗ )) , ¯ f ∗ = K θ ( X ∗ , X )[ K θ ( X , X ) + σ 2 I ] − 1 y , cov ( f ∗ ) = K θ ( X ∗ , X ∗ ) − K θ ( X ∗ , X )[ K θ ( X , X ) + σ 2 I ] − 1 K θ ( X , X ∗ ) . 6 / 100
Learning and Model Selection ◮ A fully Bayesian treatment would integrate away kernel hyperparameters θ . � (9) p ( f ∗ | X ∗ , X , y ) = p ( f ∗ | X ∗ , X , y , θ ) p ( θ | y ) d θ ◮ For example, we could specify a prior p ( θ ) , use MCMC to take J samples from p ( θ | y ) ∝ p ( y | θ ) p ( θ ) , and then find J p ( f ∗ | X ∗ , X , y ) ≈ 1 � θ ( i ) ∼ p ( θ | y ) . p ( f ∗ | X ∗ , X , y , θ ( i ) ) , (10) J i = 1 ◮ If we have a non-Gaussian noise model, and thus cannot integrate away f , the strong dependencies between Gaussian process f and hyperparameters θ make sampling extremely difficult. In my experience, the most effective solution is to use a deterministic approximation for the posterior p ( f | y ) which enables one to work with an approximate marginal likelihood. 7 / 100
Popular Kernels Let τ = x − x ′ : k SE ( τ ) = exp ( − 0 . 5 τ 2 /ℓ 2 ) (11) √ √ 3 τ 3 τ k MA ( τ ) = a ( 1 + ) exp ( − (12) ) ℓ ℓ τ 2 k RQ ( τ ) = ( 1 + 2 α ℓ 2 ) − α (13) k PE ( τ ) = exp ( − 2 sin 2 ( π τ ω ) /ℓ 2 ) (14) 8 / 100
Worked Example: Combining Kernels, CO 2 Data CO 2 Concentration (ppm) 400 380 360 340 320 1968 1977 1986 1995 2004 Year Example from Rasmussen and Williams (2006), Gaussian Processes for Machine Learning . 9 / 100
Worked Example: Combining Kernels, CO 2 Data 10 / 100
Worked Example: Combining Kernels, CO 2 Data � � − ( x p − x q ) 2 ◮ Long rising trend: k 1 ( x p , x q ) = θ 2 1 exp 2 θ 2 2 ◮ Quasi-periodic seasonal changes: k 2 ( x p , x q ) = � � − 2 sin 2 ( π ( x p − x q )) − ( x p − x q ) k RBF ( x p , x q ) k PER ( x p , x q ) = θ 2 3 exp 2 θ 2 θ 2 4 5 ◮ Multi-scale medium term irregularities: � � − θ 8 1 + ( x p − x q ) 2 k 3 ( x p , x q ) = θ 2 6 2 θ 8 θ 2 7 � � − ( x p − x q ) 2 ◮ Correlated and i.i.d. noise: k 4 ( x p , x q ) = θ 2 + θ 2 9 exp 11 δ pq 2 θ 2 10 ◮ k total ( x p , x q ) = k 1 ( x p , x q ) + k 2 ( x p , x q ) + k 3 ( x p , x q ) + k 4 ( x p , x q ) 11 / 100
What is a kernel? ◮ Informally, k describes the similarities between pairs of data points. For example, far away points may be considered less similar than nearby points. K ij = � φ ( x i ) , φ ( x j ) � and so tells us the overlap between the features (basis functions) φ ( x i ) and φ ( x j ) ◮ We have seen that all linear basis function models f ( x ) = w T φ ( x ) , with p ( w ) = N ( 0 , Σ w ) correspond to Gaussian processes with kernel k ( x , x ′ ) = φ ( x ) T Σ w φ ( x ′ ) . ◮ We have also accumulated some experience with the RBF kernel k RBF ( x , x ′ ) = a 2 exp ( − || x − x ′ || 2 ) . 2 ℓ 2 ◮ The kernel controls the generalisation behaviour of a kernel machine. For example, a kernel controls the support and inductive biases of a Gaussian process – which functions are a priori likely. ◮ A kernel is also known as covariance function or covariance kernel in the context of Gaussian processes. 12 / 100
Candidate Kernel � 1 || x − x ′ || ≤ 1 k ( x , x ′ ) = 0 otherwise ◮ Symmetric ◮ Provides information about proximity of points ◮ Exercise: Is it a valid kernel? 13 / 100
Candidate Kernel � 1 || x − x ′ || ≤ 1 k ( x , x ′ ) = 0 otherwise Try the points x 1 = 1, x 2 = 2, x 3 = 3. Compute the kernel matrix ? ? ? (15) K = ? ? ? ? ? ? 14 / 100
Candidate Kernel � 1 || x − x ′ || ≤ 1 k ( x , x ′ ) = 0 otherwise Try the points x 1 = 1, x 2 = 2, x 3 = 3. Compute the kernel matrix 1 1 0 1 1 1 (16) K = 0 1 1 � � ( 2 ) − 1 ) − 1 , 1, and ( 1 − The eigenvalues of K are ( ( 2 )) . Therefore K is not positive semidefinite. 15 / 100
Representer Theorem A decision function f ( x ) can be written as N N � � (17) f ( x ) = � w , φ ( x ) � = � α i φ ( x i ) , φ ( x ) � = α i k ( x i , x ) . i = 1 i = 1 ◮ Representer theorem says this function exists with finitely many coefficients α i even when φ is infinite dimensional (an infinite number of basis functions). ◮ Initially viewed as a strength of kernel methods, for datasets not exceeding e.g. ten thousand points. ◮ Unfortunately, the number of nonzero α i often grows linearly in the size of the training set N . ◮ Example: In GP regression, the predictive mean is N � E [ f ∗ | y , X , x ∗ ] = k T ∗ ( K + σ 2 I ) − 1 y = (18) α i k ( x i , x ∗ ) , i = 1 where α i = ( K + σ 2 I ) − 1 y . 16 / 100
Making new kernels from old Suppose k 1 ( x , x ′ ) and k 2 ( x , x ′ ) are valid. Then the following covariance functions are also valid: k ( x , x ′ ) = g ( x ) k 1 ( x , x ′ ) g ( x ′ ) (19) k ( x , x ′ ) = q ( k 1 ( x , x ′ )) (20) k ( x , x ′ ) = exp ( k 1 ( x , x ′ )) (21) k ( x , x ′ ) = k 1 ( x , x ′ ) + k 2 ( x , x ′ ) (22) k ( x , x ′ ) = k 1 ( x , x ′ ) k 2 ( x , x ′ ) (23) k ( x , x ′ ) = k 3 ( φ ( x ) , φ ( x ′ )) (24) k ( x , x ′ ) = x T Ax ′ (25) k ( x , x ′ ) = k a ( x a , x ′ a ) + k b ( x b , x ′ (26) b ) k ( x , x ′ ) = k a ( x a , x ′ a ) k b ( x b , x ′ (27) b ) where g is any function, q is a polynomial with nonnegative coefficients, φ ( x ) is a function from x to R M , k 3 is a valid covariance function in R M , A is a symmetric positive definite matrix, x a and x b are not necessarily disjoint variables with x = ( x a , x b ) T , and k a and k b are valid kernels in their respective spaces. 17 / 100
Stationary Kernels ◮ A stationary kernel is invariant to translations of the input space. Equivalently, k = k ( x − x ′ ) = k ( τ ) . ◮ All distance kernels, k = k ( || x − x ′ || ) are examples of stationary kernels. ◮ The RBF kernel k RBF ( x , x ′ ) = a 2 exp ( − || x − x ′ || 2 ) is a stationary kernel. 2 ℓ 2 0 ) p is an example of a The polynomial kernel k POL ( x , x ′ ) = ( x T x + σ 2 non-stationary kernel. ◮ Stationarity provides a useful inductive bias . 18 / 100
Recommend
More recommend