Statistical Machine Learning Lecture 13: Kernel Regression and Gaussian Processes Kristian Kersting TU Darmstadt Summer Term 2020 K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 1 / 71
Today’s Objectives Make you understand how to use kernels for regression both from a frequentist and Bayesian point of view Covered Topics Why kernel methods? Radial basis function networks What is a kernel? Dual representation Gaussian Process Regression K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 2 / 71
Outline 1. Kernel Methods for Regression 2. Gaussian Processes Regression 3. Bayesian Learning and Hyperparameters 4. Wrap-Up K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 3 / 71
1. Kernel Methods for Regression Outline 1. Kernel Methods for Regression 2. Gaussian Processes Regression 3. Bayesian Learning and Hyperparameters 4. Wrap-Up K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 4 / 71
1. Kernel Methods for Regression Why Kernels and not Neural Networks? Multi-Layer Perceptrons use univariate projections to “span” the space of the data (like an “octopus”) y = g ( w ⊺ x ) K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 5 / 71
1. Kernel Methods for Regression Why Kernels and not Neural Networks? Pros Universal function approximation Large range generalization (extrapolation) Good for high dimensional data Cons Hard to train Danger of interference K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 6 / 71
1. Kernel Methods for Regression Radial Basis Function Networks Use spatially localized kernels for learning Note: there are other basis functions that are not spatially localized K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 7 / 71
1. Kernel Methods for Regression Radial Basis Function Networks For instance with Gaussian kernels � − 1 � 2 ( x − c k ) ⊺ D ( x − c k ) φ ( x , c k ) = exp with D positive definite K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 8 / 71
1. Kernel Methods for Regression Radial Basis Function Networks The “output layer” is just a linear regression Often needs regularization (e.g., ridge regression) J = 1 2 ( t − y ) ⊺ ( t − y ) = 1 2 ( t − Φ w ) ⊺ ( t − Φ w ) t 1 φ 11 φ 12 . . . φ 1 m t 2 φ 21 φ 22 . . . φ 2 m t = , Φ = . . . . . . . . . . . . . . . φ n 1 φ n 2 . . . φ nm t n w = ( Φ ⊺ Φ ) − 1 Φ ⊺ t K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 9 / 71
1. Kernel Methods for Regression Radial Basis Function Networks The “input layer” can be optimized by gradient descent with respect to distance metric and centers of RBFs � � � � � � ∂ J ∂ y ∂Φ ∂ J ∂ y ∂Φ = ( t − y ) = − ( t − y ) w k = ( t − y ) = − ( t − y ) w k − − − ∂ c k ∂ c k ∂ c k ∂ D k ∂ D k ∂ D k ∂Φ ∂ � 1 � ∂Φ ∂ � 1 � ( x − c k ) ⊺ D ( x − c k ) ( x − c k ) ⊺ D k ( x − c k ) = exp = exp − − ∂ c k ∂ c k 2 ∂ D k ∂ D k 2 1 1 � � � � ( x − c k ) ⊺ D ( x − c k ) ( x − c k ) ⊺ D ( x − c k ) ⊺ D k ( x − c k ) = exp = exp ( x − c k ) ⊺ − − 2 2 ( x − c k ) Gradient descent can make D non positive definite = ⇒ use Cholesky Decomposition An iterative procedure is needed to for optimization, i.e., alternate update of w and update of c k and D k K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 10 / 71
1. Kernel Methods for Regression Radial Basis Function Networks Sensitivity to kernel width (bandwidth, dist. metric) of � 2 ( x − c k ) 2 h � − 1 φ ( x , c k ) = exp K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 11 / 71
1. Kernel Methods for Regression Radial Basis Function Networks Sensitivity to number of kernels and metric of � 2 ( x − c k ) 2 h � − 1 φ ( x , c k ) = exp K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 12 / 71
1. Kernel Methods for Regression Radial Basis Function Networks Benefits of center and metric adaptation K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 13 / 71
1. Kernel Methods for Regression Radial Basis Function Networks All adaptations turned on Note: RBF tend to grow wider with a lot of overlap, and learning rates are sensitive K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 14 / 71
1. Kernel Methods for Regression Radial Basis Function Networks - Summary RBFs are a powerful and efficient learning tool Number of RBFs and hyperparameter optimization is important and a bit difficult to tune Theoretical remark Poggio and Girosi (1990) showed that RBF networks arise naturally from minimizing the penalized cost function � J = 1 ( t n − y ( x n )) 2 + 1 � | G ( x ) | 2 d x 2 γ 2 n with, e.g., G ( x ) = ∂ 2 y ∂ x 2 , a smoothless prior K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 15 / 71
1. Kernel Methods for Regression Kernel Methods in General What is a kernel? Most intuitive approach for a fixed nonlinear feature space: an inner product of feature vectors = φ ( x ) ⊺ φ � x , x ′ � � x ′ � k A kernel is symmetric � x , x ′ � � � x ′ , x = k k Examples Stationary kernels: k ( x , x ′ ) = k ( x − x ′ ) Linear kernel: k ( x , x ′ ) = x ⊺ x ′ Homogeneous kernels: k ( x , x ′ ) = k ( � x − x ′ � ) K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 16 / 71
1. Kernel Methods for Regression Dual Representation of Linear Regression The dual representation gives natural rise to the kernel functions N J ( w ) = 1 ( w ⊺ φ ( x n ) − t n ) 2 + λ � 2 w ⊺ w , where λ ≥ 0 2 n = 1 N ∂ J ( w ) � = ( w ⊺ φ ( x n ) − t n ) φ ( x n ) + λ w = 0 ∂ w n = 1 N N w = − 1 � � ( w ⊺ φ ( x n ) − t n ) φ ( x n ) = a n φ ( x n ) = Φ ⊺ a λ n = 1 n = 1 where Φ = [ φ ( x 1 ) ⊺ . . . φ ( x N ) ⊺ ] ∈ R N × D Thus, w is a linear combination of φ ( x n ) The dual representation focuses on solving for a , and not w K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 17 / 71
1. Kernel Methods for Regression Dual Representation of Linear Regression Insert the dual representation into the cost function N J ( w ) = 1 ( w ⊺ φ ( x n ) − t n ) 2 + λ � 2 w ⊺ w 2 n = 1 N J ( a ) = 1 ( a ⊺ Φφ ( x n ) − t n ) 2 + λ � 2 a ⊺ ΦΦ T a 2 n = 1 N N N = 1 a ⊺ Φφ ( x n ) φ ( x n ) ⊺ a + 1 a ⊺ Φφ ( x n ) t n + λ � � � t 2 2 a ⊺ ΦΦ T a n − 2 2 n = 1 n = 1 n = 1 = 1 ΦΦ ⊺ a + 1 2 t t t − a ⊺ ΦΦ ⊺ t + λ 2 a ⊺ ΦΦ ⊺ 2 a ⊺ ΦΦ T a ���� K = 1 2 a ⊺ KKa + 1 2 t t t − a ⊺ Kt + λ 2 a ⊺ Ka K = ΦΦ ⊺ is the Gram Matrix, and K ij = φ ( x i ) ⊺ φ � � � � x j = k x i , x j K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 18 / 71
1. Kernel Methods for Regression Dual Representation of Linear Regression Solve the dual problem for a J ( a ) = 1 2 a ⊺ KKa + 1 2 t t t − a ⊺ Kt + λ 2 a ⊺ Ka ∂ J ( a ) = KKa − Kt + λ Ka = K ( Ka − t + λ a ) = 0 ∂ a a = ( K + λ I ) − 1 t Side note: since by definition of a kernel matrix, K is Positive Semi-Definite, K − 1 exists K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 19 / 71
1. Kernel Methods for Regression Dual Representation of Linear Regression Compute the prediction as y ( x ) = w ⊺ φ ( x ) = a ⊺ Φφ ( x ) = k ( x ) ⊺ ( K + λ I ) − 1 t where k ( x ) = [ k ( x , x 1 ) . . . k ( x , x N )] ⊺ All computations can be expressed in terms of the kernel function k K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 20 / 71
1. Kernel Methods for Regression Pros and Cons of the Dual Representation Cons Need to invert a N × N matrix Pros Can work entirely in feature space with the help of kernels Can even consider infinite feature spaces, as the kernel function does only have the inner product of feature vectors, which is a scalar, even for infinite feature spaces Many novel algorithms can be derived from the dual representation Many old problems of RBFs (how many kernels, which metric, which centers) can be solved in a principled way K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 21 / 71
Recommend
More recommend