Statistical Machine Learning Lecture 13: Kernel Regression and - PowerPoint PPT Presentation

Statistical Machine Learning Lecture 13: Kernel Regression and Gaussian Processes Kristian Kersting TU Darmstadt Summer Term 2020 K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 1 / 71

Today’s Objectives Make you understand how to use kernels for regression both from a frequentist and Bayesian point of view Covered Topics Why kernel methods? Radial basis function networks What is a kernel? Dual representation Gaussian Process Regression K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 2 / 71

Outline 1. Kernel Methods for Regression 2. Gaussian Processes Regression 3. Bayesian Learning and Hyperparameters 4. Wrap-Up K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 3 / 71

1. Kernel Methods for Regression Outline 1. Kernel Methods for Regression 2. Gaussian Processes Regression 3. Bayesian Learning and Hyperparameters 4. Wrap-Up K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 4 / 71

1. Kernel Methods for Regression Why Kernels and not Neural Networks? Multi-Layer Perceptrons use univariate projections to “span” the space of the data (like an “octopus”) y = g ( w ⊺ x ) K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 5 / 71

1. Kernel Methods for Regression Why Kernels and not Neural Networks? Pros Universal function approximation Large range generalization (extrapolation) Good for high dimensional data Cons Hard to train Danger of interference K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 6 / 71

1. Kernel Methods for Regression Radial Basis Function Networks Use spatially localized kernels for learning Note: there are other basis functions that are not spatially localized K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 7 / 71

1. Kernel Methods for Regression Radial Basis Function Networks For instance with Gaussian kernels � − 1 � 2 ( x − c k ) ⊺ D ( x − c k ) φ ( x , c k ) = exp with D positive definite K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 8 / 71

1. Kernel Methods for Regression Radial Basis Function Networks The “output layer” is just a linear regression Often needs regularization (e.g., ridge regression) J = 1 2 ( t − y ) ⊺ ( t − y ) = 1 2 ( t − Φ w ) ⊺ ( t − Φ w )     t 1 φ 11 φ 12 . . . φ 1 m t 2   φ 21 φ 22 . . . φ 2 m     t =  , Φ = .    .  . . . . . . . . . . . .   .  φ n 1 φ n 2 . . . φ nm t n w = ( Φ ⊺ Φ ) − 1 Φ ⊺ t K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 9 / 71

1. Kernel Methods for Regression Radial Basis Function Networks The “input layer” can be optimized by gradient descent with respect to distance metric and centers of RBFs � � � � � � ∂ J ∂ y ∂Φ ∂ J ∂ y ∂Φ = ( t − y ) = − ( t − y ) w k = ( t − y ) = − ( t − y ) w k − − − ∂ c k ∂ c k ∂ c k ∂ D k ∂ D k ∂ D k ∂Φ ∂ � 1 � ∂Φ ∂ � 1 � ( x − c k ) ⊺ D ( x − c k ) ( x − c k ) ⊺ D k ( x − c k ) = exp = exp − − ∂ c k ∂ c k 2 ∂ D k ∂ D k 2 1 1 � � � � ( x − c k ) ⊺ D ( x − c k ) ( x − c k ) ⊺ D ( x − c k ) ⊺ D k ( x − c k ) = exp = exp ( x − c k ) ⊺ − − 2 2 ( x − c k ) Gradient descent can make D non positive definite = ⇒ use Cholesky Decomposition An iterative procedure is needed to for optimization, i.e., alternate update of w and update of c k and D k K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 10 / 71

1. Kernel Methods for Regression Radial Basis Function Networks Sensitivity to kernel width (bandwidth, dist. metric) of � 2 ( x − c k ) 2 h � − 1 φ ( x , c k ) = exp K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 11 / 71

1. Kernel Methods for Regression Radial Basis Function Networks Sensitivity to number of kernels and metric of � 2 ( x − c k ) 2 h � − 1 φ ( x , c k ) = exp K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 12 / 71

1. Kernel Methods for Regression Radial Basis Function Networks Benefits of center and metric adaptation K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 13 / 71

1. Kernel Methods for Regression Radial Basis Function Networks All adaptations turned on Note: RBF tend to grow wider with a lot of overlap, and learning rates are sensitive K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 14 / 71

1. Kernel Methods for Regression Radial Basis Function Networks - Summary RBFs are a powerful and efficient learning tool Number of RBFs and hyperparameter optimization is important and a bit difficult to tune Theoretical remark Poggio and Girosi (1990) showed that RBF networks arise naturally from minimizing the penalized cost function � J = 1 ( t n − y ( x n )) 2 + 1 � | G ( x ) | 2 d x 2 γ 2 n with, e.g., G ( x ) = ∂ 2 y ∂ x 2 , a smoothless prior K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 15 / 71

1. Kernel Methods for Regression Kernel Methods in General What is a kernel? Most intuitive approach for a fixed nonlinear feature space: an inner product of feature vectors = φ ( x ) ⊺ φ � x , x ′ � � x ′ � k A kernel is symmetric � x , x ′ � � � x ′ , x = k k Examples Stationary kernels: k ( x , x ′ ) = k ( x − x ′ ) Linear kernel: k ( x , x ′ ) = x ⊺ x ′ Homogeneous kernels: k ( x , x ′ ) = k ( � x − x ′ � ) K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 16 / 71

1. Kernel Methods for Regression Dual Representation of Linear Regression The dual representation gives natural rise to the kernel functions N J ( w ) = 1 ( w ⊺ φ ( x n ) − t n ) 2 + λ � 2 w ⊺ w , where λ ≥ 0 2 n = 1 N ∂ J ( w ) � = ( w ⊺ φ ( x n ) − t n ) φ ( x n ) + λ w = 0 ∂ w n = 1 N N w = − 1 � � ( w ⊺ φ ( x n ) − t n ) φ ( x n ) = a n φ ( x n ) = Φ ⊺ a λ n = 1 n = 1 where Φ = [ φ ( x 1 ) ⊺ . . . φ ( x N ) ⊺ ] ∈ R N × D Thus, w is a linear combination of φ ( x n ) The dual representation focuses on solving for a , and not w K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 17 / 71

1. Kernel Methods for Regression Dual Representation of Linear Regression Insert the dual representation into the cost function N J ( w ) = 1 ( w ⊺ φ ( x n ) − t n ) 2 + λ � 2 w ⊺ w 2 n = 1 N J ( a ) = 1 ( a ⊺ Φφ ( x n ) − t n ) 2 + λ � 2 a ⊺ ΦΦ T a 2 n = 1 N N N = 1 a ⊺ Φφ ( x n ) φ ( x n ) ⊺ a + 1 a ⊺ Φφ ( x n ) t n + λ � � � t 2 2 a ⊺ ΦΦ T a n − 2 2 n = 1 n = 1 n = 1 = 1 ΦΦ ⊺ a + 1 2 t t t − a ⊺ ΦΦ ⊺ t + λ 2 a ⊺ ΦΦ ⊺ 2 a ⊺ ΦΦ T a �� K = 1 2 a ⊺ KKa + 1 2 t t t − a ⊺ Kt + λ 2 a ⊺ Ka K = ΦΦ ⊺ is the Gram Matrix, and K ij = φ ( x i ) ⊺ φ � � � � x j = k x i , x j K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 18 / 71

1. Kernel Methods for Regression Dual Representation of Linear Regression Solve the dual problem for a J ( a ) = 1 2 a ⊺ KKa + 1 2 t t t − a ⊺ Kt + λ 2 a ⊺ Ka ∂ J ( a ) = KKa − Kt + λ Ka = K ( Ka − t + λ a ) = 0 ∂ a a = ( K + λ I ) − 1 t Side note: since by definition of a kernel matrix, K is Positive Semi-Definite, K − 1 exists K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 19 / 71

1. Kernel Methods for Regression Dual Representation of Linear Regression Compute the prediction as y ( x ) = w ⊺ φ ( x ) = a ⊺ Φφ ( x ) = k ( x ) ⊺ ( K + λ I ) − 1 t where k ( x ) = [ k ( x , x 1 ) . . . k ( x , x N )] ⊺ All computations can be expressed in terms of the kernel function k K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 20 / 71

1. Kernel Methods for Regression Pros and Cons of the Dual Representation Cons Need to invert a N × N matrix Pros Can work entirely in feature space with the help of kernels Can even consider infinite feature spaces, as the kernel function does only have the inner product of feature vectors, which is a scalar, even for infinite feature spaces Many novel algorithms can be derived from the dual representation Many old problems of RBFs (how many kernels, which metric, which centers) can be solved in a principled way K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 21 / 71

Statistical Machine Learning Lecture 13: Kernel Regression and - PowerPoint PPT Presentation

Statistical Machine Learning Lecture 13: Kernel Regression and Gaussian Processes Kristian Kersting TU Darmstadt Summer Term 2020 K. Kersting based on Slides from J. Peters Statistical Machine Learning Summer Term 2020 1 / 71 Todays

Statistical Machine Translation George Foster George Foster Statistical Machine Translation A

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Foundations of AI Why learning works 1 6 . Statistical Machine Learning Bayesian Learning and

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

MACHINE LEARNING, STATISTICAL LEARNING AND PARALLEL COMPUTING INTRODUCTION VS MACHINE LEARNING

COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 23. PGM

Statistical Machine Translation Statistical Machine Translation p Lecture 2 Theory and Praxis of

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Poli 5D Social Science Data Analytics Regression in Stata Shane Xinyang Xuan ShaneXuan.com

Linear Models for Regression Greg Mori - CMPT 419/726 Bishop PRML Ch. 3 Regression Linear Basis

8.4.3 Linear Regression Prof. Tesler Math 283 Fall 2019 Prof. Tesler 8.4.3: Linear Regression

Simple Linear Regression Recall: A regression model describes how a dependent variable (or

Regression Testing Gavan Fantom gavan@NetBSD.org pkgsrcCon 2005 Introduction Have you ever

Econometric Analysis Using Stata Introduction Time Series Panel Data Stata : Data Analysis and

Text Selection Bryan Kelly Yale University Asaf Manela Washington University in St. Louis Alan

Political Science 209 - Fall 2018 Linear Regression Florian Hollenbach 22nd October 2018

Sambuz

Useful Links

Newsletter

Mail Us