A Neural Network View of Kernel Methods Shuiwang Ji Department of - PowerPoint PPT Presentation

A Neural Network View of Kernel Methods Shuiwang Ji Department of Computer Science & Engineering Texas A&M University 1 / 24

Linear Models 1 In a binary classification problem, we have training data { ˜ ① ✐ , y i } m i =1 , ① ✐ ∈ R n − 1 represents the input feature vector and y i ∈ {− 1 , 1 } where ˜ is the corresponding label. 2 In logistic regression, for each sample ˜ ① ✐ , a linear classifier, ✇ ∈ R n − 1 and b ∈ R , computes the classification parameterized by ˜ score as 1 ✇ ❚ ˜ h ( ˜ ① ✐ ) = σ ( ˜ ① ✐ + b ) = , (1) ✇ ❚ ˜ 1 + exp [ − ( ˜ ① ✐ + b )] where σ ( · ) is the sigmoid function. 3 Note that the classification score h ( ˜ ① ✐ ) can be interpreted as the probability of ˜ ① ✐ having label 1. � 1 � b � � ∈ R n , i = 1 , 2 , . . . , m and ✇ = ∈ R n . Then 4 We let ① ✐ = ˜ ˜ ① ✐ ✇ we can re-write Eqn. (1) as 1 h ( ① ✐ ) = σ ( ✇ ❚ ① ✐ ) = 1 + exp [ − ✇ ❚ ① ✐ ] . (2) 2 / 24

Linearly Inseparable Data 1 If the training data { ① ✐ , y i } m i =1 are linearly separable, there exists a ✇ ∗ ∈ R n such that y i ✇ ∗ ❚ ① ✐ ≥ 0, i = 1 , 2 , . . . , m . In this case, a linear model like logistic regression can perfectly fit the original training data { ① ✐ , y i } m i =1 . 1.0 0.8 0.6 0.4 0.2 0.0 0.2 0.4 0.6 2 However, this is not possible for linearly inseparable cases. 1.0 0.5 0.0 0.5 1.0 3 / 24

Feature Mapping 1 A typical method to handle such linearly inseparable cases is feature mapping . That is, instead of using original { ① ✐ } m i =1 , we use a feature mapping function φ : R n → R N on { ① ✐ } m i =1 , so that mapped feature vectors { φ ( ① ✐ ) } m i =1 are linearly separable. 2 For example, we can map the linearly inseparable data with   1 �� 1 ��  . φ ( ① ) = φ = x ˜ (3)  x ˜ x 2 ˜ 1.0 0.8 0.6 x^2 0.4 0.2 0.0 1.0 0.5 0.0 0.5 1.0 x 4 / 24

Logistic Regression with Feature Mapping In the context of logistic regression, the whole process can be described as 1 h ( ① ✐ ) = σ ( ✇ ❚ φ ( ① ✐ ) ) = (4) 1 + exp [ − ✇ ❚ φ ( ① ✐ ) ] , where the dimension of the parameter vector ✇ becomes N accordingly. � � ℎ � ∈ ℝ � �(�) ∈ ℝ � 5 / 24

Computation of Feature Mapping 1 In order to achieve strong enough representation power, it is common in practice that φ ( ① ) has a much higher dimension than ① , i.e. N >> n . 2 However, this dramatically increases the costs of computing either φ ( ① ) or ✇ ❚ φ ( ① ) . 3 In the following, we introduce an efficient way to implicitly compute ✇ ❚ φ ( ① ) . 4 Specifically, we use the representer theorem to show that computing ✇ ❚ φ ( ① ) can be transformed to computing � m i =1 α i φ ( ① ✐ ) ❚ φ ( ① ) , where { α i } m i =1 are learnable parameters. 5 Then we introduce the kernel methods, which significantly reduce the costs of computing φ ( ① ✐ ) ❚ φ ( ① ) . 6 / 24

Summary of Models with Feature Mapping Given the training data { ① ✐ , y i } m i =1 ( ① ✐ ∈ R n , y i ∈ R ) and a feature mapping φ : R n → R N , to solve a supervised learning task (regression or classification), we need to do the following steps: Compute the feature vectors { φ ( ① ✐ ) } m i =1 of all training samples; Initialize a linear model with a parameter vector ✇ ∈ R N ; Minimize the task-specific loss function L on { φ ( ① ✐ ) , y i } m i =1 with respect to ✇ . 7 / 24

Regularization 1 Since loss L is a function of z = ✇ ❚ φ ( ① ) and y , we can write it as L ( ✇ ❚ φ ( ① ) , y ). Minimizing L ( ✇ ❚ φ ( ① ) , y ) on { φ ( ① ✐ ) , y i } m i =1 is an optimization problem: m 1 � � � ✇ ❚ φ ( ① ✐ ) , y i min L . (5) m ✇ i =1 2 However, in many situations, minimizing L only may cause the problem of over-fitting. A common method to address over-fitting is to apply the ℓ 2 -regularization, changing Equation (5) into Equation (6): m 1 L ( ✇ ❚ φ ( ① ✐ ) , y i ) + λ � 2 || ✇ || 2 min 2 , λ ≥ 0 , (6) m ✇ i =1 where λ ≥ 0 is a hyper-parameter, known as the regularization parameter, controlling the extent to which we penalize large ℓ 2 -norms of ✇ . 8 / 24

Representer Theorem m 1 L ( ✇ ❚ φ ( ① ✐ ) , y i ) + λ � 2 || ✇ || 2 min 2 , λ ≥ 0 , (7) m ✇ i =1 In order to derive a solutions to this optimization problem, we introduce the following theorem, which is a special case of the well-known Representer Theorem. The Representer Theorem is the theoretical foundation of kernel methods. Theorem If the optimization problem in Equation (6) (copied above) has optimal solutions, there must exist an optimal solution with the form ✇ ∗ = � m i =1 α i φ ( ① ✐ ) . 9 / 24

Proof I Since elements of { φ ( ① ✐ ) } m i =1 are all in R N , they can form a subspace V ⊆ R N such that V = { ① : ① = � m i =1 c i φ ( ① ✐ ) } . Assume { ✈ 1 , ..., ✈ ♥ ′ } is an orthonormal basis of V , where n ′ ≤ N . V also has an orthogonal complement subspace V ⊥ , which has an orthonormal basis ❦ ✉ ❥ = 0 for any 1 ≤ k ≤ n ′ and 1 ≤ j ≤ N − n ′ . { ✉ 1 , ..., ✉ ◆ − ♥ ′ } . Clearly, ✈ ❚ For an arbitrary vector ✇ ∈ R N , we can decompose it into a linear combination of orthonormal basis vectors in the subspace V and V ⊥ . That is, we can write ✇ as n ′ N − n ′ � � ✇ = ✇ ❱ + ✇ ❱ ⊥ = s k ✈ ❦ + t j ✉ ❥ . (8) k =1 j =1 10 / 24

Proof II First, we can show that 2 � � n ′ N − n ′ � � � � || ✇ || 2 � � 2 = s k ✈ ❦ + t j ✉ ❥ � � � � k =1 j =1 � � 2     n ′ N − n ′ n ′ N − n ′ � � � � s k ✈ ❚ t j ✉ ❚ = ❦ + s k ✈ ❦ + t j ✉ ❥  ❥    k =1 j =1 k =1 j =1 n ′ N − n ′ n ′ N − n ′ � � � � s k t j ✈ ❚ s 2 k || ✈ ❦ || 2 t 2 j || ✉ ❥ || 2 = 2 + 2 + 2 ❦ ✉ ❥ k =1 j =1 k =1 j =1 n ′ N − n ′ � � s 2 k || ✈ ❦ || 2 t 2 j || ✉ ❥ || 2 = 2 + 2 k =1 j =1 n ′ � s 2 k || ✈ ❦ || 2 2 = || ✇ ❱ || 2 ≥ (9) 2 . k =1 11 / 24

Proof III Second, because each φ ( ① ✐ ) , 1 ≤ i ≤ m is a vector in V and { ✈ 1 , ..., ✈ ♥ ′ } is an orthonormal basis of V , we have φ ( ① ✐ ) = � n ′ k =1 β ik ✈ ❦ . This leads to the following equalities: � n ′   n ′ N − n ′ � � � � ✇ ❚ φ ( ① ✐ ) s k ✈ ❚ t j ✉ ❚ = ❦ + β ik ✈ ❦  ❥  k =1 j =1 k =1 n ′ � s k β ik || ✈ ❦ || 2 2 = ✇ ❚ = ❱ φ ( ① ✐ ) . (10) k =1 12 / 24

Proof IV Based on the results in Eqn. (9) and Eqn. (10), and the fact that ✇ ❱ is a vector in V = { ① : ① = � m i =1 c i φ ( ① ✐ ) } , we can derive that, for an arbitrary ✇ , there always exists a ✇ ❱ = � m i =1 α i φ ( ① ✐ ) satisfying m m 1 L ( ✇ ❚ φ ( ① ✐ ) , y i ) + λ 2 ≥ 1 ❱ φ ( ① ✐ ) , y i ) + λ � � L ( ✇ ❚ 2 || ✇ || 2 2 || ✇ ❱ || 2 2 . m m i =1 i =1 (11) In other words, if a vector ✇ ∗ minimizes � m i =1 L ( ✇ ❚ φ ( ① ✐ ) , y i ) + λ 1 2 || ✇ || 2 2 , the corresponding ✇ ∗ ❱ must also m ❱ = � m minimize it, and there exist some { α i } m i =1 such that ✇ ∗ i =1 α i φ ( ① ✐ ) . 13 / 24

Use of the Representer Theorem in Training 1 According to Theorem 1, we only need to consider ✇ ∈ { � m i =1 α i φ ( ① ✐ ) } when solving the optimization problem in Equation (6). 2 Therefore, we have the following transformed optimization problem by replacing ✇ in Equation (6) with � m i =1 α i φ ( ① ✐ ) : � m m � m m 1 + λ � � α i φ ( ① ✐ ) ❚ φ ( ① ❥ ) , y j � � α i α j φ ( ① ✐ ) ❚ φ ( ① ❥ ) , min L m 2 α 1 ,...,α m j =1 i =1 j =1 i =1 (12) for λ ≥ 0 3 As a result, if we know φ ( ① ✐ ) ❚ φ ( ① ❥ ) for all 1 ≤ j , i ≤ m , we can compute the optimization objective in Equation (12) without explicitly knowing { φ ( ① ✐ ) } m i =1 . 14 / 24

Use of the Representer Theorem in Prediction 1 In addition, the output of the linear model with parameter ✇ for any input φ ( ① ) only depends on ✇ ❚ φ ( ① ) , i.e. , � m i =1 α i φ ( ① ✐ ) ❚ φ ( ① ) . 2 So for any unseen ① that is not in the training set, we can make predictions directly without computing φ ( ① ) first if we know φ ( ① ✐ ) ❚ φ ( ① ) for all 1 ≤ i ≤ m . 3 In summary, in both training and prediction, what we really need is the inner product of two feature vectors, not the feature vectors themselves. 15 / 24

A Neural Network View of Kernel Methods Shuiwang Ji Department of - PowerPoint PPT Presentation

A Neural Network View of Kernel Methods Shuiwang Ji Department of Computer Science & Engineering Texas A&M University 1 / 24 Linear Models 1 In a binary classification problem, we have training data { , y i } m i =1 ,

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

Cumbernauld Academy Existing aerial view from west Site Plan Aerial view from South Aerial view

Black Kernel Rot Malady of Pecan B Wood, C Bock, l Wells, T Cottrell, M Hotchkiss Black Kernel

Kernel Properties - Convexity Leila Wehbe October 1st 2013 Leila Wehbe Kernel Properties -

Processes, Protection and the Kernel: Processes, Protection and the Kernel: Mode, Space, and

Linux Kernel Debugging Your kernel just oopsed - What do you do, hotshot? Muli Ben-Yehuda

Introduction to Linux Kernel Modules Luca Abeni luca.abeni@santannapisa.it Linux Kernel Modules

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

When do neural networks outperform kernel methods? Song Mei Stanford University June 29, 2020

Kernel Methods Lei Tang Arizona State University Jul. 26th, 2007 Lei Tang Kernel Methods

Optimization for Kernel Methods S. Sathiya Keerthi Yahoo! Research, Burbank, CA, USA Kernel

Machine learning theory Kernel methods Hamid Beigy Sharif university of technology April 20,

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Kernel methods and Graph kernels Social and Technological Networks Rik Sarkar University of

Explicit Feature Methods for Accelerated Kernel Learning Purushottam Kar Quick Motivation

Kernel methods for Network Analysis: An introduction Chiranjib Bhattacharyya Machine Learning

Scalable Learning in Reproducing Kernel Kre n Spaces Dino Oglic 1 Thomas Grtner 2 1

Lecture 5: SVM II Princeton University COS 495 Instructor: Yingyu Liang Review: SVM objective

L ECTURE 9: D UAL AND K ERNEL Prof. Julia Hockenmaier juliahmr@illinois.edu Linear classifiers

Spectral regularization methods for statistical inverse learning problems G. Blanchard

Meta-parameters of kernel methods and their optimization Petra Vidnerov Roman Neruda

Sambuz

Useful Links

Newsletter

Mail Us