a neural network view of kernel methods
play

A Neural Network View of Kernel Methods Shuiwang Ji Department of - PowerPoint PPT Presentation

A Neural Network View of Kernel Methods Shuiwang Ji Department of Computer Science & Engineering Texas A&M University 1 / 24 Linear Models 1 In a binary classification problem, we have training data { , y i } m i =1 ,


  1. A Neural Network View of Kernel Methods Shuiwang Ji Department of Computer Science & Engineering Texas A&M University 1 / 24

  2. Linear Models 1 In a binary classification problem, we have training data { ˜ ① ✐ , y i } m i =1 , ① ✐ ∈ R n − 1 represents the input feature vector and y i ∈ {− 1 , 1 } where ˜ is the corresponding label. 2 In logistic regression, for each sample ˜ ① ✐ , a linear classifier, ✇ ∈ R n − 1 and b ∈ R , computes the classification parameterized by ˜ score as 1 ✇ ❚ ˜ h ( ˜ ① ✐ ) = σ ( ˜ ① ✐ + b ) = , (1) ✇ ❚ ˜ 1 + exp [ − ( ˜ ① ✐ + b )] where σ ( · ) is the sigmoid function. 3 Note that the classification score h ( ˜ ① ✐ ) can be interpreted as the probability of ˜ ① ✐ having label 1. � 1 � b � � ∈ R n , i = 1 , 2 , . . . , m and ✇ = ∈ R n . Then 4 We let ① ✐ = ˜ ˜ ① ✐ ✇ we can re-write Eqn. (1) as 1 h ( ① ✐ ) = σ ( ✇ ❚ ① ✐ ) = 1 + exp [ − ✇ ❚ ① ✐ ] . (2) 2 / 24

  3. Linearly Inseparable Data 1 If the training data { ① ✐ , y i } m i =1 are linearly separable, there exists a ✇ ∗ ∈ R n such that y i ✇ ∗ ❚ ① ✐ ≥ 0, i = 1 , 2 , . . . , m . In this case, a linear model like logistic regression can perfectly fit the original training data { ① ✐ , y i } m i =1 . 1.0 0.8 0.6 0.4 0.2 0.0 0.2 0.4 0.6 2 However, this is not possible for linearly inseparable cases. 1.0 0.5 0.0 0.5 1.0 3 / 24

  4. Feature Mapping 1 A typical method to handle such linearly inseparable cases is feature mapping . That is, instead of using original { ① ✐ } m i =1 , we use a feature mapping function φ : R n → R N on { ① ✐ } m i =1 , so that mapped feature vectors { φ ( ① ✐ ) } m i =1 are linearly separable. 2 For example, we can map the linearly inseparable data with   1 �� 1 ��  . φ ( ① ) = φ = x ˜ (3)  x ˜ x 2 ˜ 1.0 0.8 0.6 x^2 0.4 0.2 0.0 1.0 0.5 0.0 0.5 1.0 x 4 / 24

  5. Logistic Regression with Feature Mapping In the context of logistic regression, the whole process can be described as 1 h ( ① ✐ ) = σ ( ✇ ❚ φ ( ① ✐ ) ) = (4) 1 + exp [ − ✇ ❚ φ ( ① ✐ ) ] , where the dimension of the parameter vector ✇ becomes N accordingly. � � ℎ � ∈ ℝ � �(�) ∈ ℝ � 5 / 24

  6. Computation of Feature Mapping 1 In order to achieve strong enough representation power, it is common in practice that φ ( ① ) has a much higher dimension than ① , i.e. N >> n . 2 However, this dramatically increases the costs of computing either φ ( ① ) or ✇ ❚ φ ( ① ) . 3 In the following, we introduce an efficient way to implicitly compute ✇ ❚ φ ( ① ) . 4 Specifically, we use the representer theorem to show that computing ✇ ❚ φ ( ① ) can be transformed to computing � m i =1 α i φ ( ① ✐ ) ❚ φ ( ① ) , where { α i } m i =1 are learnable parameters. 5 Then we introduce the kernel methods, which significantly reduce the costs of computing φ ( ① ✐ ) ❚ φ ( ① ) . 6 / 24

  7. Summary of Models with Feature Mapping Given the training data { ① ✐ , y i } m i =1 ( ① ✐ ∈ R n , y i ∈ R ) and a feature mapping φ : R n → R N , to solve a supervised learning task (regression or classification), we need to do the following steps: Compute the feature vectors { φ ( ① ✐ ) } m i =1 of all training samples; Initialize a linear model with a parameter vector ✇ ∈ R N ; Minimize the task-specific loss function L on { φ ( ① ✐ ) , y i } m i =1 with respect to ✇ . 7 / 24

  8. Regularization 1 Since loss L is a function of z = ✇ ❚ φ ( ① ) and y , we can write it as L ( ✇ ❚ φ ( ① ) , y ). Minimizing L ( ✇ ❚ φ ( ① ) , y ) on { φ ( ① ✐ ) , y i } m i =1 is an optimization problem: m 1 � � � ✇ ❚ φ ( ① ✐ ) , y i min L . (5) m ✇ i =1 2 However, in many situations, minimizing L only may cause the problem of over-fitting. A common method to address over-fitting is to apply the ℓ 2 -regularization, changing Equation (5) into Equation (6): m 1 L ( ✇ ❚ φ ( ① ✐ ) , y i ) + λ � 2 || ✇ || 2 min 2 , λ ≥ 0 , (6) m ✇ i =1 where λ ≥ 0 is a hyper-parameter, known as the regularization parameter, controlling the extent to which we penalize large ℓ 2 -norms of ✇ . 8 / 24

  9. Representer Theorem m 1 L ( ✇ ❚ φ ( ① ✐ ) , y i ) + λ � 2 || ✇ || 2 min 2 , λ ≥ 0 , (7) m ✇ i =1 In order to derive a solutions to this optimization problem, we introduce the following theorem, which is a special case of the well-known Representer Theorem. The Representer Theorem is the theoretical foundation of kernel methods. Theorem If the optimization problem in Equation (6) (copied above) has optimal solutions, there must exist an optimal solution with the form ✇ ∗ = � m i =1 α i φ ( ① ✐ ) . 9 / 24

  10. Proof I Since elements of { φ ( ① ✐ ) } m i =1 are all in R N , they can form a subspace V ⊆ R N such that V = { ① : ① = � m i =1 c i φ ( ① ✐ ) } . Assume { ✈ 1 , ..., ✈ ♥ ′ } is an orthonormal basis of V , where n ′ ≤ N . V also has an orthogonal complement subspace V ⊥ , which has an orthonormal basis ❦ ✉ ❥ = 0 for any 1 ≤ k ≤ n ′ and 1 ≤ j ≤ N − n ′ . { ✉ 1 , ..., ✉ ◆ − ♥ ′ } . Clearly, ✈ ❚ For an arbitrary vector ✇ ∈ R N , we can decompose it into a linear combination of orthonormal basis vectors in the subspace V and V ⊥ . That is, we can write ✇ as n ′ N − n ′ � � ✇ = ✇ ❱ + ✇ ❱ ⊥ = s k ✈ ❦ + t j ✉ ❥ . (8) k =1 j =1 10 / 24

  11. Proof II First, we can show that 2 � � n ′ N − n ′ � � � � || ✇ || 2 � � 2 = s k ✈ ❦ + t j ✉ ❥ � � � � k =1 j =1 � � 2     n ′ N − n ′ n ′ N − n ′ � � � � s k ✈ ❚ t j ✉ ❚ = ❦ + s k ✈ ❦ + t j ✉ ❥  ❥    k =1 j =1 k =1 j =1 n ′ N − n ′ n ′ N − n ′ � � � � s k t j ✈ ❚ s 2 k || ✈ ❦ || 2 t 2 j || ✉ ❥ || 2 = 2 + 2 + 2 ❦ ✉ ❥ k =1 j =1 k =1 j =1 n ′ N − n ′ � � s 2 k || ✈ ❦ || 2 t 2 j || ✉ ❥ || 2 = 2 + 2 k =1 j =1 n ′ � s 2 k || ✈ ❦ || 2 2 = || ✇ ❱ || 2 ≥ (9) 2 . k =1 11 / 24

  12. Proof III Second, because each φ ( ① ✐ ) , 1 ≤ i ≤ m is a vector in V and { ✈ 1 , ..., ✈ ♥ ′ } is an orthonormal basis of V , we have φ ( ① ✐ ) = � n ′ k =1 β ik ✈ ❦ . This leads to the following equalities: � n ′   n ′ N − n ′ � � � � ✇ ❚ φ ( ① ✐ ) s k ✈ ❚ t j ✉ ❚ = ❦ + β ik ✈ ❦  ❥  k =1 j =1 k =1 n ′ � s k β ik || ✈ ❦ || 2 2 = ✇ ❚ = ❱ φ ( ① ✐ ) . (10) k =1 12 / 24

  13. Proof IV Based on the results in Eqn. (9) and Eqn. (10), and the fact that ✇ ❱ is a vector in V = { ① : ① = � m i =1 c i φ ( ① ✐ ) } , we can derive that, for an arbitrary ✇ , there always exists a ✇ ❱ = � m i =1 α i φ ( ① ✐ ) satisfying m m 1 L ( ✇ ❚ φ ( ① ✐ ) , y i ) + λ 2 ≥ 1 ❱ φ ( ① ✐ ) , y i ) + λ � � L ( ✇ ❚ 2 || ✇ || 2 2 || ✇ ❱ || 2 2 . m m i =1 i =1 (11) In other words, if a vector ✇ ∗ minimizes � m i =1 L ( ✇ ❚ φ ( ① ✐ ) , y i ) + λ 1 2 || ✇ || 2 2 , the corresponding ✇ ∗ ❱ must also m ❱ = � m minimize it, and there exist some { α i } m i =1 such that ✇ ∗ i =1 α i φ ( ① ✐ ) . 13 / 24

  14. Use of the Representer Theorem in Training 1 According to Theorem 1, we only need to consider ✇ ∈ { � m i =1 α i φ ( ① ✐ ) } when solving the optimization problem in Equation (6). 2 Therefore, we have the following transformed optimization problem by replacing ✇ in Equation (6) with � m i =1 α i φ ( ① ✐ ) : � m m � m m 1 + λ � � α i φ ( ① ✐ ) ❚ φ ( ① ❥ ) , y j � � α i α j φ ( ① ✐ ) ❚ φ ( ① ❥ ) , min L m 2 α 1 ,...,α m j =1 i =1 j =1 i =1 (12) for λ ≥ 0 3 As a result, if we know φ ( ① ✐ ) ❚ φ ( ① ❥ ) for all 1 ≤ j , i ≤ m , we can compute the optimization objective in Equation (12) without explicitly knowing { φ ( ① ✐ ) } m i =1 . 14 / 24

  15. Use of the Representer Theorem in Prediction 1 In addition, the output of the linear model with parameter ✇ for any input φ ( ① ) only depends on ✇ ❚ φ ( ① ) , i.e. , � m i =1 α i φ ( ① ✐ ) ❚ φ ( ① ) . 2 So for any unseen ① that is not in the training set, we can make predictions directly without computing φ ( ① ) first if we know φ ( ① ✐ ) ❚ φ ( ① ) for all 1 ≤ i ≤ m . 3 In summary, in both training and prediction, what we really need is the inner product of two feature vectors, not the feature vectors themselves. 15 / 24

Recommend


More recommend