multiple kernel learning and feature space denoising
play

Multiple Kernel Learning and Feature Space Denoising Fei Yan, Josef - PowerPoint PPT Presentation

Overview Kernel Methods Multiple Kernel Learning MKL and Feature Space Denoising Conclusions Multiple Kernel Learning and Feature Space Denoising Fei Yan, Josef Kittler and Krystian Mikolajczyk eNTERFACE10 Multiple Kernel Learning and


  1. Overview Kernel Methods Multiple Kernel Learning MKL and Feature Space Denoising Conclusions Multiple Kernel Learning and Feature Space Denoising Fei Yan, Josef Kittler and Krystian Mikolajczyk eNTERFACE10 Multiple Kernel Learning and Feature Space Denoising

  2. Overview Kernel Methods Multiple Kernel Learning Overview MKL and Feature Space Denoising Conclusions Overview of the talk Kernel methods Kernel methods: an overview Three examples: kernel PCA, SVM, and kernel FDA Connection between SVM and kernel FDA Multiple kernel learning MKL: motivation ℓ p regularised multiple kernel FDA The effect of regularisation norm in MKL MKL and feature space denoising Conclusions eNTERFACE10 Multiple Kernel Learning and Feature Space Denoising

  3. Overview Kernel Methods: an overview Kernel Methods Kernel PCA Multiple Kernel Learning Support Vector Machine MKL and Feature Space Denoising Kernel FDA Conclusions Kernel Methods: an overview Kernel methods: one of the most active areas in ML Key idea of kernel methods: Embed data in input space into high dimensional feature space Apply linear methods in feature space Input space can be: vector, string, graph, etc. Embedding is implicit via a kernel function k ( · , · ), which defines dot product in feature space Any algorithm that can be written with only dot products is “kernelisable” eNTERFACE10 Multiple Kernel Learning and Feature Space Denoising

  4. Overview Kernel Methods: an overview Kernel Methods Kernel PCA Multiple Kernel Learning Support Vector Machine MKL and Feature Space Denoising Kernel FDA Conclusions What is PCA Principal component analysis (PCA): an orthogonal basis transformation Transform correlated variables into uncorrelated ones (principal components) Can be used for dimensionality reduction Retains as much variance as possible when reducing dimensionality eNTERFACE10 Multiple Kernel Learning and Feature Space Denoising

  5. Overview Kernel Methods: an overview Kernel Methods Kernel PCA Multiple Kernel Learning Support Vector Machine MKL and Feature Space Denoising Kernel FDA Conclusions How PCA works Given m centred vectors: ˜ X = (˜ x 1 , ˜ x 2 , · · · , ˜ x m ) X : ˜ d × m data matrix, Eigen decomposition of covariance ˜ C = ˜ X ˜ X T : ˜ C = ˜ V ˜ Ω ˜ V T Diagonal matrix ˜ Ω: eigenvalues ˜ V = (˜ v 1 , ˜ v 2 , · · · ): eigenvectors, orthogonal basis sought Data can now be projected onto orthogonal basis Projecting only onto leading eigenvectors ⇒ dimensionality reduction with minimum variance loss eNTERFACE10 Multiple Kernel Learning and Feature Space Denoising

  6. Overview Kernel Methods: an overview Kernel Methods Kernel PCA Multiple Kernel Learning Support Vector Machine MKL and Feature Space Denoising Kernel FDA Conclusions Kernelising PCA If we knew explicitly the mapping from input space to feature space x i = φ (˜ x i ): we could map all data: X = φ ( ˜ X ), where X is d × m diagonalise the covariance in feature space C = XX T : X T CV = X T V Ω: KA = A ∆ Diagonal matrix ∆: eigenvalues V = ( v 1 , v 2 , · · · ): orthogonal basis in feature space However... we have φ ( · ) only implicitly via: < φ (˜ x i ) , φ (˜ x j ) > = k (˜ x i , ˜ x j ) Kernelised PCA eNTERFACE10 Multiple Kernel Learning and Feature Space Denoising

  7. Overview Kernel Methods: an overview Kernel Methods Kernel PCA Multiple Kernel Learning Support Vector Machine MKL and Feature Space Denoising Kernel FDA Conclusions Kernelising PCA Kernel matrix K : evaluation of kernel function on all pairs of samples; symmetric, positive semi-definite (PSD) Connection between C and K : C = XX T and K = X T X C is d × d and K is m × m C is not explicitly available but K is So we diagonalise K instead of C : K = A ∆ A T A = ( α 1 , α 2 , · · · ): eigenvectors eNTERFACE10 Multiple Kernel Learning and Feature Space Denoising

  8. Overview Kernel Methods: an overview Kernel Methods Kernel PCA Multiple Kernel Learning Support Vector Machine MKL and Feature Space Denoising Kernel FDA Conclusions Kernelising PCA Using the connection between C and K , we have: C and K have the same eigenvalues Their i th eigenvectors are related by: v i = X α i v i is still not explicitly available: α i is, but X is not However... we are interested in projection onto the orthogonal basis, not the basis itself Projection onto v i : X T v i = X T X α i = K α i Both K and α i are available. eNTERFACE10 Multiple Kernel Learning and Feature Space Denoising

  9. Overview Kernel Methods: an overview Kernel Methods Kernel PCA Multiple Kernel Learning Support Vector Machine MKL and Feature Space Denoising Kernel FDA Conclusions Support Vector Machine SVM: supervised learning as opposed to (kernel) PCA In binary classification setting: maximise the margin Integrating misclassification ⇒ soft margin svm: m 1 2 w T w + C � (1 − y i ( w T x i + b )) + min (1) w , b i =1 w : multiplicative inverse of the margin ( x ) + = max( x , 0): hinge loss penalising empirical error C : parameter controlling the tradeoff y i ∈ { +1 , − 1 } : label of training sample i Goal: seeking the hyperplane with maximum soft margin eNTERFACE10 Multiple Kernel Learning and Feature Space Denoising

  10. Overview Kernel Methods: an overview Kernel Methods Kernel PCA Multiple Kernel Learning Support Vector Machine MKL and Feature Space Denoising Kernel FDA Conclusions Support Vector Machine SVM primal (1) is equivalent to its Lagrangian dual: � m � m � m i =1 α i − 1 max j =1 y i y j α i α j K ij (2) 2 i =1 α � m i =1 y i α i = 0 , 0 ≤ α ≤ C 1 subject to (2) depends only on kernel matrix K (and labels) Explicit mapping φ ( · ) into feature space not needed SVM can be kernelised eNTERFACE10 Multiple Kernel Learning and Feature Space Denoising

  11. Overview Kernel Methods: an overview Kernel Methods Kernel PCA Multiple Kernel Learning Support Vector Machine MKL and Feature Space Denoising Kernel FDA Conclusions Kernel FDA Kernel Fisher discriminant analysis: another supervised learning technique Seeking the projection w maximising Fisher criterion w T m m + m − S B w max (3) w T ( S T + λ I ) w w m : numbers of samples m + and m − : numbers of positive and negative samples S B and S T : between class and total scatters λ : regularisation parameter eNTERFACE10 Multiple Kernel Learning and Feature Space Denoising

  12. Overview Kernel Methods: an overview Kernel Methods Kernel PCA Multiple Kernel Learning Support Vector Machine MKL and Feature Space Denoising Kernel FDA Conclusions Kernel FDA It can be proved that (3) is equivalent to w || ( XP ) T w − a || 2 + λ || w || 2 min (4) P and a : constants determined by labels (4) is equivalent to its Lagrangian dual: 1 4 α T ( I + 1 λ K ) α − α T a min (5) α (5) depends only on K (and labels): FDA can be kernelised eNTERFACE10 Multiple Kernel Learning and Feature Space Denoising

  13. Overview Kernel Methods: an overview Kernel Methods Kernel PCA Multiple Kernel Learning Support Vector Machine MKL and Feature Space Denoising Kernel FDA Conclusions Connection between SVM and kernel FDA Like SVM, kernel FDA is a special cases of Tikhonov regularisation Goals of Tikhonov regularisation: Small empirical error (loss function may vary) At the same time small norm w T w (for good generalisation) λ controls the tradeoff between error and good generalisation Instead of SVM’s hinge loss for empirical error, FDA uses squared loss eNTERFACE10 Multiple Kernel Learning and Feature Space Denoising

  14. Overview Kernel Methods MKL: motivation Multiple Kernel Learning ℓ p regularised multiple kernel FDA MKL and Feature Space Denoising Effect of regularisation norm Conclusions MKL: motivation A recap on kernel methods: Embed (implicitly) into (very high dimensional) feature space Implicitly: only need dot product in feature space, i.e., the kernel function k ( · , · ) Apply linear methods in the feature space Easy balance of capacity (empirical error) and generalisation (norm w T w ) These sound nice but what kernel function to use? This choice is critically important, for it completely determines the embedding eNTERFACE10 Multiple Kernel Learning and Feature Space Denoising

  15. Overview Kernel Methods MKL: motivation Multiple Kernel Learning ℓ p regularised multiple kernel FDA MKL and Feature Space Denoising Effect of regularisation norm Conclusions MKL: motivation Ideal case: learn kernel function from data If that is hard, can we learn a good combination of given kernel matrices: the multiple kernel learning problem Given n m × m kernel matrices, K 1 , · · · , K n Most MKL formulations consider linear combination: n � K = β j K j , β j ≥ 0 (6) j =1 Goal of MKL: learn the “optimal” weights β ∈ R n eNTERFACE10 Multiple Kernel Learning and Feature Space Denoising

  16. Overview Kernel Methods MKL: motivation Multiple Kernel Learning ℓ p regularised multiple kernel FDA MKL and Feature Space Denoising Effect of regularisation norm Conclusions MKL: motivation Kernel matrix K j : pairwise dot products in feature space j Geometrical interpretation of unweighted sum K = � n j =1 K j : Cartesian product of the feature spaces Geometrical interpretation of weighted sum K = � n j =1 β j K j : � Scale feature spaces with β j , then take Cartesian product Learning kernel weights: seeking the “optimal” scaling eNTERFACE10 Multiple Kernel Learning and Feature Space Denoising

Recommend


More recommend