Combining Kernels for Classification Doctoral Thesis Seminar Darrin P . Lewis dplewis@cs.columbia.edu Combining Kernels for Classification – p.
Outline Summary of Contribution Stationary kernel combination Nonstationary kernel combination Sequential minimal optimization Results Conclusion Combining Kernels for Classification – p.
Summary of Contribution Empirical study of kernel averaging versus SDP weighted kernel combination Nonstationary kernel combination Double Jensen bound for latent MED Efficient iterative optimization Implementation Combining Kernels for Classification – p.
Outline Summary of Contribution Stationary kernel combination Nonstationary kernel combination Sequential minimal optimization Results Conclusion Combining Kernels for Classification – p.
Example Kernel One 1 4 2 . 75 3 4 16 11 12 2 . 75 11 7 . 5625 8 . 25 3 12 8 . 25 9 PCA Basis for Kernel 1 1 0.8 0.6 0.4 0.2 X2 0 −0.2 −0.4 −0.6 −0.8 −1 −24 −22 −20 −18 −16 −14 −12 −10 −8 −6 −4 X1 Combining Kernels for Classification – p.
Example Kernel Two 9 12 8 . 25 3 12 16 11 4 8 . 25 11 7 . 5625 2 . 75 3 4 2 . 75 1 PCA Basis for Kernel 2 −4 −6 −8 −10 −12 X2 −14 −16 −18 −20 −22 −24 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 X1 Combining Kernels for Classification – p.
Example Kernel Combination 10 16 11 6 16 32 22 16 11 22 15 . 125 11 6 16 11 10 PCA Basis for Combined Kernel 3 2 1 X2 0 −1 −2 −3 −45 −40 −35 −30 −25 −20 X1 Combining Kernels for Classification – p.
Effect of Combination K C ( x, z ) = K 1 ( x, z ) + K 2 ( x, z ) = � φ 1 ( x ) , φ 1 ( z ) � + � φ 2 ( x ) , φ 2 ( z ) � = � φ 1 ( x ): φ 2 ( x ) , φ 1 ( z ): φ 2 ( z ) � The implicit feature space of the combined kernel is a concatenation of the feature spaces of the individual kernels. A basis in the combined feature space may be lower dimensional than the sum of the dimensions of the individual feature spaces. Combining Kernels for Classification – p.
Combination Weights There are several ways in which the combination weights can be determined: equal weight : or unweighted combination. This is also essentially kernel averaging 14 . optimized weight : SDP weighted combination 6 . Weights and SVM Lagrange multipliers are determined in a single optimization. To regularize the kernel weights, a constraint is enforced to keep the trace of the combined kernel constant. Combining Kernels for Classification – p.
Sequence/Structure We compare 10 the state -of-the-art SDP and simple averaging for conic combinations of kernels Drawbacks of SDP include optimization time and lack of a free implementation We determined the cases in which averaging is preferable and those in which SDP is required Our experiments predict Gene Ontology 2 (GO) terms using a combination of amino acid sequence and protein structural information We use the 4,1-Mismatch sequence kernel 8 and MAMMOTH (sequence-independent) structure kernel 13 Combining Kernels for Classification – p. 1
Cumulative ROC AUC No. of terms with a given mean ROC 50 40 30 20 Sequence 10 Structure Average SDP 0 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Mean ROC Combining Kernels for Classification – p. 1
Mean ROC AUC Top 10 GO Terms GO term Structure Sequence Average SDP GO:0008168 0.941 ± 0.014 0.709 ± 0.020 0.937 ± 0.016 0.938 ± 0.015 GO:0005506 0.934 ± 0.008 0.747 ± 0.015 0.927 ± 0.012 0.927 ± 0.012 GO:0006260 0.885 ± 0.014 0.707 ± 0.020 0.878 ± 0.016 0.870 ± 0.015 GO:0048037 0.916 ± 0.015 0.738 ± 0.025 0.911 ± 0.016 0.909 ± 0.016 GO:0046483 0.949 ± 0.007 0.787 ± 0.011 0.937 ± 0.008 0.940 ± 0.008 GO:0044255 0.891 ± 0.012 0.732 ± 0.012 0.874 ± 0.015 0.864 ± 0.013 GO:0016853 0.855 ± 0.014 0.706 ± 0.029 0.837 ± 0.017 0.810 ± 0.019 GO:0044262 0.912 ± 0.007 0.764 ± 0.018 0.908 ± 0.006 0.897 ± 0.006 GO:0009117 0.892 ± 0.015 0.748 ± 0.016 0.890 ± 0.012 0.880 ± 0.012 GO:0016829 0.935 ± 0.006 0.791 ± 0.013 0.931 ± 0.008 0.926 ± 0.007 GO:0006732 0.823 ± 0.011 0.781 ± 0.013 0.845 ± 0.011 0.828 ± 0.013 GO:0007242 0.898 ± 0.011 0.859 ± 0.014 0.903 ± 0.010 0.900 ± 0.011 GO:0005525 0.923 ± 0.008 0.884 ± 0.015 0.931 ± 0.009 0.931 ± 0.009 GO:0004252 0.937 ± 0.011 0.907 ± 0.012 0.932 ± 0.012 0.931 ± 0.012 GO:0005198 0.809 ± 0.010 0.795 ± 0.014 0.828 ± 0.010 0.824 ± 0.011 Combining Kernels for Classification – p. 1
Varying Ratio Top 10 GO Terms 0.95 0.9 Mean ROC 0.85 0.8 0.75 0.7 -Inf -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 Inf Log2 ratio of kernel weights Combining Kernels for Classification – p. 1
Noisy Kernels 56 GO Terms 1 0.95 Mean ROC (SDP) 0.9 0.85 0.8 0.75 No noise 0.7 1 noise kernel 2 noise kernels 0.65 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Mean ROC (Average) Combining Kernels for Classification – p. 1
Missing Data Typical GO Term GO:0046483 0.9 Mean ROC All SDP None SDP Self SDP 0.8 All Ave None Ave Self Ave Structure 0.7 0 10 20 30 40 50 60 70 80 90 100 Percent missing structures Combining Kernels for Classification – p. 1
Outline Summary of Contribution Stationary kernel combination Nonstationary kernel combination Sequential minimal optimization Results Conclusion Combining Kernels for Classification – p. 1
Kernelized Discriminants Single: � f ( x ) = y t λ t k ( x t , x ) + b t Linear combination: � � f ( x ) = y t λ t ν m k m ( x t , x ) + b m t Nonstationary combination 9 : � � f ( x ) = y t λ t ν m,t ( x ) k m ( x t , x ) + b m t Combining Kernels for Classification – p. 1
Parabola-Line Data Combining Kernels for Classification – p. 1
Parabola-Line SDP Combining Kernels for Classification – p. 1
Ratio of Gaussian Mixtures � M m =1 α m N ( φ + m ( X t ) | µ + m , I ) L ( X t ; Θ) = ln + b � N n =1 β n N ( φ − n ( X t ) | µ − n , I ) µ + m , µ − n Gaussian means α, β mixing proportions b scalar bias For now, maximum likelihood parameters are estimated independently for each model. Note explicit feature maps, φ + , φ − . Combining Kernels for Classification – p. 2
Parabola-Line ML Combining Kernels for Classification – p. 2
Ratio of Generative Models � M m =1 P ( m, φ + m ( X t ) | θ + m ) L ( X t ; Θ) = ln + b � N n =1 P ( n, φ − n ( X t ) | θ − n ) Find distribution P (Θ) rather than specific Θ ∗ �� � Classify using ˆ y = sign Θ P (Θ) L ( X t ; Θ) d Θ Combining Kernels for Classification – p. 2
Max Ent Parameter Estimation Find P (Θ) to satisfy “moment” constraints: � Θ P (Θ) y t L ( X t ; Θ) d Θ ≥ γ t ∀ t ∈ T while assuming nothing additional. Minimize Shannon relative entropy: P (Θ) D ( P � P (0) ) = � Θ P (Θ) ln P (0) (Θ) d Θ to allow the use of a prior P (0) (Θ) . Classic ME solution 3 is: 1 P t ∈T λ t [ y t L ( X t | Θ) − γ t ] Z ( λ ) P (0) (Θ) e P (Θ) = λ fully specifies P (Θ) . Maximize log -concave objective J ( λ ) = − log Z ( λ ) . Combining Kernels for Classification – p. 2
Tractable Partition � ¨ P (0) (Θ) Z ( λ, Q | q ) = Θ � � � q t ( m ) ln P ( m, φ + m ( X t ) | θ + exp λ t ( m ) + H ( q t ) m t ∈T + � � Q t ( n ) ln P ( n, φ − n ( X t ) | θ − − n ) − H ( Q t ) + b − γ t ) n � � � q t ( n ) ln P ( n, φ − n ( X t ) | θ − exp λ t ( n ) + H ( q t ) n t ∈T − � � Q t ( m ) ln P ( m, φ + m ( X t ) | θ + − m ) − H ( Q t ) − b − γ t ) d Θ m Introduce variational distributions q t over the correct class log -sums and Q t over the incorrect class log-sums to replace them with upper and lower bounds, respectively. argmin Q argmax q ¨ Z ( λ, Q | q ) = Z ( λ ) Iterative optimization is required. Combining Kernels for Classification – p. 2
MED Gaussian Mixtures � M m =1 α m N ( φ + m ( X t ) | µ + m , I ) L ( X t ; Θ) = ln + b � N n =1 β n N ( φ − n ( X t ) | µ − n , I ) Gaussian priors N (0 , I ) on µ + m , µ − n Non -informative Dirichlet priors on α, β Non-informative Gaussian N (0 , ∞ ) prior on b . These assumptions simplify the objective and result in a set of linear equality constraints on the convex optimization. Combining Kernels for Classification – p. 2
Convex Objective ¨ � � J ( λ, Q | q ) = λ t ( H ( Q t ) − H ( q t )) + λ t γ t t ∈T t ∈T − 1 � � � q t ( m ) q t ′ ( m ) k + m ( t, t ′ ) λ t λ t ′ 2 m t,t ′ ∈T + � � Q t ( n ) Q t ′ ( n ) k − n ( t, t ′ ) + n − 1 � � � Q t ( m ) Q t ′ ( m ) k + m ( t, t ′ ) λ t λ t ′ 2 m t,t ′ ∈T − � � q t ( n ) q t ′ ( n ) k − n ( t, t ′ ) + n � � � q t ( m ) Q t ′ ( m ) k + m ( t, t ′ ) + λ t λ t ′ m t ∈T + t ′ ∈T − � � Q t ( n ) q t ′ ( n ) k − n ( t, t ′ ) + n Combining Kernels for Classification – p. 2
Recommend
More recommend