kernel learning with a million kernels
play

Kernel Learning with a Million Kernels Ashesh Jain SVN - PowerPoint PPT Presentation

Kernel Learning with a Million Kernels Ashesh Jain SVN Vishwanathan IIT Delhi Purdue University Manik Varma Microsoft Research India Kernel Learning The objective in kernel learning is to jointly learn both SVM and kernel parameters


  1. Kernel Learning with a Million Kernels Ashesh Jain SVN Vishwanathan IIT Delhi Purdue University Manik Varma Microsoft Research India

  2. Kernel Learning • The objective in kernel learning is to jointly learn both SVM and kernel parameters from training data. • Kernel parameterizations • Linear : 𝐿 = 𝑒 𝑗 𝐿 𝑗 𝑗 • Non-linear : 𝐿 = 𝐿 𝑗 = 𝑓 −𝑒 𝑗 𝐸 𝑗 𝑗 𝑗 • Regularizers • Sparse l 1 • Sparse and non-sparse l p >1 • Log determinant

  3. Kernel Learning for Object Detection • Vedaldi, Gulshan, Varma and Zisserman ICCV 2009

  4. Kernel Learning for Object Recognition • Orabona, Jie and Caputo CVPR 2010

  5. Kernel Learning for Feature Selection • Varma and Babu ICML 2009 FERET Gender Identification Data Set AdaBoost Baluja et al . LP-SVM BAHSIC Linear # OWL-QN SSVM QCQP Non-Linear Feat [ICML 2007] [ICML 2007] MKL [IJCV 2007] [COA 2004] [ICML 2007] MKL 76.3  0.9 79.5  1.9 71.6  1.4 84.9  1.9 79.5  2.6 81.2  3.2 80.8  0.2 88.7  0.8 10 82.6  0.6 80.5  3.3 87.6  0.5 85.6  0.7 86.5  1.3 83.8  0.7 93.2  0.9 20 - 83.4  0.3 84.8  0.4 89.3  1.1 88.6  0.2 89.4  2.4 86.3  1.6 95.1  0.5 30 - 86.9  1.0 88.8  0.4 90.6  0.6 89.5  0.2 91.0  1.3 89.4  0.9 95.5  0.7 50 - 88.9  0.6 90.4  0.2 90.6  1.1 92.4  1.4 90.5  0.2 80 - - - 89.5  0.2 90.6  0.3 90.5  0.2 94.1  1.3 91.3  1.3 100 - - - 91.3  0.5 90.3  0.8 90.7  0.2 94.5  0.7 150 - - - - 93.1  0.5 90.8  0.0 94.3  0.1 252 - - - - - 76.3(12.6) - 91 (221.3) 91 (58.3) 90.8 (252) - 91.6(146.3) 95.5 (69.6)

  6. The GMKL Primal Formulation P = Min w , b , d ½ 𝐱 𝑢 𝐱 + 𝐷 𝑀 𝐱 𝑢 𝛠 𝐞 𝐲 𝑗 + 𝑐, 𝑧 𝑗 + 𝑠(𝐞) 𝑗 s. t. 𝐞 ∈ 𝑬 𝑢 𝐲 𝑗 𝛠 𝐞 𝐲 𝑘 ≻ 0 ∀𝐞 ∈ 𝐸 • 𝐿 𝐞 (𝐲 𝑗 , 𝐲 𝑘 ) = 𝛠 𝐞 • 𝛂 𝐞 𝐿 and 𝛂 𝐞 𝑠 exist and are continuous

  7. The GMKL Primal Formulation • The GMKL primal formulation for binary classification. ½ 𝐱 𝑢 𝐱 + 𝐷 𝜊 𝑗 + 𝑠(𝐞) P = Min w , b , d, 𝜊 𝑗 𝑧 𝑗 𝐱 𝑢 𝛠 𝐞 𝐲 𝑗 + 𝑐 ≥ 1 − 𝜊 𝑗 s. t. 𝜊 𝑗 ≥ 0 & 𝐞 ∈ 𝑬

  8. The GMKL Primal Formulation • The GMKL primal formulation for binary classification. ½ 𝐱 𝑢 𝐱 + 𝐷 𝜊 𝑗 + 𝑠(𝐞) P = Min w , b , d, 𝜊 𝑗 𝑧 𝑗 𝐱 𝑢 𝛠 𝐞 𝐲 𝑗 + 𝑐 ≥ 1 − 𝜊 𝑗 s. t. 𝜊 𝑗 ≥ 0 & 𝐞 ∈ 𝑬 • Intermediate Dual 1 t  – ½  t YK d Y  + r( d ) D = Min d Max  1 t Y  = 0 s. t. 0    C & 𝐞 ∈ 𝑬

  9. Projected Gradient Descent x 0 1 0.9 0.8 0.7 0.6 d 2 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 d 1

  10. Projected Gradient Descent x 0 1 0.9 x 1 0.8 0.7 0.6 d 2 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 d 1

  11. Projected Gradient Descent x 0 1 0.9 x 1 0.8 0.7 0.6 d 2 0.5 0.4 0.3 0.2 0.1 x 2 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 d 1 z 1

  12. Projected Gradient Descent x 0 1 0.9 x 1 0.8 0.7 0.6 d 2 0.5 0.4 0.3 x 3 0.2 0.1 x 2 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 d 1 z 1

  13. Projected Gradient Descent x 0 1 0.9 x 1 0.8 0.7 0.6 d 2 0.5 x ∗ 0.4 0.3 x 3 0.2 0.1 x 2 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 d 1 z 1

  14. PGD Limitations • PGD requires many function and gradient evaluations as • No step size information is available. • The Armijo rule might reject many step size proposals. • Inaccurate gradient values can lead to many tiny steps.

  15. PGD Limitations • PGD requires many function and gradient evaluations as • No step size information is available. • The Armijo rule might reject many step size proposals. • Inaccurate gradient values can lead to many tiny steps. • Noisy function and gradient values can cause PGD to converge to points far away from the optimum.

  16. PGD Limitations • PGD requires many function and gradient evaluations as • No step size information is available. • The Armijo rule might reject many step size proposals. • Inaccurate gradient values can lead to many tiny steps. • Noisy function and gradient values can cause PGD to converge to points far away from the optimum. • Solving SVMs to high precision to obtain accurate function and gradient values is very expensive.

  17. PGD Limitations • PGD requires many function and gradient evaluations as • No step size information is available. • The Armijo rule might reject many step size proposals. • Inaccurate gradient values can lead to many tiny steps. • Noisy function and gradient values can cause PGD to converge to points far away from the optimum. • Solving SVMs to high precision to obtain accurate function and gradient values is very expensive. • Repeated projection onto the feasible set might also be expensive.

  18. SPG Solution – Spectral Step Length • Quadratic approximation : ½ 𝜇 −𝟐 𝐲 𝑢 𝐲 + 𝐝 𝑢 𝐲 + 𝑒 𝐲 𝑜 − 𝐲 𝑜−𝟐 , 𝐲 𝑜 − 𝐲 𝑜−𝟐 • Spectral step length : 𝜇 𝑇𝑄𝐻 = 𝐲 𝑜 − 𝐲 𝑜−𝟐 , 𝛂𝑔(𝐲 𝑜 ) − 𝛂𝑔(𝐲 𝑜−𝟐 ) x * x 1 x 1 x 0 x 0 Original Function Approximation

  19. SPG Solution – Spectral Step Length 𝐲 𝑜 − 𝐲 𝑜−𝟐 , 𝐲 𝑜 − 𝐲 𝑜−𝟐 • Spectral step length : 𝜇 𝑇𝑄𝐻 = 𝐲 𝑜 − 𝐲 𝑜−𝟐 , 𝛂𝑔(𝐲 𝑜 ) − 𝛂𝑔(𝐲 𝑜−𝟐 ) 1 x 1 x 0 0.5 0 -0.5 -1 -1 -0.5 0 0.5 1

  20. PGD Limitations – Repeated Projections • Accept P ( z t ) if it satisfies the Armijo rule z t – 𝛂 f x t P ( z t )

  21. PGD Limitations – Repeated Projections • Accept P ( z t ) if it satisfies the Armijo rule z t – 𝛂 f x t P ( z t )

  22. PGD Limitations – Repeated Projections • PGD might require many projections before accepting a point z t – 𝛂 f x t P ( z t )

  23. SPG Solution – Spectral Proj Gradient • SPG requires a single projection per step z t – 𝛂 f x t P ( z t ) – 𝛂 spg

  24. SPG Solution – Non-Monotone Rule • Handling function and gradient noise. • Non-monotone rule : 𝑔 𝑦 𝑢 − 𝑡𝛼𝑔 𝑦 𝑢 0≤𝑘≤𝑁 𝑔 𝑦 𝑢−𝑘 − 𝛿𝑡 𝛼𝑔 𝑦 𝑢 2 ≤ max 2 1438 Global Minimum SPG-GMKL 1436 1434 f(x) 1432 1430 1428 1426 0 10 20 30 40 50 60 time (s)

  25. PGD Limitations – Step Size Selection • The Armijo rule might get stuck due to noisy function values 1438 Global Minimum PGD 1436 1434 f(x) 1432 1430 1428 1426 0 10 20 30 40 50 60 time (s)

  26. SPG Solution – SVM Precision Tuning SPG PGD 0.5 hr 6.5 3 hr 0.2 hr 6 5.5 0.3 hr 5 4.5 2 hr 0.1 hr 4 3.5 1 hr 1 1 0.9 0.9 0.8 0.1 hr 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0

  27. SPG Advantages • SPG requires fewer function and gradient evaluations due to • The 2 nd order spectral step length estimation. • The non-monotone line search criterion. • SPG is more robust to noisy function and gradient values due to the non-monotone line search criterion. • SPG never needs to solve an SVM with high precision due to our precision tuning strategy. • SPG needs to perform only a single projection per step.

  28. SPG Algorithm 1: 𝑜 ← 0 2: Initialize 𝐞 0 randomly 3: 𝐬𝐟𝐪𝐟𝐛𝐮 𝛃 ∗ ← SolveSVM(𝐋 𝐞 𝑜 , ϵ) 4: 𝜇 ← SpectralStepLength 5: 𝐪 𝑜 ← 𝐞 𝑜 − 𝐐 𝐞 𝑜 − 𝛍𝛂𝑋 𝐞 𝑜 , 𝛃 ∗ 6: s 𝑜 ← Non − Monotone 7: ϵ ← TuneSVMPrecision 8: 𝐞 𝑜+𝟐 ← 𝐞 𝑜 − s 𝑜 𝐪 𝑜 9: 10: 𝐯𝐨𝐮𝐣𝐦 converged

  29. Results on Large Scale Data Sets Covertype: Sum of kernels subject to 𝑚 1.33 regularization • • Number of training points 581,012 • Number of Kernels 5 • SPG time taken 64.46 hrs • SPG took 26 SVM evaluations • First SVM evaluation took 44 hours • Only 0.19% of SV were cached

  30. Results on Large Scale Data Sets Sonar: Sum of kernels subject to 𝑚 1.33 regularization • • Number of training points 208 • Number of Kernels 1 Million • SPG time taken 105.62 hrs 5.5 SPG p=1.33 SPG p=1.66 5 PGD p=1.33 PGD p=1.66 4.5 log(Time) 4 3.5 3 2.5 2 3.6 3.8 4 4.2 4.4 4.6 4.8 5 5.2 log(#Kernel)

  31. Results on Large Scale Data Sets Sum of kernels subject to 𝑚 𝑞≥1 regularization • p =1 p =1.33 Data Sets # Train # Kernels PGD (hrs) SPG (hrs) PGD (hrs) SPG (hrs) 32,561 50 35.84 4.55 31.77 4.42 Adult - 9 59,535 50 – 25.17 66.48 19.10 Cod - RNA 50,000 40.10 42.20 KDDCup04 50 – –

  32. Results on Small Scale Data Sets Sum of kernels subject to 𝑚 1 regularization • Data Sets SimpleMKL (s) Shogun (s) PGD (s) SPG (s) 400 ± 128.4 15 ± 7.7 38 ± 17.6 6 ± 4.2 Wpbc 676 ± 356.4 12 ± 1.2 57 ± 85.1 5 ± 0.6 Breast - Cancer 383 ± 33.5 1094 ± 621.6 29 ± 7.1 10 ± 0.8 Australian 1247 ± 680.0 107 ± 18.8 1392 ± 824.2 39 ± 6.8 Ionosphere 1468 ± 1252.7 935 ± 65.0 273 ± 64.0 Sonar –

Recommend


More recommend