Explicit Feature Methods for Accelerated Kernel Learning - PowerPoint PPT Presentation

Explicit Feature Methods for Accelerated Kernel Learning Purushottam Kar

Quick Motivation • Kernel Algorithms (SVM, SVR, KPCA) have output � � � = � � � � �, � � �� • Number of “support vectors” is typically large • Provably a constant fraction of training set size* • Prediction time � �� where � � ∈ � ⊂ ℝ � • Slow for real time applications *[Steinwart NIPS 03, Steinwart-Christmann. NIPS 08] 2

The General Idea • Approximate kernel using explicit feature maps �: � ⟶ ℝ � s.t. � �, � � ≈ � � � � � � • Speeds up prediction time to � �� ≪ � �� = � � � � � � ≈ � � � � � , � � � �� = � � � � � � �� • Speeds up training time as well 3

Why Should Such Maps Exist? • Mercer’s theorem* Every PSD kernel � has the following expansion � � �, � = � � � � � � � � � �� • The series converges uniformly to kernel • For every � > �, ∃� � such that if we construct the map � � � = � � , � � , … , � � � ∈ ℝ � � , • then for all �, � ∈ � � �, � − � � � � � � � � � ≤ ϵ • Call such maps uniformly � -approximate *[Mercer 09] 4

Today’s Agenda • Some explicit feature map constructions • Randomized feature maps e.g. Translation invariant, rotation invariant • Deterministic feature maps e.g. Intersection, scale invariant • Some “fast” random feature constructions • Translation invariant, dot product • The BIG picture? 5

Random Feature Maps Approximate recovery of kernel values with high confidence 6

Translation Invariant Kernels* • Kernels of the form � �, � = � � − � • Gaussian kernel, Laplacian kernel • Bochner’s Theorem** For every � there exists a positive function � cos � � � − � � � − � = � � � �� ∈ � = E �∼� cos � � � − � • Finding � : take inverse Fourier transform of � • Select � � ∼ � for � = �, … , � � � , sin � � � � � � : � ↦ cos � � *[Rahimi-Recht NIPS 07], ** Special case for � ⊂ ℝ � , [Bochner 33] 7

Translation Invariant Kernels • Empirical averages approximate expectations • Let �: � ↦ � � � , � � � , … , � � � � � � � � � � = � � � � � � � � ∑ �� − � � � � ∑ = cos � � �� ≈ E �∼� cos � � � − � = � � − � • Let us assume points �, � ∈ � �, � ⊂ ℝ � �� Then we require � ≥ � � log �� depends on spectrum of kernel � 8

Translation Invariant Kernels • For the RBF Kernel � � �, � = exp − � − � � � � �� /� exp − � � � � � = � • If kernel � offers a � margin, then we should require � ≳ �� log �� Here � � ≈ � where �, � ∈ ℝ � 9

Rotation Invariant Kernels* • Kernels of the form � �, � = � � � � • Polynomial kernels, exponential kernel • Schoenberg’s theorem** � � � � = � � � � � � � , � � ≥ � �� • Select � � ∼ � ∈ ℕ for � = �, … , � • Approx. � � � � � : select � � , … , � � � ∼ −�, � � � � � � � � : � ↦ � � � � � � �� • Similar approximation guarantees as earlier *[K.-Karnick AISTATS 12], **[Schoenberg 42] 10

Deterministic Feature Maps Exact/approximate recovery of kernel values with certainty 11

Intersection Kernel* � min � � , � � • Kernel of the form � �, � = ∑ �� • Exploit additive separability of the kernel � � � min � � , � � � � = � � � � = � � � � �� = � � � �� , � � � • Each � � � can be calculated in � log � time ! • Requires � �log � preprocessing time per dimension • Prediction time almost independent of � • However, deterministic and exact method – no � or � *[Maji-Berg-Malik CVPR 08] 12

Scale Invariant Kernels* � � � � � , � � • Kernels of the form � �, � = ∑ where �� , � � = � � � � � � � � , � ≥ � � � • Bochner’s theorem still applies** � � � � , � � = � � log � � − log � � • Involves working with � • Restrict domain so that we have a Fourier series � � � � = � � � � �� • Use only lower frequencies � ∈ −�, … � • Deterministic � -approximate maps *[Vedaldi-Zisserman CVPR 10], **[K. 12] 13

Fast Feature Maps Accelerated Random Feature Constructions 14

Fast Fourier Features � �� • Special case of � �, � = exp �� • Old method: � ∈ ℝ �×� , � �� ∼ � �, � �� , � �� time �: � ↦ cos �� where � = �� • Instead use � • � is the Hadamard transform, � is a random permutation �, �, � random diagonal scaling, Gaussian and sign matrices � � � � � � • Prediction time � Dlog � , E � = � �, � • Rows of � are (non independent) Gaussian vectors � � � � � � � • Correlations are sufficiently low Var � ≤ � � • However, exponential convergence (for now) only for � = � 15

Fast Taylor Features • S �� , � = � � � + � � � � � • Earlier method �: � ↦ ∏ � � , takes � �� time �� • New method* takes � � � + �log � time • Earlier method works (a bit) better for � > � • Should be possible to improve new method as well • Crucial idea � � � � = � ⊗� , � ⊗� • Count Sketch** �: � ↦ �(�) such that � � � � � ≈ � � � • Create sketch � � ∈ ℝ � of tensor � = � ⊗� • Create � independent count sketches � � � , … , � � � • Can show that � � ⊗� ∼ ∏ � � � � � �� • Can be done in time � � � + �log � time using FFT *[Pham-Pagh KDD 13], **[Charikar et al 02] 16

The BIG Picture An Overview of Explicit Feature Methods 17

Other Feature Construction Methods • Efficiently evaluable maps for efficient prediction • Fidelity to a particular kernel not an objective • Hard(er?) to give generalization guarantees • Local Deep Kernel Learning (LDKL)* • Sparse features speed up evaluation time to � �log � • Training phase more involved • Pairwise Piecewise Linear Embedding (PL2)** • Encodes (discretization of) individual and pairs of features • Construct a � = � � � + � � dimensional feature map • Features are � � + � � -sparse *[Jose et al ICML 13], **[Pele et al ICML 13] 18

A Taxonomy of Feature Methods Data Dependence Yes No Nystrom Methods Explicit Maps - Slow training - Fast training - Data aware - Data oblivious Kernel Dependence Yes - Problem oblivious - Problem oblivious LDKL, PL2 - Slow(er) training No - Data aware - Problem aware 19

Discussion The next big thing in accelerated kernel learning ?

Explicit Feature Methods for Accelerated Kernel Learning - PowerPoint PPT Presentation

Explicit Feature Methods for Accelerated Kernel Learning Purushottam Kar Quick Motivation Kernel Algorithms (SVM, SVR, KPCA) have output = , Number of support vectors is typically

Decision Tree Prof. Seungchul Lee Industrial AI Lab. Feature Test Feature 1 Feature 2 Feature

Multiple Kernel Learning and Feature Space Denoising Fei Yan, Josef Kittler and Krystian

EXPLICIT INSTRUCTION EXPLICIT INSTRUCTION Michael L. Kamil Michael L. Kamil Stanford University

The explicit teaching of a The explicit teaching of a The explicit teaching of a laboratory

MOBILE COMPUTING CSE 40814/60814 Fall 2015 System Structure explicit explicit input output 1

menu del dia learning (7 slides) purushottam kar (iit kanpur) accelerated kernel learning

Machine learning theory Kernel methods Hamid Beigy Sharif university of technology April 20,

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

Kernel Properties - Convexity Leila Wehbe October 1st 2013 Leila Wehbe Kernel Properties -

Accelerated Reader What is Accelerated Reader? Accelerated Reader is the number one software

A Distinctive Feature of A Distinctive Feature of A Distinctive Feature of A Distinctive Feature

Outline Reducing Dimensionality Feature Selection 1 Steven J Zeil Feature Extraction 2

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

Black Kernel Rot Malady of Pecan B Wood, C Bock, l Wells, T Cottrell, M Hotchkiss Black Kernel

Processes, Protection and the Kernel: Processes, Protection and the Kernel: Mode, Space, and

Linux Kernel Debugging Your kernel just oopsed - What do you do, hotshot? Muli Ben-Yehuda

Kernel methods for Network Analysis: An introduction Chiranjib Bhattacharyya Machine Learning

Scalable Learning in Reproducing Kernel Kre n Spaces Dino Oglic 1 Thomas Grtner 2 1

CS480/680 Lecture 11: June 12, 2019 Kernel methods [D] Chap. 11 [B] Sec. 6.1, 6.2 [M] Sec.

SVMs and Kernel Methods Lecture 3 David Sontag New York University Slides adapted from Luke

Kernel methods and Graph kernels Social and Technological Networks Rik Sarkar University of

A Neural Network View of Kernel Methods Shuiwang Ji Department of Computer Science &

Lecture 5: SVM II Princeton University COS 495 Instructor: Yingyu Liang Review: SVM objective

L ECTURE 9: D UAL AND K ERNEL Prof. Julia Hockenmaier juliahmr@illinois.edu Linear classifiers

Explicit Feature Methods for Accelerated Kernel Learning - PowerPoint PPT Presentation

Explicit Feature Methods for Accelerated Kernel Learning Purushottam Kar Quick Motivation Kernel Algorithms (SVM, SVR, KPCA) have output = , Number of support vectors is typically

Decision Tree Prof. Seungchul Lee Industrial AI Lab. Feature Test Feature 1 Feature 2 Feature

Multiple Kernel Learning and Feature Space Denoising Fei Yan, Josef Kittler and Krystian

EXPLICIT INSTRUCTION EXPLICIT INSTRUCTION Michael L. Kamil Michael L. Kamil Stanford University

The explicit teaching of a The explicit teaching of a The explicit teaching of a laboratory

MOBILE COMPUTING CSE 40814/60814 Fall 2015 System Structure explicit explicit input output 1

menu del dia learning (7 slides) purushottam kar (iit kanpur) accelerated kernel learning

Machine learning theory Kernel methods Hamid Beigy Sharif university of technology April 20,

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

Kernel Properties - Convexity Leila Wehbe October 1st 2013 Leila Wehbe Kernel Properties -

Accelerated Reader What is Accelerated Reader? Accelerated Reader is the number one software

A Distinctive Feature of A Distinctive Feature of A Distinctive Feature of A Distinctive Feature

Outline Reducing Dimensionality Feature Selection 1 Steven J Zeil Feature Extraction 2

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

Black Kernel Rot Malady of Pecan B Wood, C Bock, l Wells, T Cottrell, M Hotchkiss Black Kernel

Processes, Protection and the Kernel: Processes, Protection and the Kernel: Mode, Space, and

Linux Kernel Debugging Your kernel just oopsed - What do you do, hotshot? Muli Ben-Yehuda

Kernel methods for Network Analysis: An introduction Chiranjib Bhattacharyya Machine Learning

Scalable Learning in Reproducing Kernel Kre n Spaces Dino Oglic 1 Thomas Grtner 2 1

CS480/680 Lecture 11: June 12, 2019 Kernel methods [D] Chap. 11 [B] Sec. 6.1, 6.2 [M] Sec.

SVMs and Kernel Methods Lecture 3 David Sontag New York University Slides adapted from Luke

Kernel methods and Graph kernels Social and Technological Networks Rik Sarkar University of

A Neural Network View of Kernel Methods Shuiwang Ji Department of Computer Science &amp;

Lecture 5: SVM II Princeton University COS 495 Instructor: Yingyu Liang Review: SVM objective

L ECTURE 9: D UAL AND K ERNEL Prof. Julia Hockenmaier juliahmr@illinois.edu Linear classifiers

A Neural Network View of Kernel Methods Shuiwang Ji Department of Computer Science &