Tricks for kernel methods in large datasets Matthias Treder - PowerPoint PPT Presentation

School of Computer Science & Informatics Tricks for kernel methods in large datasets Matthias Treder Stellenbosch University MML 10 May 2019 < > MATTHIAS TREDER · trederm@cardi ff .ac.uk � 1

OVERVIEW • Denoising in RKHS • Fast-out-of-sample predictions for (kernel) FDA • CNNs and applications < > MATTHIAS TREDER · trederm@cardi ff .ac.uk � 2

Denoising in RKHS < > MATTHIAS TREDER · trederm@cardi ff .ac.uk � 3

CHALLENGES FOR STATISTICAL MODELLING Large sample size instance Large number averaging of variables (Cichy et al) Low SNR kernel methods < > MATTHIAS TREDER · BRAIN INFORMATICS 2018 � 4

INSTANCE AVERAGING: CICHY ET AL • Subjects repeatedly viewed visual Cichy 2015 stimuli while MEG was recorded • Instances (trials) of the same class partitioned into groups of 40 (Cichy 2015) or 5 (Cichy 2017) • Linear SVM was trained and tested on the averaged data Cichy 2017 • Instance averaging shortened training/testing time and increased classification performance Cichy, R. M., Ramirez, F. M., & Pantazis, D. (2015). Can visual information encoded in cortical columns be decoded from magnetoencephalography data in humans? NeuroImage, 121, 193–204. https://doi.org/ 10.1016/j.neuroimage.2015.07.011 Cichy, R. M., & Pantazis, D. (2017). Multivariate pattern analysis of MEG and EEG: A comparison of representational structure in time and space. NeuroImage, 158, 441–454. https://doi.org/10.1016/j.neuroimage. 2017.07.023 < > MATTHIAS TREDER · BRAIN INFORMATICS 2018 � 5

INSTANCE AVERAGING Before averaging After averaging < > MATTHIAS TREDER · BRAIN INFORMATICS 2018 � 6

INSTANCE AVERAGING: GAUSSIAN DENSITY Σ Before averaging + x ∼ 𝒪 ( m , Σ ) + After averaging x ∼ 𝒪 ( m , 1 n Σ ) ¯ < > MATTHIAS TREDER · BRAIN INFORMATICS 2018 � 7

What about non linear classification problems? < > MATTHIAS TREDER · BRAIN INFORMATICS 2018 � 8

INSTANCE AVERAGING IN RADIAL DATA Before averaging After averaging < > MATTHIAS TREDER · BRAIN INFORMATICS 2018 � 9

IDEA input space 𝒴 feature space ℱ ϕ perform averaging in ℱ ? < > MATTHIAS TREDER · BRAIN INFORMATICS 2018 � 10

Kernel methods < > MATTHIAS TREDER · BRAIN INFORMATICS 2018 � 11

KERNEL METHODS: PROJECTION < > MATTHIAS TREDER · BRAIN INFORMATICS 2018 � 12

KERNEL METHODS: PROJECTION ϕ : ℝ 1 ↦ ℝ 2 ϕ ( x ) = [ x x 2 ] < > MATTHIAS TREDER · BRAIN INFORMATICS 2018 � 13

‘KERNEL TRICK’ ϕ : 𝒴 ↦ ℱ maps into a high- (possibly infinitely-) dimensional space so that ϕ ( x ) is often not actually computable. Kernel methods such as SVM and kernel regression only require inner products between data points ⟨ ϕ ( x ), ϕ ( x ′ � ) ⟩ If the space is a reproducing kernel Hilbert space (RKHS), there exists a kernel function k such that ⟨ ϕ ( x ), ϕ ( x ′ � ) ⟩ ℱ = k ( x , x ′ � ) Using the kernel function, all computations are carried out efficiently in input space < > MATTHIAS TREDER · BRAIN INFORMATICS 2018 � 14

KERNEL AVERAGING USING THE KERNEL TRICK • Problem: cannot directly compute the inner product between the averaged samples z and z’ x 2 z x 1 • However, we can evaluate the kernel function for the original samples x 1 , x 2 , x 1’ , x 2’ x 1’ • Using the bi-linearity of the inner product, we z' x 2’ can recover <z, z’> < > MATTHIAS TREDER · BRAIN INFORMATICS 2018 � 15

EXPERIMENTS • 3 simulated datasets • UCI datasets: gene expression; p53 mutants; cardiotocography • EEG dataset • Two kernel classifiers: SVM and kernel FDA • 5-fold cross-validation • Averaging approach - none - instance (averaging in input space) - kernel (averaging in RKHS) < > MATTHIAS TREDER · BRAIN INFORMATICS 2018 � 16

SIMULATED DATASETS Linear [4 classes] Radial [2 classes] Checkerboard [3 classes] 1.5 5 5 1 4 0.5 3 Variable 2 Variable 2 LDA 2 0 0 2 -0.5 1 -1 0 -5 -1.5 -1 -10 0 10 -1 0 1 0 2 4 LDA 1 Variable 1 Variable 1 < > MATTHIAS TREDER · BRAIN INFORMATICS 2018 � 17

EEG DATA 1 None 0.9 l=20 l=50 0.8 Accuracy 0.7 0.6 0.5 0.4 0.3 -0.2 0 0.2 0.4 0.6 0.8 1 Time [ms] < > MATTHIAS TREDER · BRAIN INFORMATICS 2018 � 18

REAL DATA < > MATTHIAS TREDER · BRAIN INFORMATICS 2018 � 19

KERNEL AVERAGING 6x6 kernel matrix [ ] 1.2 2.1 -0.3 1.2 -1.1 1.2 2.1 -0.3 1.2 -1.1 0 0 2x2 kernel matrix [ ] 9.1 -2.3 0.1 3.0 -2.1 0.1 9.1 -2.3 0.1 3.0 -2.1 0.1 -1.4 1.2 0.1 -0.3 0.3 -1.4 1.2 0.1 -0.3 0.3 -3 -3 1.2 -1.1 3.2 1.0 0.3 1.2 -1.1 3.2 3.2 3.2 0 0 3.0 -2.1 0.1 -2.3 1.2 0.4 3.0 -2.1 0.1 3.2 3.2 3.2 -0.3 0.3 -1.1 2.3 -0.3 -0.3 0.3 3.2 3.2 3.2 -3 -3 < > MATTHIAS TREDER · BRAIN INFORMATICS 2018 � 20

DISCUSSION • Instance averaging (eg Cichy 2015, 2017) improves SNR of data only in linear classification problems • Kernel averaging improves SNR of data in both linear and nonlinear classification problems • Smaller kernel matrix: higher speed, less memory consumption • Useful for many training-testing iterations (eg permutation testing) • Large datasets: patients vs controls; ERPs; gene expression < > MATTHIAS TREDER · BRAIN INFORMATICS 2018 � 21

NeurIPS’18: reject ESANN’19: accept Fast out-of-sample predictions for kernel FDA < > MATTHIAS TREDER · trederm@cardi ff .ac.uk � 22

MOTIVATION Can we exploit the Classification of event-related potentials in EEG data redundancies in the ERP training/test sets in 0.6 Attended multi-class kernel Unattended 0.4 EEG amplitude FDA? 0.2 accuracy 0 lda 0.8 -0.2 accuracy 0.6 -0.4 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 0.4 Time [s] -0.2 0 0.2 0.4 0.6 0.8 1 Time 600 time points x 5 folds x 5 repetitions } 300,000,000 train-test x 1000 permutations iterations x 20 participants < > MATTHIAS TREDER · trederm@cardi ff .ac.uk � 23

̂ ̂ K ∈ ℝ n × n X ∈ ℝ n × p Kernel ‘hat' matrix Kernel matrix Predictor/feature matrix [ ] [ ] G = K ( K + λ I n ) − 1 1.2 2.1 -0.3 1.2 -1.1 0 1 1.2 2.1 -0.3 1.2 2.1 -0.3 1.2 2.1 9.1 -2.3 0.1 3.0 -2.1 0.1 9.1 -2.3 0.1 9.1 -2.3 0.1 9.1 -2.3 1 submatrix G Te -1.4 1.2 0.1 -0.3 -3 0.3 -1.4 1.2 0.1 -1.4 1.2 0.1 -1.4 1.2 1 (test rows/cols) 3.2 1.0 0.3 3.2 1.0 0.3 3.2 1.0 1 1.2 -1.1 0 3.2 1.0 0.3 submatrix -2.3 1.2 0.4 -2.3 1.2 0.4 -2.3 1.2 1 G Tr , Te 3.0 -2.1 0.1 -2.3 1.2 0.4 (train rows/test cols) -1.1 2.3 -0.3 -1.1 2.3 -0.3 -1.1 2.3 1 -3 -0.3 0.3 -1.1 2.3 -0.3 y ∈ ℝ n [ ] Y ∈ ℝ n × c [ ] y in = G y 1 0 1 1 0 In-sample 1 predictions 1 0 Class 1 indicator 0 1 Class labels y out -1 matrix (two classes) 0 1 -1 Out-of-sample 0 1 -1 predictions < > MATTHIAS TREDER · trederm@cardi ff .ac.uk � 24

̂ DIRECT OUT-OF-SAMPLE PREDICTIONS FOR TWO -CLASS FDA Sherman-Morrison-Woodbury formula = ( I − G Te ) − 1 ( ̂ y out y in Te − G Te y Te ) ( ⋆ ) Te What about multi-class FDA? < > MATTHIAS TREDER · trederm@cardi ff .ac.uk � 25

MULTI-CLASS FISHER DISCRIMINANT ANALYSIS w 2 Find discriminant subspace W = [ w 1 , w 2 , . . . ] ∈ ℝ p × ( c − 1) by solving the gen. eigenvalue problem S b W = S w W Λ + + + But: multi-class FDA is not equivalent to multivariate regression :-/ w 1 Covariance matrix Class means +++ < > MATTHIAS TREDER · BRAIN INFORMATICS 2018 � 26

Optimal Scoring ( OS ) Multi-class Fisher Canonical Discriminant Analysis correlation ( FDA ) analysis ( CCA ) Kernel Fisher Kernel canonical Discriminant Analysis correlation analysis (KFDA) (KCCA) < > MATTHIAS TREDER · trederm@cardi ff .ac.uk � 27

̂ ̂ ̂ ̂ ̂ ̂ OPTIMAL SCORING (OS) Objective: Find W = [ w 1 , w 2 , . . . ] ∈ ℝ p × ( c − 1) and optimal score vectors Θ = [ θ 1 , θ 2 , . . . ] ∈ ℝ c × ( c − 1) that solve arg min w , θ = || X Tr w − Y Tr θ || 2 2 Step 1: B = min B || X Tr B − Y Tr || 2 ˜ Te = ( I − G Te ) − 1 ( ̂ multivariate Y reg Y in Te − G Te Y Te ) ( ⋆ ) 2 regression Y reg Tr = X Tr B Tr − G Tr , Te ̂ Y reg Y reg Y in Tr = Te Step 2: ( Θ , [ α 1 , α 2 , . . . ]) = eig ( ( ̂ Tr ) ⊤ Y Tr ) Y reg rotation and scaling Y reg Y out Te Θ D Te = α 2 i (1 − α i ) 2 / n W = B Θ D , D ii = < > MATTHIAS TREDER · trederm@cardi ff .ac.uk � 28

COMPLEXITY FOR K-FOLD CV (KERNEL CASE) Classical approach (k-fold) Optimal scoring + matrix update Once: Once: Calculate K: 𝒫 ( n 2 p ) Calculate and invert K: 𝒫 ( n 2 p + n 3 ) In every fold: In every fold (step 1 OS): Invert K: 𝒫 ( k n 3 Tr ) Calculate update ̂ Y reg Te : 𝒫 ( k n 3 Te ) Calculate update ̂ Y reg Tr : 𝒫 ( k n Tr n 2 Te ) < > MATTHIAS TREDER · trederm@cardi ff .ac.uk � 29

Tricks for kernel methods in large datasets Matthias Treder - PowerPoint PPT Presentation

School of Computer Science & Informatics Tricks for kernel methods in large datasets Matthias Treder Stellenbosch University MML 10 May 2019 < > MATTHIAS TREDER trederm@cardi ff .ac.uk 1 OVERVIEW Denoising in RKHS

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

Learning with Large Datasets L eon Bottou NEC Laboratories America Why Large-scale Datasets?

MANAGING AND MANAGING AND PROCESSING LARGE PROCESSING LARGE DATASETS DATASETS Christian

MANAGING AND MANAGING AND PROCESSING LARGE PROCESSING LARGE DATASETS DATASETS Christian

1 Examples The ETH-80 Dataset (Bastian Leibe and Bernt Schiele) The Caltech 101 average image

Black Kernel Rot Malady of Pecan B Wood, C Bock, l Wells, T Cottrell, M Hotchkiss Black Kernel

Kernel Properties - Convexity Leila Wehbe October 1st 2013 Leila Wehbe Kernel Properties -

Processes, Protection and the Kernel: Processes, Protection and the Kernel: Mode, Space, and

Linux Kernel Debugging Your kernel just oopsed - What do you do, hotshot? Muli Ben-Yehuda

Introduction to Linux Kernel Modules Luca Abeni luca.abeni@santannapisa.it Linux Kernel Modules

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Methods Lei Tang Arizona State University Jul. 26th, 2007 Lei Tang Kernel Methods

Optimization for Kernel Methods S. Sathiya Keerthi Yahoo! Research, Burbank, CA, USA Kernel

Machine learning theory Kernel methods Hamid Beigy Sharif university of technology April 20,

Skoltech Skolkovo Institute of Science and Technology Kernel Methods Refresher Kernel trick:

Latent Semantic Indexing Information Systems M Prof. Paolo Ciaccia

Statistical Modeling Approaches for Statistical Modeling Approaches for Information Retrieval

HL7 2.x Security Hacking medical devices Anirudh Duggal Disclaimer: All the views/ research

Maths Knowledge Overview - for Part 1, COMP24111 Tingting Mu tingtingmu@manchester.ac.uk School

CS 134: Operating Systems I/O Hardware 1 / 23 Overview CS34 Overview 2013-05-17 Hardware

Integrating cover crop residue and moldboard plowing into glyphosate-resistant Palmer amaranth

What is a Transition Town? Traditionally A local response to Climate Change and Rising

5 th Grade Transition Committee Readout & Panel November 8, 2018 Agenda School Board

Tricks for kernel methods in large datasets Matthias Treder - PowerPoint PPT Presentation

School of Computer Science & Informatics Tricks for kernel methods in large datasets Matthias Treder Stellenbosch University MML 10 May 2019 < > MATTHIAS TREDER trederm@cardi ff .ac.uk 1 OVERVIEW Denoising in RKHS

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

Learning with Large Datasets L eon Bottou NEC Laboratories America Why Large-scale Datasets?

MANAGING AND MANAGING AND PROCESSING LARGE PROCESSING LARGE DATASETS DATASETS Christian

MANAGING AND MANAGING AND PROCESSING LARGE PROCESSING LARGE DATASETS DATASETS Christian

1 Examples The ETH-80 Dataset (Bastian Leibe and Bernt Schiele) The Caltech 101 average image

Black Kernel Rot Malady of Pecan B Wood, C Bock, l Wells, T Cottrell, M Hotchkiss Black Kernel

Kernel Properties - Convexity Leila Wehbe October 1st 2013 Leila Wehbe Kernel Properties -

Processes, Protection and the Kernel: Processes, Protection and the Kernel: Mode, Space, and

Linux Kernel Debugging Your kernel just oopsed - What do you do, hotshot? Muli Ben-Yehuda

Introduction to Linux Kernel Modules Luca Abeni luca.abeni@santannapisa.it Linux Kernel Modules

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Methods Lei Tang Arizona State University Jul. 26th, 2007 Lei Tang Kernel Methods

Optimization for Kernel Methods S. Sathiya Keerthi Yahoo! Research, Burbank, CA, USA Kernel

Machine learning theory Kernel methods Hamid Beigy Sharif university of technology April 20,

Skoltech Skolkovo Institute of Science and Technology Kernel Methods Refresher Kernel trick:

Latent Semantic Indexing Information Systems M Prof. Paolo Ciaccia

Statistical Modeling Approaches for Statistical Modeling Approaches for Information Retrieval

HL7 2.x Security Hacking medical devices Anirudh Duggal Disclaimer: All the views/ research

Maths Knowledge Overview - for Part 1, COMP24111 Tingting Mu tingtingmu@manchester.ac.uk School

CS 134: Operating Systems I/O Hardware 1 / 23 Overview CS34 Overview 2013-05-17 Hardware

Integrating cover crop residue and moldboard plowing into glyphosate-resistant Palmer amaranth

What is a Transition Town? Traditionally A local response to Climate Change and Rising

5 th Grade Transition Committee Readout &amp; Panel November 8, 2018 Agenda School Board

5 th Grade Transition Committee Readout & Panel November 8, 2018 Agenda School Board