School of Computer Science & Informatics Tricks for kernel methods in large datasets Matthias Treder Stellenbosch University MML 10 May 2019 < > MATTHIAS TREDER · trederm@cardi ff .ac.uk � 1
OVERVIEW • Denoising in RKHS • Fast-out-of-sample predictions for (kernel) FDA • CNNs and applications < > MATTHIAS TREDER · trederm@cardi ff .ac.uk � 2
Denoising in RKHS < > MATTHIAS TREDER · trederm@cardi ff .ac.uk � 3
CHALLENGES FOR STATISTICAL MODELLING Large sample size instance Large number averaging of variables (Cichy et al) Low SNR kernel methods < > MATTHIAS TREDER · BRAIN INFORMATICS 2018 � 4
INSTANCE AVERAGING: CICHY ET AL • Subjects repeatedly viewed visual Cichy 2015 stimuli while MEG was recorded • Instances (trials) of the same class partitioned into groups of 40 (Cichy 2015) or 5 (Cichy 2017) • Linear SVM was trained and tested on the averaged data Cichy 2017 • Instance averaging shortened training/testing time and increased classification performance Cichy, R. M., Ramirez, F. M., & Pantazis, D. (2015). Can visual information encoded in cortical columns be decoded from magnetoencephalography data in humans? NeuroImage, 121, 193–204. https://doi.org/ 10.1016/j.neuroimage.2015.07.011 Cichy, R. M., & Pantazis, D. (2017). Multivariate pattern analysis of MEG and EEG: A comparison of representational structure in time and space. NeuroImage, 158, 441–454. https://doi.org/10.1016/j.neuroimage. 2017.07.023 < > MATTHIAS TREDER · BRAIN INFORMATICS 2018 � 5
INSTANCE AVERAGING Before averaging After averaging < > MATTHIAS TREDER · BRAIN INFORMATICS 2018 � 6
INSTANCE AVERAGING: GAUSSIAN DENSITY Σ Before averaging + x ∼ 𝒪 ( m , Σ ) + After averaging x ∼ 𝒪 ( m , 1 n Σ ) ¯ < > MATTHIAS TREDER · BRAIN INFORMATICS 2018 � 7
What about non linear classification problems? < > MATTHIAS TREDER · BRAIN INFORMATICS 2018 � 8
INSTANCE AVERAGING IN RADIAL DATA Before averaging After averaging < > MATTHIAS TREDER · BRAIN INFORMATICS 2018 � 9
IDEA input space 𝒴 feature space ℱ ϕ perform averaging in ℱ ? < > MATTHIAS TREDER · BRAIN INFORMATICS 2018 � 10
Kernel methods < > MATTHIAS TREDER · BRAIN INFORMATICS 2018 � 11
KERNEL METHODS: PROJECTION < > MATTHIAS TREDER · BRAIN INFORMATICS 2018 � 12
KERNEL METHODS: PROJECTION ϕ : ℝ 1 ↦ ℝ 2 ϕ ( x ) = [ x x 2 ] < > MATTHIAS TREDER · BRAIN INFORMATICS 2018 � 13
‘KERNEL TRICK’ ϕ : 𝒴 ↦ ℱ maps into a high- (possibly infinitely-) dimensional space so that ϕ ( x ) is often not actually computable. Kernel methods such as SVM and kernel regression only require inner products between data points ⟨ ϕ ( x ), ϕ ( x ′ � ) ⟩ If the space is a reproducing kernel Hilbert space (RKHS), there exists a kernel function k such that ⟨ ϕ ( x ), ϕ ( x ′ � ) ⟩ ℱ = k ( x , x ′ � ) Using the kernel function, all computations are carried out efficiently in input space < > MATTHIAS TREDER · BRAIN INFORMATICS 2018 � 14
KERNEL AVERAGING USING THE KERNEL TRICK • Problem: cannot directly compute the inner product between the averaged samples z and z’ x 2 z x 1 • However, we can evaluate the kernel function for the original samples x 1 , x 2 , x 1’ , x 2’ x 1’ • Using the bi-linearity of the inner product, we z' x 2’ can recover <z, z’> < > MATTHIAS TREDER · BRAIN INFORMATICS 2018 � 15
EXPERIMENTS • 3 simulated datasets • UCI datasets: gene expression; p53 mutants; cardiotocography • EEG dataset • Two kernel classifiers: SVM and kernel FDA • 5-fold cross-validation • Averaging approach - none - instance (averaging in input space) - kernel (averaging in RKHS) < > MATTHIAS TREDER · BRAIN INFORMATICS 2018 � 16
SIMULATED DATASETS Linear [4 classes] Radial [2 classes] Checkerboard [3 classes] 1.5 5 5 1 4 0.5 3 Variable 2 Variable 2 LDA 2 0 0 2 -0.5 1 -1 0 -5 -1.5 -1 -10 0 10 -1 0 1 0 2 4 LDA 1 Variable 1 Variable 1 < > MATTHIAS TREDER · BRAIN INFORMATICS 2018 � 17
EEG DATA 1 None 0.9 l=20 l=50 0.8 Accuracy 0.7 0.6 0.5 0.4 0.3 -0.2 0 0.2 0.4 0.6 0.8 1 Time [ms] < > MATTHIAS TREDER · BRAIN INFORMATICS 2018 � 18
REAL DATA < > MATTHIAS TREDER · BRAIN INFORMATICS 2018 � 19
KERNEL AVERAGING 6x6 kernel matrix [ ] 1.2 2.1 -0.3 1.2 -1.1 1.2 2.1 -0.3 1.2 -1.1 0 0 2x2 kernel matrix [ ] 9.1 -2.3 0.1 3.0 -2.1 0.1 9.1 -2.3 0.1 3.0 -2.1 0.1 -1.4 1.2 0.1 -0.3 0.3 -1.4 1.2 0.1 -0.3 0.3 -3 -3 1.2 -1.1 3.2 1.0 0.3 1.2 -1.1 3.2 3.2 3.2 0 0 3.0 -2.1 0.1 -2.3 1.2 0.4 3.0 -2.1 0.1 3.2 3.2 3.2 -0.3 0.3 -1.1 2.3 -0.3 -0.3 0.3 3.2 3.2 3.2 -3 -3 < > MATTHIAS TREDER · BRAIN INFORMATICS 2018 � 20
DISCUSSION • Instance averaging (eg Cichy 2015, 2017) improves SNR of data only in linear classification problems • Kernel averaging improves SNR of data in both linear and nonlinear classification problems • Smaller kernel matrix: higher speed, less memory consumption • Useful for many training-testing iterations (eg permutation testing) • Large datasets: patients vs controls; ERPs; gene expression < > MATTHIAS TREDER · BRAIN INFORMATICS 2018 � 21
NeurIPS’18: reject ESANN’19: accept Fast out-of-sample predictions for kernel FDA < > MATTHIAS TREDER · trederm@cardi ff .ac.uk � 22
MOTIVATION Can we exploit the Classification of event-related potentials in EEG data redundancies in the ERP training/test sets in 0.6 Attended multi-class kernel Unattended 0.4 EEG amplitude FDA? 0.2 accuracy 0 lda 0.8 -0.2 accuracy 0.6 -0.4 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 0.4 Time [s] -0.2 0 0.2 0.4 0.6 0.8 1 Time 600 time points x 5 folds x 5 repetitions } 300,000,000 train-test x 1000 permutations iterations x 20 participants < > MATTHIAS TREDER · trederm@cardi ff .ac.uk � 23
̂ ̂ K ∈ ℝ n × n X ∈ ℝ n × p Kernel ‘hat' matrix Kernel matrix Predictor/feature matrix [ ] [ ] G = K ( K + λ I n ) − 1 1.2 2.1 -0.3 1.2 -1.1 0 1 1.2 2.1 -0.3 1.2 2.1 -0.3 1.2 2.1 9.1 -2.3 0.1 3.0 -2.1 0.1 9.1 -2.3 0.1 9.1 -2.3 0.1 9.1 -2.3 1 submatrix G Te -1.4 1.2 0.1 -0.3 -3 0.3 -1.4 1.2 0.1 -1.4 1.2 0.1 -1.4 1.2 1 (test rows/cols) 3.2 1.0 0.3 3.2 1.0 0.3 3.2 1.0 1 1.2 -1.1 0 3.2 1.0 0.3 submatrix -2.3 1.2 0.4 -2.3 1.2 0.4 -2.3 1.2 1 G Tr , Te 3.0 -2.1 0.1 -2.3 1.2 0.4 (train rows/test cols) -1.1 2.3 -0.3 -1.1 2.3 -0.3 -1.1 2.3 1 -3 -0.3 0.3 -1.1 2.3 -0.3 y ∈ ℝ n [ ] Y ∈ ℝ n × c [ ] y in = G y 1 0 1 1 0 In-sample 1 predictions 1 0 Class 1 indicator 0 1 Class labels y out -1 matrix (two classes) 0 1 -1 Out-of-sample 0 1 -1 predictions < > MATTHIAS TREDER · trederm@cardi ff .ac.uk � 24
̂ DIRECT OUT-OF-SAMPLE PREDICTIONS FOR TWO -CLASS FDA Sherman-Morrison-Woodbury formula = ( I − G Te ) − 1 ( ̂ y out y in Te − G Te y Te ) ( ⋆ ) Te What about multi-class FDA? < > MATTHIAS TREDER · trederm@cardi ff .ac.uk � 25
MULTI-CLASS FISHER DISCRIMINANT ANALYSIS w 2 Find discriminant subspace W = [ w 1 , w 2 , . . . ] ∈ ℝ p × ( c − 1) by solving the gen. eigenvalue problem S b W = S w W Λ + + + But: multi-class FDA is not equivalent to multivariate regression :-/ w 1 Covariance matrix Class means +++ < > MATTHIAS TREDER · BRAIN INFORMATICS 2018 � 26
Optimal Scoring ( OS ) Multi-class Fisher Canonical Discriminant Analysis correlation ( FDA ) analysis ( CCA ) Kernel Fisher Kernel canonical Discriminant Analysis correlation analysis (KFDA) (KCCA) < > MATTHIAS TREDER · trederm@cardi ff .ac.uk � 27
̂ ̂ ̂ ̂ ̂ ̂ OPTIMAL SCORING (OS) Objective: Find W = [ w 1 , w 2 , . . . ] ∈ ℝ p × ( c − 1) and optimal score vectors Θ = [ θ 1 , θ 2 , . . . ] ∈ ℝ c × ( c − 1) that solve arg min w , θ = || X Tr w − Y Tr θ || 2 2 Step 1: B = min B || X Tr B − Y Tr || 2 ˜ Te = ( I − G Te ) − 1 ( ̂ multivariate Y reg Y in Te − G Te Y Te ) ( ⋆ ) 2 regression Y reg Tr = X Tr B Tr − G Tr , Te ̂ Y reg Y reg Y in Tr = Te Step 2: ( Θ , [ α 1 , α 2 , . . . ]) = eig ( ( ̂ Tr ) ⊤ Y Tr ) Y reg rotation and scaling Y reg Y out Te Θ D Te = α 2 i (1 − α i ) 2 / n W = B Θ D , D ii = < > MATTHIAS TREDER · trederm@cardi ff .ac.uk � 28
COMPLEXITY FOR K-FOLD CV (KERNEL CASE) Classical approach (k-fold) Optimal scoring + matrix update Once: Once: Calculate K: 𝒫 ( n 2 p ) Calculate and invert K: 𝒫 ( n 2 p + n 3 ) In every fold: In every fold (step 1 OS): Invert K: 𝒫 ( k n 3 Tr ) Calculate update ̂ Y reg Te : 𝒫 ( k n 3 Te ) Calculate update ̂ Y reg Tr : 𝒫 ( k n Tr n 2 Te ) < > MATTHIAS TREDER · trederm@cardi ff .ac.uk � 29
Recommend
More recommend