Tricks for kernel methods in large datasets
Stellenbosch University MML 10 May 2019
Tricks for kernel methods in large datasets Matthias Treder - - PowerPoint PPT Presentation
School of Computer Science & Informatics Tricks for kernel methods in large datasets Matthias Treder Stellenbosch University MML 10 May 2019 < > MATTHIAS TREDER trederm@cardi ff .ac.uk 1 OVERVIEW Denoising in RKHS
Tricks for kernel methods in large datasets
Stellenbosch University MML 10 May 2019
OVERVIEW
(kernel) FDA
Denoising in RKHS
CHALLENGES FOR STATISTICAL MODELLING
Large number
Large sample size Low SNR
kernel methods instance averaging (Cichy et al)
stimuli while MEG was recorded
partitioned into groups of 40 (Cichy 2015) or 5 (Cichy 2017)
training/testing time and increased classification performance
INSTANCE AVERAGING: CICHY ET AL
Cichy, R. M., Ramirez, F. M., & Pantazis, D. (2015). Can visual information encoded in cortical columns be decoded from magnetoencephalography data in humans? NeuroImage, 121, 193–204. https://doi.org/ 10.1016/j.neuroimage.2015.07.011 Cichy, R. M., & Pantazis, D. (2017). Multivariate pattern analysis of MEG and EEG: A comparison of representational structure in time and space. NeuroImage, 158, 441–454. https://doi.org/10.1016/j.neuroimage. 2017.07.023 Cichy 2015 Cichy 2017INSTANCE AVERAGING
Before averaging After averagingINSTANCE AVERAGING: GAUSSIAN DENSITY
Before averaging After averaging+ +
Σ
x ∼ 𝒪(m, Σ) ¯ x ∼ 𝒪(m, 1
n Σ)
What about nonlinear classification problems?
INSTANCE AVERAGING IN RADIAL DATA
Before averaging After averagingIDEA input space 𝒴 feature space ℱ
ϕ
perform averaging in ℱ?
Kernel methods
KERNEL METHODS: PROJECTION
KERNEL METHODS: PROJECTION
ϕ : ℝ1 ↦ ℝ2 ϕ(x) = [ x x2]
‘KERNEL TRICK’
⟨ϕ(x), ϕ(x′)⟩ℱ = k(x, x′) If the space is a reproducing kernel Hilbert space (RKHS), there exists a kernel function k such that ϕ : 𝒴 ↦ ℱ maps into a high- (possibly infinitely-) dimensional space so that ϕ(x) is often not actually computable. require inner products between data points ⟨ϕ(x), ϕ(x′)⟩ Kernel methods such as SVM and kernel regression only Using the kernel function, all computations are carried out efficiently in input space
KERNEL AVERAGING USING THE KERNEL TRICK
x2’ x2 x1 x1’product between the averaged samples z and z’
the original samples x1, x2, x1’, x2’
can recover <z, z’>
z z'EXPERIMENTS
SIMULATED DATASETS
EEG DATA
Time [ms]
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Accuracy
None l=20 l=50REAL DATA
KERNEL AVERAGING
1.2 3.2 2.12x2 kernel matrix
6x6 kernel matrix
problems
DISCUSSION
Fast out-of-sample predictions for kernel FDA
NeurIPS’18: reject ESANN’19: accept
MOTIVATION
300,000,000 train-test iterations
}
Classification of event-related potentials in EEG dataCan we exploit the redundancies in the training/test sets in multi-class kernel FDA?
X ∈ ℝn×p
y ∈ ℝn [ ]
K ∈ ℝn×n
Out-of-sample predictionŝ yout
Class indicator matrixY ∈ ℝn×c[ ]
1 1 1 1 1 1 In-sample predictionŝ yin = G y
Kernel ‘hat' matrixG = K (K + λIn)−1 GTe
submatrix (test rows/cols)GTr,Te
submatrix (train rows/test cols)DIRECT OUT-OF-SAMPLE PREDICTIONS FOR TWO-CLASS FDA
Sherman-Morrison-Woodbury formula
What about multi-class FDA?
̂ yout
Te= (I − GTe)−1 ( ̂ yin
Te − GTe yTe)( ⋆ )
+++
Class meansMULTI-CLASS FISHER DISCRIMINANT ANALYSIS
Find discriminant subspace W = [w1, w2, . . . ] ∈ ℝp×(c−1) But: multi-class FDA is not equivalent to multivariate regression :-/ by solving the gen. eigenvalue problem Sb W = SwW Λ w1 w2
+ + +
Canonical correlation analysis (CCA) Multi-class Fisher Discriminant Analysis (FDA) Optimal Scoring (OS) Kernel Fisher Discriminant Analysis (KFDA) Kernel canonical correlation analysis (KCCA)
that solve arg minw,θ = ||XTr w − YTr θ||2
2Step 1: multivariate regression Step 2: rotation and scaling
̂ Yreg
Tr = XTr B(Θ, [α1, α2, . . . ]) = eig (( ̂ Yreg
Tr )⊤YTr)W = BΘD, Dii = α2
i (1 − αi)2/nObjective: Find W = [w1, w2, . . . ] ∈ ℝp×(c−1) and optimal score vectors Θ = [θ1, θ2, . . . ] ∈ ℝc×(c−1)
OPTIMAL SCORING (OS)
˜ B = minB||XTr B − YTr||2
2̂ Yreg
Tr =̂ Yin
Tr − GTr,Te ̂Yreg
Tê Yreg
Te = (I − GTe)−1 ( ̂Yin
Te − GTe YTe) ( ⋆ )̂ Yout
Te =̂ Yreg
Te Θ DCOMPLEXITY FOR K-FOLD CV (KERNEL CASE)
Optimal scoring + matrix update Once: In every fold (step 1 OS): Classical approach (k-fold) Once: In every fold: Calculate K: 𝒫(n2 p) Invert K: 𝒫(k n3
Tr)Calculate and invert K: 𝒫(n2 p + n3) Calculate update ̂ Yreg
Te : 𝒫(k n3 Te)Calculate update ̂ Yreg
Tr : 𝒫(k nTr n2 Te)permutation tests
SIMULATIONS (LINEAR CASE)
10 16 29 52 94 170 307 554 1000 Features 10x 100x 1000x n=100 Classes 5 10 10 16 29 52 94 170 307 554 1000 Features 10x 100x 1000x n=1000 Cross-validation Speed increase 1000 10x 100x 1000x 10,000x 10 100 Permutations n=100 Features 100 1000 10 100 Permutations 10x 100x 1000x 10,000x n=1000 Permutations Speed increaseDOES IT SCALE TO LARGE DATA?
Approximate kernel matrix K ≈ Kr = LL⊤, L ∈ ℝn×r, r ≪ n Kernel 'hat' matrix Gr = λ−1 L (Ir − L⊤L (L⊤L + λIr)−1) L⊤ =: R ∈ ℝr×r
never requiring the full n x n matrix
CONCLUSION
CNNs and applications
CNNs and applications
Fun with GANs
GENERATIVE ADVERSARIAL NETWORK (GAN)
GAN EXAMPLES
Do you know these Hollywood actors?
GAN EXAMPLES
Style transfer
GENERATIVE MODELING OF BRAINS
Networks (GANs) to produce synthetic MRIs
purposes (no privacy concerns)
transition from control to disease)
GENERATIVE MODELING OF BRAINS
CNNs and applications
Symmetry in convolutional layers
TRANSLATION SYMMETRY
Translation symmetry: implemented via convolutions But what about other types
REFLECTION / MIRROR SYMMETRY
Many natural and artificial objects are mirror-symmetric The human visual system is very sensitive to reflection symmetry
REFLECTION / MIRROR SYMMETRY
Many natural and artificial objects are mirror-symmetric The human visual system is very sensitive to reflection symmetry Translation Reflection
exhibit filter symmetry?
training?
affect CNN performance on image tasks?)
between a filter and a reflected version of another filter from the same convolutional layer
REFLECTION SYMMETRY BETWEEN FILTERS
3x3x2 filter horizontal reflection vertical reflectionDo large CNNs trained on ImageNet exhibit filter symmetry? (>75% correlation)
all convolutional layers with kernel size at least 3x3Do large CNNs trained on ImageNet exhibit filter symmetry?
Does filter symmetry evolve during training?
Thank you for your attention!