Tricks for kernel methods in large datasets Matthias Treder - - PowerPoint PPT Presentation

tricks for kernel methods in large datasets
SMART_READER_LITE
LIVE PREVIEW

Tricks for kernel methods in large datasets Matthias Treder - - PowerPoint PPT Presentation

School of Computer Science & Informatics Tricks for kernel methods in large datasets Matthias Treder Stellenbosch University MML 10 May 2019 < > MATTHIAS TREDER trederm@cardi ff .ac.uk 1 OVERVIEW Denoising in RKHS


slide-1
SLIDE 1 > < School of Computer Science & Informatics MATTHIAS TREDER · trederm@cardiff.ac.uk Matthias Treder 1

Tricks for kernel methods in large datasets

Stellenbosch University MML 10 May 2019

slide-2
SLIDE 2 > < MATTHIAS TREDER · trederm@cardiff.ac.uk 2

OVERVIEW

  • Denoising in RKHS
  • Fast-out-of-sample predictions for

(kernel) FDA

  • CNNs and applications
slide-3
SLIDE 3 > < MATTHIAS TREDER · trederm@cardiff.ac.uk 3

Denoising in RKHS

slide-4
SLIDE 4 > < MATTHIAS TREDER · BRAIN INFORMATICS 2018 4

CHALLENGES FOR STATISTICAL MODELLING

Large number

  • f variables

Large sample size Low SNR

kernel methods instance averaging (Cichy et al)

slide-5
SLIDE 5 > < MATTHIAS TREDER · BRAIN INFORMATICS 2018 5
  • Subjects repeatedly viewed visual

stimuli while MEG was recorded

  • Instances (trials) of the same class

partitioned into groups of 40 (Cichy 2015) or 5 (Cichy 2017)

  • Linear SVM was trained and tested
  • n the averaged data
  • Instance averaging shortened

training/testing time and increased classification performance

INSTANCE AVERAGING: CICHY ET AL

Cichy, R. M., Ramirez, F. M., & Pantazis, D. (2015). Can visual information encoded in cortical columns be decoded from magnetoencephalography data in humans? NeuroImage, 121, 193–204. https://doi.org/ 10.1016/j.neuroimage.2015.07.011 Cichy, R. M., & Pantazis, D. (2017). Multivariate pattern analysis of MEG and EEG: A comparison of representational structure in time and space. NeuroImage, 158, 441–454. https://doi.org/10.1016/j.neuroimage. 2017.07.023 Cichy 2015 Cichy 2017
slide-6
SLIDE 6 > < MATTHIAS TREDER · BRAIN INFORMATICS 2018 6

INSTANCE AVERAGING

Before averaging After averaging
slide-7
SLIDE 7 > < MATTHIAS TREDER · BRAIN INFORMATICS 2018 7

INSTANCE AVERAGING: GAUSSIAN DENSITY

Before averaging After averaging

+ +

Σ

x ∼ 𝒪(m, Σ) ¯ x ∼ 𝒪(m, 1

n Σ)

slide-8
SLIDE 8 > < MATTHIAS TREDER · BRAIN INFORMATICS 2018 8

What about nonlinear classification problems?

slide-9
SLIDE 9 > < MATTHIAS TREDER · BRAIN INFORMATICS 2018 9

INSTANCE AVERAGING IN RADIAL DATA

Before averaging After averaging
slide-10
SLIDE 10 > < MATTHIAS TREDER · BRAIN INFORMATICS 2018 10

IDEA input space 𝒴 feature space ℱ

ϕ

perform averaging in ℱ?

slide-11
SLIDE 11 > < MATTHIAS TREDER · BRAIN INFORMATICS 2018 11

Kernel methods

slide-12
SLIDE 12 > < MATTHIAS TREDER · BRAIN INFORMATICS 2018 12

KERNEL METHODS: PROJECTION

slide-13
SLIDE 13 > < MATTHIAS TREDER · BRAIN INFORMATICS 2018 13

KERNEL METHODS: PROJECTION

ϕ : ℝ1 ↦ ℝ2 ϕ(x) = [ x x2]

slide-14
SLIDE 14 > < MATTHIAS TREDER · BRAIN INFORMATICS 2018 14

‘KERNEL TRICK’

⟨ϕ(x), ϕ(x′)⟩ℱ = k(x, x′) If the space is a reproducing kernel Hilbert space (RKHS), there exists a kernel function k such that ϕ : 𝒴 ↦ ℱ maps into a high- (possibly infinitely-) dimensional space so that ϕ(x) is often not actually computable. require inner products between data points ⟨ϕ(x), ϕ(x′)⟩ Kernel methods such as SVM and kernel regression only Using the kernel function, all computations are carried out efficiently in input space

slide-15
SLIDE 15 > < MATTHIAS TREDER · BRAIN INFORMATICS 2018 15

KERNEL AVERAGING USING THE KERNEL TRICK

x2’ x2 x1 x1’
  • Problem: cannot directly compute the inner

product between the averaged samples z and z’

  • However, we can evaluate the kernel function for

the original samples x1, x2, x1’, x2’

  • Using the bi-linearity of the inner product, we

can recover <z, z’>

z z'
slide-16
SLIDE 16 > < MATTHIAS TREDER · BRAIN INFORMATICS 2018 16
  • 3 simulated datasets
  • UCI datasets: gene expression; p53 mutants; cardiotocography
  • EEG dataset
  • Two kernel classifiers: SVM and kernel FDA
  • 5-fold cross-validation
  • Averaging approach
  • none
  • instance (averaging in input space)
  • kernel (averaging in RKHS)

EXPERIMENTS

slide-17
SLIDE 17 > < MATTHIAS TREDER · BRAIN INFORMATICS 2018 17

SIMULATED DATASETS

  • 10
10 LDA 1
  • 5
5 LDA 2 Linear [4 classes]
  • 1
1 Variable 1
  • 1.5
  • 1
  • 0.5
0.5 1 1.5 Variable 2 Radial [2 classes] 2 4 Variable 1
  • 1
1 2 3 4 5 Variable 2 Checkerboard [3 classes]
slide-18
SLIDE 18 > < MATTHIAS TREDER · BRAIN INFORMATICS 2018 18

EEG DATA

  • 0.2
0.2 0.4 0.6 0.8 1

Time [ms]

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Accuracy

None l=20 l=50
slide-19
SLIDE 19 > < MATTHIAS TREDER · BRAIN INFORMATICS 2018 19

REAL DATA

slide-20
SLIDE 20 > < MATTHIAS TREDER · BRAIN INFORMATICS 2018

[ ]

20

KERNEL AVERAGING

1.2 3.2 2.1
  • 0.3
9.1
  • 2.3
0.1
  • 1.4
1.2 0.1 1.0 0.3
  • 2.3
1.2 0.4
  • 1.1
2.3
  • 0.3
1.2 3.0
  • 0.3
  • 1.1
  • 2.1
  • 3
0.1 0.3 1.2 3.0
  • 0.3
  • 1.1
  • 2.1
  • 3
0.1 0.3 1.2 3.2 2.1
  • 0.3
9.1
  • 2.3
0.1 0.1 3.2 3.2 3.2 3.2 3.2 3.2 1.2 3.0
  • 0.3
  • 1.1
  • 2.1
0.1 3.0
  • 0.3
  • 1.4
1.2 3.2 3.2
  • 3
0.3 1.2
  • 1.1
  • 2.1
  • 3
0.1 0.3

2x2 kernel matrix

[ ]

6x6 kernel matrix

slide-21
SLIDE 21 > < MATTHIAS TREDER · BRAIN INFORMATICS 2018 21
  • Instance averaging (eg Cichy 2015, 2017) improves SNR of data only in linear classification

problems

  • Kernel averaging improves SNR of data in both linear and nonlinear classification problems
  • Smaller kernel matrix: higher speed, less memory consumption
  • Useful for many training-testing iterations (eg permutation testing)
  • Large datasets: patients vs controls; ERPs; gene expression

DISCUSSION

slide-22
SLIDE 22 > < MATTHIAS TREDER · trederm@cardiff.ac.uk 22

Fast out-of-sample predictions for kernel FDA

NeurIPS’18: reject ESANN’19: accept

slide-23
SLIDE 23 > < MATTHIAS TREDER · trederm@cardiff.ac.uk 23

MOTIVATION

  • 0.4
  • 0.2
0.2 0.4 0.6 0.8 1 Time [s]
  • 0.4
  • 0.2
0.2 0.4 0.6 EEG amplitude ERP Attended Unattended
  • 0.2
0.2 0.4 0.6 0.8 1 Time 0.4 0.6 0.8 accuracy accuracy lda 600 time points x 5 folds x 5 repetitions x 1000 permutations x 20 participants

300,000,000 train-test iterations

}

Classification of event-related potentials in EEG data

Can we exploit the redundancies in the training/test sets in multi-class kernel FDA?

slide-24
SLIDE 24 > < MATTHIAS TREDER · trederm@cardiff.ac.uk 24 Predictor/feature matrix

X ∈ ℝn×p

[ ]

1.2 2.1
  • 0.3
9.1
  • 2.3
0.1
  • 1.4
1.2 0.1 3.2 1.0 0.3
  • 2.3
1.2 0.4
  • 1.1
2.3
  • 0.3
1.2 2.1
  • 0.3
9.1
  • 2.3
0.1
  • 1.4
1.2 0.1 3.2 1.0 0.3
  • 2.3
1.2 0.4
  • 1.1
2.3
  • 0.3
3.2
  • 2.3
  • 1.1
1.2 9.1
  • 1.4
1.0 1.2 2.3 2.1
  • 2.3
1.2 1 1 1 1 1 1 Class labels (two classes)

y ∈ ℝn [ ]

  • 1
  • 1
  • 1
1 1 1

[ ]

1.2 3.2 2.1
  • 0.3
9.1
  • 2.3
0.1
  • 1.4
1.2 0.1 1.0 0.3
  • 2.3
1.2 0.4
  • 1.1
2.3
  • 0.3
1.2 3.0
  • 0.3
  • 1.1
  • 2.1
  • 3
0.1 0.3 1.2 3.0
  • 0.3
  • 1.1
  • 2.1
  • 3
0.1 0.3 Kernel matrix

K ∈ ℝn×n

Out-of-sample predictions

̂ yout

Class indicator matrix

Y ∈ ℝn×c[ ]

1 1 1 1 1 1 In-sample predictions

̂ yin = G y

Kernel ‘hat' matrix

G = K (K + λIn)−1 GTe

submatrix (test rows/cols)

GTr,Te

submatrix (train rows/test cols)
slide-25
SLIDE 25 > < MATTHIAS TREDER · trederm@cardiff.ac.uk 25

DIRECT OUT-OF-SAMPLE PREDICTIONS FOR TWO-CLASS FDA

Sherman-Morrison-Woodbury formula

What about multi-class FDA?

̂ yout

Te

= (I − GTe)−1 ( ̂ yin

Te − GTe yTe)

( ⋆ )

slide-26
SLIDE 26 > < MATTHIAS TREDER · BRAIN INFORMATICS 2018 26 Covariance matrix

+++

Class means

MULTI-CLASS FISHER DISCRIMINANT ANALYSIS

Find discriminant subspace W = [w1, w2, . . . ] ∈ ℝp×(c−1) But: multi-class FDA is not equivalent to multivariate regression :-/ by solving the gen. eigenvalue problem Sb W = SwW Λ w1 w2

+ + +

slide-27
SLIDE 27 > < MATTHIAS TREDER · trederm@cardiff.ac.uk 27

Canonical correlation analysis (CCA) Multi-class Fisher Discriminant Analysis (FDA) Optimal Scoring (OS) Kernel Fisher Discriminant Analysis (KFDA) Kernel canonical correlation analysis (KCCA)

slide-28
SLIDE 28 > < MATTHIAS TREDER · trederm@cardiff.ac.uk 28

that solve arg minw,θ = ||XTr w − YTr θ||2

2

Step 1: multivariate regression Step 2: rotation and scaling

̂ Yreg

Tr = XTr B

(Θ, [α1, α2, . . . ]) = eig (( ̂ Yreg

Tr )⊤YTr)

W = BΘD, Dii = α2

i (1 − αi)2/n

Objective: Find W = [w1, w2, . . . ] ∈ ℝp×(c−1) and optimal score vectors Θ = [θ1, θ2, . . . ] ∈ ℝc×(c−1)

OPTIMAL SCORING (OS)

˜ B = minB||XTr B − YTr||2

2

̂ Yreg

Tr =

̂ Yin

Tr − GTr,Te ̂

Yreg

Te

̂ Yreg

Te = (I − GTe)−1 ( ̂

Yin

Te − GTe YTe) ( ⋆ )

̂ Yout

Te =

̂ Yreg

Te Θ D
slide-29
SLIDE 29 > < MATTHIAS TREDER · trederm@cardiff.ac.uk 29

COMPLEXITY FOR K-FOLD CV (KERNEL CASE)

Optimal scoring + matrix update Once: In every fold (step 1 OS): Classical approach (k-fold) Once: In every fold: Calculate K: 𝒫(n2 p) Invert K: 𝒫(k n3

Tr)

Calculate and invert K: 𝒫(n2 p + n3) Calculate update ̂ Yreg

Te : 𝒫(k n3 Te)

Calculate update ̂ Yreg

Tr : 𝒫(k nTr n2 Te)
slide-30
SLIDE 30 > < MATTHIAS TREDER · trederm@cardiff.ac.uk 30
  • 10-fold cross-validation and

permutation tests

  • (Linear) multi-class FDA
  • Multivariate normally distributed data

SIMULATIONS (LINEAR CASE)

10 16 29 52 94 170 307 554 1000 Features 10x 100x 1000x n=100 Classes 5 10 10 16 29 52 94 170 307 554 1000 Features 10x 100x 1000x n=1000 Cross-validation Speed increase 1000 10x 100x 1000x 10,000x 10 100 Permutations n=100 Features 100 1000 10 100 Permutations 10x 100x 1000x 10,000x n=1000 Permutations Speed increase
slide-31
SLIDE 31 > < MATTHIAS TREDER · trederm@cardiff.ac.uk 31

DOES IT SCALE TO LARGE DATA?

Approximate kernel matrix K ≈ Kr = LL⊤, L ∈ ℝn×r, r ≪ n Kernel 'hat' matrix Gr = λ−1 L (Ir − L⊤L (L⊤L + λIr)−1) L⊤ =: R ∈ ℝr×r

  • Invert r x r matrix and store in memory
  • For updating, submatrices GTe can be extracted efficiently by selecting rows of L,

never requiring the full n x n matrix

slide-32
SLIDE 32 > < MATTHIAS TREDER · trederm@cardiff.ac.uk 32
  • Generalise ‘update trick’ to create out-of-sample predictions to multi-class KFDA
  • Solution is exact
  • Useful for cross-validation especially with k large
  • Useful for permutation testing
  • Applicable to other resampling techniques (e.g. bootstrapping)
  • Applicable to large datasets with kernel approximations

CONCLUSION

slide-33
SLIDE 33 > < MATTHIAS TREDER · trederm@cardiff.ac.uk 33

CNNs and applications

slide-34
SLIDE 34 > < MATTHIAS TREDER · trederm@cardiff.ac.uk 34

CNNs and applications

Fun with GANs

slide-35
SLIDE 35 > < MATTHIAS TREDER · trederm@cardiff.ac.uk 35

GENERATIVE ADVERSARIAL NETWORK (GAN)

slide-36
SLIDE 36 > < MATTHIAS TREDER · trederm@cardiff.ac.uk 36

GAN EXAMPLES

Do you know these Hollywood actors?

slide-37
SLIDE 37 > < MATTHIAS TREDER · trederm@cardiff.ac.uk 37

GAN EXAMPLES

Style transfer

slide-38
SLIDE 38 > < MATTHIAS TREDER · trederm@cardiff.ac.uk 38

GENERATIVE MODELING OF BRAINS

  • Use Generative Adversarial

Networks (GANs) to produce synthetic MRIs

  • Have been used for:
  • image transfer eg T1 to T2
  • super-resolution
  • Synthetic MRIs for display

purposes (no privacy concerns)

  • Model trajectories (ageing;

transition from control to disease)

slide-39
SLIDE 39 > < MATTHIAS TREDER · trederm@cardiff.ac.uk 39

GENERATIVE MODELING OF BRAINS

slide-40
SLIDE 40 > < MATTHIAS TREDER · trederm@cardiff.ac.uk 40

CNNs and applications

Symmetry in convolutional layers

slide-41
SLIDE 41 > < MATTHIAS TREDER · trederm@cardiff.ac.uk 41

TRANSLATION SYMMETRY

Translation symmetry: implemented via convolutions But what about other types

  • f symmetry?
slide-42
SLIDE 42 > < MATTHIAS TREDER · trederm@cardiff.ac.uk 42

REFLECTION / MIRROR SYMMETRY

Many natural and artificial objects are mirror-symmetric The human visual system is very sensitive to reflection symmetry

slide-43
SLIDE 43 > < MATTHIAS TREDER · trederm@cardiff.ac.uk 43

REFLECTION / MIRROR SYMMETRY

Many natural and artificial objects are mirror-symmetric The human visual system is very sensitive to reflection symmetry Translation Reflection

slide-44
SLIDE 44 > < MATTHIAS TREDER · trederm@cardiff.ac.uk 44
  • Do large CNNs trained on ImageNet

exhibit filter symmetry?

  • Does filter symmetry evolve during

training?

  • (How does hard-coding symmetry

affect CNN performance on image tasks?)

  • Symmetry measure: correlation

between a filter and a reflected version of another filter from the same convolutional layer

REFLECTION SYMMETRY BETWEEN FILTERS

3x3x2 filter horizontal reflection vertical reflection
slide-45
SLIDE 45 > < MATTHIAS TREDER · trederm@cardiff.ac.uk 45

Do large CNNs trained on ImageNet exhibit filter symmetry? (>75% correlation)

all convolutional layers with kernel size at least 3x3
slide-46
SLIDE 46 > < MATTHIAS TREDER · trederm@cardiff.ac.uk 46

Do large CNNs trained on ImageNet exhibit filter symmetry?

slide-47
SLIDE 47 > < MATTHIAS TREDER · trederm@cardiff.ac.uk 47

Does filter symmetry evolve during training?

slide-48
SLIDE 48 > < MATTHIAS TREDER · trederm@cardiff.ac.uk 48

Thank you for your attention!