Lecture 13: Even more dimension reduction techniques Felix Held, - - PowerPoint PPT Presentation

lecture 13 even more dimension reduction techniques
SMART_READER_LITE
LIVE PREVIEW

Lecture 13: Even more dimension reduction techniques Felix Held, - - PowerPoint PPT Presentation

Lecture 13: Even more dimension reduction techniques Felix Held, Mathematical Sciences MSA220/MVE440 Statistical Learning for Big Data 10th May 2019 Recap: kernel PCA The projection of a feature vector onto the -th principal


slide-1
SLIDE 1

Lecture 13: Even more dimension reduction techniques

Felix Held, Mathematical Sciences

MSA220/MVE440 Statistical Learning for Big Data 10th May 2019

slide-2
SLIDE 2

Recap: kernel PCA

Given a set of 𝑛-dimensional feature vectors 𝐲1, … , 𝐲𝑜 and a kernel 𝑙(𝐲, 𝐳), form the Gram matrix 𝐋 = (𝑙(𝐲𝑗, 𝐲

𝑘))𝑗𝑘 and

perform

▶ Solve the eigenvalue problem 𝐋𝐛𝑗 = 𝜇𝑗𝑜𝐛𝑗 for 𝜇𝑗 and 𝐛𝑗 ▶ Scale 𝐛𝑗 such that

𝐛𝑈

𝑗 𝐋𝐛𝑗 = 1

The projection of a feature vector 𝐲 onto the 𝑗-th principal component in the implicit space of the 𝝔(𝐲) is 𝜃𝑗(𝑦) =

𝑜

𝑚=1

𝑏𝑗𝑚𝑙(𝐲, 𝐲𝑚)

1/20

slide-3
SLIDE 3

Centring and kernel PCA

▶ The derivation assumed that the implicitly defined

feature vectors 𝝔(𝐲𝑚) were centred. What if they are not?

▶ In the derivation we look at scalar products 𝝔(𝐲𝑗)𝑈𝝔(𝐲𝑚).

Centring in the implicit space leads to (𝝔(𝐲𝑗) − 1 𝑜

𝑜

𝑘=1

𝝔(𝐲

𝑘)) 𝑈

(𝝔(𝐲𝑚) − 1 𝑜

𝑜

𝑘=1

𝝔(𝐲

𝑘)) =

𝐿𝑗𝑚 − 1 𝑜

𝑜

𝑘=1

𝐿𝑘𝑗 − 1 𝑜

𝑜

𝑘=1

𝐿𝑘𝑚 + 1 𝑜2

𝑜

𝑘=1 𝑜

𝑛=1

𝐿𝑘𝑛

▶ Using the centring matrix 𝐊 = 𝐉𝑜 −

1 𝑜𝟐𝟐𝑈, centring in the

implicit space is equivalent to transforming 𝐋 as 𝐋′ = 𝐊𝐋𝐊

▶ Algorithm is the same, apart from using 𝐋′ instead of 𝐋.

2/20

slide-4
SLIDE 4

Dimension reduction while preserving distances

slide-5
SLIDE 5

Preserving distance

Like in cartography, the goal of dimension reduction can be subject to different sub-criteria, e.g. PCA preserves the directions of largest variance. What if we want to preserve the distance while reducing the dimension? For given vectors 𝐲1, … , 𝐲𝑜 ∈ ℝ𝑞 we want to find 𝐳1, … , 𝐳𝑜 ∈ ℝ𝑛 where 𝑛 < 𝑞 such that ‖𝐲𝑗 − 𝐲𝑚‖2 ≈ ‖𝐳𝑗 − 𝐳𝑚‖2

3/20

slide-6
SLIDE 6

Distance matrices and the linear kernel

Given a data matrix 𝐘 ∈ ℝ𝑜×𝑞, note that 𝐘𝐘𝑈 = ⎛ ⎜ ⎜ ⎝ 𝐲𝑈

1 𝐲1

⋯ 𝐲𝑈

1 𝐲𝑜

⋮ ⋮ 𝐲𝑈

𝑜𝐲1

⋯ 𝐲𝑈

𝑜𝐲𝑜

⎞ ⎟ ⎟ ⎠ = 𝐋 which is also the Gram matrix 𝐋 of the linear kernel. Let 𝐄 = (‖𝐲𝑚 − 𝐲𝑛‖2)𝑚𝑛 be the distance matrix in the Euclidean

  • norm. Note that

‖𝐲𝑚 − 𝐲𝑛‖2

2 = 𝐲𝑈 𝑚 𝐲𝑚 − 2𝐲𝑈 𝑚 𝐲𝑛 + 𝐲𝑈 𝑛𝐲𝑛

and (with element-wise exponentiation) −1 2𝐄2 = 𝐘𝐘𝑈 − 1 2𝟐 diag(𝐘𝐘𝑈) − 1 2 diag(𝐘𝐘𝑈)𝟐𝑈. Through calculation it can be shown that with 𝐊 = 𝐉𝑜 −

1 𝑜𝟐𝟐𝑈

𝐋 = 𝐊 (−1 2𝐄2) 𝐊

4/20

slide-7
SLIDE 7

Finding an exact embedding

▶ Can be shown that if 𝐋 is positive semi-definite then

there exists an exact embedding in 𝑛 = rank(𝐋) ≤ rank(𝐘) ≤ min(𝑜, 𝑞) dimensions.

  • 1. Perform PCA on 𝐋 = 𝐕𝚳𝐕𝑈
  • 2. If 𝑛 = rank(𝐋), set

𝐙 = (√𝜇1𝐯1, … , √𝜇𝑞𝐯𝑛) ∈ ℝ𝑜×𝑛

  • 3. The rows of 𝐙 are the sought-after embedding, i.e. for

𝐳𝑚 = 𝐙𝑚⋅ it holds that ‖𝐲𝑗 − 𝐲𝑚‖2 = ‖𝐳𝑗 − 𝐳𝑚‖2

▶ Note: This is not guaranteed to lead to dimension

reduction, i.e. 𝑛 = 𝑞 possible. However, usually the internal structure of the data is lower-dimensional and 𝑛 < 𝑞.

5/20

slide-8
SLIDE 8

Multi-dimensional scaling

▶ Keeping only the first 𝑟 < 𝑛 components of 𝐳𝑚 is known

as classical scaling or multi-dimensional scaling (MDS) and minimizes the so-called stress or strain 𝑒(𝐄, 𝐙) = (∑

𝑗≠𝑘

(𝐸𝑗𝑘 − ‖𝐳𝑗 − 𝐳

𝑘‖2)2) 1/2

▶ Results also hold for general distance matrices 𝐄 as long

as 𝜇1, … , 𝜇𝑛 > 0 for 𝑛 = rank(𝐋). This is called metric MDS.

6/20

slide-9
SLIDE 9

Lower-dimensional data in a high-dimensional space

slide-10
SLIDE 10

A problematic geometry

x y z

Swiss roll (n = 1000)

  • Ideal unrolled graph
  • PCA
  • kPCA (RBF kernel, sigma = 0.13)
  • Classical scaling
  • 7/20
slide-11
SLIDE 11

What is the problem here?

▶ The data has an intrinsic structure that is quite simple

(2D) in itself, but much more complex in the three-dimensional space

▶ To understand this data set properly we need to learn

about the local structure of the data

▶ PCA is a global method and will always look at all data ▶ kernel PCA is a local method but the chosen Gaussian

kernel does not represent the structure of the data well

▶ Classical scaling performs roughly like PCA ▶ What is the issue? All approaches measure distances in

the Euclidean norm in three dimensions.

8/20

slide-12
SLIDE 12

Data-driven distance measure (I)

We can create a local, data-driven distance measure by looking at the 𝑙 nearest neighbours of a data point. x y z

Swiss roll (n = 1000)

  • x

y z

Nearest neighbours (k = 6)

  • 9/20
slide-13
SLIDE 13

Data-driven distance measure (II)

Computation

  • 1. For a data point 𝐲𝑚 find the 𝑙 nearest neighbours
  • 2. Construct a graph between data points and their 𝑙

nearest neighbours, weighting each edge by the Euclidean distance

  • 3. To measure distance between data points measure their

geodesic distance, i.e. find the shortest path in the weighted graph and sum up the weights This creates a distance matrix 𝐄𝐻 between data points that is more adapted to the actual geometry. To embed the geometry in a lower-dimensional space, MDS can be applied to 𝐄𝐻.

10/20

slide-14
SLIDE 14

Isomap

The combination of geodesic, local distance measure and classical scaling is called Isomap.

Isomap (knn = 6) Isomap (knn = 20)

11/20

slide-15
SLIDE 15

Caveats of Isomap

▶ The distance between two vectors in Euclidean space can

always be measure, but it can happen that there is no connection in the graph between two data points.

▶ When? If the graph of the data falls into two (or more)

  • components. Distance is considered infinite in these

cases.

▶ Implementations typically return a different embedding

for each component

▶ Isomap has problems with datasets that have varying

density

▶ Number of nearest neighbours has to be carefully tuned

12/20

slide-16
SLIDE 16

t-distributed Stochastic Neighbour Embedding (tSNE)

t-distributed stochastic neighbour embedding (tSNE) follows a similar strategy as Isomap, in the sense that it measures distances locally. Idea: Measure distance of feature 𝐲𝑚 to another feature 𝐲𝑗 proportional to the likelihood of 𝐲𝑗 under a Gaussian distribution centred at 𝐲𝑚 with an isotropic covariance matrix.

13/20

slide-17
SLIDE 17

Computation of tSNE

For feature vectors 𝐲1, … , 𝐲𝑜, set 𝑞𝑗|𝑚 = exp(−‖𝐲𝑚 − 𝐲𝑗‖2

2/(2𝜏2 𝑚 ))

∑𝑙≠𝑚 exp(−‖𝐲𝑚 − 𝐲𝑙‖2

2/(2𝜏2 𝑚 ))

and 𝑞𝑗𝑚 = 𝑞𝑗|𝑚 + 𝑞𝑚|𝑗 2𝑜 , 𝑞𝑚𝑚 = 0 The variances 𝜏2

𝑗 are chosen such that the perplexity (here:

approximate number of close neighbours) of each marginal distribution (the 𝑞𝑗|𝑚 for fixed 𝑚) is constant. In the lower-dimensional embedding distance between 𝐳1, … , 𝐳𝑜 is measured with a t-distribution with one degree of freedom or Cauchy distribution 𝑟𝑗𝑚 = (1 + ‖𝐳𝑗 − 𝐳𝑚‖2

2) −1

∑𝑙≠𝑘 (1 + ‖𝐳𝑙 − 𝐳

𝑘‖2 2) −1

and 𝑟𝑚𝑚 = 0 To determine the 𝐳𝑚 the KL divergence between the distributions 𝑄 = (𝑞𝑗𝑚)𝑗𝑚 and 𝑅 = (𝑟𝑗𝑚)𝑗𝑚 is minimized with gradient descent KL(𝑄||𝑅) = ∑

𝑗≠𝑚

𝑞𝑗𝑚 log 𝑞𝑗𝑚 𝑟𝑗𝑚

14/20

slide-18
SLIDE 18

Revisiting the Swiss roll with tSNE

x y z

Swiss roll (n = 1000)

  • −5

5

Isomap (knn = 6)

  • −20

−10 10 20

tSNE (perplexity = 30)

  • ▶ Results are similar to Isomap

▶ Slightly more condensed, but manages the main goal to

unroll data

15/20

slide-19
SLIDE 19

A more impressive example of tSNE

  • Iso1

Iso2

Isomap (knn = 20)

  • tSNE1

tSNE2

tSNE (perplexity = 20)

Digit

  • 1

2 3 4 5 6 7 8 9

16/20

slide-20
SLIDE 20

Caveats of tSNE

tSNE is a powerful method but comes with some difficulties as well

▶ Convergence to local minimum (i.e. repeated runs can

give different results)

▶ Perplexity is hard to tune (as with any tuning parameter)

Let’s see what tSNE does to our old friend, the moons dataset.

  • −2

2 −4 −2 2 4 6 x y

17/20

slide-21
SLIDE 21

Influence of perplexity on tSNE

  • ●●
  • ● ●
  • ● ●
  • ●●
  • ● ●
  • 30

50 100 2 5 15

−100 −50 50 −100 −50 50 −100 −50 50 −50 50 −50 50 tSNE1 tSNE2 Varying perplexity

Transformed with tSNE 18/20

slide-22
SLIDE 22

tSNE multiple runs

  • ● ●
  • 6

7 8 9 10 1 2 3 4 5

−20 20 −20 20 −20 20 −20 20 −20 20 −20 20 40 −20 20 40 tSNE1 tSNE2 Perplexity = 20, multiple runs

Transformed with tSNE 19/20

slide-23
SLIDE 23

Take-home message

▶ Dimension reduction has multiple sub-goals, like

preserving structure

▶ Data that has a lower-dimensional structure in a

high-dimensional room can be tricky to uncover

▶ Isomap and tSNE are powerful dimension reduction

techniques that also uncover structure, but be careful about applying them blindly

20/20