Lecture 13: Even more dimension reduction techniques Felix Held, - - PowerPoint PPT Presentation
Lecture 13: Even more dimension reduction techniques Felix Held, - - PowerPoint PPT Presentation
Lecture 13: Even more dimension reduction techniques Felix Held, Mathematical Sciences MSA220/MVE440 Statistical Learning for Big Data 10th May 2019 Recap: kernel PCA The projection of a feature vector onto the -th principal
Recap: kernel PCA
Given a set of 𝑛-dimensional feature vectors 𝐲1, … , 𝐲𝑜 and a kernel 𝑙(𝐲, 𝐳), form the Gram matrix 𝐋 = (𝑙(𝐲𝑗, 𝐲
𝑘))𝑗𝑘 and
perform
▶ Solve the eigenvalue problem 𝐋𝐛𝑗 = 𝜇𝑗𝑜𝐛𝑗 for 𝜇𝑗 and 𝐛𝑗 ▶ Scale 𝐛𝑗 such that
𝐛𝑈
𝑗 𝐋𝐛𝑗 = 1
The projection of a feature vector 𝐲 onto the 𝑗-th principal component in the implicit space of the 𝝔(𝐲) is 𝜃𝑗(𝑦) =
𝑜
∑
𝑚=1
𝑏𝑗𝑚𝑙(𝐲, 𝐲𝑚)
1/20
Centring and kernel PCA
▶ The derivation assumed that the implicitly defined
feature vectors 𝝔(𝐲𝑚) were centred. What if they are not?
▶ In the derivation we look at scalar products 𝝔(𝐲𝑗)𝑈𝝔(𝐲𝑚).
Centring in the implicit space leads to (𝝔(𝐲𝑗) − 1 𝑜
𝑜
∑
𝑘=1
𝝔(𝐲
𝑘)) 𝑈
(𝝔(𝐲𝑚) − 1 𝑜
𝑜
∑
𝑘=1
𝝔(𝐲
𝑘)) =
𝐿𝑗𝑚 − 1 𝑜
𝑜
∑
𝑘=1
𝐿𝑘𝑗 − 1 𝑜
𝑜
∑
𝑘=1
𝐿𝑘𝑚 + 1 𝑜2
𝑜
∑
𝑘=1 𝑜
∑
𝑛=1
𝐿𝑘𝑛
▶ Using the centring matrix 𝐊 = 𝐉𝑜 −
1 𝑜𝟐𝟐𝑈, centring in the
implicit space is equivalent to transforming 𝐋 as 𝐋′ = 𝐊𝐋𝐊
▶ Algorithm is the same, apart from using 𝐋′ instead of 𝐋.
2/20
Dimension reduction while preserving distances
Preserving distance
Like in cartography, the goal of dimension reduction can be subject to different sub-criteria, e.g. PCA preserves the directions of largest variance. What if we want to preserve the distance while reducing the dimension? For given vectors 𝐲1, … , 𝐲𝑜 ∈ ℝ𝑞 we want to find 𝐳1, … , 𝐳𝑜 ∈ ℝ𝑛 where 𝑛 < 𝑞 such that ‖𝐲𝑗 − 𝐲𝑚‖2 ≈ ‖𝐳𝑗 − 𝐳𝑚‖2
3/20
Distance matrices and the linear kernel
Given a data matrix 𝐘 ∈ ℝ𝑜×𝑞, note that 𝐘𝐘𝑈 = ⎛ ⎜ ⎜ ⎝ 𝐲𝑈
1 𝐲1
⋯ 𝐲𝑈
1 𝐲𝑜
⋮ ⋮ 𝐲𝑈
𝑜𝐲1
⋯ 𝐲𝑈
𝑜𝐲𝑜
⎞ ⎟ ⎟ ⎠ = 𝐋 which is also the Gram matrix 𝐋 of the linear kernel. Let 𝐄 = (‖𝐲𝑚 − 𝐲𝑛‖2)𝑚𝑛 be the distance matrix in the Euclidean
- norm. Note that
‖𝐲𝑚 − 𝐲𝑛‖2
2 = 𝐲𝑈 𝑚 𝐲𝑚 − 2𝐲𝑈 𝑚 𝐲𝑛 + 𝐲𝑈 𝑛𝐲𝑛
and (with element-wise exponentiation) −1 2𝐄2 = 𝐘𝐘𝑈 − 1 2𝟐 diag(𝐘𝐘𝑈) − 1 2 diag(𝐘𝐘𝑈)𝟐𝑈. Through calculation it can be shown that with 𝐊 = 𝐉𝑜 −
1 𝑜𝟐𝟐𝑈
𝐋 = 𝐊 (−1 2𝐄2) 𝐊
4/20
Finding an exact embedding
▶ Can be shown that if 𝐋 is positive semi-definite then
there exists an exact embedding in 𝑛 = rank(𝐋) ≤ rank(𝐘) ≤ min(𝑜, 𝑞) dimensions.
- 1. Perform PCA on 𝐋 = 𝐕𝚳𝐕𝑈
- 2. If 𝑛 = rank(𝐋), set
𝐙 = (√𝜇1𝐯1, … , √𝜇𝑞𝐯𝑛) ∈ ℝ𝑜×𝑛
- 3. The rows of 𝐙 are the sought-after embedding, i.e. for
𝐳𝑚 = 𝐙𝑚⋅ it holds that ‖𝐲𝑗 − 𝐲𝑚‖2 = ‖𝐳𝑗 − 𝐳𝑚‖2
▶ Note: This is not guaranteed to lead to dimension
reduction, i.e. 𝑛 = 𝑞 possible. However, usually the internal structure of the data is lower-dimensional and 𝑛 < 𝑞.
5/20
Multi-dimensional scaling
▶ Keeping only the first 𝑟 < 𝑛 components of 𝐳𝑚 is known
as classical scaling or multi-dimensional scaling (MDS) and minimizes the so-called stress or strain 𝑒(𝐄, 𝐙) = (∑
𝑗≠𝑘
(𝐸𝑗𝑘 − ‖𝐳𝑗 − 𝐳
𝑘‖2)2) 1/2
▶ Results also hold for general distance matrices 𝐄 as long
as 𝜇1, … , 𝜇𝑛 > 0 for 𝑛 = rank(𝐋). This is called metric MDS.
6/20
Lower-dimensional data in a high-dimensional space
A problematic geometry
x y z
Swiss roll (n = 1000)
- ●
- ●
- ●
- ●
- ●
- Ideal unrolled graph
- PCA
- ●
- ●
- kPCA (RBF kernel, sigma = 0.13)
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- Classical scaling
- 7/20
What is the problem here?
▶ The data has an intrinsic structure that is quite simple
(2D) in itself, but much more complex in the three-dimensional space
▶ To understand this data set properly we need to learn
about the local structure of the data
▶ PCA is a global method and will always look at all data ▶ kernel PCA is a local method but the chosen Gaussian
kernel does not represent the structure of the data well
▶ Classical scaling performs roughly like PCA ▶ What is the issue? All approaches measure distances in
the Euclidean norm in three dimensions.
8/20
Data-driven distance measure (I)
We can create a local, data-driven distance measure by looking at the 𝑙 nearest neighbours of a data point. x y z
Swiss roll (n = 1000)
- ●
- ●
- ●
- ●
- x
y z
Nearest neighbours (k = 6)
- ●
- ●
- ●
- ●
- 9/20
Data-driven distance measure (II)
Computation
- 1. For a data point 𝐲𝑚 find the 𝑙 nearest neighbours
- 2. Construct a graph between data points and their 𝑙
nearest neighbours, weighting each edge by the Euclidean distance
- 3. To measure distance between data points measure their
geodesic distance, i.e. find the shortest path in the weighted graph and sum up the weights This creates a distance matrix 𝐄𝐻 between data points that is more adapted to the actual geometry. To embed the geometry in a lower-dimensional space, MDS can be applied to 𝐄𝐻.
10/20
Isomap
The combination of geodesic, local distance measure and classical scaling is called Isomap.
Isomap (knn = 6) Isomap (knn = 20)
11/20
Caveats of Isomap
▶ The distance between two vectors in Euclidean space can
always be measure, but it can happen that there is no connection in the graph between two data points.
▶ When? If the graph of the data falls into two (or more)
- components. Distance is considered infinite in these
cases.
▶ Implementations typically return a different embedding
for each component
▶ Isomap has problems with datasets that have varying
density
▶ Number of nearest neighbours has to be carefully tuned
12/20
t-distributed Stochastic Neighbour Embedding (tSNE)
t-distributed stochastic neighbour embedding (tSNE) follows a similar strategy as Isomap, in the sense that it measures distances locally. Idea: Measure distance of feature 𝐲𝑚 to another feature 𝐲𝑗 proportional to the likelihood of 𝐲𝑗 under a Gaussian distribution centred at 𝐲𝑚 with an isotropic covariance matrix.
13/20
Computation of tSNE
For feature vectors 𝐲1, … , 𝐲𝑜, set 𝑞𝑗|𝑚 = exp(−‖𝐲𝑚 − 𝐲𝑗‖2
2/(2𝜏2 𝑚 ))
∑𝑙≠𝑚 exp(−‖𝐲𝑚 − 𝐲𝑙‖2
2/(2𝜏2 𝑚 ))
and 𝑞𝑗𝑚 = 𝑞𝑗|𝑚 + 𝑞𝑚|𝑗 2𝑜 , 𝑞𝑚𝑚 = 0 The variances 𝜏2
𝑗 are chosen such that the perplexity (here:
approximate number of close neighbours) of each marginal distribution (the 𝑞𝑗|𝑚 for fixed 𝑚) is constant. In the lower-dimensional embedding distance between 𝐳1, … , 𝐳𝑜 is measured with a t-distribution with one degree of freedom or Cauchy distribution 𝑟𝑗𝑚 = (1 + ‖𝐳𝑗 − 𝐳𝑚‖2
2) −1
∑𝑙≠𝑘 (1 + ‖𝐳𝑙 − 𝐳
𝑘‖2 2) −1
and 𝑟𝑚𝑚 = 0 To determine the 𝐳𝑚 the KL divergence between the distributions 𝑄 = (𝑞𝑗𝑚)𝑗𝑚 and 𝑅 = (𝑟𝑗𝑚)𝑗𝑚 is minimized with gradient descent KL(𝑄||𝑅) = ∑
𝑗≠𝑚
𝑞𝑗𝑚 log 𝑞𝑗𝑚 𝑟𝑗𝑚
14/20
Revisiting the Swiss roll with tSNE
x y z
Swiss roll (n = 1000)
- ●
- ●
- ●
- ●
- ●
- −5
5
Isomap (knn = 6)
- ●
- ●
- ●
- ●
- ●
- −20
−10 10 20
tSNE (perplexity = 30)
- ●
- ●
- ▶ Results are similar to Isomap
▶ Slightly more condensed, but manages the main goal to
unroll data
15/20
A more impressive example of tSNE
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- Iso1
Iso2
Isomap (knn = 20)
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- tSNE1
tSNE2
tSNE (perplexity = 20)
Digit
- 1
2 3 4 5 6 7 8 9
16/20
Caveats of tSNE
tSNE is a powerful method but comes with some difficulties as well
▶ Convergence to local minimum (i.e. repeated runs can
give different results)
▶ Perplexity is hard to tune (as with any tuning parameter)
Let’s see what tSNE does to our old friend, the moons dataset.
- ●
- ●
- ●
- ●
- ●
- ●
- −2
2 −4 −2 2 4 6 x y
17/20
Influence of perplexity on tSNE
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●●
- ●
- ●
- ●
- ●
- ● ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ● ●
- ●
- ●
- ●●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ● ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- 30
50 100 2 5 15
−100 −50 50 −100 −50 50 −100 −50 50 −50 50 −50 50 tSNE1 tSNE2 Varying perplexity
Transformed with tSNE 18/20
tSNE multiple runs
- ●
- ●
- ●
- ●
- ●
- ● ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- 6
7 8 9 10 1 2 3 4 5
−20 20 −20 20 −20 20 −20 20 −20 20 −20 20 40 −20 20 40 tSNE1 tSNE2 Perplexity = 20, multiple runs
Transformed with tSNE 19/20
Take-home message
▶ Dimension reduction has multiple sub-goals, like
preserving structure
▶ Data that has a lower-dimensional structure in a
high-dimensional room can be tricky to uncover
▶ Isomap and tSNE are powerful dimension reduction