Dimension Reduction CSE 6242 / CX 4242 Thanks : Prof. Jaegul Choo - - PowerPoint PPT Presentation

dimension reduction
SMART_READER_LITE
LIVE PREVIEW

Dimension Reduction CSE 6242 / CX 4242 Thanks : Prof. Jaegul Choo - - PowerPoint PPT Presentation

Dimension Reduction CSE 6242 / CX 4242 Thanks : Prof. Jaegul Choo , Dr. Ramakrishnan Kannan, Prof. Le Song What is Dimension Reduction? Data item index (n) low-dim Columns as data data items Dimension index (d) Dimension Reduction


slide-1
SLIDE 1

ž

Dimension Reduction

CSE 6242 / CX 4242

Thanks : Prof. Jaegul Choo , Dr. Ramakrishnan Kannan, Prof. Le Song

slide-2
SLIDE 2

What is Dimension Reduction?

Data item index (n) Dimension index (d) Columns as data items

low-dim data

Dimension Reduction

How big is this?

Why?

Attribute=Feature= Variable=Dimension

slide-3
SLIDE 3

3

Serialized/rasterized pixel values

Image Data

3 80 24 58 63 45 3 80 24 58 78 45 5 34 78

Raw images Pixel values

5 34 63

Serialized pixels

In a 4K (4096x2160) image there are totally 8.8 million pixels

slide-4
SLIDE 4

3 80 24 58 63 45 5 34 78 49 54 78 14 67 36 22 86 15 4

Serialized/rasterized pixel values

ž Huge dimensions

— 4096x2160 image size → 8847360 dimensions — 30 fps. — Means for 2 mins video, you generate a matrix of size

8847360 x3600

Video Data

3 80 24 58 63 45

Raw images Pixel values

5 34 63

Serialized pixels

49 54 78 14 15 67 22 86 36

slide-5
SLIDE 5

ž Bag-of-words vector

— Document 1 = “Life of Pi won Oscar” — Document 2 = “Life of Pi is also a book.”

Text Documents

Life Pi movies also

  • scar

book won

Vocabulary Doc 1 Doc 2

1 1 1 1

1 1 1 1

slide-6
SLIDE 6

ž Data items

— How many data items?

ž Dimensions

— How many dimensions representing each item?

Two Axes of Data Set

Data item index (n) Dimension index (d) Columns as data items

  • vs. Rows as data items

We will use this during lecture

slide-7
SLIDE 7

Dimension Reduction

7

Dimension Reduction

High-dim data (n) low-dim data (n)

  • No. of

dimensions (k) Additional info about data Other parameters Dim-reducing transformation for new data : user-specified

Reduced dimension (k)

Dimension index (d)

slide-8
SLIDE 8

Benefits of Dimension Reduction

Obviously, Compression Visualization Faster computation

Computing distances: 100,000-dim vs. 10-dim vectors

More importantly, Noise removal (improving data quality)

Separates the data into General Pattern + Sparse + Noise Is Noise the important signal? Works as pre-processing for better performance e.g., microarray data analysis, information retrieval, face recognition, protein disorder prediction, network intrusion detection, document categorization, speech recognition

slide-9
SLIDE 9

Two Main Techniques

  • 1. Feature selection

Selects a subset of the original variables as reduced dimensions relevant for a particular task e.g., the number of genes responsible for a particular disease may be small

  • 2. Feature extraction

Each reduced dimension combines multiple original dimensions The original dataset will be transformed to some other numbers

9

Feature = Variable = Dimension

slide-10
SLIDE 10

Feature Selection

What are the optimal subset of m features to maximize a given criterion? Widely-used criteria

Information gain, correlation, …

Typically combinatorial optimization problems Therefore, greedy methods are popular

Forward selection: Empty set → Add one variable at a time Backward elimination: Entire set → Remove one variable at a time

10

slide-11
SLIDE 11

Feature Extraction

slide-12
SLIDE 12

Aspects of Dimension Reduction

Linear vs. Nonlinear Unsupervised vs. Supervised Global vs. Local Feature vectors vs. Similarity (as an input)

12

slide-13
SLIDE 13

Linear vs. Nonlinear

Linear Represents each reduced dimension as a linear combination of original dimensions

Of the form aX+b where a, x and b are vectors/matrices e.g., Y1 = 3*X1 – 4*X2 + 0.3*X3 – 1.5*X4 Y2 = 2*X1 + 3.2*X2 – X3 + 2*X4

Naturally capable of mapping new data to the same space

13

Dimension Reduction D1 D2 X1 1 1 X2 1 X3 2 X4 1 1 D1 D2 Y1 1.75

  • 0.27

Y2

  • 0.21

0.58

slide-14
SLIDE 14

Linear vs. Nonlinear

Linear Represents each reduced dimension as a linear combination of original dimensions

e.g., Y1 = 3*X1 – 4*X2 + 0.3*X3 – 1.5*X4, Y2 = 2*X1 + 3.2*X2 – X3 + 2*X4

Naturally capable of mapping new data to the same space Nonlinear More complicated, but generally more powerful Recently popular topics

14

slide-15
SLIDE 15

Unsupervised vs. Supervised

Unsupervised Uses only the input data

15

Dimension Reduction

High-dim data low-dim data

  • No. of

dimensions Other parameters Dim-reducing Transformer for a new data Additional info about data

slide-16
SLIDE 16

Unsupervised vs. Supervised

Supervised Uses the input data + additional info

16

Dimension Reduction

High-dim data low-dim data

  • No. of

dimensions Other parameters Dim-reducing Transformer for a new data Additional info about data

slide-17
SLIDE 17

Unsupervised vs. Supervised

Supervised Uses the input data + additional info

e.g., grouping label

17

Dimension Reduction

High-dim data low-dim data

  • No. of

dimensions Additional info about data Other parameters Dim-reducing Transformer for a new data

slide-18
SLIDE 18

Global vs. Local

Dimension reduction typically tries to preserve all the relationships/distances in data Information loss is unavoidable! Then, what should we emphasize more? Global Treats all pairwise distances equally important

Focuses on preserving large distances

Local Focuses on small distances, neighborhood relationships Active research area, e.g., manifold learning

18

slide-19
SLIDE 19

Feature vectors vs. Similarity (as an input)

Dimension Reduction

High-dim data (n) low-dim data

  • No. of

dimensions (k) Other parameters Dim-reducing Transformer for a new data Additional info about data

Typical setup (feature vectors as an input)

Reduced dimension (k) Dimension index (d)

slide-20
SLIDE 20

Feature vectors vs. Similarity (as an input)

Dimension Reduction

Similarity matrix low-dim data

  • No. of

dimensions Other parameters Dim-reducing Transformer for a new data Additional info about data

Typical setup (feature vectors as an input) Alternatively, takes similarity matrix instead

(i,j)-th component indicates similarity between i-th and j-th data Assuming distance is a metric, similarity matrix is symmetric

slide-21
SLIDE 21

Feature vectors vs. Similarity (as an input)

Dimension Reduction

low-dim data(kxn)

  • No. of

dimensions Other parameters Dim-reducing Transformer for a new data Additional info about data

Typical setup (feature vectors as an input) Alternatively, takes similarity matrix instead Internally, converts feature vectors to similarity matrix before performing dimension reduction

Similarity matrix(nxn) High-dim data (dxn)

Dimension Reduction

low-dim data(kxn)

Graph Embedding

slide-22
SLIDE 22

Feature vectors vs. Similarity (as an input)

Why called graph embedding? Similarity matrix can be viewed as a graph where similarity represents edge weight Similarity matrix High-dim data(dxn)

Dimension Reduction

low-dim data

Graph Embedding

slide-23
SLIDE 23

Methods

Traditional

Principal component analysis (PCA) Multidimensional scaling (MDS) Linear discriminant analysis (LDA)

Advanced (nonlinear, kernelized, manifold learning)

Isometric feature mapping (Isomap)

23

* Matlab codes are available at

http://homepage.tudelft.nl/19j49/Matlab_Toolbox_for_Dimensionality_Reduction.html

slide-24
SLIDE 24

Principal Component Analysis

Finds the axis showing the largest variation, and project all points into this axis Reduced dimensions are orthogonal Algorithm: Eigen-decomposition Pros: Fast Cons: Limited performances

24

Image source: http://en.wikipedia.org/wiki/Principal_component_analysis PC1 PC2

Linear Unsupervised Global Feature vectors

slide-25
SLIDE 25

PCA – Some Questions

Algorithm Subtract mean from the dataset (X-μ) Find the covariance matrix (X-μ)’ (X-μ) Perform Eigen decomposition on this covariance matrix Key Questions Why covariance matrix? SVD on the original matrix vs Eigen decomposition on covariance matrix

25

slide-26
SLIDE 26

Multidimensional Scaling (MDS)

Main idea Tries to preserve given pairwise distances in low- dimensional space Metric MDS

Preserves given distance values

Nonmetric MDS

When you only know/care about ordering of distances Preserves only the orderings of distance values

Algorithm: gradient-decent type c.f. classical MDS is the same as PCA

26

Nonlinear Unsupervised Global Similarity input

ideal distance Low-dim distance

slide-27
SLIDE 27

Multidimensional Scaling

Pros: widely-used (works well in general) Cons: slow (n-body problem)

Nonmetric MDS is even much slower than metric MDS Fast algorithm are available. Barnes-Hut algorithm GPU-based implementations

27

slide-28
SLIDE 28

Linear Discriminant Analysis

What if clustering information is available? LDA tries to separate clusters by Putting different cluster as far as possible Putting each cluster as compact as possible (a) (b)

slide-29
SLIDE 29

Aspects of Dimension Reduction

Unsupervised vs. Supervised

Supervised Uses the input data + additional info

e.g., grouping label

Dimension Reduction

High-dim data low-dim data

  • No. of

dimensions Additional info about data Other parameters Dim-reducing Transformer for a new data

slide-30
SLIDE 30

Linear Discriminant Analysis (LDA)

  • vs. Principal Component Analysis

30

2D visualization of 7 Gaussian mixture of 1000 dimensions Linear discriminant analysis (Supervised) Principal component analysis (Unsupervised)

30

slide-31
SLIDE 31

LDA

Compute mean of the two classes, global mean (μ1, μ2,μ) Compute class specific covariance matrix Sw Compute between class covariance matrix using means. Call it Sb Find the eigenvectors corresponding to the eigenvalues of inv(Sw)*Sb

31

  • Fisher’s LDA generalizes gracefully

– 𝑧 𝐷 − 1 [𝑧1, 𝑧2, … 𝑧𝐷−1] 𝐷 − 1 𝑥𝑗 𝑋 = [𝑥1|𝑥2| … |𝑥𝐷−1] 𝑧𝑗 = 𝑥𝑗

𝑈𝑦 ⇒ 𝑧 = 𝑋𝑈𝑦

𝑇𝑋 = 𝑇𝑗

𝐷 𝑗=1

  • 𝑇𝑗 =

𝑦 − 𝜈𝑗 𝑦 − 𝜈𝑗 𝑈

𝑦∈𝜕𝑗

𝜈𝑗 =

1 𝑂𝑗

𝑦

𝑦∈𝜕𝑗

– es 𝑇𝐶 = 𝑂𝑗 𝜈𝑗 − 𝜈 𝜈𝑗 − 𝜈 𝑈

𝐷 𝑗=1

  • 𝜈 =

1 𝑂

𝑦

∀𝑦

=

1 𝑂

𝑂𝑗𝜈𝑗

𝐷 𝑗=1

– 𝑇𝑈 = 𝑇𝐶 + 𝑇𝑋

1 2 3

  • S B 1

S B 3 S B 2 S W 3 S W 1 S W 2 x 1 x 2 1 2 3

  • S B 1

S B 3 S B 2 S W 3 S W 1 S W 2 x 1 x 2

*http://research.cs.tamu.edu/prism/lectures/pr/pr_l10.pdf

slide-32
SLIDE 32

Linear Discriminant Analysis

Maximally separates clusters by Putting different cluster far apart Shrinking each cluster compactly Algorithm: generalized eigendecomposition Pros: better at showing cluster structure Cons: may distort original relationships of data

32

Linear Supervised Global Feature vectors

slide-33
SLIDE 33

Methods

Traditional

Principal component analysis (PCA) Multidimensional scaling (MDS) Linear discriminant analysis (LDA)

Advanced (nonlinear, kernelized, manifold learning)

Isometric feature mapping (Isomap)

33

* Matlab codes are available at

http://homepage.tudelft.nl/19j49/Matlab_Toolbox_for_Dimensionality_Reduction.html

slide-34
SLIDE 34

Manifold Learning

Swiss Roll Data

Swiss roll data Originally in 3D What is the intrinsic dimensionality? (allowing flattening)

34

slide-35
SLIDE 35

Manifold Learning

Swiss Roll Data

Swiss roll data Originally in 3D What is the intrinsic dimensionality? (allowing flattening) → 2D

35

What if your data has low intrinsic dimensionality but resides in high-dimensional space?

slide-36
SLIDE 36

Isomap

(Isometric Feature Mapping)

Let’s preserve pairwise geodesic distance (along manifold) Compute geodesic distance as the shortest path length from k-nearest neighbor (k-NN) graph *Eigen-decomposition on pairwise geodesic distance matrix to obtain embedding that best preserves given distances

36

* Eigen-decomposition is the main algorithm of PCA

slide-37
SLIDE 37

Isomap

(Isometric Feature Mapping)

Algorithm: all-pair shortest path computation + eigen- decomposition Pros: performs well in general Cons: slow (shortest path), sensitive to parameters

37

Nonlinear Unsupervised Global: all pairwise distances are considered Feature vectors

slide-38
SLIDE 38

Practitioner’s Guide

Caveats

38

Trustworthiness of dimension reduction results Inevitable distortion/information loss in 2D/3D The best result of a method may not align with what we want, e.g., PCA visualization of facial image data

(1, 2)-dimension (3, 4)-dimension

slide-39
SLIDE 39

Practitioner’s Guide

General Recommendation

Want something simple and fast to visualize data? PCA, force-directed layout Want to first try some manifold learning methods? Isomap

It is the method that will give empirically the best result.

Have cluster label to use? (pre-given or computed) LDA (supervised)

Supervised approach is sometimes the only viable option when your data do not have clearly separable clusters

39

slide-40
SLIDE 40

Practitioner’s Guide

Results Still Not Good?

Try various pre-processing Data centering

Subtract the global mean from each vector

Normalization

Make each vector have unit Euclidean norm Otherwise, a few outlier can affect dimension reduction significantly

Application-specific pre-processing

Document: TF-IDF weighting, remove too rare and/or short terms Image: histogram normalization

40

slide-41
SLIDE 41

Practitioner’s Guide

Too Slow?

Apply PCA to reduce to an intermediate dimensions before the main dimension reduction step

The results may be even better due to noise removed by PCA

See if there is any approximated but faster version

Landmarked versions (only using a subset of data items) e.g., landmarked Isomap Linearized versions (the same criterion, but only allow linear mapping) e.g., Laplacian Eigenmaps → Locality preserving projection

41

slide-42
SLIDE 42

Practitioner’s Guide

Still need more?

Tweak dimension reduction for your own purpose Play with its algorithm, convergence criteria, etc.

See if you can impose label information Restrict the number of iterations to save computational time.

The main purpose of DR is to serve us in exploring data and solving complicated real-world problems

42

slide-43
SLIDE 43

Take Away

43

PCA MDS LDA Isomap Supervised ✖ ✖ ✔ ✖ Linear ✔ ✖ ✔ ✖ Global ✔ ✔ ✔ ✔ Feature ✔ ✖ ✔ ✔

slide-44
SLIDE 44

Useful Resource

Tutorial on PCA http://arxiv.org/pdf/1404.1100.pdf Tutorial on LDA

http://research.cs.tamu.edu/prism/lectures/pr/pr_l10.pdf

Review article

http://www.iai.uni- bonn.de/~jz/dimensionality_reduction_a_comparative_review.pdf

Matlab toolbox for dimension reduction

http://homepage.tudelft.nl/19j49/Matlab_Toolbox_for_Dimensionality_ Reduction.html

Matlab manifold learning demo

http://www.math.ucla.edu/~wittman/mani/

44