Dimensionality Reduction STAT 209 Dimensionality Reduction November 26, 2019 Colin Reimer Dawson 1 / 24
Dimensionality Reduction Outline Dimensionality Reduction 2 / 24
Dimensionality Reduction High Dimensional Data ● Modern datasets often have huge numbers of variables ● E.g., images, biomarker data, measurements at fine-grained time points, social networks, product preferences ● Clustering can be a useful way to find “groups” of similar observations ● However, distance measures have some strange properties in high dimensions ● Can be useful to try to extract a few dimensions that carry most of the “signal” 3 / 24
Dimensionality Reduction Images Have Many Variables 4 / 24 but maybe only a few meaningful “features”
Dimensionality Reduction High dimensional inputs Comprehensible arranged this way... 5 / 24
Dimensionality Reduction “Eigenfaces” 6 / 24
Dimensionality Reduction Finding the "Main Direction" of Variation 20 10 QuizCentered ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −10 −20 −20 −10 0 10 20 MidtermCentered 7 / 24
Dimensionality Reduction Finding the “Eigen-features” ## Here I am pulling out the perpendicular directions in (Midterm,Quiz) ## space that align with the ellipse on the scatterplot. ## If you know some linear algebra: ## These are the eigenvectors of the covariance matrix directions <- select(Scores, Midterm, Quiz) %>% cov() %>% eigen() directions %>% extract2("vectors") %>% round(digits = 2) [,1] [,2] [1,] -0.97 0.24 [2,] -0.24 -0.97 ## Creating two new variables that are a weighted sum and weighted ## difference of the midterm and quiz score, with weights chosen so ## that the new variables are uncorrelated Scores_augmented <- mutate(Scores, V1 = 0.97 * Midterm + 0.24 * Quiz, V2 = 0.24 * Midterm - 0.97 * Quiz) 8 / 24
Recommend
More recommend