principal component analysis
play

Principal Component Analysis Surajit Ray Reader, University of - PowerPoint PPT Presentation

DataCamp Multivariate Probability Distributions in R MULTIVARIATE PROBABILITY DISTRIBUTIONS IN R Principal Component Analysis Surajit Ray Reader, University of Glasgow DataCamp Multivariate Probability Distributions in R Principal Component


  1. DataCamp Multivariate Probability Distributions in R MULTIVARIATE PROBABILITY DISTRIBUTIONS IN R Principal Component Analysis Surajit Ray Reader, University of Glasgow

  2. DataCamp Multivariate Probability Distributions in R Principal Component Analysis (PCA) goals Dimension reduction Creating uncorrelated variables Capturing variability in fewer dimensions

  3. DataCamp Multivariate Probability Distributions in R Algorithm PC1 explains maximum variation in orange direction PC2 uncorrelated to PC1 - explains maximum remaining variation in blue direction PC3 uncorrelated to PC1 and PC2 - explains maximum remaining variation in green direction princomp() function calculates PCs

  4. DataCamp Multivariate Probability Distributions in R Principal Component Analysis in R Simplified format princomp(x, cor = FALSE, scores = TRUE) x : a numeric matrix or data frame cor : use correlation matrix instead of covariance scores : scores/projection of the data on principal components are produced

  5. DataCamp Multivariate Probability Distributions in R Principal Component Analysis of mtcars dataset mtcars dataset relates to 11 variables on fuel consumption for 32 automobiles head(mtcars,5) mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2

  6. DataCamp Multivariate Probability Distributions in R Selecting numeric columns from mtcars dataset Exclude the vs and am variables - both binary mtcars.sub <- mtcars[ , -c(8,9)] Perform PCA cars.pca <- princomp(mtcars.sub, cor = TRUE, scores = TRUE)

  7. DataCamp Multivariate Probability Distributions in R princomp function output cars.pca # Output of cars.pca Standard deviations: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 2.378 1.443 0.710 0.515 0.428 0.352 0.324 0.242 0.149 summary(cars.pca) # Summary of cars.pca Importance of components: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 Standard deviation 2.378 1.443 0.710 0.5148 0.4280 0.3518 0.3241 0.2419 0.14896 Proportion of Variance 0.628 0.231 0.056 0.0294 0.0204 0.0138 0.0117 0.0065 0.00247 Cumulative Proportion 0.628 0.860 0.916 0.9453 0.9656 0.9794 0.9910 0.9975 1.00000

  8. DataCamp Multivariate Probability Distributions in R MULTIVARIATE PROBABILITY DISTRIBUTIONS IN R Let's apply principal component analyis!

  9. DataCamp Multivariate Probability Distributions in R MULTIVARIATE PROBABILITY DISTRIBUTIONS IN R Choosing the number of components Surajit Ray Reader, University of Glasgow

  10. DataCamp Multivariate Probability Distributions in R Summary of princomp object summary(cars.pca) # Summary of cars.pca Importance of components: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 Standard deviation 2.378 1.443 0.710 0.5148 0.4280 0.3518 0.3241 0.2419 0.14896 Proportion of Variance 0.628 0.231 0.056 0.0294 0.0204 0.0138 0.0117 0.0065 0.00247 Cumulative Proportion 0.628 0.860 0.916 0.9453 0.9656 0.9794 0.9910 0.9975 1.00000

  11. DataCamp Multivariate Probability Distributions in R Using the scree plot Method 1 Proportion of variation explained screeplot(cars.pca, type = "lines") Choice based on steepness of curve followed by a flat line

  12. DataCamp Multivariate Probability Distributions in R Cumulative variance explained Method 2 Cumulative variation Explain predetermined value summary(cars.pca) # Summary of cars.pca Importance of components: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 Standard deviation 2.378 1.443 0.710 0.5148 0.4280 0.3518 0.3241 0.2419 0.14896 Proportion of Variance 0.628 0.231 0.056 0.0294 0.0204 0.0138 0.0117 0.0065 0.00247 Cumulative Proportion 0.628 0.860 0.916 0.9453 0.9656 0.9794 0.9910 0.9975 1.00000

  13. DataCamp Multivariate Probability Distributions in R Calculating cumulative proportional variance Cumulative proportion # Variance explained pc.var <- cars.pca$sdev^2 # Proportion of variation pc.pvar <- pc.var / sum(pc.var) # Cumulative proportion plot(cumsum(pc.pvar), type = 'b') abline(h = 0.9)

  14. DataCamp Multivariate Probability Distributions in R Calculating cumulative proportional variance Cumulative proportion # Variance explained pc.var <- cars.pca$sdev^2 # Proportion of variation pc.pvar <- pc.var / sum(pc.var) # Cumulative proportion plot(cumsum(pc.pvar), type = 'b') abline(h = 0.9) 3 PCs explain 90 percent of the variation

  15. DataCamp Multivariate Probability Distributions in R MULTIVARIATE PROBABILITY DISTRIBUTIONS IN R Let's practice using these techniques!

  16. DataCamp Multivariate Probability Distributions in R MULTIVARIATE PROBABILITY DISTRIBUTIONS IN R Interpreting PCA outputs Surajit Ray Reader, University of Glasgow

  17. DataCamp Multivariate Probability Distributions in R Attributes of princomp object cars.pca <- princomp(mtcars.sub, cor = TRUE, scores = TRUE) attributes(cars.pca) $names [1] "sdev" "loadings" "center" "scale" "n.obs" "scores" "call"

  18. DataCamp Multivariate Probability Distributions in R Interpretation of loadings cars.pca$loadings # or loadings(cars.pca) Loadings: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 mpg 0.393 -0.221 -0.321 0.720 0.381 0.125 -0.115 cyl -0.403 -0.252 0.117 0.224 0.159 -0.810 -0.163 disp -0.397 0.339 -0.487 0.182 0.662 hp -0.367 -0.269 -0.295 0.354 -0.696 0.166 -0.252 drat 0.312 -0.342 0.150 0.846 0.162 -0.135 wt -0.373 0.172 0.454 0.191 -0.187 0.428 0.198 -0.569 qsec 0.224 0.484 0.628 -0.148 0.258 -0.276 -0.356 0.169 gear 0.209 -0.551 0.207 -0.282 -0.562 -0.323 -0.316 carb -0.245 -0.484 0.464 -0.214 0.400 0.357 0.206 0.108 0.320

  19. DataCamp Multivariate Probability Distributions in R Geometry of loadings - numerical values If we choose to retain two components cars.pca$loadings[, 1:2] Loadings: Comp.1 Comp.2 mpg 0.393 cyl -0.403 disp -0.397 hp -0.367 -0.269 drat 0.312 -0.342 wt -0.373 0.172 qsec 0.224 0.484 gear 0.209 -0.551 carb -0.245 -0.484

  20. DataCamp Multivariate Probability Distributions in R Geometry of loadings - plot biplot(cars.pca, col = c("gray","steelblue"), cex = c(0.5, 1.3))

  21. DataCamp Multivariate Probability Distributions in R PCA scores Projection of the original dataset on the principal components. Total of 9 scores available for each observations head(cars.pca$scores) # PC scores of first 6 observations Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 Mazda RX4 0.67 -1.19 -0.21 -0.128 0.764 -0.127 0.430 0.0033 0.1697 Mazda RX4 Wag 0.65 -0.99 0.11 -0.087 0.667 -0.067 0.456 -0.0575 0.0727 Datsun 710 2.34 0.33 -0.21 -0.110 -0.077 -0.576 -0.392 0.2053 -0.1163 Hornet 4 Drive 0.22 2.01 -0.33 -0.313 -0.248 0.085 -0.034 0.0241 0.1476 Hornet Sportabout -1.61 0.84 -1.05 0.150 -0.226 0.186 0.059 -0.1548 0.1571 Valiant -0.05 2.49 0.11 -0.885 -0.128 -0.234 -0.228 -0.1002 0.0043

  22. DataCamp Multivariate Probability Distributions in R PCA scores on first two components Projection of the original dataset on the principal components Scores on the first two components head(cars.pca$scores[, 1:2]) # First two PC scores of first 6 observations Comp.1 Comp.2 Mazda RX4 0.67 -1.19 Mazda RX4 Wag 0.65 -0.99 Datsun 710 2.34 0.33 Hornet 4 Drive 0.22 2.01 Hornet Sportabout -1.61 0.84 Valiant -0.05 2.49

  23. DataCamp Multivariate Probability Distributions in R Calculating, visualizing and intrepreting scores biplot(cars.pca, col = c("steelblue", "white"), cex = c(0.8, 0.01))

  24. DataCamp Multivariate Probability Distributions in R Plotting scores using ggplot scores <- data.frame(cars.pca$scores) ggplot(data = scores, aes(x = Comp.1, y = Comp.2, label = rownames(scores))) + geom_text(size = 4, col = "steelblue")

  25. DataCamp Multivariate Probability Distributions in R Plotting and coloring scores using ggplot cylinder <- factor(mtcars$cyl) ggplot(data = scores, aes(x = Comp.1, y = Comp.2, label = rownames(scores), color = cylinder)) + geom_text(size = 4)

  26. DataCamp Multivariate Probability Distributions in R Using the factoextra library fviz_pca_biplot() fviz_pca_ind() fviz_pca_var()

  27. DataCamp Multivariate Probability Distributions in R

  28. DataCamp Multivariate Probability Distributions in R

  29. DataCamp Multivariate Probability Distributions in R

  30. DataCamp Multivariate Probability Distributions in R MULTIVARIATE PROBABILITY DISTRIBUTIONS IN R Let's practice these functions!

  31. DataCamp Multivariate Probability Distributions in R MULTIVARIATE PROBABILITY DISTRIBUTIONS IN R Multi-dimensional Scaling Surajit Ray Reader, University of Glasgow

Recommend


More recommend