multivariate ordination analyses principal component
play

Multivariate Ordination Analyses: Principal Component Analysis Dilys - PowerPoint PPT Presentation

Multivariate Ordination Analyses: Principal Component Analysis Dilys Vela Tatiana Boza Tatiana Boza Multivariate Analyses Multivariate Analyses A multivariate data set includes more than one variable recorded from a number of i bl d d f b f


  1. Multivariate Ordination Analyses: Principal Component Analysis Dilys Vela Tatiana Boza Tatiana Boza

  2. Multivariate Analyses Multivariate Analyses A multivariate data set includes more than one variable recorded from a number of i bl d d f b f replicate sampling or experimental units, sometimes referred to as objects. i f d bj

  3. � If these objects are � If these objects are organisms, the variables might be morphological g p g or physiological measurements � If the objects are ecological sampling ecological sampling units, the variables might be g physicochemical measurements or species abundances

  4. What ordinations analyses are ? � Ordination is arranging items along a scale (axis) or multiples l i l axes. Th The proposed d of f ordination di i i is summarized graphically complex relationships, extracting one or few dominant patterns from an extracting one or few dominant patterns from an infinite number of possible patterns. � The placement of variables along an axis it is possible because the ordination it is base on the variables because the ordination it is base on the variables correlation.

  5. What ordination analyses help us to see? ? � Select the most important variables from multiple � Select the most important variables from multiple variables imagined or hypothesized. � Reveal unforeseen patterns and suggest unforeseen processes. p

  6. What type of question can we answer with ordination analysis? with ordination analysis? � In ecology, to seek and describe pattern of process. � In community ecology, to describe the strongest patterns in species composition. � I � In systematics, to recognize and to define species i i d d fi i boundaries.

  7. Multivariate Analysis Clasification (or Ordination Analysis Clustering Analysis) Indirect Direct Gradient Gradient Analysis Analysis Corresponden Linear ce Analysis Regression (CA) (Many (Few Species) Species) Distant Values Redundancy Raw Data Detrended CA Canonical CA Analysis available (DCA) (CCA) (RDA) P i Principal i l Non ‐ metric N t i Coordinate Dimensional Analysis Analysis (PCoA (NMDS) Principal Non ‐ metric Components Detrended CA Canonical CA Dimensional Analysis (DCA) (CCA) Analysis (PCA) (PCA) (NMDS) (NMDS)

  8. Principal Components Analysis Principal Components Analysis � Principal component analysis (PCA) is a statistical p p y ( ) technique that has been specifically developed to address data reduction. � In general terms the major aim of PCA is to reduce the � In general terms, the major aim of PCA is to reduce the complexity of the interrelationships among a potentially large number of observed variables to a relatively small number of linear combinations of them, which are b f l b f h h h referred to as principal components. � Principal components analysis finds a set of orthogonal � Principal components analysis finds a set of orthogonal standardized linear combinations which together explain all of the variation in the original data.

  9. What are the assumptions of PCA? What are the assumptions of PCA? � Assumes relationships among variables � Assumes relationships among variables. � cloud of points in p ‐ dimensional space has linear dimensions that can be effectively summarized by the principal axes. � If the structure in the data is NONLINEAR (the cloud of points twists and curves its way through p ‐ f d h h dimensional space), the principal axes will not be an efficient and informative summary of the data efficient and informative summary of the data.

  10. Considerations before to run a PCA Considerations before to run a PCA � Normal Distributions � Normal Distributions � Data Outliers � � Transformations f i � Standardization � Data Matrix

  11. Normal Distributions Normal Distributions • When using PCA data normality is not When using PCA data normality is not essential. However, these methods are based on the correlation or covariance matrix which on the correlation or covariance matrix, which is strongly affected by non ‐ normally distributed data and the presence of outliers distributed data and the presence of outliers.

  12. Data outliers Data outliers • Extreme values as well as outliers can have a Extreme values as well as outliers can have a severe influence on PCA, since they are based on the correlation or covariance matrix (Pison et al., 2003). • Outliers should thus be removed prior to the statistical analysis, or statistical methods able to handle outliers should be employed, and the influence of extreme values needs to be h i fl f l d b reduced (e.g., via a suitable transformation).

  13. Transformations Transformations � Transformations, which change the scale of measurement of the data in relation to meeting the normality assumption of the data, in relation to meeting the normality assumption of parametric analyses and the homogeneity of variance assumption of most of these analyses. � Transformations are particularly important for multivariate procedures based on eigenanalysis (e.g. principal components analysis) because covariances and correlations measure linear analysis) because covariances and correlations measure linear relationships between variables. � Transformations that improve linearity will increase the � Transformations that improve linearity will increase the efficiency with which the eigenanalysis extracts the eigenvectors.

  14. Standardization Standardization � The first stage in rotating the data cloud is to � The first stage in rotating the data cloud is to standardize the data by subtracting the mean and dividing by the standard deviation and dividing by the standard deviation. � It may be argued that we should not divide by the standard deviation By standardizing we the standard deviation. By standardizing, we are giving all species the same variation, i.e. a standard deviation of 1 standard deviation of 1.

  15. Data Matrix Data Matrix � We actually can have it both ways: � We actually can have it both ways: � A PCA without dividing by the standard deviation is an analysis of the covariance matrix. � A PCA in which you do indeed divide by the standard deviation is an analysis of the correlation matrix matrix. � When using species/variables measured in � When using species/variables measured in different units, you must use a correlation matrix matrix.

  16. Look at Descriptors Homogeneous nature? Heterogenous nature? All Same Kind ? Different kind? Same Units? Different Units? Same Order of Magnitude Different order of Magnitude? S matrix R matrix (Covariance) ( ) (Correlation) ( )

  17. Advantages Disadvantages � The results of � There are considerable differences in the Correlation Matrix analyses for standard deviations, caused mainly by different sets of differences in scale. � None of the correlations is particularly large in random variables are more directly are more directly absolute value absolute value. � PCs has moderate ‐ sized coefficients for several comparable. of the variables. � PCs give coefficients for standardized variables and are therefore less easy to interpret directly. � PCs for the � The sensitivity of the PCs to the units of Covariance Matrix Matrix covariance matrix covariance matrix measurement used for each element of the measurement used for each element of the are each dominated variables . If there are large differences between by a single variable. the variances of the elements of the variables, � The variances and then those variables whose variances are largest total variance are will tend to dominate the first few PCs. more meaningful indices for measuring variability measuring variability in data sets that are symmetric.

  18. Eigenvalues & Eigenvectors Eigenvalues & Eigenvectors � The eigenvectors are the loadings of the � The eigenvectors are the loadings of the principal components spanning the new PCA coordinate system coordinate system. � The amount of variability contained in each principal component is expressed by the principal component is expressed by the eigenvalues which are simply the variances of the scores the scores.

  19. � PCA searches for the direction in the multivariate space that in the multivariate space that contains the maximum variability. � This is the direction of the first principal component (PC1). The second principal p p component (PC2) has to be orthogonal (perpendicular) to PC1 and will contain the PC1 and will contain the maximum amount of the remaining data variability. S b Subsequent principal t i i l components are found by the same principle.

  20. Biplots � A biplot is a visualization tool to present results of PCA. The PCA t lt f PCA Th PCA biplot is called the scaling process. � The loadings(arrows) represent the elements. The lengths of the arrows in the plot are directly proportional i h l di l i l to the variability included in the two components (PC1 and PC2) displayed, and the angle between any two arrows is a measure of the correlation between those variables correlation between those variables.

  21. Misconceptions Misconceptions � PCA cannot cope with missing values (but � PCA cannot cope with missing values (but neither can most other statistical methods). � It does not require normality � It does not require normality. � It is not a hypothesis test. � There are no clear distinctions between response variables and explanatory variables.

Recommend


More recommend