Overview Factor Analysis and Beyond ◮ Principal Components Analysis ◮ Factor Analysis ◮ Independent Components Analysis Chris Williams ◮ Non-linear Factor Analysis School of Informatics, University of Edinburgh ◮ Reading: Handout on “Factor Analysis and Beyond”, Bishop §12.1, 12.2 (but not 12.2.1, 12.2.2, 12.2.3), 12.4 October 2011 (but not 12.4.2) 1 / 26 2 / 26 Covariance matrix . . . . . . . . . . . . . . . . . . . . . ◮ Let � � denote an average . . . . ◮ Suppose we have a random vector X = ( X 1 , X 2 , . . . , X d ) T . ◮ � X � denotes the mean of X , ( µ 1 , µ 2 , . . . µ d ) T ◮ σ ii = � ( X i − µ i ) 2 � is the variance of component i (gives a measure of the “spread” of component i ) ◮ σ ij = � ( X i − µ i )( X j − µ j ) � is the covariance between components i and j ◮ In d -dimensions there are d variances and d ( d − 1 ) / 2 covariances which can be arranged into a covariance matrix Σ ◮ The population covariance matrix is denoted Σ , the sample covariance matrix is denoted S 3 / 26 4 / 26
Principal Components Analysis ◮ Generalize this to consider projection from d dimensions down to m If you want to use a single number to describe a whole vector ◮ Σ has eigenvalues λ 1 ≥ λ 2 ≥ . . . λ d ≥ 0 drawn from a known distribution, pick the projection of the ◮ The directions to choose are the first m eigenvectors of Σ vector onto the direction of maximum variation (variance) corresponding to λ 1 , . . . , λ m ◮ w i . w j = 0 i � = j ◮ Assume � x � = 0 ◮ Fraction of total variation explained by using m principal ◮ y = w . x components is ◮ Choose w to maximize � y 2 � , subject to w . w = 1 � m i = 1 λ i ◮ Solution: w is the eigenvector corresponding to the largest � d i = 1 λ i eigenvalue of Σ = � xx T � ◮ PCA is basically a rotation of the axes in the data space 5 / 26 6 / 26 Factor Analysis ◮ visible variables : x = ( x 1 , . . . , x d ) , ◮ latent variables: z = ( z 1 , . . . , z m ) , z ∼ N ( 0 , I m ) ◮ noise variables: e = ( e 1 , . . . , e d ) , e ∼ N ( 0 , Ψ) , where ◮ A latent variable model; can the observations be explained Ψ = diag ( ψ 1 , . . . , ψ d ) . in terms of a small number of unobserved latent variables ? Assume ◮ FA is a proper statistical model of the data; it explains x = µ + W z + e covariance between variables rather than variance ( cf PCA) then covariance structure of x is ◮ FA has a controversial rôle in social sciences C = WW T + Ψ W is called the factor loadings matrix 7 / 26 8 / 26
◮ Rotation of solution: if W is a solution, so is WR where RR T = I m as ( WR )( WR ) T = WW T . Causes a problem if p ( x ) is like a multivariate Gaussian pancake we want to interpret factors. Unique solution can be imposed by various conditions, e.g. that W T Ψ − 1 W is p ( x | z ) ∼ N ( W z + µ , Ψ) diagonal. � p ( x ) = p ( x | z ) p ( z ) d z ◮ Is the FA model a simplification of the covariance structure? S has d ( d + 1 ) / 2 independent entries. Ψ and p ( x ) ∼ N ( µ , WW T + Ψ) W together have d + dm free parameters (and uniqueness condition above can reduce this). FA model makes sense if number of free parameters is less than d ( d + 1 ) / 2. 9 / 26 10 / 26 FA example m = 1 m = 2 (not rotated) m = 2 (rotated) ˜ ˜ Variable w 1 w 1 w 2 w 1 w 2 [from Mardia, Kent & Bibby, table 9.4.1] ◮ Correlation matrix 1 0.600 0.628 0.372 0.270 0.678 1 0 . 553 0 . 547 0 . 410 0 . 389 mechanics 2 0.667 0.696 0.313 0.360 0.673 1 0 . 610 0 . 485 0 . 437 vectors 3 0.917 0.899 -0.050 0.743 0.510 1 0 . 711 0 . 665 algebra 4 0.772 0.779 -0.201 0.740 0.317 1 0 . 607 analysis 5 0.724 0.728 -0.200 0.698 0.286 1 statstics ◮ Maximum likelihood FA (impose that W T Ψ − 1 W is ◮ 1-factor and first factor of the 2-factor solutions differ (cf PCA) diagonal). Require m ≤ 2 otherwise more free parameters ◮ problem of interpretation due to rotation of factors than entries in S . 11 / 26 12 / 26
FA for visualization Learning W , Ψ p ( z | x ) ∝ p ( z ) p ( x | z ) Posterior is a Gaussian. If z is low dimensional. Can be used for visualization (as with PCA) ◮ Maximum likelihood solution available (Lawley/Jöreskog). ◮ EM algorithm for ML solution (Rubin and Thayer, 1982) x 2 ◮ E-step: for each x i , infer p ( z | x i ) . o data ◮ M-step: do linear regression from z to x to get W latent space space x = z w ◮ Choice of m difficult (see Bayesian methods later). . z x 1 0 13 / 26 14 / 26 Comparing FA and PCA Probabilistic PCA Tipping and Bishop (1997), see Bishop §12.2 ◮ Both are linear methods and model second-order structure S Let Ψ = σ 2 I . ◮ FA is invariant to changes in scaling on the axes, but not ◮ In this case W ML spans the space defined by the first m rotation invariant (cf PCA). eigenvectors of S ◮ FA models covariance , PCA models variance ◮ PCA and FA give same results as Ψ → 0. 15 / 26 16 / 26
Example Application: Useful Texts Handwritten Digits Recognition on PCA and FA Hinton, Dayan and Revow, IEEE Trans Neural Networks 8(1), 1997 ◮ Do digit recognition with class-conditional densities ◮ B. S. Everitt and G. Dunn “Applied Multivariate Data Analysis” Edward Arnold, 1991. ◮ 8 × 8 images ⇒ 64 · 65 / 2 entries in the covariance matrix. ◮ C. Chatfield and A. J. Collins “Introduction to Multivariate ◮ 10-dimensional latent space used Analysis”, Chapman and Hall, 1980. ◮ Visualization of W matrix. Each hidden unit gives rise to a ◮ K. V. Mardia, J. T. Kent and J. M. Bibby “Multivariate weight image ... Analysis”, Academic Press, 1979. ◮ In practice use a mixture of FAs! 17 / 26 18 / 26 Independent Components Analysis ◮ A non-Gaussian latent variable model, plus linear transformation, e.g. m � e −| z i | p ( z ) ∝ i = 1 x = W z + µ + e ◮ Rotational symmetry in z -space is now broken ◮ p ( x ) is non-Gaussian, go beyond second-order statistics of data for fitting model unmixed mixed ◮ Can be used with dim ( z ) = dim ( x ) for blind source separation ◮ http://www.cnl.salk.edu/ ∼ tony/ica.html ◮ Blind source separation demo: Te-Won Lee 19 / 26 20 / 26
A General View of Latent Variable Models Non-linear Factor Analysis � . . . p ( x ) = p ( x | z ) p ( z ) d z z For PPCA p ( x | z ) ∼ N ( W z + µ , σ 2 I ) . . . x If we make the prediction of the mean a non-linear function of z , we get non-linear factor analysis, with p ( x | z ) ∼ N ( φ ( z ) , σ 2 I ) and φ ( z ) = ( φ 1 ( z ) , φ 2 ( z ) , . . . , φ d ( z )) T . However, there is a problem— we can’t do the integral analytically, so we need to approximate it. ◮ Clustering: z is one-on-in- m encoding ◮ Factor analysis: z ∼ N ( 0 , I m ) K p ( x ) ≃ 1 � p ( x | z k ) ◮ ICA: p ( z ) = � i p ( z i ) , and each p ( z i ) is non-Gaussian K k = 1 ◮ Latent Dirichlet Allocation: z ∼ Dir ( α ) (Blei et al, 2003). Used especially for “topic modelling” of documents where the samples z k are drawn from the density p ( z ) . Note that the approximation to p ( x ) is a mixture of Gaussians. 21 / 26 22 / 26 Fitting the Model to Data x 3 . . . φ . . . ◮ Adjust the parameters of φ and σ 2 to maximize the log z . . . 2 likelihood of the data. z 1 x ◮ For a simple form of mapping φ ( z ) = � i w i ψ i ( z ) we can 2 obtain EM updates for the weights { w i } and the variance x σ 2 . 1 ◮ We are fitting a constrained mixture of Gaussians to the data. The algorithm works quite like Kohonen’s ◮ Generative Topographic Mapping (Bishop, Svensen and self-organizing map (SOM), but is more principled as there Williams, 1997/8) is an objective function. ◮ Do GTM demo 23 / 26 24 / 26
Visualization Manifold Learning ◮ The mean may be ◮ A manifold is a topological space that is locally Euclidean a bad summary of ◮ We are particularly interested in the case of non-linear the posterior + dimensionality reduction, where a low-dimensional distribution. nonlinear manifold is embedded in a high-dimensional space ◮ As well as GTM, there are other methods for non-linear dimensionality reduction. Some recent methods based on P(z|x) eigendecomposition include: ◮ Isomap (Renenbaum et al, 2000) ◮ Local linear embedding (Roweis and Saul, 2000) ◮ Lapacian eigenmaps (Belkin and Niyogi, 2001) z 25 / 26 26 / 26
Recommend
More recommend