probabilistic dimensionality reduction
play

Probabilistic Dimensionality Reduction Neil D. Lawrence Amazon - PowerPoint PPT Presentation

Probabilistic Dimensionality Reduction Neil D. Lawrence Amazon Research Cambridge and University of She ffi eld, U.K. Probabilistic Scientific Computing Workshop ICERM at Brown 6th June 2017 Outline Dimensionality Reduction Conclusions


  1. Probabilistic Dimensionality Reduction Neil D. Lawrence Amazon Research Cambridge and University of She ffi eld, U.K. Probabilistic Scientific Computing Workshop ICERM at Brown 6th June 2017

  2. Outline Dimensionality Reduction Conclusions

  3. Motivation for Non-Linear Dimensionality Reduction USPS Data Set Handwritten Digit ◮ 3648 Dimensions ◮ 64 rows by 57 columns

  4. Motivation for Non-Linear Dimensionality Reduction USPS Data Set Handwritten Digit ◮ 3648 Dimensions ◮ 64 rows by 57 columns ◮ Space contains more than just this digit.

  5. Motivation for Non-Linear Dimensionality Reduction USPS Data Set Handwritten Digit ◮ 3648 Dimensions ◮ 64 rows by 57 columns ◮ Space contains more than just this digit. ◮ Even if we sample every nanosecond from now until the end of the universe, you won’t see the original six!

  6. Motivation for Non-Linear Dimensionality Reduction USPS Data Set Handwritten Digit ◮ 3648 Dimensions ◮ 64 rows by 57 columns ◮ Space contains more than just this digit. ◮ Even if we sample every nanosecond from now until the end of the universe, you won’t see the original six!

  7. Simple Model of Digit Rotate a ’Prototype’

  8. Simple Model of Digit Rotate a ’Prototype’

  9. Simple Model of Digit Rotate a ’Prototype’

  10. Simple Model of Digit Rotate a ’Prototype’

  11. Simple Model of Digit Rotate a ’Prototype’

  12. Simple Model of Digit Rotate a ’Prototype’

  13. Simple Model of Digit Rotate a ’Prototype’

  14. Simple Model of Digit Rotate a ’Prototype’

  15. Simple Model of Digit Rotate a ’Prototype’

  16. MATLAB Demo demDigitsManifold([1 2], ’all’)

  17. MATLAB Demo demDigitsManifold([1 2], ’all’) 0.1 0.05 PC no 2 0 -0.05 -0.1 -0.1 -0.05 0 0.05 0.1 PC no 1

  18. MATLAB Demo demDigitsManifold([1 2], ’sixnine’ ) 0.1 0.05 PC no 2 0 -0.05 -0.1 -0.1 -0.05 0 0.05 0.1 PC no 1

  19. Low Dimensional Manifolds Pure Rotation is too Simple ◮ In practice the data may undergo several distortions. ◮ e.g. digits undergo ‘thinning’, translation and rotation. ◮ For data with ‘structure’: ◮ we expect fewer distortions than dimensions; ◮ we therefore expect the data to live on a lower dimensional manifold. ◮ Conclusion: deal with high dimensional data by looking for lower dimensional non-linear embedding.

  20. Existing Methods Spectral Approaches ◮ Classical Multidimensional Scaling (MDS) (Mardia et al., 1979) . ◮ Uses eigenvectors of similarity matrix. ◮ Isomap (Tenenbaum et al., 2000) is MDS with a particular proximity measure. ◮ Kernel PCA (Sch¨ olkopf et al., 1998) ◮ Provides a representation and a mapping — dimensional expansion. ◮ Mapping is implied throught he use of a kernel function as a similarity matrix. ◮ Locally Linear Embedding (Roweis and Saul, 2000). ◮ Looks to preserve locally linear relationships in a low dimensional space.

  21. Existing Methods II Iterative Methods ◮ Multidimensional Scaling (MDS) ◮ Iterative optimisation of a stress function (Kruskal, 1964). ◮ Sammon Mappings (Sammon, 1969) . ◮ Strictly speaking not a mapping — similar to iterative MDS. ◮ NeuroScale (Lowe and Tipping, 1997) ◮ Augmentation of iterative MDS methods with a mapping.

  22. Existing Methods III Probabilistic Approaches ◮ Probabilistic PCA (Tipping and Bishop, 1999; Roweis, 1998) ◮ A linear method.

  23. Existing Methods III Probabilistic Approaches ◮ Probabilistic PCA (Tipping and Bishop, 1999; Roweis, 1998) ◮ A linear method. ◮ Density Networks (MacKay, 1995) ◮ Use importance sampling and a multi-layer perceptron.

  24. Existing Methods III Probabilistic Approaches ◮ Probabilistic PCA (Tipping and Bishop, 1999; Roweis, 1998) ◮ A linear method. ◮ Density Networks (MacKay, 1995) ◮ Use importance sampling and a multi-layer perceptron. ◮ Generative Topographic Mapping (GTM) (Bishop et al., 1998) ◮ Uses a grid based sample and an RBF network.

  25. Existing Methods III Probabilistic Approaches ◮ Probabilistic PCA (Tipping and Bishop, 1999; Roweis, 1998) ◮ A linear method. ◮ Density Networks (MacKay, 1995) ◮ Use importance sampling and a multi-layer perceptron. ◮ Generative Topographic Mapping (GTM) (Bishop et al., 1998) ◮ Uses a grid based sample and an RBF network. Di ffi culty for Probabilistic Approaches ◮ Propagate a probability distribution through a non-linear mapping.

  26. The New Model A Probabilistic Non-linear PCA ◮ PCA has a probabilistic interpretation (Tipping and Bishop, 1999; Roweis, 1998) . ◮ It is di ffi cult to ‘non-linearise’. Dual Probabilistic PCA ◮ We present a new probabilistic interpretation of PCA (Lawrence, 2005) . ◮ This interpretation can be made non-linear. ◮ The result is non-linear probabilistic PCA.

  27. Notation q — dimension of latent / embedded space p — dimension of data space n — number of data points � ⊤ = � � ∈ ℜ n × p centred data, Y = � y 1 , : , . . . , y n , : y : , 1 , . . . , y : , p � ⊤ = � � ∈ ℜ n × q latent variables, X = � x 1 , : , . . . , x n , : x : , 1 , . . . , x : , q mapping matrix, W ∈ ℜ p × q a i , : is a vector from the i th row of a given matrix A a : , j is a vector from the j th row of a given matrix A

  28. Reading Notation X and Y are design matrices ◮ Covariance given by n − 1 Y ⊤ Y . ◮ Inner product matrix given by YY ⊤ .

  29. Linear Dimensionality Reduction Linear Latent Variable Model ◮ Represent data, Y , with a lower dimensional set of latent variables X . ◮ Assume a linear relationship of the form y i , : = Wx i , : + ǫ i , : , where � � 0 , σ 2 I ǫ i , : ∼ N .

  30. Linear Latent Variable Model Probabilistic PCA ◮ Define linear-Gaussian X relationship between W latent variables and data. σ 2 Y n � � � y i , : | Wx i , : , σ 2 I p ( Y | X , W ) = N i = 1

  31. Linear Latent Variable Model Probabilistic PCA ◮ Define linear-Gaussian X relationship between W latent variables and data. σ 2 ◮ Standard Latent Y variable approach: n � � � y i , : | Wx i , : , σ 2 I p ( Y | X , W ) = N i = 1

  32. Linear Latent Variable Model Probabilistic PCA X W ◮ Define linear-Gaussian relationship between latent variables and σ 2 Y data. ◮ Standard Latent variable approach: n ◮ Define Gaussian prior � � � y i , : | Wx i , : , σ 2 I p ( Y | X , W ) = N over latent space , X . i = 1 n � � � N x i , : | 0 , I p ( X ) = i = 1

  33. Linear Latent Variable Model X W Probabilistic PCA ◮ Define linear-Gaussian relationship between σ 2 Y latent variables and data. ◮ Standard Latent n variable approach: � � � y i , : | Wx i , : , σ 2 I p ( Y | X , W ) = N ◮ Define Gaussian prior i = 1 over latent space , X . n � ◮ Integrate out latent � � p ( X ) = N x i , : | 0 , I variables . i = 1 n � y i , : | 0 , WW ⊤ + σ 2 I � � p ( Y | W ) = N i = 1

  34. Computation of the Marginal Likelihood � � 0 , σ 2 I x i , : ∼ N ( 0 , I ) , ǫ i , : ∼ N y i , : = Wx i , : + ǫ i , : ,

  35. Computation of the Marginal Likelihood � � 0 , σ 2 I x i , : ∼ N ( 0 , I ) , ǫ i , : ∼ N y i , : = Wx i , : + ǫ i , : , Wx i , : ∼ N � 0 , WW ⊤ � ,

  36. Computation of the Marginal Likelihood � � 0 , σ 2 I x i , : ∼ N ( 0 , I ) , ǫ i , : ∼ N y i , : = Wx i , : + ǫ i , : , Wx i , : ∼ N � 0 , WW ⊤ � , 0 , WW ⊤ + σ 2 I � � Wx i , : + ǫ i , : ∼ N

  37. Linear Latent Variable Model II Probabilistic PCA Max. Likelihood Soln (Tipping and Bishop, 1999) W σ 2 Y n � y i , : | 0 , WW ⊤ + σ 2 I � � p ( Y | W ) = N i = 1

  38. Linear Latent Variable Model II Probabilistic PCA Max. Likelihood Soln (Tipping and Bishop, 1999) n � N � y i , : | 0 , C � , C = WW ⊤ + σ 2 I p ( Y | W ) = i = 1

  39. Linear Latent Variable Model II Probabilistic PCA Max. Likelihood Soln (Tipping and Bishop, 1999) n � N � y i , : | 0 , C � , C = WW ⊤ + σ 2 I p ( Y | W ) = i = 1 log p ( Y | W ) = − n 2 log | C | − 1 � � C − 1 Y ⊤ Y 2tr + const.

  40. Linear Latent Variable Model II Probabilistic PCA Max. Likelihood Soln (Tipping and Bishop, 1999) n � N � y i , : | 0 , C � , C = WW ⊤ + σ 2 I p ( Y | W ) = i = 1 log p ( Y | W ) = − n 2 log | C | − 1 � � C − 1 Y ⊤ Y 2tr + const. If U q are first q principal eigenvectors of n − 1 Y ⊤ Y and the corresponding eigenvalues are Λ q ,

  41. Linear Latent Variable Model II Probabilistic PCA Max. Likelihood Soln (Tipping and Bishop, 1999) n � N � y i , : | 0 , C � , C = WW ⊤ + σ 2 I p ( Y | W ) = i = 1 log p ( Y | W ) = − n 2 log | C | − 1 � � C − 1 Y ⊤ Y 2tr + const. If U q are first q principal eigenvectors of n − 1 Y ⊤ Y and the corresponding eigenvalues are Λ q , � 1 � W = U q LR ⊤ , Λ q − σ 2 I 2 L = where R is an arbitrary rotation matrix.

  42. Linear Latent Variable Model III Dual Probabilistic PCA ◮ Define linear-Gaussian W relationship between X latent variables and data. σ 2 Y n � � y i , : | Wx i , : , σ 2 I � p ( Y | X , W ) = N i = 1

  43. Linear Latent Variable Model III Dual Probabilistic PCA ◮ Define linear-Gaussian W relationship between X latent variables and data. σ 2 ◮ Novel Latent variable Y approach: n � � y i , : | Wx i , : , σ 2 I � p ( Y | X , W ) = N i = 1

Recommend


More recommend