Rotation Invariant Householder Parameterization for Bayesian PCA Rajbir-Singh Nirwan, Nils Bertschinger June 11, 2019
Outline • Probabilistic PCA (PPCA) • Non-identifiability issue of PPCA • Conceptual solution to the problem • Implementation • Results � 2
Probabilistic PCA • Classical PCA Formulated as a projection from data space Y to a lower dimensional latent space X Y ∈ ℝ N × D X ∈ ℝ N × Q → Latent space: maximizes variance of projected data, minimizes MSE � 3
Probabilistic PCA • Classical PCA Formulated as a projection from data space Y to a lower dimensional latent space X Y ∈ ℝ N × D X ∈ ℝ N × Q → Latent space: maximizes variance of projected data, minimizes MSE • Probabilistic PCA (PPCA) Viewed as a generative model, that maps the latent space X to the data space Y X ∈ ℝ N × Q Y ∈ ℝ N × D → Y = XW T + ϵ ϵ ∼ 𝒪 ( 0 , σ 2 I ) X ∼ 𝒪 ( 0 , I ), N 𝒪 ( Y n ,: | 0 , WW T + σ 2 I ) ∏ p ( Y | W ) = n =1 WRR T W T = WW T ∀ RR T = I � 4
Probabilistic PCA • Classical PCA Formulated as a projection from data space Y to a lower dimensional latent space X Y ∈ ℝ N × D X ∈ ℝ N × Q → Latent space: maximizes variance of projected data, minimizes MSE • Probabilistic PCA (PPCA) Viewed as a generative model, that maps the latent space X to the data space Y X ∈ ℝ N × Q Y ∈ ℝ N × D → • Optimization for D=5, Q=2 Y = XW T + ϵ ϵ ∼ 𝒪 ( 0 , σ 2 I ) X ∼ 𝒪 ( 0 , I ), N 𝒪 ( Y n ,: | 0 , WW T + σ 2 I ) ∏ p ( Y | W ) = n =1 WRR T W T = WW T ∀ RR T = I � 5
Probabilistic PCA • Classical PCA Formulated as a projection from data space Y to a lower dimensional latent space X Y ∈ ℝ N × D X ∈ ℝ N × Q → Latent space: maximizes variance of projected data, minimizes MSE • Probabilistic PCA (PPCA) Viewed as a generative model, that maps the latent space X to the data space Y X ∈ ℝ N × Q Y ∈ ℝ N × D → • Optimization for D=5, Q=2 Y = XW T + ϵ ϵ ∼ 𝒪 ( 0 , σ 2 I ) X ∼ 𝒪 ( 0 , I ), N 𝒪 ( Y n ,: | 0 , WW T + σ 2 I ) ∏ p ( Y | W ) = n =1 WRR T W T = WW T ∀ RR T = I � 6
Probabilistic PCA • Classical PCA Formulated as a projection from data space Y to a lower dimensional latent space X Y ∈ ℝ N × D X ∈ ℝ N × Q → Latent space: maximizes variance of projected data, minimizes MSE • Probabilistic PCA (PPCA) Viewed as a generative model, that maps the latent space X to the data space Y X ∈ ℝ N × Q Y ∈ ℝ N × D → • Optimization for D=5, Q=2 Y = XW T + ϵ ϵ ∼ 𝒪 ( 0 , σ 2 I ) X ∼ 𝒪 ( 0 , I ), N 𝒪 ( Y n ,: | 0 , WW T + σ 2 I ) ∏ p ( Y | W ) = n =1 WRR T W T = WW T ∀ RR T = I � 7
Probabilistic PCA • Classical PCA Formulated as a projection from data space Y to a lower dimensional latent space X Y ∈ ℝ N × D X ∈ ℝ N × Q → Latent space: maximizes variance of projected data, minimizes MSE • Probabilistic PCA (PPCA) Viewed as a generative model, that maps the latent space X to the data space Y X ∈ ℝ N × Q Y ∈ ℝ N × D → • Rotation invariant likelihood Y = XW T + ϵ optimized 2 ϵ ∼ 𝒪 ( 0 , σ 2 I ) X ∼ 𝒪 ( 0 , I ), 1 N W 2 0 𝒪 ( Y n ,: | 0 , WW T + σ 2 I ) ∏ p ( Y | W ) = −1 n =1 −2 WRR T W T = WW T ∀ RR T = I −2 −1 0 1 2 � 8 W 1
Bayesian approach to PPCA p ( Y | W ) p ( W ) p ( W | Y ) = p ( Y ) • If prior does not break the symmetry, posterior will be rotation invariant as well • Sampling will be challenging, posterior averages are meaningless and the interpretation of the latent space is almost impossible � 9
Bayesian approach to PPCA sampled 2 p ( Y | W ) p ( W ) 1 p ( W | Y ) = W 2 0 p ( Y ) −1 −2 −2 −1 0 1 2 W 1 • If prior does not break the symmetry, posterior will be rotation invariant as well • Sampling will be challenging, posterior averages are meaningless and the interpretation of the latent space is almost impossible � 10
Solution • Find di ff erent parameterization of the model, such that the probabilistic model is not changed Outline of procedure WW T = U Σ V T ( U Σ V T ) T = U Σ 2 U T • SVD of W • Fix coordinate system V = I • Specify correct prior p ( U , Σ ) • Sample from p ( U , Σ | Y ) � 11
Solution • Find di ff erent parameterization of the model, such that the probabilistic model is not changed Outline of procedure WW T = U Σ V T ( U Σ V T ) T = U Σ 2 U T • SVD of W • Fix coordinate system V = I • Specify correct prior p ( U , Σ ) • Sample from p ( U , Σ | Y ) is Wishart distributed WW T W ∼ 𝒪 ( 0 , I ) → U ∼ ? is Wishart distributed U ΣΣ T U T → Σ ∼ ? � 12
Σ ∼ ? → U ΣΣ T U T Wishart U ∼ ? Theory • Since U , Σ is SVD of W and U , Σ 2 is eigenvalue decomposition of WW T → U is eigenvector matrix 𝒲 Q , D = { U ∈ ℝ D × Q | U T U = I } Stiefel manifold U ∈ 𝒲 Q , D Eigenvectors of Wishart matrix are distributed uniformly in space of orthogonal matrices ( Blai (2007), Uhlig (1994) ) → U is uniformly distributed on the Stiefel manifold � 13
Σ ∼ ? → U ΣΣ T U T wishart U ∼ ? Theory • Since U , Σ is SVD of W and U , Σ 2 is eigenvalue decomposition of WW T → U is eigenvector matrix 𝒲 Q , D = { U ∈ ℝ D × Q | U T U = I } Stiefel manifold U ∈ 𝒲 Q , D Eigenvectors of Wishart matrix are distributed uniformly in space of orthogonal matrices ( Blai (2007), Uhlig (1994) ) → U is uniformly distributed on the Stiefel manifold • Square of ordered eigenvalue matrix Σ is distributed as (James & Lee (2014)) q ′ � = q +1 λ q − λ q ′ � ) 2 ∑ Q q =1 ( λ p ( λ ) = ce − 1 D − Q − 1 q =1 λ q ∏ Q ∏ Q 2 q Q Q Q p ( σ 1 , …, σ Q ) = ce − 1 2 ∑ Q q =1 σ 2 ∏ ∏ ∏ σ D − Q − 1 σ 2 q − σ 2 q 2 σ q q q ′ � q ′ � = q +1 q =1 q =1 � 14
Implementation U ∼ uniform on Stiefel 𝒲 Q , D • Need: Σ ∼ p ( Σ ) ← easy, since we know the analytic exp for density How to uniformly sample U on 𝒲 Q , D for n = D : 1 v n ∼ uniform on 𝕋 n − 1 v n + sgn ( v n 1 ) v n e 1 u n = v n + sgn ( v n 1 ) v n e 1 H n ( v n ) = − sgn ( v n 1 ) ( I − 2 u n u T ˜ n ) H n = ( H n ) I 0 ˜ 0 Mezzadri (2007) U = H D ( v D ) H D − 1 ( v D − 1 ) … H 1 ( v 1 ) � 15
Implementation The full generative model for Bayesian PPCA: v D , …, v D − Q +1 ∼ 𝒪 (0, I ) σ ∼ p ( σ ) μ ∼ p ( μ ) Q H D − q +1 ( v D − q +1 ) ∏ U = q =1 Σ = diag( σ ) W = U Σ σ noise ∼ p ( σ noise ) N 𝒪 ( Y n ,: | μ , WW T + σ 2 noise I ) ∏ Y ∼ n =1 � 16
Results Synthetic Dataset • Construction X ∼ 𝒪 ( 0 , I ) ∈ ℝ N × Q U ∼ uniform on Stiefel 𝒲 Q , D ( N , D , Q ) = (150,5,2) ϵ ∼ 𝒪 (0, 0.01) ∈ ℝ N × D Σ = diag ( σ 1 , σ 2 ) = diag (3.0, 1.0) W = U Σ ∈ ℝ D × Q Y = XW T + ϵ • Inference 700 saPples froP p ( σ | Y ) σ - classical 3CA 2 2 600 σ - 7rue values 500 1 1 400 W 2 W 2 0 0 300 −1 200 −1 100 −2 −2 0 −2 −1 0 1 2 −0.5 0.0 0.5 1.0 1.0 1.5 2.0 2.5 3.0 3.5 4.0 W 1 W 1 σ � 17
Results Breast Cancer Wisconsin Dataset ( N , D ) = (569, 30) • Bayesian PCA 1.0 1.0 0.8 0.5 0.6 W 2 W 2 0.0 0.4 −0.5 0.2 −1.0 0.0 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 W 1 W 1 • Advantages • Breaks the rotation symmetry without changing the probabilistic model • Enrichment of the classical PCA solution with uncertainty estimates • Decomposition of prior into rotation and principle variances • Allows to construct other priors without issues • Sparsity prior on principle variances without a-priori rotation preference • If desired a-priori rotation preference without a ff ecting the variances � 18
Extension to non-linear models • GPLVM with the same rotation invariant problem standard - chain: 1 standard - chain: 2 standard - chain: 3 5.0 d =1 𝒪 ( Y :, d | μ , K + σ 2 I ) p ( Y | X ) = ∏ D 2.5 X 2 0.0 i ,: X j ,: = k ( X i ,: , X j ,: ) −2.5 K = XX T , K ij = X T −5.0 −5 0 5 −5 0 5 −5 0 5 X 1 X 1 X 1 SE exp ( − 0.5 2 / l 2 ) 2 unique - chain: 1 unique - chain: 2 unique - chain: 3 k SE ( x , x ′ � ) = σ 2 x − x ′ � 5.0 2.5 X 2 0.0 −2.5 −5.0 −2 0 2 −2 0 2 −2 0 2 X 1 X 1 X 1 • No rotation symmetry in the posterior for the suggested parameterization • Di ff erent chains converge to di ff erent solutions due to increased model complexity � 19
Conclusion • Suggested new parameterization for W in PPCA, which uniquely identifies principle components even though the likelihood and the posterior are rotationally symmetric • Showed how to set the prior on the new parameters such that the model is not changed compared to a standard Gaussian prior on W • Provided an e ffi cient implementation via Householder transformations (no Jacobian correction needed) • New parameterization allows for other interpretable priors on rotation and principle variances • Extended to non-linear models and successfully solved the rotation problem there as well � 20
Poster session: #235 Github: https://github.com/RSNirwan/HouseholderBPCA Thanks for your attention! Supervisor: Prof. Dr. Nils Bertschinger Funder: Dr. h. c. Helmut O. Maucher
Recommend
More recommend