Spectral Methods for Latent Variable Models Kaizheng Wang Department of ORFE Princeton University March 20 th 2020
Data Diversity Unstructured, heterogeneous and incomplete information: ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Credit: https://www.mathworks.com/help/textanalytics/gs/getting-started-with-topic-modeling.html, https://www.alliance-scotland.org.uk/alliance-homepage-holding-people-networking-2017-01-3/, 2 https://medicalxpress.com/news/2015-04-tumor-only-genetic-sequencing-misguide-cancer .html, https://www.nature.com/articles/nature21386/figures/1, https://viterbi-web.usc.edu/ soltanol/RSC.pdf, Dzenan Hamzic
Matrix Representations Object-by-feature ( ): n × d ? ? ? ? • Texts: document-term; ? ? ? ? • Genetics: individual-marker; ? ? ? ? • Recomm. systems: user-item. ? ? ? Object-by-object ( ): n × n • Networks: adjacency matrices. 3 Credit (upper right): https://viterbi-web.usc.edu/ soltanol/RSC.pdf.
Matrix Representations Common belief: high ambient dim. but low intrinsic dim. Low-rank approximation: = + 4
Matrix Representations Low-dimensional embedding via latent variables : X 1 f 1 E 1 . . . . . . B = + . . . . X n f n E n n × d r × d n × d n × r latent latent samples noises coordinates bases Principal Component Analysis (PCA) — truncated SVD 5 0 2
Example: Genes Mirror Geography within Europe Novembre et al. (2008), Nature. n = 1387 individuals and d = 197146 SNPs; Figure 1a: 2-dim. embedding vs. labels. PC1 2 C P 6
Outline • Distributed PCA and linearization of eigenvectors • An ℓ p theory for spectral methods • Summary and future directions
Distributed PCA and linearization of eigenvectors
Principal Component Analysis { X i } n i =1 ✓ R d E ( X i X > E X i = 0 i ) = Σ Data : i.i.d., , . n Goal : estimate the principal subspace spanned by the K leading eigenvectors of . ) = Σ SVD ˆ X = ( X 1 , · · · , X n ) > u K ) 2 R d ⇥ K PCA : U = (ˆ u 1 , · · · , ˆ 0 9 2
Distributed PCA Data 5 Data 1 Center Data 4 Data 2 Data 3 10
Distributed PCA m local machines in total, each has n samples. 1. PCA in parallel : the ℓ - th machine conducts X ( ` ) 2 R n ⇥ d U ( ` ) 2 R d ⇥ K SVD ˆ and sends to the central server; ˆ U ( ` ) 2. Aggregation : . { ˆ ˆ U ( ` ) } m U 2 R d ⇥ K ` =1 Related works: Mcdonald et al. 2009; Zhang et al. 2013; Lee et al., 2015; Battey et al., 2015; Qu et al. 2002; El Karoui and d’Aspremont 2010; Liang et al. 2014. 11
Center of Subspaces ˆ { ˆ U ( ` ) } m How to find that best summarizes ? U 2 O d ⇥ K ` =1 12
Center of Subspaces ˆ { ˆ U ( ` ) } m How to find that best summarizes ? U 2 O d ⇥ K ` =1 • Subspace distance : ⇢ ( V , W ) = k V V > � W W > k F . • Least squares : m X ˆ ⇢ 2 ( V , ˆ U ( ` ) ) . U = argmin V 2 O d × K ` =1 • Algorithm : SVD ( ˆ U (1) , · · · , ˆ ˆ U ( m ) ) 2 R d ⇥ mK U 2 O d ⇥ K . 13
Theoretical Results k Σ � 1 / 2 X i k 2 . 1 Assume that is sub-Gaussian, i.e. . X i Define the effective rank and condition number as r = Tr( Σ ) / � 1 , = � 1 / ( � K � � K +1 ) . 14
Theoretical Results k Σ � 1 / 2 X i k 2 . 1 Assume that is sub-Gaussian, i.e. . X i Define the effective rank and condition number as r = Tr( Σ ) / � 1 , = � 1 / ( � K � � K +1 ) . Theorem (FW W Z, AoS 2019 ) n � C 2 p There exists constant C such that when , Kr p r � � Kr Kr U > � UU > k F � k ˆ U ˆ + 2 � � . 1 . � mn n | {z } | {z } bias variance 15
Theoretical Results k Σ � 1 / 2 X i k 2 . 1 Assume that is sub-Gaussian, i.e. . X i Define the effective rank and condition number as r = Tr( Σ ) / � 1 , = � 1 / ( � K � � K +1 ) . Theorem (FW W Z, AoS 2019 ) n � C 2 p There exists constant C such that when , Kr p r � � Kr Kr U > � UU > k F � k ˆ U ˆ + 2 � � . 1 . � mn n | {z } | {z } bias variance • If , distributed PCA is optimal. m . n/ ( 2 r ) n � C 2 p • The condition cannot be improved. Kr 16
Analysis of Aggregation X (1) 2 R n ⇥ d U (1) 2 O d ⇥ K ˆ . . . . ˆ SVD SVD . . U 2 O d ⇥ K X ( m ) 2 R n ⇥ d U ( m ) 2 O d ⇥ K ˆ P m U ( ` ) ˆ ` =1 ˆ ˆ 1 U ( ` ) > : eigenvectors of . U m 17
Analysis of Aggregation X (1) 2 R n ⇥ d U (1) 2 O d ⇥ K ˆ . . . . ˆ SVD SVD . . U 2 O d ⇥ K X ( m ) 2 R n ⇥ d U ( m ) 2 O d ⇥ K ˆ P m U ( ` ) ˆ ` =1 ˆ ˆ 1 U ( ` ) > : eigenvectors of . U m Averaging reduces variance but retains bias . • Variance: controlled by Davis-Kahan: U ( ` ) ˆ k ˆ U ( ` ) > � UU > k F . k ( ˆ Σ ( ` ) � Σ ) U k F / ∆ . • Bias: how large it is? 18
Linearization of Eigenvectors Theorem (FW W Z, AoS 2019 ) U ( ` ) ˆ U ( ` ) > � [ UU > + f ( ˆ Σ ( ` ) � Σ )] k F . [ k ( ˆ Σ ( ` ) � Σ ) U k F / ∆ ] 2 , k ˆ > � k f : R d ⇥ d ! R d ⇥ d is a linear functional determined by . ) = Σ 19
Linearization of Eigenvectors Theorem (FW W Z, AoS 2019 ) U ( ` ) ˆ U ( ` ) > � [ UU > + f ( ˆ Σ ( ` ) � Σ )] k F . [ k ( ˆ Σ ( ` ) � Σ ) U k F / ∆ ] 2 , k ˆ > � k f : R d ⇥ d ! R d ⇥ d is a linear functional determined by . ) = Σ More precise than Davis-Kahan : U ( ` ) ˆ k ˆ U ( ` ) > � UU > k F . k ( ˆ Σ ( ` ) � Σ ) U k F / ∆ . PCA has small bias: U ( ` ) ˆ Σ ( ` ) � Σ ) U k F / ∆ ] 2 . k E ( ˆ U ( ` ) > ) � UU > k F . [ k ( ˆ 20
Summary Theoretical guarantees for distributed PCA : • Bias and variance of PCA; • Linearization of eigenvectors, high-order Davis-Kahan. Paper (alphabetical order): • Fan, Wang, Wang and Zhu. Distributed estimation of principal eigenspaces. The Annals of Statistics , 2019. 21
Example: Genes Mirror Geography within Europe Novembre et al. (2008), Nature. n = 1387 individuals and d = 197146 SNPs; Figure 1a: 2-dim. embedding vs. labels. PC1 2 C P 22
A Pipeline for Spectral Methods 1. Similarity matrix construction e.g. Gram , adjacency ; XX > A 2. Spectral decomposition r get r eigen-pairs ; � λ j , u j j =1 3. r -dim. embedding e.g. using the rows of ; ( u 1 , u 2 , . . . , u r ) 4. Downstream tasks e.g. visualization. Ext.: { robust, probabilistic, sparse, nonnegative } PCA. Pearson (1901), Hotelling (1933), Schölkopf (1997), Tipping and Bishop (1999), Shi and Malik (2000), Ng et al. (2002), Belkin and Niyogi (2003), Von Luxburg (2007) 23
An ℓ p theory for spectral methods • Network analysis and Wigner-type matrices • Mixture model and Wishart-type matrices
Community Detection and SBM Community detection in networks: 25 Credit: Yuxin Chen.
Community Detection and SBM Community detection in networks: Stochastic Block Model (Holland et al., 1983) Symmetric adjacency matrix , : | J | = | J c | = n A ∈ { 0 , 1 } n × n 2 ( p, if i, j or i, j P ( A ij = 1) = if i, j or i, j . q, McSherry (2001), Coja-Oghlan (2006), Rohe et al. (2011), Mossel et al. (2013), Massoulie (2014), Lelarge et al. (2015), Chin et al. (2015), Abbe et al. (2016), Zhang and Zhou (2016). 26 Credit: Yuxin Chen.
Community Detection and SBM ✓ 1 J ✓ p 1 J,J ◆ ◆ � = p + q q 1 J,J c 2 n 11 > + p − q 1 > − 1 > � E A = . J J c q 1 J c ,J p 1 J c ,J c − 1 J c 2 n 1 The 2 nd eigenvector reveals . u = √ n ( 1 J − 1 J c ) ( J, J c ) ¯ A = E A + A = E A + + A − E A . A = E A 27 Credit: Yuxin Chen.
Community Detection and SBM ✓ 1 J ✓ p 1 J,J ◆ ◆ � = p + q q 1 J,J c 2 n 11 > + p − q 1 > − 1 > � E A = . J J c q 1 J c ,J p 1 J c ,J c − 1 J c 2 n 1 The 2 nd eigenvector reveals . u = √ n ( 1 J − 1 J c ) ( J, J c ) ¯ Spectral method: SVD the 2 nd eigenvector . sgn( u ) A u To recover , we need in a uniform way. ( J, J c ) u ≈ ¯ u Classical ℓ 2 bounds (Davis and Kahan, 1970) are too loose! 28
Optimality of Spectral Method Theorem (AF W Z, AoS 2020+ ) ( a log n , if i, j or i, j Let and . n P ( A ij = 1) = a 6 = b b log n , if i, j or i, j n � 2 > 2 • Exact recovery w.h.p. when ; √ � √ a − b √ � √ a − • Error rate when . √ � 2 ≤ 2 n − ( √ a − b ) 2 / 2 b • Optimality (Abbe et al., 2016; Zhang and Zhou, 2016) . 29
Recommend
More recommend