pca for distributed data sets
play

PCA for Distributed Data Sets Raymond H. Chan Department of - PowerPoint PPT Presentation

PCA for Distributed Data Sets Raymond H. Chan Department of Mathematics The Chinese University of Hong Kong Joint work with Franklin Luk (RPI) and Z.-J. Bai (CUHK) p. 1/34 Grid Powerful processors with relatively slow links.


  1. PCA for Distributed Data Sets Raymond H. Chan Department of Mathematics The Chinese University of Hong Kong Joint work with Franklin Luk (RPI) and Z.-J. Bai (CUHK) – p. 1/34

  2. Grid Powerful processors with relatively slow links. ✬✩ ✬✩ Powerful 0 1 Teraflop ✫✪ ✫✪ � ❅ Processors � ❅ ✬✩ � ❅ Slow � ❅ � ❅ Gigabit 4 ✫✪ Networks � � ✬✩ � ✬✩ � � 3 2 ✫✪ ✫✪ – p. 2/34

  3. New York State TeraGrid Initiative � State University of NY at Albany � Rensselaer Polytechnic Institute � Brookhaven National Laboratory � State University of NY at Stony Brook � in partnership with IBM and NYSERNet – p. 3/34

  4. – p. 4/34

  5. Grid Topology � Four sites, scalable grid architecture � 10 to 20 Gb/sec connection � 6TF Processors Communication = 6 × 10 12 Computation 0.3 × 10 9 = 20, 000! Communication Bottleneck: computation done locally. – p. 5/34

  6. Principal Component Analysis Let X be an n × p data matrix, where n ≫ p . Data covariace matrix S is given by nS = X T ( I − 1 n e n e T n ) X , where e T n = ( 1, 1, . . . , 1 ) . PCA ⇐ ⇒ Karhunen-Loève transform ⇐ ⇒ Hotelling transform – p. 6/34

  7. PCA means � get spectral decomposition of n · S : nS = V Σ 2 V T . � choose few largest eigenvalues and eigenvectors � V . � form principal component vectors X · � V . � low-dimension representation of original data. – p. 7/34

  8. PCA involves only V and Σ Since ( I − 1 n e n e T n ) is symmetric and idempotent, nS = X T ( I − 1 n e n e T n ) X = X T ( I − 1 n e n e T n e n e T n )( I − 1 n ) X Σ and V can be obtained from SVD of: n e n e T n ) X = U Σ V T . ( I − 1 Low-dim’l representation X · � V can still be done. – p. 8/34

  9. Distributed PCA Big data matrix X : n ≈ 10 12 . E.g. visualization, data mining. Problem: Data are distributed amongst s processors. Can we find Σ and V without moving X across processors? – p. 9/34

  10. Data among s processors Denote     X 0      s − 1 X 1   ∑ X = n = n i ,   . .    .  i = 0   X s − 1 where X i is n i × p , resides on processor i . Typical: n i ≈ 10 12 and p ≈ 10 3 . – p. 10/34

  11. Aim � Compute PCA of X without moving the data matrix X i . � Move O ( p α ) data across processors instead of O ( n i ) . – p. 11/34

  12. Distributed PCA by Qu et al. 1. At processor i , calculate local PCA using SVD: n i e n i e T n i ) X i = U i Σ i V T ( I − 1 i . Say matrix has numerical rank k i . Send ¯ x i (column sum of X i ), and k i largest principal components � Σ i and � V i to central processor. Communication costs = O ( pk i ) . – p. 12/34

  13. 2. At central processor: Assemble p × p covariance matrix and find its PCA s − 1 s − 1 n � V i � � i � Σ 2 V T x ) T ∑ ∑ S = i + n i ( ¯ x i − ¯ x )( ¯ x i − ¯ i = 0 i = 0 = V Σ 2 V T . Broadcast � V , the first k columns of V . Communication costs = O ( pk ) . – p. 13/34

  14. 3. Calculate principal component vectors at processor i : X i � V . – p. 14/34

  15. Analysis of Qu’s Approach Advantage: Reduce communication costs: � s − 1 � �� ∑ O ( pn ) − → O p k i . i = 0 Disadvantages: � Local SVD’s introduce approximation errors. � Central processor becomes bottleneck for communications and computation. – p. 15/34

  16. Luk’s Algorithms Replace SVD by QR – p. 16/34

  17. 1a. At processor i Calculate QR decomposition of ( I − 1 n i e n i e T n i ) X i : n i ) X i = Q ( 0 ) R ( 0 ) n i e n i e T ( I − 1 i , i where R ( 0 ) is p × p . i Send n i and ¯ x i to central processor. If i ≥ s /2, send R ( 0 ) to processor i − s /2. i No need to send Q ( 0 ) i . – p. 17/34

  18. 1b. At processor i < s /2 Calculate QR decomposition � � R ( 0 ) = Q ( 1 ) R ( 1 ) i i , R ( 0 ) i i + s /2 where R ( 1 ) is p × p . Equals to QRD of i ( I − 1 1 n i e n i e T n i + s /2 e n i + s /2 e T n i ) X i ( I − n i + s /2 ) X i + s / and If i ≥ s /4, send R ( 1 ) to processor i − s /4. i No need to send Q ( 1 ) i . – p. 18/34

  19. 1c. At processor i < s /4 Calculate QR decomposition � � R ( 1 ) = Q ( 2 ) R ( 2 ) i i , R ( 1 ) i i + s /4 where R ( 2 ) is p × p . i If i ≥ s /8, send R ( 2 ) to processor i − s /8. i No need to send Q ( 2 ) i . – p. 19/34

  20. 1d. Eventually, at processor 0 Calculate QR decomposition of � � R ( l − 1 ) = Q ( l ) 0 R ( l ) 0 0 , R ( l − 1 ) 1 where l = ⌈ log 2 s ⌉ . Send R ( l ) 0 to central processor. No need to send Q ( l ) 0 . – p. 20/34

  21. Main Results � Total communication costs = O ( ⌈ log 2 s ⌉ p 2 ) . � The covariance matrix nS = X T ( I − 1 n e n e T n ) X , is given by: s − 1 T nS = R ( ℓ ) R ( ℓ ) x ) T . ∑ 0 + n i ( ¯ x i − ¯ x )( ¯ x i − ¯ 0 i = 0 – p. 21/34

  22. 2. At central processor Assemble ( s + p ) × p data matrix:   √ n 0 ( ¯ x ) T x 0 − ¯ √ n 1 ( ¯   x ) T x 1 − ¯    .  . Z = . .   √ n s − 1 ( ¯   x ) T x s − 1 − ¯   R ( l ) 0 Notice that: nS = Z T Z . – p. 22/34

  23. 2. At central processor Compute SVD: Z = U Σ V T (after triangulation). Say Z has numerical rank k . x and � V , first k columns of V . Broadcast ¯ Communication costs = O ( pk ) . – p. 23/34

  24. 3. At processor i Calculate principal component vectors: X i � V . – p. 24/34

  25. Analysis of Luk’s Algorithm Advantage over Qu’s Approach: � Communication costs on PCA: �� − � s − 1 � → O ( p 2 ⌈ log 2 s ⌉ ) , ∑ O p k i i = 0 � No local PCA approximation errors. � Less congestion in central processor for communications and computation. � Work directly with data matrices. – p. 25/34

  26. Data Updating Assume global synchronization at t 0 , t 1 , . . . , t k , i.e. at [ t k − 1 , t k ] , new data are added to X ( k ) on i processor i . Aim: Find the PCA for the new extended matrix, without moving X ( k ) across processors. i – p. 26/34

  27. X ( m ) X ( 0 ) X ( 1 ) X ( m ) · · · Processor X ( 0 ) X ( 1 ) X ( m ) s − 1 s − 1 s − 1 s − 1 . . . · · · X ( 0 ) X ( 1 ) X ( m ) 1 1 1 1 X ( 0 ) X ( 1 ) X ( m ) 0 0 0 0 t · · · t 0 t 1 t m – p. 27/34

  28. At time t k Let    X ( k )    0     X ( k )   s − 1 X ( k ) = n ( k ) = n ( k )   ∑ 1 i , .   .  .    i = 0    X ( k ) s − 1 where X ( k ) is n ( k ) × p . i i Assume PCA of original matrix X ( 0 ) = X is available by Luk’s algorithm. – p. 28/34

  29. Global Data Matrix at t m Denote    X ( 0 )       X ( 1 ) m   X ( m ) = n ( k ) . ∑ g ( m ) =   . .    .  i = 0   X ( s − 1 ) Aim: Find PCA for its covariance matrix: g ( m ) · S g ( m ) = X ( m ) T ( I − g ( m ) ) X ( m ) . g ( m ) e g ( m ) e T 1 – p. 29/34

  30. Our Theorem Let n ( k ) · S k = X ( k ) T ( I − n ( k ) ) X ( k ) . n ( k ) e n ( k ) e T 1 Then m n ( k ) S k ∑ g ( m ) S g ( m ) = k = 0 m g ( k − 1 ) n ( k ) ∑ x n ( k ) ) T + ( ¯ x g ( k − 1 ) − ¯ x n ( k ) )( ¯ x g ( k − 1 ) − ¯ g ( k ) k = 1 – p. 30/34

  31. Explanation PCA of update data matrix X ( k ) can be obtained by Luk’s algorithm, i.e. n ( k ) S k = R T k R k . Then m R T ∑ g ( m ) S g ( m ) = k R k k = 0 m g ( k − 1 ) n ( k ) ∑ x n ( k ) ) T + ( ¯ x g ( k − 1 ) − ¯ x n ( k ) )( ¯ x g ( k − 1 ) − ¯ g ( k ) k = 1 Assemble them to construct global PCA for X ( m ) . – p. 31/34

  32. Analysis of Our Algorithm � Global PCA can be computed without moving X ( k ) . � Communication costs still O ( p 2 ⌈ log 2 s ⌉ ) , � No local PCA approximation errors. � Work directly with data matrices and update matrices. � Load balancing for communications and computation. – p. 32/34

  33. Load Balancing Let s = 2 ℓ . We can allocate all processors to do the QR factorizations such that: � PCA of X ( k ) ← PCA of X ( k − 1 ) + R factor of X ( k ) . � PCA of X ( k ) obtained in t k + ℓ . � The procedure is periodic with period ℓ . � Well-balanced among the processors. – p. 33/34

Recommend


More recommend