memory efficient parallel computation of tensor and
play

Memory-Efficient Parallel Computation of Tensor and Matrix Products - PowerPoint PPT Presentation

Memory-Efficient Parallel Computation of Tensor and Matrix Products for Big Tensor Decomposition N. Ravindran , N.D. Sidiropoulos , S. Smith , and G. Karypis Dept. of ECE & DTC, Dept. of CSci & DTC University of


  1. Memory-Efficient Parallel Computation of Tensor and Matrix Products for Big Tensor Decomposition N. Ravindran ∗ , N.D. Sidiropoulos ∗ , S. Smith † , and G. Karypis † ∗ Dept. of ECE & DTC, † Dept. of CSci & DTC University of Minnesota, Minneapolis Asilomar Conf. on Signals, Systems, and Computers, Nov. 3-5, 2014 Ravindran et al Memory-Efficient Parallel Tensor-Matrix Products Asilomar CSSC 2014 1 / 1

  2. Outline Rank decomposition for tensors - PARAFAC/CANDECOMP (CP) ALS Computational bottleneck for big tensors: � unfolded tensor data times Khatri-Rao matrix product Prior work Proposed memory- and computation-efficient algorithms Review recent randomized tensor compression results: identifiability, PARACOMP Memory- and computation-efficient algorithms for multi-way tensor compression Parallelization and high-performance computing optimization (underway) Ravindran et al Memory-Efficient Parallel Tensor-Matrix Products Asilomar CSSC 2014 2 / 1

  3. Rank decomposition for tensors Sum of outer products F � X = a f ◦ b f ◦ c f ( † ) f = 1 A , I × F holding { a f } F f = 1 , B , J × F holding { b f } F f = 1 , C , K × F holding { c f } F f = 1 X (: , : , k ) := k -th I × J matrix “slice” of X X T ( 3 ) := IJ × K matrix whose k -th column is vec ( X (: , : , k )) Similarly, JK × I matrix X T ( 1 ) ; and IK × J matrix X T ( 2 ) Equivalent ways to write ( † ): ( 1 ) = ( C ⊙ B ) A T ⇐ ( 2 ) = ( C ⊙ A ) B T ⇐ X T ⇒ X T ⇒ ( 3 ) = ( B ⊙ A ) C T ⇐ X T ⇒ vec ( X T ( 3 ) ) = ( C ⊙ B ⊙ A ) 1 Ravindran et al Memory-Efficient Parallel Tensor-Matrix Products Asilomar CSSC 2014 3 / 1

  4. Alternating Least Squares (ALS) Multilinear least squares A , B , C || X ( 1 ) − ( C ⊙ B ) A T || 2 min F Nonconvex, in fact NP-hard even for F = 1. Alternating least squares using ( 1 ) = ( C ⊙ B ) A T ⇐ ( 2 ) = ( C ⊙ A ) B T ⇐ X T ⇒ X T ⇒ X T ( 3 ) = ( B ⊙ A ) C T Reduces to C = X ( 3 ) ( B ⊙ A )( B T B ∗ A T A ) † A = X ( 1 ) ( C ⊙ B )( C T C ∗ B T B ) † B = X ( 2 ) ( C ⊙ A )( C T C ∗ A T A ) † Ravindran et al Memory-Efficient Parallel Tensor-Matrix Products Asilomar CSSC 2014 4 / 1

  5. Core computation Computation and inversion of ( C T C ∗ B T B ) relatively easy: relatively small K × F , J × F matrices, invert F × F matrix, usually F is small Direct computation of, say, X ( 1 ) ( C ⊙ B ) , requires O ( JKF ) memory to store C ⊙ B , in addition to O ( NNZ ) memory to store the tensor data, where NNZ is the number of non-zero elements in the tensor X Further, JKF flops are required to compute ( C ⊙ B ) , and JKF + 2 F NNZ flops to compute its product with X ( 1 ) Bottleneck is computing X ( 1 ) ( C ⊙ B ) ; likewise X ( 2 ) ( C ⊙ A ) , X ( 3 ) ( B ⊙ A ) Entire X needs to be accessed for each computation in each ALS iteration, incurring large data transport costs Memory access pattern of the tensor data is different for the three computations, making efficient block caching very difficult ‘Solution’: replicate data three times in main (fast) memory :-( Ravindran et al Memory-Efficient Parallel Tensor-Matrix Products Asilomar CSSC 2014 5 / 1

  6. Prior work Tensor Toolbox [Kolda etal , 2008-] has explicit support for sparse tensors, avoids intermediate data explosion by ‘accumulating’ tensor-matrix operations Computes X ( 1 ) ( C ⊙ B ) with 3 F NNZ flops using NNZ intermediate memory (on top of that required to store the tensor). Does not provision for efficient parallelization (accumulation step must be performed serially) Kang et al , 2012, computes X ( 1 ) ( C ⊙ B ) with 5 F NNZ flops using O ( max ( J + NNZ , K + NNZ )) intermediate memory; in return, it admits parallel MapReduce implementation Room for considerable improvements in terms of memory- and computation-efficiency, esp. for high-performance parallel computing architectures Ravindran et al Memory-Efficient Parallel Tensor-Matrix Products Asilomar CSSC 2014 6 / 1

  7. Our first contribution: Suite of three agorithms Algorithm 1: Output: M 1 ← X ( 1 ) ( C ⊙ B ) ∈ R I × F 1: M 1 ← 0 2: for k = 1 , . . . , K do M 1 ← M 1 + X (: , : , k ) B diag ( C ( k , :)) 3: 4: end for Algorithm 2: Output: M 2 ← X ( 2 ) ( C ⊙ A ) ∈ R J × F 1: M 2 ← 0 2: for k = 1 , . . . , K do M 2 ← M 2 + X (: , : , k ) A diag ( C ( k , :)) 3: 4: end for Algorithm 3: Output: M 3 ← X ( 3 ) ( B ⊙ A ) ∈ R K × F 1: for k = 1 , . . . , K do M 3 ( k , :) ← 1 T ( A ∗ ( X (: , : , k ) B )) 2: 3: end for Ravindran et al Memory-Efficient Parallel Tensor-Matrix Products Asilomar CSSC 2014 7 / 1

  8. Features Essentially no additional intermediate memory needed - updates of A , B and C can be effectively performed in place Computational complexity savings relative to Kang et al , similar or better (depending on the pattern of nonzeros) than Kolda et al Algorithms 1, 2, 3 share the same tensor data access pattern - enabling efficient orderly block caching / pre-fetching if the tensor is stored in slower / serially read memory, without need for three-fold replication ( → asymmetry between Algorithms 1, 2 and Algorithm 3) The loops can be parallelized across K threads, where each thread only requires access to an I × J slice of the tensor. This favors parallel computation and distributed storage Ravindran et al Memory-Efficient Parallel Tensor-Matrix Products Asilomar CSSC 2014 8 / 1

  9. Computational complexity of Algorithm 1 Algorithm 1: Output: M 1 ← X ( 1 ) ( C ⊙ B ) ∈ R I × F 1: M 1 ← 0 2: for k = 1 , . . . , K do M 1 ← M 1 + X (: , : , k ) B diag ( C ( k , :)) 3: 4: end for Let I k be the number of non-empty rows and J k be the number of K K non-empty columns in X (: , : , k ) and define NNZ 1 := � I k , NNZ 2 := � J k k = 1 k = 1 Assume that empty rows and columns of X (: , : , k ) can be identified offline and skipped during the matrix multiplication and update of M 1 operations Note: only need to scale by diag ( C ( k , :)) those rows of B corresponding to nonempty columns of X (: , : , k ) , and this can be done using FJ k flops, for a total of F NNZ 2 Ravindran et al Memory-Efficient Parallel Tensor-Matrix Products Asilomar CSSC 2014 9 / 1

  10. Computational complexity of Algorithm 1, continued Algorithm 1: Output: M 1 ← X ( 1 ) ( C ⊙ B ) ∈ R I × F 1: M 1 ← 0 2: for k = 1 , . . . , K do M 1 ← M 1 + X (: , : , k ) B diag ( C ( k , :)) 3: 4: end for Next, the multiplications X (: , : , k ) B diag ( C ( k , :)) can be carried out for all k at 2 F NNZ flops (counting additions and multiplications). Finally, only rows of M 1 corresponding to nonzero rows of X (: , : , k ) need to be updated, and the cost of each row update is F , since X (: , : , k ) B diag ( C ( k , :)) has F columns; hence the total M 1 row updates cost is F NNZ 1 flops Overall F NNZ 1 + F NNZ 2 + 2 F NNZ flops. Kang: 5 F NNZ ; Kolda 3 F NNZ . Note NNZ > NNZ 1 , NNZ 2 Ravindran et al Memory-Efficient Parallel Tensor-Matrix Products Asilomar CSSC 2014 10 / 1

  11. Multi-way tensor compression Multiply (every slab of) X from the I -mode with U T , from the J -mode with V T , and from the K -mode with W T , where U is I × L , V is J × M , and W is K × N , with L ≤ I , M ≤ J , N ≤ K and LMN ≪ IJK Sidiropoulos et al , IEEE SPL ’12: if columns of A , B , C are sparse, can recover LRT of big tensor from LRT of small tensor, under certain conditions Ravindran et al Memory-Efficient Parallel Tensor-Matrix Products Asilomar CSSC 2014 11 / 1

  12. PARACOMP: PArallel RAndomly COMPressed Cubes Sidiropoulos et al , IEEE SPM Sep. ’14 (SI on SP for Big Data) Guaranteed ID of big tensor LRT from small tensor LRTs, sparse or dense factors and data Distributed storage, naturally parallel, overall complexity/storage gains � IJ � scale as O , for F ≤ I ≤ J ≤ K . F Ravindran et al Memory-Efficient Parallel Tensor-Matrix Products Asilomar CSSC 2014 12 / 1

  13. Multi-way tensor compression: Computational aspects Compressed tensor Y p ∈ R L p × M p × N p can be computed as I J K � � � Y p ( l , m , n ) = U p ( l , i ) V p ( m , j ) W p ( n , k ) X ( i , j , k ) , i = 1 j = 1 k = 1 ∀ l ∈ { 1 , . . . , L p } , m ∈ { 1 , . . . , M p } , n ∈ { 1 , . . . , N p } On the bright side, can be performed ‘in place’, can exploit sparsity by summing only over the non-zero elements of X On the other hand, complexity is O ( LMNIJK ) for a dense tensor, and O ( LMN ( NNZ )) for a sparse tensor Bad U p , V p , W p memory access pattern (esp. for sparse X ) can bog down computations Ravindran et al Memory-Efficient Parallel Tensor-Matrix Products Asilomar CSSC 2014 13 / 1

  14. Multi-way tensor compression: Computational aspects Alternative computation schedule I � T 1 ( l , j , k ) = U ( l , i ) X ( i , j , k ) , i = 1 ∀ l ∈ { 1 , . . . , L p } , j ∈ { 1 , . . . , J } , k ∈ { 1 , . . . , K } (1) J � T 2 ( l , m , k ) = V ( m , j ) T 1 ( l , j , k ) , j = 1 ∀ l ∈ { 1 , . . . , L p } , m ∈ { 1 , . . . , M p } , k ∈ { 1 , . . . , K } (2) K � Y ( l , m , n ) = W ( n , k ) T 2 ( l , m , k ) , k = 1 ∀ l ∈ { 1 , . . . , L p } , m ∈ { 1 , . . . , M p } , n ∈ { 1 , . . . , N p } (3) Ravindran et al Memory-Efficient Parallel Tensor-Matrix Products Asilomar CSSC 2014 14 / 1

Recommend


More recommend