constrained tensor factorization with accelerated ao admm
play

Constrained Tensor Factorization with Accelerated AO-ADMM Shaden - PowerPoint PPT Presentation

Constrained Tensor Factorization with Accelerated AO-ADMM Shaden Smith 1 , Alec Beri 2 , and George Karypis 1 1 Department of Computer Science & Engineering, University of Minnesota 2 Department of Computer Science, University of Maryland


  1. Constrained Tensor Factorization with Accelerated AO-ADMM Shaden Smith 1 ∗ , Alec Beri 2 , and George Karypis 1 1 Department of Computer Science & Engineering, University of Minnesota 2 Department of Computer Science, University of Maryland ∗ shaden@cs.umn.edu 1 / 26

  2. Table of Contents Introduction Accelerated AO-ADMM Experiments Conclusions 1 / 26

  3. Tensor Introduction ◮ Tensors are the generalization of matrices to higher dimensions. ◮ Allow us to represent and analyze multi-dimensional data. ◮ Applications in precision healthcare, cybersecurity, recommender systems, . . . patients s e s o n g a i d procedures 2 / 26

  4. Canonical polyadic decomposition (CPD) The CPD models a tensor as the summation of rank-1 tensors. ≈ + · · · + � � 2 F � � � � � minimize L ( X , A , B , C ) = � X − A (: , f ) ◦ B (: , f ) ◦ C (: , f ) � � A , B , C � f =1 F Notation A ∈ R I × F , B ∈ R J × F , and C ∈ R K × F denote the factor matrices for a 3D tensor. 3 / 26

  5. Alternating least squares (ALS) The CPD is most commonly computed with ALS: Algorithm 1 CPD-ALS 1: while not converged do A T ← ( C T C ∗ B T B ) − 1 � � T X (1) ( C � B ) 2: B T ← ( C T C ∗ A T A ) − 1 � � T X (2) ( C � A ) 3: C T ← ( B T B ∗ A T A ) − 1 � � T X (3) ( B � A ) 4: � �� � � �� � Normal equations MTTKRP 5: end while Notation ∗ denotes the Hadamard (elementwise) product. 4 / 26

  6. Constrained factorization We often want to impose some constraints or regularizations on the factorization: minimize L ( X , A , B , C ) + r ( A ) + r ( B ) + r ( C ) A , B , C � �� � � �� � Loss Constraints/Regularizations Example Non-negative factorizations use an indicator function for R + : � 0 if A ≥ 0 r ( A ) = ∞ otherwise 5 / 26

  7. AO-ADMM [Huang & Sidiropoulos ’15] AO-ADMM combines alternating optimization (AO) with alternating direction method of multipliers (ADMM). ◮ A , B , and C are updated in sequence using ADMM. 6 / 26

  8. AO-ADMM [Huang & Sidiropoulos ’15] AO-ADMM combines alternating optimization (AO) with alternating direction method of multipliers (ADMM). ◮ A , B , and C are updated in sequence using ADMM. ADMM formulation for the update of A : � T ( C � B ) T � 1 2 � X (1) − ˜ � � minimize A F + r ( A ) � 2 A , ˜ A T . A = ˜ subject to A 6 / 26

  9. Alternating optimization step (outer iterations) 1: Initialize primal variables A , B , and C randomly. 2: Initialize dual variables ˆ A , ˆ B , and ˆ C with 0 . 3: repeat G ← B T B ∗ C T C 4: K ← X (1) ( C � B ) 5: A , ˆ A ← ADMM( A , ˆ A , K , G ) 6: 7: G ← A T A ∗ C T C 8: K ← X (2) ( C � A ) 9: B , ˆ B ← ADMM( B , ˆ B , K , G ) 10: 11: G ← A T A ∗ B T B 12: K ← X (3) ( B � A ) 13: C , ˆ C ← ADMM( C , ˆ C , K , G ) 14: 15: until L ( X , A , B , C ) ceases to improve. 7 / 26

  10. ADMM step (inner iterations) ADMM to update one factor matrix: 1: Input: H , U , K , G 2: Output: H , U 3: ρ ← trace( G ) / F 4: L ← Cholesky( G + ρ I ) 5: repeat H 0 ← H 6: H ← L − T L − 1 ( K + ρ ( H + U )) T ˜ 7: T + U || 2 2 || H − ˜ H ← argmin H r ( H ) + ρ H 8: F T U ← U + H − ˜ H 9: T || 2 r ← || H − ˜ F / || H || 2 H 10: F s ← || H − H 0 || 2 F / || U || 2 11: F 12: until r < ǫ and s < ǫ 8 / 26

  11. Table of Contents Introduction Accelerated AO-ADMM Experiments Conclusions 8 / 26

  12. Parallelization opportunities All steps but Line 8 are either element-wise or row-wise independent. 1: Input: H , U , K , G 2: Output: H , U 3: ρ ← trace( G ) / F 4: L ← Cholesky( G + ρ I ) 5: repeat H 0 ← H 6: H ← L − T L − 1 ( K + ρ ( H + U )) T ˜ 7: T + U || 2 2 || H − ˜ H ← argmin H r ( H ) + ρ H 8: F T U ← U + H − ˜ H 9: T || 2 r ← || H − ˜ F / || H || 2 H 10: F s ← || H − H 0 || 2 F / || U || 2 11: F 12: until r < ǫ and s < ǫ 9 / 26

  13. Performance opportunities 1. The factor matrices are tall-skinny (e.g., 10 6 × 50). ◮ The ADMM step will be bound by memory bandwidth. 2. Real-world tensors have non-uniform distributions of non-zeros. ◮ This may lead to non-uniform convergence of the factor rows during ADMM. 3. Many constraints and regularizations naturally invoke sparsity in the factors. ◮ We can exploit this sparsity during MTTKRP ( in paper ). 10 / 26

  14. Blocked ADMM If the proximity operator coming from r ( · ) is row-separable, reformulate the ADMM problem to work on B blocks of rows: B 1 � � 2 � T � ( X (1) ) b − ˜ � b ( C � B ) T � minimize A F + r ( A b ) b � 2 ( A 1 , ˜ A 1 ) ,..., ( A B , ˜ A B ) b =1 A 1 = ˜ A 1 , . . . , A B = ˜ subject to A B . Optimizing each block separately allows for them to converge at different rates, while acting as a form of cache tiling. 11 / 26

  15. Blocked ADMM More simply: 12 / 26

  16. Effects of block size The block size affects both convergence rate and computational efficiency: ◮ A block size of 1 optimizes each row of H independently. ◮ Larger block sizes better utilize hardware resources, but should be chosen to fit in cache. Our evaluation uses F =50, and we experimentally found a block size of 50 rows to be a good balance between convergence rate and performance. 13 / 26

  17. Table of Contents Introduction Accelerated AO-ADMM Experiments Conclusions 13 / 26

  18. Experimental Setup Source code: ◮ Modified from SPLATT 1 ◮ Written in C and parallelized with OpenMP ◮ Compiled with icc v17.0.1 and linked with Intel MKL Machine specifications: ◮ 2 × 10-core Intel Xeon E5-2650v3 (Haswell) ◮ 396GB RAM 1 https://github.com/ShadenSmith/splatt 14 / 26

  19. Convergence measurement We measure convergence based on the relative reconstruction error: relative error = L ( X , A , B , C ) . � X � 2 F Termination: ◮ Convergence is detected when the relative error improves less than 10 − 6 or if we exceed 200 outer iterations. ◮ ADMM is limited to 50 iterations and ǫ = 10 − 2 . 15 / 26

  20. Datasets We selected four tensors from the FROSTT 2 collection based on non-negative factorization performance: ◮ require a non-trivial number of iterations ◮ have a factorization quality that suggests a non-negative CPD is appropriate Dataset NNZ I J K Reddit 95M 310K 6K 510K 143M 3M 2M 25M NELL Amazon 1.7B 5M 18M 2M 3.5B 46 240K 240K Patents 2 http://frostt.io/ 16 / 26

  21. Relative Factorization Costs Fraction of time spent in MTTKRP and ADMM during a rank-50 non-negative factorization: MTTKRP ADMM OTHER 1.0 Fraction of factorization time 0.8 0.6 0.4 0.2 0.0 Reddit NELL Amazon Patents 17 / 26

  22. Parallel Scalability Blocked ADMM improves speedup when the factorization is dominated by ADMM: 20 20 Reddit Reddit NELL NELL Amazon Amazon Patents Patents ideal ideal Speedup Speedup 10 10 8 8 4 4 2 2 1 1 1 2 4 8 10 20 1 2 4 8 10 20 Threads Threads Baseline Blocked 18 / 26

  23. Convergence: Reddit Blocking results in faster per-iteration runtimes and also converges in fewer iterations. base base 0.89 blocked 0.89 blocked Relative error 0.88 Relative error 0.88 0.87 0.87 0.86 0.86 0.85 0.85 0 10 20 30 40 50 60 70 80 90 0 5 10 15 20 25 30 35 40 45 Time (s) Outer iteration 19 / 26

  24. Convergence: NELL Convergence is 3 . 7 × faster with blocking, despite using additional iterations to achieve a lower error. base base 0.62 0.62 blocked blocked 0.60 0.60 Relative error Relative error 0.58 0.58 0.56 0.56 0.54 0.54 0 500 1000 1500 2000 2500 3000 0 5 10 15 20 25 30 Time (s) Outer iteration 20 / 26

  25. Convergence: Amazon Both formulations exceed the maximum of 200 outer iterations, but the blocked formulation achieves a lower error in less time. base base 0.69 0.69 blocked blocked 0.68 0.68 Relative error Relative error 0.67 0.67 0.66 0.66 0.65 0.65 0.64 0.64 0 2000 4000 6000 8000 10000 12000 0 50 100 150 200 Time (s) Outer iteration 21 / 26

  26. Convergence: Patents Per-iteration runtimes are largely unaffected, as Patents is dominated by MTTKRP time. However, fewer iterations are required. base base 0.570 0.570 blocked blocked 0.565 0.565 Relative error Relative error 0.560 0.560 0.555 0.555 0.550 0.550 0.545 0.545 0.540 0.540 0 500 1000 1500 2000 2500 3000 3500 4000 4500 0 20 40 60 80 100 120 140 Time (s) Outer iteration 22 / 26

  27. Table of Contents Introduction Accelerated AO-ADMM Experiments Conclusions 22 / 26

  28. Wrapping up Blocked ADMM accelerates constrained tensor factorization in two ways: ◮ Optimizing blocks independently saves computation on the “simple” rows and better optimizes “hard” rows. ◮ Blocks can be kept in cache during ADMM, saving memory bandwidth. Also in the paper: ◮ MTTKRP can be accelerated by exploiting the sparsity that dynamically evolves in the factors. ◮ An additional ∼ 2 × speedup is achieved. Future work: ◮ Analytical model for selecting block sizes. ◮ Automatic runtime selection of data structure for sparse factors. 23 / 26

  29. Reproducibility All of our work is open source (in the wip/ao-admm branch for now): https://github.com/ShadenSmith/splatt Datasets are freely available: http://frostt.io/ 24 / 26

  30. Backup Slides 24 / 26

Recommend


More recommend