planc parallel low rank approximations with non
play

PLANC: Parallel Low Rank Approximations with Non-negativity - PowerPoint PPT Presentation

PLANC: Parallel Low Rank Approximations with Non-negativity Constraints Ramakrishnan Kannan, Michael Matheson, Grey Ballard, Srinivas Eswar , Koby Hayashi, Haesun Park Ph.D student, School of CSE Georgia Institute of Technology Advisors: Rich


  1. PLANC: Parallel Low Rank Approximations with Non-negativity Constraints Ramakrishnan Kannan, Michael Matheson, Grey Ballard, Srinivas Eswar , Koby Hayashi, Haesun Park Ph.D student, School of CSE Georgia Institute of Technology Advisors: Rich Vuduc and Haesun Park January 26, 2019 Workshop on Compiler Techniques for Sparse Tensor Algebra Acknowledgement: This work was partly sponsored by NSF, Sandia and ORNL Srinivas Eswar (GT) PLANC January 26, 2019 1 / 12

  2. Summary PLANC is an open source, scalable and flexible software package to compute Non-negative Tensor Factorisation. Srinivas Eswar (GT) PLANC January 26, 2019 2 / 12

  3. Summary PLANC is an open source, scalable and flexible software package to compute Non-negative Tensor Factorisation. Implements state of the art communication avoiding algorithm for matricised-tensor times Khatri-Rao product (MTTKRP). Srinivas Eswar (GT) PLANC January 26, 2019 2 / 12

  4. Summary PLANC is an open source, scalable and flexible software package to compute Non-negative Tensor Factorisation. Implements state of the art communication avoiding algorithm for matricised-tensor times Khatri-Rao product (MTTKRP). Popular optimisation methods like Block Principal Pivoting, Alternating Direction Method of Multipliers, first order Nesterov methods etc. for Non-negative Least Squares are included. Srinivas Eswar (GT) PLANC January 26, 2019 2 / 12

  5. Summary PLANC is an open source, scalable and flexible software package to compute Non-negative Tensor Factorisation. Implements state of the art communication avoiding algorithm for matricised-tensor times Khatri-Rao product (MTTKRP). Popular optimisation methods like Block Principal Pivoting, Alternating Direction Method of Multipliers, first order Nesterov methods etc. for Non-negative Least Squares are included. NTF is an important contributor towards explainable AI with a wide range of applications like spectral unmixing, scientific visualization, healthcare analytics, topic modelling etc. Srinivas Eswar (GT) PLANC January 26, 2019 2 / 12

  6. CP Decomposition R � u r ( σ r v T Matrix: M ≈ r ) r =1 R � Tensor: X ≈ ( λ r u r ) ◦ v r ◦ w r r =1 This is known as the CANDECOMP or PARAFAC or canonical polyadic or CP decomposition . It approximate tensors as sum of outer products or rank-1 tensors. NNCP imposes non-negativity constraints on the factor matrices to aid interpretability. Srinivas Eswar (GT) PLANC January 26, 2019 3 / 12

  7. Computational Bottlenecks The MTTKRP is the major bottleneck for NNCP. M (1) = X (1) ( W ⊙ V ) J K � � m ir = x ijk v jr w kr j =1 k =1 Standard approach is to explicitly matricise the tensor and form the Khatri-Rao product before calling DGEMM. Can we do better? Avoid matricisation of the tensor and full Khatri-Rao products ... Srinivas Eswar (GT) PLANC January 26, 2019 4 / 12

  8. Communication Lower Bounds Following the nested arrays lower bounds [BKR18]. Theorem Any parallel MTTKRP algorithm involving a tensor with I k = I 1 / N for all k and that evenly distributes one copy of the input and output performs at least � I N �� NIR � 1 / N � � 2 N − 1 Ω + NR P P sends and receives. (Either term can dominate.) Key Assumptions: algorithm is not allowed to pre-compute and re-use temporary values. � I � � 1 / N � Ω NR is the most frequently occurring case for relatively P small P or R . Srinivas Eswar (GT) PLANC January 26, 2019 5 / 12

  9. Shared Memory Optimisation - Dimension Trees Reuse computations across MTTKRPs. M (1) = X (1) ( U (3) ⊙ U (2) ) M (2) = X (2) ( U (3) ⊙ U (1) ) Utilise a “dimension tree” to store and reuse partial products [PTC13, LKL + 17, HBJT18]. { 1 , 2 , 3 } PM PM { 1 , 2 } M (3) mTTV mTTV M (1) M (2) PM = partial MTTKRP mTTV = multi-Tensor-Times-Vector Srinivas Eswar (GT) PLANC January 26, 2019 6 / 12

  10. Distributed Memory Optimisation - Communication Avoiding Each processor Starts with one subtensor and 1 subset of rows of each input factor matrix U (1) U (3) M (2) Srinivas Eswar (GT) PLANC January 26, 2019 7 / 12

  11. Distributed Memory Optimisation - Communication Avoiding Each processor Starts with one subtensor and 1 subset of rows of each input factor matrix All-Gathers all the rows needed 2 from U (1) U (1) U (3) M (2) Srinivas Eswar (GT) PLANC January 26, 2019 7 / 12

  12. Distributed Memory Optimisation - Communication Avoiding Each processor Starts with one subtensor and 1 subset of rows of each input factor matrix All-Gathers all the rows needed 2 from U (1) U (1) All-Gathers all the rows needed 3 from U (3) U (3) M (2) Srinivas Eswar (GT) PLANC January 26, 2019 7 / 12

  13. Distributed Memory Optimisation - Communication Avoiding Each processor Starts with one subtensor and 1 subset of rows of each input factor matrix All-Gathers all the rows needed 2 from U (1) U (1) All-Gathers all the rows needed 3 from U (3) Computes its contribution to rows 4 of M (2) (local MTTKRP) U (3) M (2) Srinivas Eswar (GT) PLANC January 26, 2019 7 / 12

  14. Distributed Memory Optimisation - Communication Avoiding Each processor Starts with one subtensor and 1 subset of rows of each input factor matrix All-Gathers all the rows needed 2 from U (1) U (1) All-Gathers all the rows needed 3 from U (3) Computes its contribution to rows 4 of M (2) (local MTTKRP) U (3) M (2) Srinivas Eswar (GT) PLANC January 26, 2019 7 / 12

  15. Distributed Memory Optimisation - Communication Avoiding Each processor Starts with one subtensor and 1 subset of rows of each input factor matrix All-Gathers all the rows needed 2 from U (1) U (1) All-Gathers all the rows needed 3 from U (3) Computes its contribution to rows 4 of M (2) (local MTTKRP) Reduce-Scatters to compute and 5 distribute M (2) evenly U (3) M (2) Srinivas Eswar (GT) PLANC January 26, 2019 7 / 12

  16. Distributed Memory Optimisation - Communication Avoiding Each processor Starts with one subtensor and 1 subset of rows of each input factor matrix All-Gathers all the rows needed 2 from U (1) U (1) All-Gathers all the rows needed 3 from U (3) Computes its contribution to rows 4 of M (2) (local MTTKRP) Reduce-Scatters to compute and 5 distribute M (2) evenly U (3) Solve local NLS problem using 6 M (2) M (2) and U (2) Srinivas Eswar (GT) PLANC January 26, 2019 7 / 12

  17. Performance Plots - Strong Scaling 4D Strong Scaling 50 40 gram Running Time (in Secs) nnls 30 mttkrp multittv reducescatter 20 allgather allreduce 10 0 0-128 0-16 0-32 0-64 1-128 1-16 1-32 1-64 2-128 2-16 2-32 2-64 4-128 4-16 4-32 4-64 5-128 5-16 5-32 5-64 alg-p Figure: Strong Scaling on synthetic tensor of dimensions 256 × 256 × 256 × 256 on 8, 16, 32, 64 and 128 nodes of Titan Can achieve nearly linear scaling since NNCP is compute bound. Srinivas Eswar (GT) PLANC January 26, 2019 8 / 12

  18. Performance Plots - CPU vs GPU MU 400 60.0 HALS ANLS/BPP 57.5 350 AO-ADMM Total Time(in Secs) Total Time(in Secs) Nestrov MU 55.0 300 CP/ALS HALS ANLS/BPP 52.5 250 AO-ADMM Nestrov 50.0 200 CP/ALS 47.5 150 45.0 100 42.5 20 30 40 50 60 70 80 90 100 20 30 40 50 60 70 80 90 100 Low Rank (k) Low Rank (k) Figure: CPU Figure: GPU 4D synthetic tensor of dimensions 384 × 384 × 384 × 384 on 81 Titan nodes as a 3 × 3 × 3 × 3 grid with varying low rank. Offloading DGEMM calls to GPU can provide 7X speedup . Srinivas Eswar (GT) PLANC January 26, 2019 9 / 12

  19. Compiler Challenges and Extensions to the Sparse Setting 1 Dimension Tree ordering. Combinatorial explosion for sparse case (contrasted with the single split choice for the dense case). Sparse case involves growth in intermediate values. 2 Communication Pattern establishment and load balancing. Automatic communicator setup given a processor grid and tensor operation. Automatic data distribution using the communication-avoiding loop optimisation [Kni15, DR16]. 3 Block Parallelism in Least Squares Solvers. Active Set orderings can be grouped in an embarrassingly parallel call. Sparse case with masking matrix has a similar RHS pattern. 4 Binary Bloat Separate binaries for GPU/CPU and Sparse/Dense. Srinivas Eswar (GT) PLANC January 26, 2019 10 / 12

  20. Summary PLANC is an open source, scalable and flexible software package to compute Non-negative Tensor Factorisation. Implements state of the art communication avoiding algorithm for matricised-tensor times Khatri-Rao product (MTTKRP). Popular optimisation methods like Block Principal Pivoting, Alternating Direction Method of Multipliers, first order Nesterov methods etc. for Non-negative Least Squares are included. NTF is an important contributor towards explainable AI with a wide range of applications like spectral unmixing, scientific visualization, healthcare analytics, topic modelling etc. Coming soon as a miniapp on OLCF machines. https://github.com/ramkikannan/planc. Srinivas Eswar (GT) PLANC January 26, 2019 11 / 12

Recommend


More recommend