scalable tensor computations with cyclops and faster
play

Scalable Tensor Computations with Cyclops and Faster Algorithms for - PowerPoint PPT Presentation

Scalable Tensor Computations with Cyclops and Faster Algorithms for Alternating Least Squares Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign Invited Workshop on Compiler Techniques for Sparse Tensor


  1. Scalable Tensor Computations with Cyclops and Faster Algorithms for Alternating Least Squares Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign Invited Workshop on Compiler Techniques for Sparse Tensor Algebra Cambridge, MA Jan 26, 2019 Compiler Techniques for Sparse Tensor Algebra Cyclops 1/9

  2. A library for parallel tensor computations Cyclops Tensor Framework (CTF) 1 , C++ (MPI/OpenMP) ⇒ Python distributed-memory symmetric/sparse/dense tensor objects Matrix <int > A(n, n, AS|SP , World( MPI_COMM_WORLD )); Tensor <float > T(order , is_sparse , dims , syms , ring , world ); T.read(...); T.write(...); T.slice(...); T.permute(...); parallel contraction/summation of tensors Z["abij"] += V["ijab"]; // C++ Z.i("abij") << V.i("ijab") // Python W["mnij"] += 0.5*W["mnef"]*T["efij"]; // C++ W.i("mnij") << 0.5*W.i("mnef")*T.i("efij") // Python einsum("mnef ,efij ->mnij",W,T) // numpy -style Python ∼ 2000 commits since 2011, open source since 2013 1 E.S., D. Matthews, J.R. Hammond, J. Demmel, JPDC 2014 Compiler Techniques for Sparse Tensor Algebra Cyclops 2/9

  3. Electronic structure calculations with Cyclops Coupled cluster engine in Aquarius (Devin Matthews) Weak scaling on BlueGene/Q 1024 Aquarius-CTF CCSD 512 Aquarius-CTF CCSDT 256 Teraflops 128 64 32 16 8 4 512 1024 2048 4096 8192 16384 32768 #nodes Cyclops works with QChem, VASP, CC4S, Psi4, and PySCF Is also being used for other applications, e.g. by IBM+LLNL collaboration to perform 49-qubit quantum circuit simulation 2 2 E. Pednault et al. arXiv:1710.05867 Compiler Techniques for Sparse Tensor Algebra Cyclops 3/9

  4. Sparse MP3 code Strong and weak scaling of sparse MP3 code, with (1) dense V and T (2) sparse V and dense T (3) sparse V and T Strong scaling of MP3 with no=40, nv=160 Weak scaling of MP3 with no=40, nv=160 256 2048 dense dense 128 1024 10% sparse*dense 10% sparse*dense 64 10% sparse*sparse 10% sparse*sparse 512 1% sparse*dense 1% sparse*dense 32 256 1% sparse*sparse 1% sparse*sparse seconds/iteration seconds/iteration 16 .1% sparse*dense .1% sparse*dense 128 .1% sparse*sparse .1% sparse*sparse 8 64 4 32 2 16 1 8 0.5 4 0.25 2 0.125 1 24 48 96 192 384 768 24 48 96 192 384 768 1536 3072 6144 #cores #cores Compiler Techniques for Sparse Tensor Algebra Cyclops 4/9

  5. Special operator application: betweenness centrality Betweenness centrality code snippet, for k of n nodes void btw_central(Matrix <int > A, Matrix <path > P, int n, int k){ Monoid <path > mon(..., []( path a, path b){ if (a.w<b.w) return a; else if (b.w<a.w) return b; else return path(a.w, a.m+b.m); }, ...); Matrix <path > Q(n,k,mon); // shortest path matrix Q["ij"] = P["ij"]; Function <int ,path > append ([]( int w, path p){ return path(w+p.w, p.m); }; ); for (int i=0; i<n; i++) Q["ij"] = append(A["ik"],Q["kj"]); ... } Compiler Techniques for Sparse Tensor Algebra Cyclops 5/9

  6. Betweenness Centrality on R-MAT Graphs Strong scaling of MFBC and CombBLAS for R-MAT S=22 Strong scaling of three versions of MFBC for R-MAT S=22 16384 16384 E=128 CTF-MFBC E=128 adapt=sparse*sparse E=8 CTF-MFBC E=128 dense=sparse*sparse 4096 4096 E=128 CA-MFBC E=128 dense=sparse*dense E=8 CA-MFBC E=8 adapt=sparse*sparse MTEPS/node 1024 MTEPS/node E=128 CombBLAS E=8 dense=sparse*sparse 1024 E=8 CombBLAS E=8 dense=sparse*dense 256 256 64 64 16 4 16 1 4 16 64 256 1 4 16 64 #nodes #nodes Left plot compares different algorithms with CombBLAS with CA-MFBC (statically-mapped comm-efficient matrix distribution) Right plot compares matrix represenations (including push/pull) adjacency matrix sparse for all versions frontier sparse or dense rectangular matrix vertices adjacent to frontier (output) sparse or dense rectangular matrix Compiler Techniques for Sparse Tensor Algebra Cyclops 6/9

  7. Tensor Decomposition Algorithms Tensor decomposition algorithms generally use a variant of gradient descent or alternating least squares (ALS) ALS is effective for CP and Tucker as well as MPS/PEPS/DMRG update each site/factor in network individually by quadratic optimization 3 3 Holtz, Rohwedder, and Schneider SISC 2012 Compiler Techniques for Sparse Tensor Algebra Cyclops 7/9

  8. Accelerating Alternating Least Squares Dimension trees amortize cost across quadratic subproblems Pairwise perturbation (PP) approximates ALS with less cost 4 , specifically for rank R decomposition for order N and s × · · · × s tensor dimension tree ALS sweep PP setup PP approximate sweep 4 s N R 6 s N R 2 Ns 2 R CP 4 s N R 6 s N R 2 Ns 2 R N − 1 Tucker Cyclops-based implementation of PP shows improvements over regular dimension tree ALS for both synthetic and real-world tensors 4 Linjian Ma and E.S. arXiv:1811.10573 Compiler Techniques for Sparse Tensor Algebra Cyclops 8/9

  9. Conclusion Summary Cyclops is a distributed-memory sparse/dense tensor library has seen adaptation in quantum chemistry and quantum circuit simulation supports general semirings, efficient parallel graph algorithms Pairwise perturbation is a first-order-accurate approximation to ALS its asymptotically faster in theory and 2-3X faster in practice In-progress/future work Sparse tensor completion with Cyclops using ALS/CCD/SGD Perturbative ALS with low-rank updates Acknowledgements Devin Matthews (UT Austin), Jeff Hammond (Intel Corp.), Maciej Besta, Flavio Vella, Torsten Hoefler (ETH Zurich), Zecheng Zhang (UIUC), Linjian Ma, James Demmel (UC Berkeley) Computational resources at NERSC, CSCS, ALCF, NCSA, and TACC Compiler Techniques for Sparse Tensor Algebra Cyclops 9/9

Recommend


More recommend