A Sparse Tensor Format and a Benchmark Suite Jiajia Li Pacific Northwest National Laboratory January 25, 2019 @ MIT Figure sources: “A brief survey of tensors” by Berton Earnshaw and NVIDIA Tensor Cores
HiCOO: Hierarchical Storage of Sparse Tensors Jiajia Li 1,2 , Jimeng Sun 1 , Richard Vuduc 1 1 Georgia Institute of Technology 2 Pacific Northwest National Laboratory SUNLAB Code: https://github.com/hpcgarage/ParTI (v1.0.0) Figure sources: “A brief survey of tensors” by Berton Earnshaw and NVIDIA Tensor Cores
Challenges Compactness: A space-efficient data structure Mode-Genericity: Efficient traversals of the data structure for computations The concept “mode-genericity” is inherited from [Baskaran et al. 2012]. [Baskaran et al. 2012] M. Baskaran et al., “Efficient and scalable computations with sparse tensors,” HPEC2012 � 3
Baseline Sparse Tensor Formats in This Work COO: coordinate formats [Bader et al., 2006] CSF: Compressed Sparse Fibers, extension of CSR. [Smith et al. 2015] i = 1,…,I F-COO: Flagged COO format [Liu et al., 2017] 4 k = 1,…,K j = i j k val 1 bf j k val , … , J 4 3 0 1 2 3 i 0 0 0 1 1 0 0 1 0 1 0 2 0 1 0 2 sf[0]=1 1 0 0 3 1 0 0 3 0 1 0 0 1 2 0 3 j 1 0 2 4 0 0 2 4 2 1 0 5 1 1 0 5 2 2 2 6 0 2 2 6 0 0 0 2 0 2 1 2 k sf[1]=1 3 0 1 7 1 0 1 7 3 3 2 8 1 2 3 4 5 6 7 8 0 3 2 8 val (a) COO (b) CSF (c) F-COO Mode-Generic Mode-Specific prefer different representations for different modes. � 4
Mode-Specific Tensor Formats Three CSF/F-COO representations are required/preferred for three kernels. Kernel in Mode-1 Tensor 0 1 2 3 i Decomposition 0 1 0 0 1 2 0 3 j CSF-1 0 0 0 2 0 2 1 2 k 1 2 3 4 5 6 7 8 val 0 1 2 3 j Kernel in Mode-2 0 1 1 3 0 2 2 3 i CSF-2 0 0 2 1 0 0 2 2 k val 1 3 4 7 2 5 6 8 Kernel in Mode-3 k 0 1 2 0 0 1 2 3 1 2 3 i CSF-3 j 0 1 0 1 0 0 2 3 val 1 2 3 5 7 4 6 8 � 5
Mode-Specific Tensor Formats Three CSF/F-COO representations are required/preferred for three kernels. Kernel in Mode-1 Tensor 0 1 2 3 i Decomposition 0 1 0 0 1 2 0 3 j CSF-1 0 0 0 2 0 2 1 2 k 1 2 3 4 5 6 7 8 val Kernel in Mode-2 Performance drops Kernel in Mode-3 � 6
Mode Orientation Tensor decomposition Mode-Specific Mode-Generic Kernel in Mode-1 Mode-1 oriented (CSF/FCOO) Coordinate (COO) HiCOO Kernel in Mode-2 Kernel in Mode-3 Efficient In-efficient � 7
HiCOO Format Store a sparse tensor in units of small sparse blocks bptr bi bj bk ei ej ek val i j k val 0 0 0 0 0 0 0 1 0 0 0 1 i = 1,…,I B1 0 1 0 2 0 1 0 2 1 0 0 3 1 0 0 3 B2 3 0 0 1 1 0 0 4 1 0 2 4 k = 1,…,K j = 1,…,J 4 1 0 0 0 1 0 5 2 1 0 5 B3 1 0 1 7 2 2 2 6 6 1 1 1 0 0 0 6 3 0 1 7 B4 Block size: 2*2*2 1 1 0 8 3 3 2 8 COO HiCOO Extension from Compressed Sparse Blocks (CSB) format by Buluc et al. SPAA. 2009. 8 �
HiCOO Format Store a sparse tensor in units of small sparse blocks Shorten the bit-length of element indices • block indices element indices 32-bit 32-bit 8-bit bptr bi bj bk ei ej ek val i j k val 0 0 0 0 0 0 0 1 0 0 0 1 i = 1,…,I B1 0 1 0 2 0 1 0 2 1 0 0 3 1 0 0 3 B2 3 0 0 1 1 0 0 4 1 0 2 4 k = 1,…,K j = 1,…,J 4 1 0 0 0 1 0 5 2 1 0 5 B3 1 0 1 7 2 2 2 6 6 1 1 1 0 0 0 6 3 0 1 7 B4 Block size: 2*2*2 1 1 0 8 3 3 2 8 COO HiCOO i = bi * B + ei � 9
HiCOO Format Store a sparse tensor in units of small sparse blocks Shorten the bit-length of element indices • Compress the number of block indices • block indices element indices 32-bit 32-bit 8-bit bptr bi bj bk ei ej ek val i j k val 0 0 0 0 0 0 0 1 0 0 0 1 B1 0 1 0 2 0 1 0 2 1 0 0 3 1 0 0 3 B2 3 0 0 1 1 0 0 4 1 0 2 4 4 1 0 0 0 1 0 5 2 1 0 5 B3 1 0 1 7 2 2 2 6 6 1 1 1 0 0 0 6 3 0 1 7 B4 1 1 0 8 3 3 2 8 COO HiCOO � 10
HiCOO Format Store a sparse tensor in units of small sparse blocks Shorten the bit-length of element indices • Compress the number of block indices • block indices element indices 32-bit 32-bit 8-bit bptr bi bj bk ei ej ek val i j k val 0 0 0 0 0 0 0 1 0 0 0 1 COO indices: B1 0 1 0 2 0 1 0 2 = nnz * 3 * 32 1 0 0 3 1 0 0 3 B2 3 0 0 1 1 0 0 4 1 0 2 4 HiCOO indices: 4 1 0 0 0 1 0 5 2 1 0 5 B3 = nnz * 3 * 8 + nnb * (3 * 32 + 32) 1 0 1 7 2 2 2 6 6 1 1 1 0 0 0 6 3 0 1 7 B4 1 1 0 8 3 3 2 8 COO HiCOO i = bi * B + ei nnz: #Nonzeros; nnb: #Non-zero blocks 11 �
HiCOO Format Store a sparse tensor in units of small sparse blocks Shorten the bit-length of element indices • Compress the number of block indices • For arbitrary-order sparse tensors. • 32-bit 32-bit 8-bit bptr bi bj bk ei ej ek val i j k val 0 0 0 0 0 0 0 1 0 0 0 1 B1 0 1 0 2 0 1 0 2 For the tensor: Reduce its storage 1 0 0 3 1 0 0 3 and memory footprints B2 3 0 0 1 1 0 0 4 1 0 2 4 4 1 0 0 0 1 0 5 2 1 0 5 B3 For matrices: Better data locality 1 0 1 7 2 2 2 6 6 1 1 1 0 0 0 6 3 0 1 7 B4 1 1 0 8 3 3 2 8 COO HiCOO 12 �
Platform and Dataset Platform : Intel Xeon CPU E7-4850 v3 platform consisting 56 physical cores with icc 18.0.2 and parallelized by OpenMP. Dataset : FROSTT [Smith et al. 2017], HaTen2 [Jeon et al. 2015], and healthcare data [Perros et al. 2017]. � 13
Multicore CP-ALS HiCOO outperforms COO by 6.2 × and CSF by 2.1 × on average. Speedup ov er CSF (higher is better) 3D 4D choa 4.00 cr ime darpa fb−m nips nips nell2 HiCOO 2.00 fb−s nell2 HiCOO cr ime flickr CSF−1 deli4d darpa nell1 deli 1.00 CSF−1 fb−m fb−s enron choa � flickr deli nell2 enron � choa nell1 0.50 COO � fb−m � � deli4d darpa nips � fb−s � � cr ime deli COO 0.25 � deli4d � � flickr nell1 � enron 1 2 4 1 2 4 Compression r atio relati v e to CSF (higher is better) � 14
Following Work HiCOO for other tensor operations and Tucker decomposition HiCOO-MTTKRP/CPD on GPUs and distributed systems. � 15
PASTA: A Parallel Sparse Tensor Algorithm Benchmark Suite Jiajia Li 1 , Yuchen Ma 2 , Xiaolong Wu 3 , Ang Li 1 , Kevin Barker 1 1 Pacific Northwest National Laboratory 2 Hangzhou Dianzi University 3 Virginia Tech Code: https://gitlab.com/tensorworld/pasta Figure sources: “A brief survey of tensors” by Berton Earnshaw and NVIDIA Tensor Cores
PASTA Workloads MTTKRP Data (Matriced TEW TS TTV TTM Platforms Structures/ Tensor-Times- (Element-Wise) (Tensor-scalar) (Tensor-Times- (Tensor-Times- Khatri-Rao Algorithms Vector) Matrix) Product) Single-core CPUs COO Multi-core CPUs
PASTA Workloads Arbitrary shape and nonuniform nonzero pattern MTTKRP Data (Matriced TEW TS TTV TTM Platforms Structures/ Tensor-Times- (Element-Wise) (Tensor-scalar) (Tensor-Times- (Tensor-Times- Khatri-Rao Algorithms Vector) Matrix) Product) Single-core CPUs COO Multi-core CPUs
PASTA Workloads Parallelize Parallelize nonzero Parallelize Parallelize nonzeros with partitions nonzeros nonzero fibers atomics MTTKRP Data (Matriced TEW TS TTV TTM Platforms Structures/ Tensor-Times- (Element-Wise) (Tensor-scalar) (Tensor-Times- (Tensor-Times- Khatri-Rao Algorithms Vector) Matrix) Product) Single-core CPUs COO Multi-core CPUs
Memory-Bound Workloads � 20
Following Work Include HiCOO, CSF and other formats Support GPUs, FPGAs (long-term future) � 21
Other Recent Work A dynamic sparse tensor structure for tensor contraction • Collaborators: Sriram Krishnamoorthy (PNNL) • Application: Quantum Chemistry, NWChemEx Hybrid formats and nonzero partitioning strategies • Collaborators: Israt Nisa (OSU), P. (Saday) Sadayappan (OSU), Sriram Krishnamoorthy (PNNL) � 22
Acknowledgement � 23
Recommend
More recommend