An Input-Adaptive and In-Place Approach to Dense Tensor-Times-Matrix Multiply Jiajia Li , Casey Battaglino, Ioakeim Perros, Jimeng Sun, Richard Vuduc Computational Science & Engineering, Georgia Institute of Technology 19 th Nov 2015 19 th Nov 2015 J. Li et.al. (CSE, GaTech) InTensLi 1 / 33
The problem TTM I U Y = X x n U F F Y K X I J K J 19 th Nov 2015 J. Li et.al. (CSE, GaTech) InTensLi 2 / 33
The problem TTM I U Y = X x n U F Y F K X I Transform J Transform K J JK JK I Y (n) U F F Y (n) = UX (n) X (n) I GEMM 19 th Nov 2015 J. Li et.al. (CSE, GaTech) InTensLi 3 / 33
The problem Transform: TTM I U Y = X x n U 70% running time. F Y F K 50% space. X I Transform J Transform K J JK JK I Y (n) U F F Y (n) = UX (n) X (n) I GEMM We proposed an in-place TTM algorithm and employed auto-tuning method to adapt its parameters. 19 th Nov 2015 J. Li et.al. (CSE, GaTech) InTensLi 4 / 33
Outline Background Motivation InTensLi Framework Experiments and Analysis Conclusion 19 th Nov 2015 J. Li et.al. (CSE, GaTech) InTensLi 5 / 33
Background Tensor and Applications Tensor: interpreted as a multi-dimensional array, e.g. X ∈ R I × J × K . Special cases: vectors ( x ) are 1 D tensors, and matrices ( A )are 2 D tensors. Tensor dimension ( N ): also called mode or order. We focus on dense tensors in this work. Applications Quantum chemistry, quantum physics, signal and image processing, neuroscience, and data analytics. i=1,...,I K , . . . , 1 = k j=1,...,J A third-order (or three-dimensional) I × J × K tensor. 19 th Nov 2015 J. Li et.al. (CSE, GaTech) InTensLi 6 / 33
Background Tensor Representations Sub-tensor Slices Fibers K I K I J X ( i , :, k ) X ( i , j , :) X (:, j , k ) J Tube Column Row X ( i , :, :) X (:, j , :) X (:, :, k ) Horizontal Lateral Frontal Whole tensor X X (2) 5 7 Matricization 1 5 2 6 J=2 1 3 6 8 I=2 3 7 4 8 Tensorization 2 4 2 IK=4 = K J=2 Di ff representations → Di ff algorithms → Di ff performance. 19 th Nov 2015 J. Li et.al. (CSE, GaTech) InTensLi 7 / 33
Background Memory Mapping Tensor organization Multi-dimensional array – logically Linear storage – physically Memory mapping 1 . Logical Physical Row-major (LDim: k) X 1 5 3 7 K -> J -> I 5 7 2 6 4 8 1 3 Mapping 6 8 I=2 Function 2 4 K=2 Column-major (LDim: 1) J=2 1 2 3 4 I -> J -> K 5 6 7 8 LDim: Leading Dimension 1GARCIA, R.,and LUMSDAINE, A. Multiarray:A c++ library for generic programming with arrays.Software Practive Experience 35 (2004), 159–188. 19 th Nov 2015 J. Li et.al. (CSE, GaTech) InTensLi 8 / 33
Background Tensor Operations Matricization, aka unfolding or flattening. Mode- n product, aka tensor-times-matrix multiply ( Ttm ) TTM on Mode-1 I Y = X x n U F F K I J K J JK I JK F F Y (n) = UX (n) I Tensor contraction, Kronecker product, Matricized tensor times Khatri-Rao product (MTTKRP) etc. 19 th Nov 2015 J. Li et.al. (CSE, GaTech) InTensLi 9 / 33
Background Ttm Algorithm Baseline Ttm algorithm in Tensor Toolbox and Cyclops Tensor Framework ( Ctf ). Input: Output: X Y Transformation Tensorization Matricization Y (n) U X (n) Multiplication: Y (n) = UX (n) Ttm Applications Low-rank tensor decomposition. Tucker decomposition, e.g. Tucker-HOOI algorithm. Y = X × 1 A (1) T · · · × n − 1 A ( n − 1) T × n +1 A ( n +1) T · · · × N A ( N ) T . 19 th Nov 2015 J. Li et.al. (CSE, GaTech) InTensLi 10 / 33
Background Main Contributions Proposed an in-place tensor-times-matrix multiply ( InTtm ) algorithm, by avoiding physical reorganization of tensors. Built an input-adaptive framework InTensLi to automatically adapt parameters and generate the code. Achieved 4 × and 13 × speedups compared to the state-of-the-art Tensor Toolbox and Ctf tools. 19 th Nov 2015 J. Li et.al. (CSE, GaTech) InTensLi 11 / 33
Motivation Observation 1: Transformation is expensive. Notation: the number of words ( Q ), floating-point operations ( W ), last-level cache size ( Z ). Z − Z 2 for both general matrix-matrix multiply ( Gemm ) and W The relation of them is Q ≥ √ 8 Ttm . Suppose Ttm does the same number of flops as Gemm ( ˆ W = W ), the relation of Arithmetic Intensity of Gemm and Ttm : ˆ A ≈ A / (1 + A m ) when counting transformation. (1 + A m ) is the penalty. Assume cache size Z is 8MB, the penalty of a 3-D tensor is 33. Conclusion: When Ttm and Gemm do the same number of flops, Arithmetic Intensity of Ttm is decreased by a penalty of 33 or more, as tensor dimension increases. 2G. Ballard, E. Carson, J. Demmel, M. Hoemmen, N. Knight, and O. Schwartz. Communication lower bounds and optimal algorithms for numerical linear algebra. Acta Numerica, 23:pp. 1–155, 2014. 19 th Nov 2015 J. Li et.al. (CSE, GaTech) InTensLi 12 / 33
Motivation Observation 2: Performance of the multiplication in Ttm is far below peak. Ttm algorithm involves a variety of rectangular problem sizes. 14 140 13 (short) (fat) 12 120 k =I n n= I 1 ...I n-1 I n+1 ...I N 11 100 10 m =16 9 U log 2 n (tiny) 80 8 X (n) 7 60 k 6 (short) 5 40 4 3 20 2 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 log 2 k (a) TTM’s multiplication. (b) GEMM performance in Intel MKL with 4 threads. 19 th Nov 2015 J. Li et.al. (CSE, GaTech) InTensLi 13 / 33
Motivation Observation 3: Ttm organization is critical to data locality. There are many ways to organize data accesses. I U X(: ,:, k) F I J F U X(:, j, :) K Non-Contigunous Contigunous 19 th Nov 2015 J. Li et.al. (CSE, GaTech) InTensLi 14 / 33
Motivation Observation 3: Ttm organization is critical to data locality. There are many ways to organize data accesses. Choose slice representation. Table 1 : Di ff erent representation forms of mode-1 Ttm on a I × J × K tensor. Mode-1 Product Representation Forms BLAS Level Transformation Tensor representation — — Full Y = X × 1 U reorganization Matrix representation L3 Yes Y (1) = UX (1) Fiber representation y ( f , : , k ) = X (: , : , k ) u ( f , :) , L2 No Sub-tensor Loops : k = 1 , · · · , K , f = 1 , · · · , F extraction Slice representation L3 No Y (: , : , k ) = UX (: , : , k ) , Loops : k = 1 , · · · , K 19 th Nov 2015 J. Li et.al. (CSE, GaTech) InTensLi 15 / 33
InTensLi Framework Algorithmic Strategy Layout Background 1 Motivation 2 InTensLi Framework 3 Algorithmic Strategy InTensLi Framework Experiments and Analysis 4 Conclusion 5 References 6 19 th Nov 2015 J. Li et.al. (CSE, GaTech) InTensLi 16 / 33
InTensLi Framework Algorithmic Strategy Algorithmic Strategy n-1 N-n { { I 1 I 2 I 3 I 4 I N { ... I 1 I 2 I 4 I 5 ...I N I 3 X sub backward forward Group To avoid data copy, Rules: 1) compress only contiguous dimensions; 2) always include the leading dimension. Lemma: Ttm can be performed on up to max { n − 1 , N − n } contiguous dimensions without physical reorganization. 19 th Nov 2015 J. Li et.al. (CSE, GaTech) InTensLi 17 / 33
InTensLi Framework Algorithmic Strategy Algorithmic Strategy n-1 N-n { { I 1 I 2 I 3 I 4 I N { ... I 1 I 2 I 4 I 5 ...I N I 3 X sub backward forward Group To avoid data copy, Rules: 1) compress only contiguous dimensions; 2) always include the leading dimension. Lemma: Ttm can be performed on up to max { n − 1 , N − n } contiguous dimensions without physical reorganization. To get high performance of Gemm , Find an approximate matrix size according to computer architecture. Use auto-tuning method in InTensLi framework. 19 th Nov 2015 J. Li et.al. (CSE, GaTech) InTensLi 17 / 33
InTensLi Framework Algorithmic Strategy InTtm Algorithm and Comparison √ ˆ InTtm ’s AI: ˜ Q = 8 Z ≈ A . A . ˆ Q √ 8 Z Traditional Ttm ’s AI: ˆ A m . A ≈ 1+ A InTtm eliminates the AI by a factor 1 + A m . In-place Tensor-Times-Matrix Multiply ( InTtm ) algorithm. 19 th Nov 2015 J. Li et.al. (CSE, GaTech) InTensLi 18 / 33
InTensLi Framework Algorithmic Strategy InTtm Algorithm and Comparison √ ˆ InTtm ’s AI: ˜ Q = 8 Z ≈ A . A . ˆ Q √ 8 Z Traditional Ttm ’s AI: ˆ A m . A ≈ 1+ A InTtm eliminates the AI by a factor 1 + A m . In-place Tensor-Times-Matrix Multiply ( InTtm ) algorithm. 19 th Nov 2015 J. Li et.al. (CSE, GaTech) InTensLi 19 / 33
InTensLi Framework InTensLi Framework Layout Background 1 Motivation 2 InTensLi Framework 3 Algorithmic Strategy InTensLi Framework Experiments and Analysis 4 Conclusion 5 References 6 19 th Nov 2015 J. Li et.al. (CSE, GaTech) InTensLi 20 / 33
InTensLi Framework InTensLi Framework InTensLi Framework Input: tensor features, hardware configuration, and MM benchmark. Parameter estimation Mode partitioning: M L and M C . Thread allocation: P L and P C . Code generation Parameter Estimator InTTM Code Data Layout Nested loops Tensor Mode M L Input parfor i1=1 : I1 Partition Parameters Mode n parfor i2 = 1 : I2 M C ... ... Code Thresholds MM A ect P Generator Matrix-matrix L Benchmark Multiplication Max # of P Y sub =UX sub Thread C threads Hardware OR Allocation Y sub =X sub U’ Parameters MM Libraries InTensLi framework BLIS MKL 19 th Nov 2015 J. Li et.al. (CSE, GaTech) InTensLi 21 / 33
Recommend
More recommend