I Don’t Care about BLAS Devin Matthews Institute for Computational Engineering and Sciences, UT Austin #smallmoleculesmatter BLIS Retreat, Sept. 18, 2017 1
From QC to DGEMM H ˆ ¯ R | Φ i = E ˆ “Simple” eigenproblem… R | Φ i r abef H abij ¯ ijkl , In terms of tensors… cdkl r abef ijkl , W ai ck , F i k , t ab In terms of other tensors… ij , . . . e ¯ r ab ¯ f r abef With structured sparsity… l or ˇ ij ¯ k ¯ ijkl e< ¯ r a<b ¯ f r abef With symmetry… l or ˇ i<j ¯ k< ¯ i ≤ j ≤ k ≤ l r abef 0000 , r abef 0001 , r abef With slicing (or blocking etc.)… 0002 , . . . ( ab ) γ ab With more sparsity… r ( ef ) γ ef r ab ef ∈ R n a ⊗ n b ⊗ n e ⊗ n f In terms of dense tensors… In terms of matrices. A = B · C BLIS Retreat, Sept. 18, 2017 2
Tensor Contraction Today “TTDT” “LoG” W ak = W ak t ae V mk ic ic im ec ck ai X = [ t am ] ei [ V mk ] ec [ W ak ] ic m ∀ i, e, c A = B · C em ck for i for e em ai for c DGEMM t ae V mk im ec BLIS Retreat, Sept. 18, 2017 3
BLIS Retreat, Sept. 18, 2017 4
DGEMM Considered Harmful • Tensors have to be transposed in order to use DGEMM. • DGEMM needs dense matrices. If our tensors have structure (permutational symmetry, point group symmetry, sparsity, etc.) we have to expand or block them. • Point group symmetry is efficiently handled with the Direct Product Decomposition (DPD), but we want to automate and optimize it. • Blocking reduces the size of individual DGEMM calls. Can we aggregate these into more efficient operations? DPD: Stanton, J.F.; Gauss, J.; Watts, J.D.; Bartlett, R.J. J. Chem. Phys. 1991 , 94 , 4334. BLIS Retreat, Sept. 18, 2017 5
How Much Does Transpose Cost? BLIS Retreat, Sept. 18, 2017 6
BLIS à TBLIS 5 th +loop+around+micro&kernel+ n C n C Tensor#physical#layout# +=+ Tensor A C j B j +=# C" A" B" to 4 th +loop+around+micro&kernel+ ~ matrix Pack “ A i ” → A i Matrix7like#logical#layout# k C B p n C n C C j +=+ A p packing C i A i m C m C k C B p +=# kernel k C ~ Pack B p → B p 3 rd +loop+around+micro&kernel+ ~ ~ C i A i m C m C B p Packed#internal#layout# Pack “ B p ” → B p +=+ ~ ~ n R n R B p C i A i ~ Pack A i → A i m R +=# 2 nd +loop+around+micro&kernel+ ~ ~ n R n R B p C i A i k C m R +=+ #1 st #loop#around#micro7kernel# m R k C +=# 1 st +loop+around+micro&kernel+ (Possibly) k C n R n R m R scattered +=+ k C #micro7kernel# main#memory# update 1 L3#cache# micro&kernel+ +=# L2#cache# main+memory+ 1 1 L1#cache# L3+cache+ registers# +=+ L2+cache+ 1 L1+cache+ registers+ BLIS Retreat, Sept. 18, 2017 7
Results for Dense Tensors 45 45 40 40 35 35 30 30 GFLOPS/s GFLOPS/s 25 25 20 20 15 15 E5-2690 v3 MKL MKL 10 10 12 cores 1 core BLIS BLIS TBLIS TBLIS 5 5 TTDT TTDT 0 0 0 500 1000 1500 2000 0 1000 2000 3000 4000 5000 Approx. M=N=K Approx. M=N=K 45 45 MKL 40 40 BLIS TBLIS 35 35 TTDT 30 30 GFLOPS/s GFLOPS/s 25 25 20 20 15 15 MKL 10 10 BLIS TBLIS 5 5 TTDT 0 0 0 100 200 300 400 500 0 100 200 300 400 500 Approx. K (M=N=4000) Approx. K (M=N=16000) BLIS Retreat, Sept. 18, 2017 8
Works Great on Xeon Phi Too “Square” MM and TC “Rank-k” MM and TC on Xeon Phi 7210 on Xeon Phi 7210 2000 2000 MKL BLIS 1800 1800 TBLIS TTDT 1600 1600 1400 1400 1200 1200 GFLOPS/s GFLOPS/s 1000 1000 800 800 MKL 600 600 BLIS TBLIS 400 400 TTDT 200 200 0 0 0 1000 2000 3000 4000 5000 6000 7000 8000 0 500 1000 1500 2000 Approx. M=N=K Approx. K (M=N=16000) BLIS Retreat, Sept. 18, 2017 9
Quasi-Sparse Tensor Contractions Option #1: Batch within TBLIS framework ~ ~ B p n R C i A i zero, not m R computed … += Entire quantity non-zero laid out on disk n R k C Hunk: sized to fit in memory Option #2: Batch outside of TBLIS framework Chunk: fixed irreducible for k’ representations pfor n’ pfor m’ Virtual block: fixed if ∃ A(m’,k’) ∧ values of ijkl ∃ B(k’,n’) ∧ ∃ C(m’,n’) contract(…) endif done done done BLIS Retreat, Sept. 18, 2017 10
Quasi-Sparse Tensor Contractions for k’ pfor n’ pfor m’ if ∃ A(m’,k’) ∧ ∃ B(k’,n’) ∧ ∃ C(m’,n’) contract(…) endif done done done Split communicator into c_in & c_out pfor n’ over c_out pfor m’ over c_out Use hierarchical ks = {} for k’ dynamic+static parallelism if ∃ A(m’,k’) ∧ and aggregate blocks when ∃ B(k’,n’) ∧ possible. ∃ C(m’,n’) append k’ to ks endif done pcontract(ks,…) over c_in done done BLIS Retreat, Sept. 18, 2017 11
Quasi-Sparse Tensor Contractions (uses TBLIS for 1130 1121 1200 600 558 551 inner tensor ↑46% ↑85% 500 1000 contraction) ↑54% ↑72% 766 400 800 362 baseline, thread baseline, thread 321 within blocks within blocks 612 GFLOPS/s GFLOPS/s (adds hierarchical 282 baseline, thread 600 baseline, thread 300 outside blocks 486 outside blocks ↑80% multithreading ↑55% 205 TBLIS TBLIS 357 400 200 157 and block 230 aggregation) 140 100 200 53 3 5 0 0 nv =40, no=10 nv =60, no=5 nv =20, no=20 nv =40, no=10 nv =60, no=5 nv =20, no=20 70 900 842 845 825 784 800 59 60 ↑0% ↑5% ↑638% 700 50 600 543 baseline, thread baseline, thread 40 within blocks within blocks 40 500 GFLOPS/s GFLOPS/s ↑567% baseline, thread baseline, thread ↑8% outside blocks outside blocks 400 30 311 ↑1257% TBLIS TBLIS 287 300 19 20 200 140 10 8 6 100 1 5 0 0 0 0 0 nv =40, no=10 nv =60, no=5 nv =20, no=20 nv =40, no=10 nv =60, no=5 nv =20, no=20 BLIS Retreat, Sept. 18, 2017 12
Taking Advantage of Structure Point Group Symmetry Cost savings proportional to g 2 (g = number of irreducible representations/blocks). A ijk = -A jik = A jki = Permutational Symmetry -A kji = A kij = -A ikj i Factorial cost savings for increasing dimensionality. A i<j<k k j BLIS Retreat, Sept. 18, 2017 13
Taking Advantage of Structure ik j k i i block j Stride of j within block depends on block of i! ik j Zero pad Full scatter BLIS Retreat, Sept. 18, 2017 14
Speedup in computation of coupled cluster singles and doubles (CCSD) ground state energy when using TBLIS 1.8 1.7 Guanine-cytosine dimer ( GC ), no symmetry Krepl et al., J. Phys. Chem. B 2013, 117, 1872 1.6 2x Xeon E5-2698 v3 (32 cores) 1.5 Speedup 1.4 1.3 1.2 2,4,-diphenylfuran ( DPF ), C s symmetry 1.1 1 GC DPF DBDPE Au2 (E) 1,2-dibromo- 1,2- Less symmetry More symmetry diphenylethene ( DBDPE ) planar, C 2h symmetry Au Au Gold dimer ( Au 2 ) all-electron, D 2h symmetry BLIS Retreat, Sept. 18, 2017 15
Summary • Novel algorithms leveraging the BLIS methodology can significantly outperform DGEMM-based algorithms for tensor contraction . • Breaking through the DGEMM barrier allows new algorithms to be implemented with high efficiency. BLIS Retreat, Sept. 18, 2017 16
Thanks! Robert van Jianyu Field Tyler Devangi de Geijn Huang Van Zee Smith Parikh TBLIS on #smallmoleculesmatter www.cfour.de BLIS Retreat, Sept. 18, 2017 17
Recommend
More recommend