I Dont Care about BLAS Devin Matthews Institute for Computational - PowerPoint PPT Presentation

I Don’t Care about BLAS Devin Matthews Institute for Computational Engineering and Sciences, UT Austin #smallmoleculesmatter BLIS Retreat, Sept. 18, 2017 1

From QC to DGEMM H ˆ ¯ R | Φ i = E ˆ “Simple” eigenproblem… R | Φ i r abef H abij ¯ ijkl , In terms of tensors… cdkl r abef ijkl , W ai ck , F i k , t ab In terms of other tensors… ij , . . . e ¯ r ab ¯ f r abef With structured sparsity… l or ˇ ij ¯ k ¯ ijkl e< ¯ r a<b ¯ f r abef With symmetry… l or ˇ i<j ¯ k< ¯ i ≤ j ≤ k ≤ l r abef 0000 , r abef 0001 , r abef With slicing (or blocking etc.)… 0002 , . . . ( ab ) γ ab With more sparsity… r ( ef ) γ ef r ab ef ∈ R n a ⊗ n b ⊗ n e ⊗ n f In terms of dense tensors… In terms of matrices. A = B · C BLIS Retreat, Sept. 18, 2017 2

Tensor Contraction Today “TTDT” “LoG” W ak = W ak t ae V mk ic ic im ec ck ai X = [ t am ] ei [ V mk ] ec [ W ak ] ic m ∀ i, e, c A = B · C em ck for i for e em ai for c DGEMM t ae V mk im ec BLIS Retreat, Sept. 18, 2017 3

BLIS Retreat, Sept. 18, 2017 4

DGEMM Considered Harmful • Tensors have to be transposed in order to use DGEMM. • DGEMM needs dense matrices. If our tensors have structure (permutational symmetry, point group symmetry, sparsity, etc.) we have to expand or block them. • Point group symmetry is efficiently handled with the Direct Product Decomposition (DPD), but we want to automate and optimize it. • Blocking reduces the size of individual DGEMM calls. Can we aggregate these into more efficient operations? DPD: Stanton, J.F.; Gauss, J.; Watts, J.D.; Bartlett, R.J. J. Chem. Phys. 1991 , 94 , 4334. BLIS Retreat, Sept. 18, 2017 5

How Much Does Transpose Cost? BLIS Retreat, Sept. 18, 2017 6

BLIS à TBLIS 5 th +loop+around+micro&kernel+ n C n C Tensor#physical#layout# +=+ Tensor A C j B j +=# C" A" B" to 4 th +loop+around+micro&kernel+ ~ matrix Pack “ A i ” → A i Matrix7like#logical#layout# k C B p n C n C C j +=+ A p packing C i A i m C m C k C B p +=# kernel k C ~ Pack B p → B p 3 rd +loop+around+micro&kernel+ ~ ~ C i A i m C m C B p Packed#internal#layout# Pack “ B p ” → B p +=+ ~ ~ n R n R B p C i A i ~ Pack A i → A i m R +=# 2 nd +loop+around+micro&kernel+ ~ ~ n R n R B p C i A i k C m R +=+ #1 st #loop#around#micro7kernel# m R k C +=# 1 st +loop+around+micro&kernel+ (Possibly) k C n R n R m R scattered +=+ k C #micro7kernel# main#memory# update 1 L3#cache# micro&kernel+ +=# L2#cache# main+memory+ 1 1 L1#cache# L3+cache+ registers# +=+ L2+cache+ 1 L1+cache+ registers+ BLIS Retreat, Sept. 18, 2017 7

Results for Dense Tensors 45 45 40 40 35 35 30 30 GFLOPS/s GFLOPS/s 25 25 20 20 15 15 E5-2690 v3 MKL MKL 10 10 12 cores 1 core BLIS BLIS TBLIS TBLIS 5 5 TTDT TTDT 0 0 0 500 1000 1500 2000 0 1000 2000 3000 4000 5000 Approx. M=N=K Approx. M=N=K 45 45 MKL 40 40 BLIS TBLIS 35 35 TTDT 30 30 GFLOPS/s GFLOPS/s 25 25 20 20 15 15 MKL 10 10 BLIS TBLIS 5 5 TTDT 0 0 0 100 200 300 400 500 0 100 200 300 400 500 Approx. K (M=N=4000) Approx. K (M=N=16000) BLIS Retreat, Sept. 18, 2017 8

Works Great on Xeon Phi Too “Square” MM and TC “Rank-k” MM and TC on Xeon Phi 7210 on Xeon Phi 7210 2000 2000 MKL BLIS 1800 1800 TBLIS TTDT 1600 1600 1400 1400 1200 1200 GFLOPS/s GFLOPS/s 1000 1000 800 800 MKL 600 600 BLIS TBLIS 400 400 TTDT 200 200 0 0 0 1000 2000 3000 4000 5000 6000 7000 8000 0 500 1000 1500 2000 Approx. M=N=K Approx. K (M=N=16000) BLIS Retreat, Sept. 18, 2017 9

Quasi-Sparse Tensor Contractions Option #1: Batch within TBLIS framework ~ ~ B p n R C i A i zero, not m R computed … += Entire quantity non-zero laid out on disk n R k C Hunk: sized to fit in memory Option #2: Batch outside of TBLIS framework Chunk: fixed irreducible for k’ representations pfor n’ pfor m’ Virtual block: fixed if ∃ A(m’,k’) ∧ values of ijkl ∃ B(k’,n’) ∧ ∃ C(m’,n’) contract(…) endif done done done BLIS Retreat, Sept. 18, 2017 10

Quasi-Sparse Tensor Contractions for k’ pfor n’ pfor m’ if ∃ A(m’,k’) ∧ ∃ B(k’,n’) ∧ ∃ C(m’,n’) contract(…) endif done done done Split communicator into c_in & c_out pfor n’ over c_out pfor m’ over c_out Use hierarchical ks = {} for k’ dynamic+static parallelism if ∃ A(m’,k’) ∧ and aggregate blocks when ∃ B(k’,n’) ∧ possible. ∃ C(m’,n’) append k’ to ks endif done pcontract(ks,…) over c_in done done BLIS Retreat, Sept. 18, 2017 11

Quasi-Sparse Tensor Contractions (uses TBLIS for 1130 1121 1200 600 558 551 inner tensor ↑46% ↑85% 500 1000 contraction) ↑54% ↑72% 766 400 800 362 baseline, thread baseline, thread 321 within blocks within blocks 612 GFLOPS/s GFLOPS/s (adds hierarchical 282 baseline, thread 600 baseline, thread 300 outside blocks 486 outside blocks ↑80% multithreading ↑55% 205 TBLIS TBLIS 357 400 200 157 and block 230 aggregation) 140 100 200 53 3 5 0 0 nv =40, no=10 nv =60, no=5 nv =20, no=20 nv =40, no=10 nv =60, no=5 nv =20, no=20 70 900 842 845 825 784 800 59 60 ↑0% ↑5% ↑638% 700 50 600 543 baseline, thread baseline, thread 40 within blocks within blocks 40 500 GFLOPS/s GFLOPS/s ↑567% baseline, thread baseline, thread ↑8% outside blocks outside blocks 400 30 311 ↑1257% TBLIS TBLIS 287 300 19 20 200 140 10 8 6 100 1 5 0 0 0 0 0 nv =40, no=10 nv =60, no=5 nv =20, no=20 nv =40, no=10 nv =60, no=5 nv =20, no=20 BLIS Retreat, Sept. 18, 2017 12

Taking Advantage of Structure Point Group Symmetry Cost savings proportional to g 2 (g = number of irreducible representations/blocks). A ijk = -A jik = A jki = Permutational Symmetry -A kji = A kij = -A ikj i Factorial cost savings for increasing dimensionality. A i<j<k k j BLIS Retreat, Sept. 18, 2017 13

Taking Advantage of Structure ik j k i i block j Stride of j within block depends on block of i! ik j Zero pad Full scatter BLIS Retreat, Sept. 18, 2017 14

Speedup in computation of coupled cluster singles and doubles (CCSD) ground state energy when using TBLIS 1.8 1.7 Guanine-cytosine dimer ( GC ), no symmetry Krepl et al., J. Phys. Chem. B 2013, 117, 1872 1.6 2x Xeon E5-2698 v3 (32 cores) 1.5 Speedup 1.4 1.3 1.2 2,4,-diphenylfuran ( DPF ), C s symmetry 1.1 1 GC DPF DBDPE Au2 (E) 1,2-dibromo- 1,2- Less symmetry More symmetry diphenylethene ( DBDPE ) planar, C 2h symmetry Au Au Gold dimer ( Au 2 ) all-electron, D 2h symmetry BLIS Retreat, Sept. 18, 2017 15

Summary • Novel algorithms leveraging the BLIS methodology can significantly outperform DGEMM-based algorithms for tensor contraction . • Breaking through the DGEMM barrier allows new algorithms to be implemented with high efficiency. BLIS Retreat, Sept. 18, 2017 16

Thanks! Robert van Jianyu Field Tyler Devangi de Geijn Huang Van Zee Smith Parikh TBLIS on #smallmoleculesmatter www.cfour.de BLIS Retreat, Sept. 18, 2017 17

I Dont Care about BLAS Devin Matthews Institute for Computational - PowerPoint PPT Presentation

I Dont Care about BLAS Devin Matthews Institute for Computational Engineering and Sciences, UT Austin #smallmoleculesmatter BLIS Retreat, Sept. 18, 2017 1 From QC to DGEMM H R | i = E Simple eigenproblem R | i r

Pretorsion theories in arbitrary categories Alberto Facchini Universit` a di Padova Almer

BLAS for Tensors: What, Why, and How Devin Ma;hews

They Don t Want Them Or You t Want Them Or You They Don Don t Have Them: t Have

Don Juans Troubles Don Juans Troubles Hey, Anna, how are you? Don Juans Troubles Hey,

Lower Don Trail Master Plan Refresh Public Open House_September 17 2019 1 Lower Don Trail

DON Cybersecurity/Information Assurance Workforce Management Chris Kelsall DON CIO, Director,

Typical English mistakes The system consist of three main component. Giorgio Buttazzo don't forget

BACKGROUND JOB PROCESSING DO'S AND DON'TS BACKGROUND JOB PROCESSING - DO'S AND DON'TS IMAGE

CBS - 2016 A DIALOG WITH CAPACITY BUILDING CHAMPIONS FROM THE ACADEMIC COMMUNITY Global ICT

PYROT OTECHN HNICS BLAS AST IMPAC ACT LOAD AD DROP EARTHQ HQUAK UAKE FIRE RESISTAN

Eric Calderon, Senior Political Science Major, UCR Counts Committee Marlenee Blas, Director of UCR

SLAM Tutorial Course 6.834 Cognitive Robotics SLAM for Dummies Authors: Morten Rufus Blas,

Materials Science Blas Pedro Uberuaga Los Alamos National Laboratory U N C L A S S I F I E D

Tensor Contraction with Extended BLAS Kernels on CPU and GPU Yang Shi University of California,

Numerical Linear Algebra Software (based on slides written by Michael Grant) BLAS, ATLAS

Some Notes on BLAS, SIMD, MIC and GPGPU (for Octave and Matlab) Christian Himpe (

Automatic 3D Mapping for Infrared Image Analysis i i r r f f m m c c a a d d a a r

Salient Keypoint Selection for Object Representation Paper ID: 1570232318 Twenty Second National

Stereo Matching Wei-Chih Tu ( ) National Taiwan University Fall 2018 Stereo Matching

Budget Presentation Budget Presentation Controller s Office s Office Controller Board

Ang ie Ro dg e rs, Dire c to r Ne w Co mmunitie s Initia tive DC Offic e o f the De puty Ma

Circuits Lecture 11 Uniform Circuit Complexity 1 Recall 2 Recall Non-uniform complexity 2

June 25, 2015 CI4CC Research Resonance Initiative Defining

No-counterexample Interpretations of Logic and the Geometry of Interaction Masaru Shirahata