BLIS Performs Devangi N. Parikh Science of High Performance Compu8ng The University of Texas at Aus8n
ThunderX2 Architecture arm v8.1 Base frequency 2.0 GHz # sockets/node 2 # cores/socket 28 armv8a kernels in BLIS were wriOen by Fransisco D. Igual for cortexa57 architectures.
DGEMM (armv8a) DGEMM (single-threaded) 16 14 12 10 GFLOPS 8 BLIS 6 4 2 0 0 200 400 600 800 1000 1200 1400 1600 1800 2000 matrix dimension m=n=k
DGEMM – Other Libraries DGEMM (single-threaded) 16 14 12 10 GFLOPS 8 6 4 BLIS OpenBLAS ARMPL 2 0 0 200 400 600 800 1000 1200 1400 1600 1800 2000 matrix dimension m=n=k
GEMM – Other Datatypes SGEMM (single-threaded) 30 25 20 GFLOPS 15 10 BLIS OpenBLAS ARMPL 5 0 0 200 400 600 800 1000 1200 1400 1600 1800 2000 matrix dimension m=n=k
GEMM – Other Datatypes CGEMM (single-threaded) 30 25 20 GFLOPS 15 10 5 BLIS OpenBLAS ARMPL 0 0 200 400 600 800 1000 1200 1400 1600 1800 2000 matrix dimension m=n=k
GEMM – Other Datatypes ZGEMM (single-threaded) 16 14 12 10 GFLOPS 8 6 4 2 BLIS OpenBLAS ARMPL 0 0 200 400 600 800 1000 1200 1400 1600 1800 2000 matrix dimension m=n=k
Level 3 SGEMM (single-threaded) SSYRK (single-threaded) SSYMM (single-threaded) STRMM (single-threaded) 30 30 30 30 25 25 25 25 20 20 20 20 GFLOPS GFLOPS GFLOPS GFLOPS 15 15 15 15 10 10 10 10 5 5 5 5 0 0 0 0 0 500 1000 1500 2000 0 500 1000 1500 2000 0 500 1000 1500 2000 0 500 1000 1500 2000 DGEMM (single-threaded) DSYRK (single-threaded) DSYMM (single-threaded) DTRMM (single-threaded) 15 15 15 15 10 10 10 10 GFLOPS GFLOPS GFLOPS GFLOPS 5 5 5 5 0 0 0 0 0 500 1000 1500 2000 0 500 1000 1500 2000 0 500 1000 1500 2000 0 500 1000 1500 2000 CGEMM (single-threaded) CSYRK (single-threaded) CHEMM (single-threaded) CTRMM (single-threaded) 30 30 30 30 25 25 25 25 20 20 20 20 GFLOPS GFLOPS GFLOPS GFLOPS 15 15 15 15 10 10 10 10 5 5 5 5 0 0 0 0 0 500 1000 1500 2000 0 500 1000 1500 2000 0 500 1000 1500 2000 0 500 1000 1500 2000 ZGEMM (single-threaded) ZSYRK (single-threaded) ZHEMM (single-threaded) ZTRMM (single-threaded) 15 15 15 15 GFLOPS 10 GFLOPS 10 GFLOPS 10 GFLOPS 10 5 5 5 5 BLIS OpenBLAS ARMPL 0 0 0 0 0 500 1000 1500 2000 0 500 1000 1500 2000 0 500 1000 1500 2000 0 500 1000 1500 2000 matrix dimension m=n=k matrix dimension m=n=k matrix dimension m=n=k matrix dimension m=n=k
MulG-threaded BLIS (28 cores) SGEMM (multi-threaded) SSYRK (multi-threaded) SSYMM (multi-threaded) STRMM (multi-threaded) 800 800 800 800 600 600 600 600 GFLOPS GFLOPS GFLOPS GFLOPS 400 400 400 400 200 200 200 200 0 0 0 0 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 DGEMM (multi-threaded) DSYRK (multi-threaded) DSYMM (multi-threaded) DTRMM (multi-threaded) 400 400 400 400 300 300 300 300 GFLOPS GFLOPS GFLOPS GFLOPS 200 200 200 200 100 100 100 100 0 0 0 0 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 CGEMM (multi-threaded) CSYRK (multi-threaded) CHEMM (multi-threaded) CTRMM (multi-threaded) 800 800 800 800 600 600 600 600 GFLOPS GFLOPS GFLOPS GFLOPS 400 400 400 400 200 200 200 200 0 0 0 0 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 ZGEMM (multi-threaded) ZSYRK (multi-threaded) ZHEMM (multi-threaded) ZTRMM (multi-threaded) 400 400 400 400 300 300 300 300 GFLOPS GFLOPS GFLOPS GFLOPS 200 200 200 200 BLIS 100 100 100 100 OpenBLAS ARMPL 0 0 0 0 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 matrix dimension m=n=k matrix dimension m=n=k matrix dimension m=n=k matrix dimension m=n=k
MulG-threaded BLIS (56 cores) SGEMM (multi-threaded) SSYRK (multi-threaded) SSYMM (multi-threaded) STRMM (multi-threaded) 1500 1500 1500 1500 GFLOPS GFLOPS GFLOPS GFLOPS 1000 1000 1000 1000 500 500 500 500 0 0 0 0 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 DGEMM (multi-threaded) DSYRK (multi-threaded) DSYMM (multi-threaded) DTRMM (multi-threaded) 800 800 800 800 600 600 600 600 GFLOPS GFLOPS GFLOPS GFLOPS 400 400 400 400 200 200 200 200 0 0 0 0 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 CGEMM (multi-threaded) CSYRK (multi-threaded) CHEMM (multi-threaded) CTRMM (multi-threaded) 1500 1500 1500 1500 GFLOPS GFLOPS GFLOPS GFLOPS 1000 1000 1000 1000 500 500 500 500 0 0 0 0 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 ZGEMM (multi-threaded) ZSYRK (multi-threaded) ZHEMM (multi-threaded) ZTRMM (multi-threaded) 800 800 800 800 600 600 600 600 GFLOPS GFLOPS GFLOPS GFLOPS 400 400 400 400 BLIS 200 200 200 200 OpenBLAS ARMPL 0 0 0 0 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 matrix dimension m=n=k matrix dimension m=n=k matrix dimension m=n=k matrix dimension m=n=k
Other Architectures SkylakeX (single core) SGEMM (single-threaded) SSYRK (single-threaded) SSYMM (single-threaded) STRMM (single-threaded) 100 100 100 100 80 80 80 80 GFLOPS GFLOPS GFLOPS GFLOPS 60 60 60 60 40 40 40 40 20 20 20 20 0 0 0 0 0 500 1000 1500 2000 0 500 1000 1500 2000 0 500 1000 1500 2000 0 500 1000 1500 2000 DGEMM (single-threaded) DSYRK (single-threaded) DSYMM (single-threaded) DTRMM (single-threaded) 50 50 50 50 40 40 40 40 GFLOPS GFLOPS GFLOPS GFLOPS 30 30 30 30 20 20 20 20 10 10 10 10 0 0 0 0 0 500 1000 1500 2000 0 500 1000 1500 2000 0 500 1000 1500 2000 0 500 1000 1500 2000 CGEMM (single-threaded) CSYRK (single-threaded) CHEMM (single-threaded) CTRMM (single-threaded) 100 100 100 100 80 80 80 80 GFLOPS GFLOPS GFLOPS GFLOPS 60 60 60 60 40 40 40 40 20 20 20 20 0 0 0 0 0 500 1000 1500 2000 0 500 1000 1500 2000 0 500 1000 1500 2000 0 500 1000 1500 2000 ZGEMM (single-threaded) ZSYRK (single-threaded) ZHEMM (single-threaded) ZTRMM (single-threaded) 50 50 50 50 40 40 40 40 GFLOPS GFLOPS GFLOPS GFLOPS 30 30 30 30 20 20 20 20 BLIS OpenBLAS 10 10 10 10 MKL 0 0 0 0 0 500 1000 1500 2000 0 500 1000 1500 2000 0 500 1000 1500 2000 0 500 1000 1500 2000 matrix dimension m=n=k matrix dimension m=n=k matrix dimension m=n=k matrix dimension m=n=k
Other Architectures SkylakeX (20 cores) SGEMM (multi-threaded) SSYRK (multi-threaded) SSYMM (multi-threaded) STRMM (multi-threaded) 2000 2000 2000 2000 1500 1500 1500 1500 GFLOPS GFLOPS GFLOPS GFLOPS 1000 1000 1000 1000 500 500 500 500 0 0 0 0 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 DGEMM (multi-threaded) DSYRK (multi-threaded) DSYMM (multi-threaded) DTRMM (multi-threaded) 1000 1000 1000 1000 800 800 800 800 GFLOPS GFLOPS GFLOPS GFLOPS 600 600 600 600 400 400 400 400 200 200 200 200 0 0 0 0 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 CGEMM (multi-threaded) CSYRK (multi-threaded) CHEMM (multi-threaded) CTRMM (multi-threaded) 2000 2000 2000 2000 1500 1500 1500 1500 GFLOPS GFLOPS GFLOPS GFLOPS 1000 1000 1000 1000 500 500 500 500 0 0 0 0 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 ZGEMM (multi-threaded) ZSYRK (multi-threaded) ZHEMM (multi-threaded) ZTRMM (multi-threaded) 1000 1000 1000 1000 800 800 800 800 GFLOPS GFLOPS GFLOPS GFLOPS 600 600 600 600 400 400 400 400 BLIS OpenBLAS 200 200 200 200 MKL 0 0 0 0 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 matrix dimension m=n=k matrix dimension m=n=k matrix dimension m=n=k matrix dimension m=n=k
Recommend
More recommend