blis performs
play

BLIS Performs Devangi N. Parikh Science of High Performance - PowerPoint PPT Presentation

BLIS Performs Devangi N. Parikh Science of High Performance Compu8ng The University of Texas at Aus8n ThunderX2 Architecture arm v8.1 Base frequency 2.0 GHz # sockets/node 2 # cores/socket 28 armv8a kernels in BLIS were wriOen by


  1. BLIS Performs Devangi N. Parikh Science of High Performance Compu8ng The University of Texas at Aus8n

  2. ThunderX2 Architecture arm v8.1 Base frequency 2.0 GHz # sockets/node 2 # cores/socket 28 armv8a kernels in BLIS were wriOen by Fransisco D. Igual for cortexa57 architectures.

  3. DGEMM (armv8a) DGEMM (single-threaded) 16 14 12 10 GFLOPS 8 BLIS 6 4 2 0 0 200 400 600 800 1000 1200 1400 1600 1800 2000 matrix dimension m=n=k

  4. DGEMM – Other Libraries DGEMM (single-threaded) 16 14 12 10 GFLOPS 8 6 4 BLIS OpenBLAS ARMPL 2 0 0 200 400 600 800 1000 1200 1400 1600 1800 2000 matrix dimension m=n=k

  5. GEMM – Other Datatypes SGEMM (single-threaded) 30 25 20 GFLOPS 15 10 BLIS OpenBLAS ARMPL 5 0 0 200 400 600 800 1000 1200 1400 1600 1800 2000 matrix dimension m=n=k

  6. GEMM – Other Datatypes CGEMM (single-threaded) 30 25 20 GFLOPS 15 10 5 BLIS OpenBLAS ARMPL 0 0 200 400 600 800 1000 1200 1400 1600 1800 2000 matrix dimension m=n=k

  7. GEMM – Other Datatypes ZGEMM (single-threaded) 16 14 12 10 GFLOPS 8 6 4 2 BLIS OpenBLAS ARMPL 0 0 200 400 600 800 1000 1200 1400 1600 1800 2000 matrix dimension m=n=k

  8. Level 3 SGEMM (single-threaded) SSYRK (single-threaded) SSYMM (single-threaded) STRMM (single-threaded) 30 30 30 30 25 25 25 25 20 20 20 20 GFLOPS GFLOPS GFLOPS GFLOPS 15 15 15 15 10 10 10 10 5 5 5 5 0 0 0 0 0 500 1000 1500 2000 0 500 1000 1500 2000 0 500 1000 1500 2000 0 500 1000 1500 2000 DGEMM (single-threaded) DSYRK (single-threaded) DSYMM (single-threaded) DTRMM (single-threaded) 15 15 15 15 10 10 10 10 GFLOPS GFLOPS GFLOPS GFLOPS 5 5 5 5 0 0 0 0 0 500 1000 1500 2000 0 500 1000 1500 2000 0 500 1000 1500 2000 0 500 1000 1500 2000 CGEMM (single-threaded) CSYRK (single-threaded) CHEMM (single-threaded) CTRMM (single-threaded) 30 30 30 30 25 25 25 25 20 20 20 20 GFLOPS GFLOPS GFLOPS GFLOPS 15 15 15 15 10 10 10 10 5 5 5 5 0 0 0 0 0 500 1000 1500 2000 0 500 1000 1500 2000 0 500 1000 1500 2000 0 500 1000 1500 2000 ZGEMM (single-threaded) ZSYRK (single-threaded) ZHEMM (single-threaded) ZTRMM (single-threaded) 15 15 15 15 GFLOPS 10 GFLOPS 10 GFLOPS 10 GFLOPS 10 5 5 5 5 BLIS OpenBLAS ARMPL 0 0 0 0 0 500 1000 1500 2000 0 500 1000 1500 2000 0 500 1000 1500 2000 0 500 1000 1500 2000 matrix dimension m=n=k matrix dimension m=n=k matrix dimension m=n=k matrix dimension m=n=k

  9. MulG-threaded BLIS (28 cores) SGEMM (multi-threaded) SSYRK (multi-threaded) SSYMM (multi-threaded) STRMM (multi-threaded) 800 800 800 800 600 600 600 600 GFLOPS GFLOPS GFLOPS GFLOPS 400 400 400 400 200 200 200 200 0 0 0 0 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 DGEMM (multi-threaded) DSYRK (multi-threaded) DSYMM (multi-threaded) DTRMM (multi-threaded) 400 400 400 400 300 300 300 300 GFLOPS GFLOPS GFLOPS GFLOPS 200 200 200 200 100 100 100 100 0 0 0 0 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 CGEMM (multi-threaded) CSYRK (multi-threaded) CHEMM (multi-threaded) CTRMM (multi-threaded) 800 800 800 800 600 600 600 600 GFLOPS GFLOPS GFLOPS GFLOPS 400 400 400 400 200 200 200 200 0 0 0 0 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 ZGEMM (multi-threaded) ZSYRK (multi-threaded) ZHEMM (multi-threaded) ZTRMM (multi-threaded) 400 400 400 400 300 300 300 300 GFLOPS GFLOPS GFLOPS GFLOPS 200 200 200 200 BLIS 100 100 100 100 OpenBLAS ARMPL 0 0 0 0 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 matrix dimension m=n=k matrix dimension m=n=k matrix dimension m=n=k matrix dimension m=n=k

  10. MulG-threaded BLIS (56 cores) SGEMM (multi-threaded) SSYRK (multi-threaded) SSYMM (multi-threaded) STRMM (multi-threaded) 1500 1500 1500 1500 GFLOPS GFLOPS GFLOPS GFLOPS 1000 1000 1000 1000 500 500 500 500 0 0 0 0 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 DGEMM (multi-threaded) DSYRK (multi-threaded) DSYMM (multi-threaded) DTRMM (multi-threaded) 800 800 800 800 600 600 600 600 GFLOPS GFLOPS GFLOPS GFLOPS 400 400 400 400 200 200 200 200 0 0 0 0 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 CGEMM (multi-threaded) CSYRK (multi-threaded) CHEMM (multi-threaded) CTRMM (multi-threaded) 1500 1500 1500 1500 GFLOPS GFLOPS GFLOPS GFLOPS 1000 1000 1000 1000 500 500 500 500 0 0 0 0 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 ZGEMM (multi-threaded) ZSYRK (multi-threaded) ZHEMM (multi-threaded) ZTRMM (multi-threaded) 800 800 800 800 600 600 600 600 GFLOPS GFLOPS GFLOPS GFLOPS 400 400 400 400 BLIS 200 200 200 200 OpenBLAS ARMPL 0 0 0 0 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 matrix dimension m=n=k matrix dimension m=n=k matrix dimension m=n=k matrix dimension m=n=k

  11. Other Architectures SkylakeX (single core) SGEMM (single-threaded) SSYRK (single-threaded) SSYMM (single-threaded) STRMM (single-threaded) 100 100 100 100 80 80 80 80 GFLOPS GFLOPS GFLOPS GFLOPS 60 60 60 60 40 40 40 40 20 20 20 20 0 0 0 0 0 500 1000 1500 2000 0 500 1000 1500 2000 0 500 1000 1500 2000 0 500 1000 1500 2000 DGEMM (single-threaded) DSYRK (single-threaded) DSYMM (single-threaded) DTRMM (single-threaded) 50 50 50 50 40 40 40 40 GFLOPS GFLOPS GFLOPS GFLOPS 30 30 30 30 20 20 20 20 10 10 10 10 0 0 0 0 0 500 1000 1500 2000 0 500 1000 1500 2000 0 500 1000 1500 2000 0 500 1000 1500 2000 CGEMM (single-threaded) CSYRK (single-threaded) CHEMM (single-threaded) CTRMM (single-threaded) 100 100 100 100 80 80 80 80 GFLOPS GFLOPS GFLOPS GFLOPS 60 60 60 60 40 40 40 40 20 20 20 20 0 0 0 0 0 500 1000 1500 2000 0 500 1000 1500 2000 0 500 1000 1500 2000 0 500 1000 1500 2000 ZGEMM (single-threaded) ZSYRK (single-threaded) ZHEMM (single-threaded) ZTRMM (single-threaded) 50 50 50 50 40 40 40 40 GFLOPS GFLOPS GFLOPS GFLOPS 30 30 30 30 20 20 20 20 BLIS OpenBLAS 10 10 10 10 MKL 0 0 0 0 0 500 1000 1500 2000 0 500 1000 1500 2000 0 500 1000 1500 2000 0 500 1000 1500 2000 matrix dimension m=n=k matrix dimension m=n=k matrix dimension m=n=k matrix dimension m=n=k

  12. Other Architectures SkylakeX (20 cores) SGEMM (multi-threaded) SSYRK (multi-threaded) SSYMM (multi-threaded) STRMM (multi-threaded) 2000 2000 2000 2000 1500 1500 1500 1500 GFLOPS GFLOPS GFLOPS GFLOPS 1000 1000 1000 1000 500 500 500 500 0 0 0 0 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 DGEMM (multi-threaded) DSYRK (multi-threaded) DSYMM (multi-threaded) DTRMM (multi-threaded) 1000 1000 1000 1000 800 800 800 800 GFLOPS GFLOPS GFLOPS GFLOPS 600 600 600 600 400 400 400 400 200 200 200 200 0 0 0 0 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 CGEMM (multi-threaded) CSYRK (multi-threaded) CHEMM (multi-threaded) CTRMM (multi-threaded) 2000 2000 2000 2000 1500 1500 1500 1500 GFLOPS GFLOPS GFLOPS GFLOPS 1000 1000 1000 1000 500 500 500 500 0 0 0 0 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 ZGEMM (multi-threaded) ZSYRK (multi-threaded) ZHEMM (multi-threaded) ZTRMM (multi-threaded) 1000 1000 1000 1000 800 800 800 800 GFLOPS GFLOPS GFLOPS GFLOPS 600 600 600 600 400 400 400 400 BLIS OpenBLAS 200 200 200 200 MKL 0 0 0 0 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 matrix dimension m=n=k matrix dimension m=n=k matrix dimension m=n=k matrix dimension m=n=k

Recommend


More recommend