tsm2 optimizing tall and skinny matrix matrix
play

TSM2: Optimizing Tall-and-Skinny Matrix- Matrix Multiplication on - PowerPoint PPT Presentation

TSM2: Optimizing Tall-and-Skinny Matrix- Matrix Multiplication on GPUs Jieyang Chen , Nan Xiong, Xin Liang, Dingwen Tao*, Sihuan Li, Kaiming Ouyang, Kai Zhao, Nathan DeBardeleben**, Qiang Guan***, Zizhong Chen University of California, Riverside


  1. TSM2: Optimizing Tall-and-Skinny Matrix- Matrix Multiplication on GPUs Jieyang Chen , Nan Xiong, Xin Liang, Dingwen Tao*, Sihuan Li, Kaiming Ouyang, Kai Zhao, Nathan DeBardeleben**, Qiang Guan***, Zizhong Chen University of California, Riverside *University of Alabama **Los Alamos National Laboratory ***Kent State University

  2. Linear algebra kernels are widely y used • Linear algebra kernels have been widely used . • E.g., scientific simulation, big data analytics, machine leaning, etc. • Matrix-matrix multiplication ( GEMM ) • One of the most fundamental computation kernel that is used to build up other kernels • Core computation of many applications. • Cost most of the computation time of applications (Source: Berkeley Dwarfs Report)

  3. In Input shap ape e of of GE GEMM can an var aries ies fr from om ap applic lication ion to o ap applic licatio ion Tall-and-skinny shape input Relative regular shape input Dense Matrix Decompositions K-means Deep Neural Networks Algorithm Based Fault Tolerance

  4. Tw Two Kinds of Computations • Computation bound à Performance of application is bounded by the computation power . • Memory bound à Performance of application is bounded by the memory bandwidth. Matrix-vector multiplication Matrix-matrix multiplication 1 n 1 n n n n A × n x = C n y n n A × n B = Computation bound memory bound Matrix-matrix multiplication with Tall-and-skinny input n k k n A × n B = n B n > 10,000 and k < 100

  5. Wh Why tall all-an and-sk skinn nny be beha haves s di differently tha han n regul gular sha shape pe input nput? Matrix-matrix multiplication Matrix-matrix multiplication with Tall-and-skinny input n k n k n n C n A × n B = n n C n A × n B = Input matrices size is O (n 2 ). Input matrices size is O (n 2 ). Computing time complexity is O (n 3 ). Computing time complexity is O (n 2 k) Each element is used n times. Each element is used k times on average • So for tall and skinny matrix input, depending on the k and the ratio between target GPU’s peak computation power and peak memory throughput, it is usually memory bound.

  6. GP GPUs ar are e wid idely ely used ed for ac acceler eleratin ing ap applic licatio ions • Good at parallelized computations. • Higher computation power and memory throughput. • Commonly used for accelerating matrix-related computations.

  7. cu cuBLAS lib librar ary • One of the most commonly used standard linear algebra libraries optimized for GPUs, which is developed by Nvidia. • The core computing library of many big data and scientific computing applications. • With deep optimization by Nvidia, the cuBLAS library is able to provide state-of-the-art performance in regular-shaped input matrix cases. • But not fully optimized for tall-and-skinny matrix cases.

  8. Po Poor Pe Perfo formance on Current State-of of-the the-Ar Art t Desi sign: gn: Regular-sized matrix multiplication Tall-and-skinny matrix multiplication n k n k n n Regular size: 80%-90% of C n A × n B = n n B n A × n B = the peak computation With large n, k in similar magnitude n >> k power Computation bound memory bound Current state-of-the-art design only 250 350 350 optimized for computation bound case 250 Comp./Mem. HW peak Comp./Mem. HW peak 300 300 200 200 Memory Throughput (GB/s) Performance Memory Throughput (GB/s) 250 Memory 250 Low GPU utilization: Performance (Gflop/s) Memory Performance (Gflops) 150 150 throughput 200 200 K = 2: • throughput 150 150 • 49.9% memory band. 100 100 100 100 37.9% peak comp. power • Sudden drop 50 50 K=16: • 50 50 Performance Sudden drop 31.1% memory band. • 0 0 0 0 10240 11264 12288 13312 14336 15360 16384 17408 18432 19456 20480 21504 22528 23552 24576 25600 26624 27648 28672 29696 30720 10240 11264 12288 13312 14336 15360 16384 17408 18432 19456 20480 21504 22528 23552 24576 25600 26624 27648 28672 29696 30720 56.6% peak comp. power • Input Matrix Size (n) with k = 2 Matrix Size (n) with k = 16 Performance (cuBLAS) Peak Performance Performance (cuBLAS) Peak Performance Memory Throughput (cuBLAS) Peak Memory Throughput Memory Throughput (cuBLAS) Peak Memory Throughput

  9. TS TSM2: redesigned matrix-matrix x multiplication for tall-an and-sk skinn nny input nput • Several factors are considered: 1) Total number of global memory accesses. 2) Efficiency on global memory throughput. 3) Parallelism of overall workload. 4) On-chip memory utilization. 5) Streaming Multiprocessor (SM) utilization.

  10. Al Algorithm thm de desi sign: gn: ho how to fit t the the workload d into the the pr progr gramming ng mod model el of of CUDA(C (Con ontin inued ed) • We divide the workload by assigning n rows of matrix A to n different threads. Each vector-matrix multiplication is assigned to one thread. i. To ensure high parallelism and high Streaming Multiprocessor occupancy. ii. To ensure minimum number of memory access in favor of matrix A. iii. To enable high memory accesses efficiency. n k k Thread i n n A n × B = C

  11. Re Redesigning matrix-matrix x multiplication for tall-an and-sk skinn nny input nput • Rethinking algorithm design – aiming to reduce total number of memory access • Inner product vs. Outer product 3.5 cuBLAS BLASX 3.0 Version 1: Outer Product TSM2-V0 TSM2-V1 2.5 Speedup 2.0 Version 0: Inner Product 1.5 1.0 0.5 0.0 10K 15K 20K 25K 30K Matrix Size (n) • Memory access to each element of A: k times Memory access to each element of A: 1 time • • Memory access to each element of B: n times Tall-and-skinny GEMM with K=8 on Nvidia Tesla K40c Memory access to each element of B: n times • Total number of accesses: 2kn 2 • Total number of accesses: (k+1)n 2 •

  12. Glo Global al memo memory ac acces ess effic ficien iency an analy alysis is Global memory access efficiency per transaction = useful data/cache line size • Affect overall application memory access efficiency • Determined by the memory access pattern and the algorithm • Can be challenging to improve without modifying the algorithm design • For outer product GEMM: • 128 𝑐𝑧𝑢𝑓𝑡 128 𝑐𝑧𝑢𝑓𝑡 = 𝟐𝟏𝟏% 𝑝𝑠 32 𝑐𝑧𝑢𝑓𝑡 128 𝑐𝑧𝑢𝑓𝑡 = 𝟕. 𝟑𝟔% 𝑝𝑠 8 𝑐𝑧𝑢𝑓𝑡 8 𝑐𝑧𝑢𝑓𝑡 32 𝑐𝑧𝑢𝑓𝑡 = 𝟐𝟏𝟏% 32 𝑐𝑧𝑢𝑓𝑡 = 𝟑𝟔%

  13. Imp Improvin ing gl global memory access effi ficiency GPU shared memory: sharing data between threads with threadblock • Version 2: Outer Product + Shared Mem. Benefit: decoupling data load pattern and data use pattern. • 3.5 cuBLAS BLASX 3.0 TSM2-V0 TSM2-V1 TSM2-V2 2.5 Load data in shared memory in a more efficient way. 2.0 Speedup Mem. transaction efficiency = 100% 1.5 Keeping the original data use 1.0 pattern in outer product version. 0.5 0.0 10K 15K 20K 25K 30K Matrix Size (n) Tall-and-skinny GEMM with K=8 on Nvidia Tesla K40c

  14. Imp Improvin ing gl global memory access effi ficiency • Even with efficient global memory loading pattern, Version 2: Outer Product + Shared Mem. it still brings high GPU underutilization • Main cause: long memory access latency can be hard to hide. Load Data dependency between data load and data use instructions Use

  15. Da Data prefetch: : Improving g GPU utilization Version 3: Outer Product + • Adding prefectch data for the next Shared Mem. + Data Prefectch 3.5 { one Thread 0 TSM2-V0 TSM2-V1 TSM2-V2 TSM2-V2 cuBLAS BLASX Thread 1 thread t 1 Thread 2 block iteration improves latency hiding Thread 3 3.0 prefetch next and GPU utilization. B tile B to registers load next tile to shared mem. 2.5 before next iteration. next tile becomes shared mem. Speedup 2.0 current tile in next iteration Load data in shared memory in a holding current Load prefetch next tile A tile of B t 3 to registers more efficient way. { one Thread 0 Thread 1 thread calculation on 1.5 Thread 2 Mem. transaction efficiency = 100% block current tile Thread 3 t 2 registers holding current tile of A 1.0 Prefetch the data needed for the next Keeping the original data use iteration. pattern in outer product version. 0.5 C A Use Threads Sync. 0.0 LD C Threads Sync. Threads Sync. ST C LD NextB LD NextB LD NextB 10K 15K 20K 25K 30K LD NextA LD NextA LD NextA LD NextA LD NextA LD NextA Matrix Size (n) Compute Compute Compute Compute Compute Compute Data prefetch Tall-and-skinny GEMM with K=8 on Nvidia Tesla K40c

  16. Exp xperimental evaluation: GPU Model Micro-architectures Memory Peak performance Peak memory bandwidth Tesla K40c Kepler 12 GB 1430 GFLOPS 288 GB/s Tesla M40 Maxwell 24 GB 213 GFLOPS 288 GB/s Tesla P100 Pascal 16 GB 4600 GFLOPS 720 GB/s

  17. Exp xperimental evaluation: Speedup (on Nvidia Tesla K40c)

  18. Exp xperimental evaluation: Memory bandwidth (on Nvidia Tesla K40c)

  19. Exp xperimental evaluation on Nvidia Tesla M40 and P100 Tesla M40 Tesla P100

Recommend


More recommend