Carnegie Mellon Extending the BLIS Analytical Model for GPUs Elliot Binder, Claudia Kho, Doru Thom Popovici, Tze Meng Low 18 September 2018 BLIS Retreat 1
Carnegie Mellon Many problems are “MMM” Popula5on Genomics k-Nearest Neighbours ~Length of DNA Sequence A C T G G T G A C T C A G A T G C A C G G A … # Samples A C A C G T A A C T C C C T T A G A G A C A … C T T G A C A A C T T C C A T A C C G T A A … A C A C G T G T G A T C C A A A C A T T A C … DNA Fingerprin5ng All-Pairs Shortest Path 2
Carnegie Mellon Leveraging BLIS § Small microkernel � 4 th loop around micro-kernel § 5 parameters k C B p += A p C j k C ~ Pack B p → B p 3 rd loop around micro-kernel m r n r k c m c n c ~ C i A i m C m C B p += ~ Pack A i → A i 2 nd loop around micro-kernel ~ ~ n R n R B p C i A i m R += k C 1 st loop around micro-kernel n R m R += k C micro-kernel main memory 1 L3 cache += L2 cache 1 L1 cache registers 3
Carnegie Mellon Population Genomics Nikolaos AlachioUs, Thom Popovici, Tze Meng Low, 2016. Efficient ComputaUon of Linkage Disequilibria as Dense Linear Algebra OperaUons. 4 HiCOMB 2016
Carnegie Mellon Population Genomics m r n r ≥ N Popcnt L Popcnt N vec Tze Meng Low, Francisco D. Igual, Tyler M. Smith, and Enrique S. Quintana-OrU. 2016. AnalyUcal Modeling Is Enough for High-Performance BLIS. ACM Trans. Math. SoBw. 43, 2, ArUcle 12 Nikolaos AlachioUs, Thom Popovici, Tze Meng Low, 2016. Efficient ComputaUon of Linkage Disequilibria as Dense Linear Algebra OperaUons. 5 HiCOMB 2016
Carnegie Mellon Application of (Partial) Model Convolu5on Neural Nets Large m-D FFTs Core0 Core1 Performance on Intel Haswell L1 L1 FPU FPU % of Peak t0 t1 t2 t3 100% L1 L1 80% 60% L2 L2 40% 20% 0% L3 1 2 3 4 5 Layers of AlexNet OpenBLAS + Layout Change OpenBLAS GEMM Customed ConvoluUon Finite Field Linear Algebra Microcontrollers Performance of different FF Algorithms Bits Ops / Cycle 2000 4R - BLIS m4ri (4 bit tables) O(n^3) Naive Peak 1500 1000 500 0 1024 4096 16384 6 N = M = K
Carnegie Mellon “Can we do it on a GPU?” 7
Carnegie Mellon Our initial attempt Linkage Disequilibrium on GTX 980 % of peak 100% 90% 80% 70% 60% 50% 40% 30% 2k-64 2k-1024 20% 4k-64 10% 4k-1024 0% 64 128 192 256 320 384 448 512 576 640 704 768 832 896 960 1024 K 8
Carnegie Mellon GTX 980 in a nutshell 1 of 16 SMs • 1 warp = 32 threads • 4 clusters of 32 (SP) FMA cores • Each cluster with 8 SFU cores (popcnt) • 64k registers per SM (255/thread) • 48K/96K shared memory 9
Carnegie Mellon GTX 980 in a nutshell 1 of 16 SMs • 1 warp = 32 threads • 4 clusters of 32 (SP) FMA cores • Each cluster with 8 SFU cores (popcnt) • 64k registers per SM (255/thread) • 48K/96K shared memory • Latency of FMA ≈ 8 cycles • Latency of Popcnt ≈ 12-13 cycles • Popcnt seems to be pipelined 10
Carnegie Mellon Applying the model • Minimum size of kernel m r n r ≥ N Popcnt L Popcnt N vec 256 4 clusters 8 cycles 8 threads • Maximum size of kernel 64 k 256 = 256 11
Carnegie Mellon Applying the model • Minimum size of kernel m r n r ≥ N Popcnt L Popcnt N vec 256 4 clusters 8 cycles 8 threads • Maximum size of kernel 64 k >255 registers/thread 256 = 256 12
Carnegie Mellon Applying the model • Minimum size of kernel m r n r ≥ N Popcnt L Popcnt N vec 256 4 clusters 8 cycles 8 threads • Maximum size of kernel 64 k 256 = 256 1024 Threads, Registers 64 13
Carnegie Mellon Our initial attempt Linkage Disequilibrium on GTX 980 % of peak 100% 90% 80% 70% 60% 50% 40% 30% 2k-64 2k-1024 20% 4k-64 10% 4k-1024 0% 64 128 192 256 320 384 448 512 576 640 704 768 832 896 960 1024 K 14
Carnegie Mellon With Shared Memory Linkage Disequilibrium on GTX 980 % of peak 100% 90% 80% 70% 60% 50% 40% 1k 30% 2k 20% 4k 10% 0% K 15
Recommend
More recommend