gemm like operations
play

GEMM-Like Operations Applications Richard Michael Veras Platforms - PowerPoint PPT Presentation

Carnegie Mellon Performance An Algorithmic Specific Code Generator for GEMM-Like Operations Applications Richard Michael Veras Platforms Carnegie Mellon Want Automatic High Performance Model Driven GEMM-Like approach for Operations


  1. Carnegie Mellon Performance An Algorithmic Specific Code Generator for GEMM-Like Operations Applications Richard Michael Veras Platforms

  2. Carnegie Mellon Want Automatic High Performance Model Driven GEMM-Like approach for Operations generating DGEMM: High Perf. Compiler Techniques: Richard Veras (rveras@cmu.edu) 2

  3. Carnegie Mellon GEMM Like Operations Clustering : ~𝑩𝑩 𝑼 𝑩 Centrality : ~𝑩 𝑼 𝑩 (triangle) [betweeness] (𝒂, β‹€, +, 𝟐, 𝟏) (𝒂, +, 𝑡𝑱𝑢, 𝟏, 𝟐) Community Detection : ~𝑩 𝑳 Check out GraphBLAS Richard Veras (rveras@cmu.edu) 3

  4. Carnegie Mellon High Performance Micro-Kernels Cast micro kernel as outer product: Use Models to Select from Design Space: Aggressively Schedule Enumerate all possible tilings and Optimize: given ISA Veras, R. , Smith T., Low T.M., Franchetti, F. van de Geijn, R. [CGO 2017 Submitted] Richard Veras (rveras@cmu.edu) 4

  5. Carnegie Mellon High Performance Micro-Kernels OpenBLAS Our Generated ATLAS Veras, R. , Smith T., Low T.M., Franchetti, F. van de Geijn, R. [CGO 2017 Submitted] Richard Veras (rveras@cmu.edu) 5

  6. Carnegie Mellon Automating with Compiler Techniques Subtle Semiring Changes Cast as ILP Problem Impact Performance: [extremetech.com] Minimizing Stalls [cs.duke.edu] [massey.ac.nz] Richard Veras (rveras@cmu.edu) 6

  7. Carnegie Mellon A Generator for GEMM-Like Kernels Input: Ouput: Kernel Algorithm from Betweeness Centrality our design space: (Floyd-Warshall) Kernel Sustains High Throughput: GEMM Like: 𝑫 𝑩π‘ͺ + 𝑫 𝑫 𝒃𝒅𝒅 𝟏 Initialize 𝑫 𝒃𝒅𝒅 𝒒 𝒃𝒄 𝑼 Compute Tuned to the Target π’ˆ(𝑫 𝒃𝒅𝒅 ) Accumulate 𝑫 Architecture Semiring (𝒂, +, 𝑡𝑱𝑢, 𝟏, 𝟐) Richard Veras (rveras@cmu.edu) 7

  8. Carnegie Mellon Our GEMM Generator Pipeline Find Efficient Instructions Mix (Algo) Turn Mix into Efficient Code (Implementation) Template Template Enumerate User defines Top Candidate transformed into Created for algorithm space block size, ISA selected optimized and Candidate and semiring scheduled code Richard Veras (rveras@cmu.edu) 8

  9. Carnegie Mellon From Math to Tiling Identify Small Outer Products from ISA 𝑫 𝑩π‘ͺ + 𝑫 𝑫 𝒃𝒅𝒅 𝟏 Initialize 𝑫 𝒃𝒅𝒅 𝒒 𝒃𝒄 𝑼 Compute π’ˆ(𝑫 𝒃𝒅𝒅 ) Accumulate 𝑫 A High Throughput Mix Enumerate Space of Outer Products Start with ISA Select Best Mix with Queueing Model Richard Veras (rveras@cmu.edu) 9

  10. Carnegie Mellon From Tiling to Template Selected Outer Product: 𝑫 𝒃𝒅𝒅 𝟏 for( i = 0; i < m_r; i++ ) for( j = 0; j < n_r; j++ ) init(c_reg, ii,jj ) 𝒃𝒄 𝑼 𝑫 𝒃𝒅𝒅 𝒒 for( pp = 0; pp < k_b; pp++ ) /* perform the outer products */ for( i = 0; i < m_r; i+=m_s ) for( j = 0; j < n_r; j+=n_s ) for( ii = i; ii < i+m_s; ii++ ) Embedding Function (get_b): get_a_elem(a_reg, ii,j ); for( jj = j; jj < j+n_s; jj++ ) get_b_elem(b_reg, ii,jj ); apply(c_reg,a_reg,b_reg,ii,jj,pp); def get_b_element ( var array b_reg[][], ptr B, ii, jj ) opts = { 𝑫 π’ˆ(𝑫 𝒃𝒅𝒅 ) 0: assign( b_reg[jj], vload(B,jj )), 1: assign( b_reg[jj], shuffle(b_reg[jj-1]) ), for( i = 0; i < m_r; i++ ) 2: assign( b_reg[jj], permute(b_reg[jj-1]) ), for( j = 0; j < n_r; j++ ) 3: assign( b_reg[jj], shuffle(b_reg[jj-1]) )} accumulate(C, c_reg, ii, jj ); if ii mod v = 0 return opts[jj] Richard Veras (rveras@cmu.edu) 10

  11. Carnegie Mellon Floyd-Warshall Embedded in GEMM DGEMM Floyd-Warshall Semiring: (𝑺,βˆ—, +, 𝟐, 𝟏) (𝒂, +, 𝑡𝑱𝑢, 𝟏, ∞) Initialize: assign( c_reg[ii,jj], 0) assign( c_reg[ii,jj], 𝑫 𝒃𝒅𝒅 𝟏 (ii==jj)? 0 : INFINITY) Compute: assign( c_reg[ii,jj], assign( c_reg[ii,jj], 𝑫 𝒃𝒅𝒅 𝒒 𝒃𝒄 𝑼 a_reg[ii,pp]* MIN(a_reg[ii,pp]+ b_reg[pp,jj], b_reg[pp,jj], +c_reg[ii,jj])) c_reg[ii,jj])) Accumulate: assign( C[(ii,jj)], assign( C[(ii,jj)], 𝑫 π’ˆ(𝑫 𝒃𝒅𝒅 ) MIN(c_reg[ii,jj], c_reg[ii,jj]+ C[(ii,jj)] ) C[(ii,jj)] Richard Veras (rveras@cmu.edu) 11

  12. Carnegie Mellon Scheduling the Problem Pipeline for Scheduling: We have built the kernel code, Built Kernel + Now we need to schedule: Embedding Func. Express as Decision Vars Static Scheduling still matters on OOO Processors: Formulate Constraints over Vars Minimize with ILP Solver Scheduled Kernel Code Richard Veras (rveras@cmu.edu) 12

  13. Carnegie Mellon OASIC approach for ILP Scheduling Expressing Constraints in Representing Design Space as terms of X: Polytope: Every instruction n is executed once 𝒍 = 𝟐 𝒖 π’š 𝒐𝒖 𝒍 At any timestep t, functional unit k is used no more than it can 𝒍 𝒍 π’š 𝒐𝒖 ≀ 𝑺 𝒍 Decision Variable: instruction n is executed on functional unit k at time step t If 𝒖 𝒏 depends on 𝒖 𝒐 , then 𝒖 𝒏 will not execute until l cycles after 𝒖 𝒐 𝒍 𝒍 𝒍 𝒖 π’š 𝒐𝒖 𝒏 + 𝒖 π’š 𝒐𝒖 𝒏 ≀ 𝟐 π’š 𝒐𝒖 Functional unit 𝒍 𝒍 cycle But wait, there’s more! Instruction label Richard Veras (rveras@cmu.edu) 13

  14. Carnegie Mellon Emitting The Code Code is emitted: for( pp = 0; pp < k_b; pp+=KUNR ) Code is now Scheduled /* STEADY STATE CODE */ VLOAD_IA(GET_A_ADDR(0),GET_A_REG(0)) VLOAD_IA(GET_A_ADDR(1),GET_A_REG(1)) VLOAD_IA(GET_B_ADDR(0),GET_B_REG(0)) VSHUFFLE_IA(GET_B_REG(0),GET_B_REG(1)) VFMA(GET_A_REG(0), GET_B_REG(0),GET_C_REG(0,0)) VFMA(GET_A_REG(0),GET_B_REG(1),GET_C_REG(0,1)) VPERM2F128_IA(0x01,GET_B_REG(1),GET_B_REG(2)) VSHUFFLE_IA(0x05,GET_B_REG(2),GET_B_REG(3)) VFMA(GET_A_REG(1),GET_B_REG(0),GET_C_REG(0,0)) VFMA(GET_A_REG(1),GET_B_REG(1),GET_C_REG(0,1)) Need Custom ANSI C compliant SIMD Is this necessary? wrappers to schedule in Compiler: #define VADD(srca,srcb,dest) asm volatile( "vaddpd %[vsrca],%[vsrcb], %[vdest]" : [vdest] "=x"(dest) : [vsrca] "x"(srca), [vsrcb] "x"(srcb)); Richard Veras (rveras@cmu.edu) 14

  15. Carnegie Mellon Putting it All Together Have Operation that we can express like GEMM: 𝑫 𝑩π‘ͺ + 𝑫 𝑫 𝒃𝒅𝒅 𝟏 Initialize 𝑫 𝒃𝒅𝒅 𝒒 𝒃𝒄 𝑼 Compute π’ˆ(𝑫 𝒃𝒅𝒅 ) Accumulate 𝑫 Run through our Pipeline: Richard Veras (rveras@cmu.edu) 15

  16. Carnegie Mellon Moving Forward Clustering : ~𝑩𝑩 𝑼 𝑩 Centrality : ~𝑩 𝑼 𝑩 (triangle) (betweeness) Community Detection : ~𝑩 𝑳 Richard Veras (rveras@cmu.edu) 16

  17. Carnegie Mellon Summary ο‚’ There exists a large class of GEMM-like Operations ο‚’ Obtaining DGEMM level performance for each of these operations requires automation ο‚’ We have a systematic approach for automatically generating DGEMM ο‚’ We are extending it by allowing the user to define a semi-ring with an initialize and accumulate function Richard Veras (rveras@cmu.edu) 17

Recommend


More recommend