ERLANGEN REGIONAL COMPUTING CENTER Analytical Tool-Supported Modeling of Streaming and Stencil Loops Georg Hager, Julian Hammer Erlangen Regional Computing Center (RRZE) Scalable Tools Workshop August 3-6, 2015, Lake Tahoe, CA
RRZE LIKWID tiny.cc/LIKWID GHOST tiny.cc/GHOST Performance Engineering http://blogs.fau.de/... hager/talks/nlpe Automated loop performance model construction | G. Hager 2
Motivation DAXPY on Sandy Bridge core 2D-5pt stencil on Sandy Bridge core Loop length Inner dimension w/ in- memory data Automated loop performance model construction | G. Hager 3
THE ECM MODEL Registers L1 L2 L3 MEM
ECM model – the rules 4 cy 8 cy 3 cy 43 cy 1. LOADs in the L1 cache do not STORE MULT … 6 cy LOAD ADD overlap with any other data L2-L1 transfer in the memory hierarchy 9 cy L3-L2 9 cy 2. Everything else in the core overlaps perfectly with data transfers (STOREs show some Mem-L3 19 cy non-overlap) time [cy] Example: 3. The scaling limit is set by the ratio of Single-core (data in L1): 8 cy (ADD) Single-core (data in memory): # cycles per CL overall 6+9+9+19 cy = 43 cy # cycles per CL at the bottleneck Scaling limit: 43 / 19 = 2.3 cores Automated loop performance model construction | G. Hager 5
ECM model – composition ECM predicted time 𝑈 𝐹𝐷𝑁 = maximum of overlapping time and sum of all other contributions 𝑈 𝑃𝑀 𝑈 𝑑𝑝𝑠𝑓 = max(𝑈 𝑜𝑃𝑀 , 𝑈 𝑃𝑀 ) 𝑈 𝑜𝑃𝑀 LOAD 𝑈 ADD 𝑈 𝐹𝐷𝑁 = max(𝑈 𝑜𝑃𝑀 + 𝑈 𝑒𝑏𝑢𝑏 , 𝑈 𝑃𝑀 ) 𝑑𝑝𝑠𝑓 L2-L1 𝑈 𝑀1𝑀2 Shorthand notation for time contributions: L3-L2 𝑈 𝑀2𝑀3 𝑁𝑓𝑛 𝑈 𝐹𝐷𝑁 𝑈 𝑈 𝑈 𝑀1𝑀2 𝑈 𝑀2𝑀3 | 𝑈 𝑃𝑀 𝑜𝑃𝑀 𝑀3𝑁𝑓𝑛 𝑈 𝑒𝑏𝑢𝑏 Mem-L3 # cy invariant to # cy changes w/ 𝑈 𝑀3𝑁𝑓𝑛 clock speed clock speed Example from previous slide: 8 6 9 9 | 19 cy Automated loop performance model construction | G. Hager 6
ECM model – prediction Notation for cycle predictions in different memory hierarchy levels: 𝑈 𝑃𝑀 𝑀1 𝑀2 𝑀3 𝑁𝑓𝑛 𝑈 𝐹𝐷𝑁 𝑈 𝐹𝐷𝑁 𝑈 𝐹𝐷𝑁 𝑈 𝐹𝐷𝑁 𝑈 𝑜𝑃𝑀 LOAD 𝑀1 𝑈 𝐹𝐷𝑁 ADD 𝑀2 𝑈 𝐹𝐷𝑁 L2-L1 𝑀1 𝑈 𝐹𝐷𝑁 = 𝑈 𝑑𝑝𝑠𝑓 = max 𝑈 𝑜𝑃𝑀 , 𝑈 𝑃𝑀 𝑀3 𝑈 𝐹𝐷𝑁 𝑀2 𝑈 𝐹𝐷𝑁 = max 𝑈 𝑜𝑃𝑀 + 𝑈 𝑀1𝑀2 , 𝑈 𝑃𝑀 L3-L2 𝑁𝑓𝑛 𝑈 𝐹𝐷𝑁 𝑀3 𝑈 𝐹𝐷𝑁 = max 𝑈 𝑜𝑃𝑀 + 𝑈 𝑀1𝑀2 + 𝑈 𝑀2𝑀3 , 𝑈 𝑃𝑀 𝑁𝑓𝑛 = max 𝑈 𝑜𝑃𝑀 + 𝑈 𝑀1𝑀2 + 𝑈 𝑀2𝑀3 + 𝑈 𝑀3𝑁𝑓𝑛 , 𝑈 𝑃𝑀 Mem-L3 𝑈 𝐹𝐷𝑁 Substitute by Example: 8 15 24 43 cy commas Roofline 8.6 16.2 26 47 cy Experimental data (measured) notation: Automated loop performance model construction | G. Hager 7
ECM model – saturation Main assumption: Performance scaling is linear until a bandwidth bottleneck ( 𝑐 𝑇 ) is hit LOAD ADD Performance vs. cores (Memory BN): L2-L1 𝑁𝑓𝑛 𝑁𝑓𝑛 , 𝑐 𝑇 𝑄 𝐹𝐷𝑁 𝑜 = min 𝑜𝑄 𝐹𝐷𝑁 𝑁𝑓𝑛 𝐶 𝐷 L3-L2 𝑁𝑓𝑛 𝑈 𝐹𝐷𝑁 Number of cores at saturation: Mem-L3 𝑁𝑓𝑛 𝑐 𝑇 𝐶 𝐷 𝑈 𝐹𝐷𝑁 𝑈 𝑀3𝑁𝑓𝑛 𝑜 𝑇 = = 𝑁𝑓𝑛 𝑈 𝑄 𝑀3𝑁𝑓𝑛 𝐹𝐷𝑁 Example: cy ⟹ 𝑜 𝑇 = 43 8 6 9 9 | 19 cy, 8 15 24 43 19 = 3 Automated loop performance model construction | G. Hager 9
How do we get the numbers? Basic information about hardware Registers capabilities: 𝑈 𝑑𝑝𝑠𝑓 : Code analysis, Intel IACA In-core limitations L1 Throughput limits:µops, LD/ST, ADD/MULT per cycle Pipeline depths L2 Cache hierarchy 𝑗 : 𝑈 𝑀1𝑀2 , 𝑈 𝑀2𝑀3 , 𝑈 𝑀3𝑁𝑓𝑛 , 𝐶 𝐷 ECM : Cycles per CL transfer Data flow analysis RL : measured max bandwidths for all L3 cache levels, core counts Memory interface ECM : measured saturated BW MEM RL : measured max bandwidths for all core counts Automated loop performance model construction | G. Hager 11
2D 5-PT JACOBI STENCIL (DOUBLE PRECISION) for(j=1; j < Nj-1; ++j) for(i=1; i < Ni-1; ++i) b[j][i] = (a[ j ][i-1] + a[ j ][i+1] + a[j-1][ i ] + a[j+1][ i ] ) * s; Unit of work (1 CL): 8 LUPs Data transfer per unit: 5 CL if layer condition violated in higher cache level 3 CL if layer condition satisfied
ECM Model for 2D Jacobi (AVX) on SNB 2.7 GHz Radius- 𝑠 stencil (2 𝑠 +1) layers have to fit Cache 𝑙 has size 𝐷 𝑙 Layer condition: for(j=1; j < Nj-1; ++j) (2𝑠 + 1) ∙ 𝑂 𝑗 ∙ 8 B < 𝐷 𝑙 for(i=1; i < Ni-1; ++i) b[j][i] = (a[ j ][i-1] + a[ j ][i+1] 2 + a[j-1][ i ] + a[j+1][ i ] ) * s; 2D 5-pt: 𝑠 = 1 LC = layer condition satisfied in … Automated loop performance model construction | G. Hager 18
2D 5-pt serial in-memory performance and layer conditions SNB 2.7 GHz Automated loop performance model construction | G. Hager 19
3D LONG-RANGE STENCIL (SINGLE PRECISION) #pragma omp parallel for for(int k=4; k < N-4; k++) { for(int j=4; j < N-4; j++) { for(int i=4; i < N-4; i++) { float lap = c0 * %V%[k][j][i] + c1 * ( V[ k ][ j ][i+1]+ V[ k ][ j ][i-1]) + c1 * ( V[ k ][j+1][ i ]+ V[ k ][j-1][ i ]) + c1 * ( V[k+1][ j ][ i ]+ V[k-1][ j ][ i ]) ... Source: + c4 * ( V[ k ][ j ][i+4]+ V[ k ][ j ][i-4]) http://goo.gl/dqOlnI + c4 * ( V[ k ][j+4][ i ]+ V[ k ][j-4][ i ]) + c4 * ( V[k+4][ j ][ i ]+ V[k-4][ j ][ i ]); U[k][j][i] = 2.f * V[k][j][i] - U[k][j][i] + ROC[k][j][i] * lap; }}}
3D long-range SP stencil ECM model Layer condition in L3 at problem size 𝑂 𝑗 × 𝑂 𝑘 × 𝑂 𝑙 : 𝑘 ∙ 𝑜 𝑢ℎ𝑠𝑓𝑏𝑒𝑡 ∙ 4 B < 𝐷 3 9 ∙ 𝑂 𝑗 ∙ 𝑐 2 | | | 68 62 24 24 17 cy 68 86 110 127 cy ECM Model: 𝑈 𝑀3𝑁𝑓𝑛 plays minor part 127 Saturation at 𝑜 𝑡 = = 8 cores. 17 Consequences: Temporal blocking will not yield substantial speedup Improve low-level code first (semi-stencil …?) Automated loop performance model construction | G. Hager 29
3D long-range SP stencil results (SNB) Roofline too optimistic due to overlapping assumption Automated loop performance model construction | G. Hager 30
KERNCRAFT First steps towards automated model construction
kerncraft: ECM/Roofline modeling toolkit Automated loop performance model construction | G. Hager 32
Towards automated model generation Manual Automated Registers Code inspection IACA or and/or IACA direct analysis L1 Reuse distance Traffic analysis w/ analysis, cache layer conditions simulation L2 L3 HW limits: micro- HW limits: benchmarking likwid-bench & docs & docs MEM Automated loop performance model construction | G. Hager 33
kerncraft vmovsd (%rsi,%rbx,8), %xmm1 #define N 1000 #define M 2000 vaddsd 16(%rsi,%rbx,8), %xmm1, %xmm2 vaddsd 8(%rdx,%rbx,8), %xmm2, %xmm3 for(j=1; j < N-1; ++j) Compiler vaddsd 8(%rcx,%rbx,8), %xmm3, %xmm4 for(i=1; i < M-1; ++i) vaddsd 8(%r8,%rbx,8), %xmm4, %xmm5 b[j][i] = (a[ j ][i-1] + a[ j ][i+1] vaddsd 8(%r9,%rbx,8), %xmm5, %xmm6 + a[j-1][ i ] + a[j+1][ i ] ) * s; vmulsd %xmm6, %xmm0, %xmm7 IACA pycparser TP/CP 𝑈 𝑃𝑀 , 𝑈 𝑜𝑃𝑀 AST Registers LOAD Cache simulator/ ADD L2- L1 L1 reuse distance Roofline / ECM model L3- L2 L2 𝑈 𝑁𝑓𝑛 𝐹𝐷𝑁 analysis Mem-L3 L3 𝑈 𝑀3𝑁𝑓𝑛 MEM 𝑊 𝑐 𝑈 𝑀1𝑀2 , … , 𝑈 𝑀3𝑁𝑓𝑛 Traffic volumes 𝑈 = Machine description docs likwid-bench (yaml file) Automated loop performance model construction | G. Hager 34
Restrictions on code input (selection) Only doubles and ints supported Array declarations may use fixed sizes or constants, with an optional offset (e.g., double u1[M+3][N-2][23], but not double u[M*N]) Only the innermost loop may contain assignment statements Array references must either use index variables from for-loops, with optional addition or subtraction, constant or fixed values All for-loops must use a declaration as initial statement and an increment or a decrement assignment operation as the next statement (e.g., i++, i -= 2) Function calls and the use of pointers is not allowed anywhere in the kernel code Write access to any data is assumed to use “normal” STORE instructions (e.g., no non-temporal stores) Automated loop performance model construction | G. Hager 35
Operating modes ECM Full ECM model including in-core analysis ECMData Data traffic analysis only (works on any system) ECMCPU In-core part of ECM model (IACA) Roofline Full Roofline model using CPU peak performance as in-core limit RooflineIACA Full Roofline model using IACA analysis for in-core Benchmark Run the actual benchmark for model validation Automated loop performance model construction | G. Hager 36
Recommend
More recommend