ERLANGEN REGIONAL COMPUTING CENTER Analytical Tool-Supported - PowerPoint PPT Presentation

ERLANGEN REGIONAL COMPUTING CENTER Analytical Tool-Supported Modeling of Streaming and Stencil Loops Georg Hager, Julian Hammer Erlangen Regional Computing Center (RRZE) Scalable Tools Workshop August 3-6, 2015, Lake Tahoe, CA

RRZE  LIKWID tiny.cc/LIKWID  GHOST tiny.cc/GHOST  Performance Engineering http://blogs.fau.de/... hager/talks/nlpe Automated loop performance model construction | G. Hager 2

Motivation DAXPY on Sandy Bridge core 2D-5pt stencil on Sandy Bridge core Loop length Inner dimension w/ in- memory data Automated loop performance model construction | G. Hager 3

THE ECM MODEL Registers L1 L2 L3 MEM

ECM model – the rules 4 cy 8 cy 3 cy 43 cy 1. LOADs in the L1 cache do not STORE MULT … 6 cy LOAD ADD overlap with any other data L2-L1 transfer in the memory hierarchy 9 cy L3-L2 9 cy 2. Everything else in the core overlaps perfectly with data transfers (STOREs show some Mem-L3 19 cy non-overlap) time [cy] Example: 3. The scaling limit is set by the ratio of Single-core (data in L1): 8 cy (ADD) Single-core (data in memory): # cycles per CL overall 6+9+9+19 cy = 43 cy # cycles per CL at the bottleneck Scaling limit: 43 / 19 = 2.3 cores Automated loop performance model construction | G. Hager 5

ECM model – composition ECM predicted time 𝑈 𝐹𝐷𝑁 = maximum of overlapping time and sum of all other contributions 𝑈 𝑃𝑀 𝑈 𝑑𝑝𝑠𝑓 = max(𝑈 𝑜𝑃𝑀 , 𝑈 𝑃𝑀 ) 𝑈 𝑜𝑃𝑀 LOAD 𝑈 ADD 𝑈 𝐹𝐷𝑁 = max(𝑈 𝑜𝑃𝑀 + 𝑈 𝑒𝑏𝑢𝑏 , 𝑈 𝑃𝑀 ) 𝑑𝑝𝑠𝑓 L2-L1 𝑈 𝑀1𝑀2 Shorthand notation for time contributions: L3-L2 𝑈 𝑀2𝑀3 𝑁𝑓𝑛 𝑈 𝐹𝐷𝑁 𝑈 𝑈 𝑈 𝑀1𝑀2 𝑈 𝑀2𝑀3 | 𝑈 𝑃𝑀 𝑜𝑃𝑀 𝑀3𝑁𝑓𝑛 𝑈 𝑒𝑏𝑢𝑏 Mem-L3 # cy invariant to # cy changes w/ 𝑈 𝑀3𝑁𝑓𝑛 clock speed clock speed Example from previous slide: 8 6 9 9 | 19 cy Automated loop performance model construction | G. Hager 6

ECM model – prediction Notation for cycle predictions in different memory hierarchy levels: 𝑈 𝑃𝑀 𝑀1 𝑀2 𝑀3 𝑁𝑓𝑛 𝑈 𝐹𝐷𝑁 𝑈 𝐹𝐷𝑁 𝑈 𝐹𝐷𝑁 𝑈 𝐹𝐷𝑁 𝑈 𝑜𝑃𝑀 LOAD 𝑀1 𝑈 𝐹𝐷𝑁 ADD 𝑀2 𝑈 𝐹𝐷𝑁 L2-L1 𝑀1 𝑈 𝐹𝐷𝑁 = 𝑈 𝑑𝑝𝑠𝑓 = max 𝑈 𝑜𝑃𝑀 , 𝑈 𝑃𝑀 𝑀3 𝑈 𝐹𝐷𝑁 𝑀2 𝑈 𝐹𝐷𝑁 = max 𝑈 𝑜𝑃𝑀 + 𝑈 𝑀1𝑀2 , 𝑈 𝑃𝑀 L3-L2 𝑁𝑓𝑛 𝑈 𝐹𝐷𝑁 𝑀3 𝑈 𝐹𝐷𝑁 = max 𝑈 𝑜𝑃𝑀 + 𝑈 𝑀1𝑀2 + 𝑈 𝑀2𝑀3 , 𝑈 𝑃𝑀 𝑁𝑓𝑛 = max 𝑈 𝑜𝑃𝑀 + 𝑈 𝑀1𝑀2 + 𝑈 𝑀2𝑀3 + 𝑈 𝑀3𝑁𝑓𝑛 , 𝑈 𝑃𝑀 Mem-L3 𝑈 𝐹𝐷𝑁 Substitute by Example: 8 15 24 43 cy commas  Roofline 8.6 16.2 26 47 cy Experimental data (measured) notation: Automated loop performance model construction | G. Hager 7

ECM model – saturation Main assumption: Performance scaling is linear until a bandwidth bottleneck ( 𝑐 𝑇 ) is hit LOAD ADD Performance vs. cores (Memory BN): L2-L1 𝑁𝑓𝑛 𝑁𝑓𝑛 , 𝑐 𝑇 𝑄 𝐹𝐷𝑁 𝑜 = min 𝑜𝑄 𝐹𝐷𝑁 𝑁𝑓𝑛 𝐶 𝐷 L3-L2 𝑁𝑓𝑛 𝑈 𝐹𝐷𝑁 Number of cores at saturation: Mem-L3 𝑁𝑓𝑛 𝑐 𝑇 𝐶 𝐷 𝑈 𝐹𝐷𝑁 𝑈 𝑀3𝑁𝑓𝑛 𝑜 𝑇 = = 𝑁𝑓𝑛 𝑈 𝑄 𝑀3𝑁𝑓𝑛 𝐹𝐷𝑁 Example: cy ⟹ 𝑜 𝑇 = 43 8 6 9 9 | 19 cy, 8 15 24 43 19 = 3 Automated loop performance model construction | G. Hager 9

How do we get the numbers? Basic information about hardware Registers capabilities: 𝑈 𝑑𝑝𝑠𝑓 : Code analysis, Intel IACA  In-core limitations L1  Throughput limits:µops, LD/ST, ADD/MULT per cycle  Pipeline depths L2  Cache hierarchy 𝑗 : 𝑈 𝑀1𝑀2 , 𝑈 𝑀2𝑀3 , 𝑈 𝑀3𝑁𝑓𝑛 , 𝐶 𝐷  ECM : Cycles per CL transfer Data flow analysis  RL : measured max bandwidths for all L3 cache levels, core counts  Memory interface  ECM : measured saturated BW MEM  RL : measured max bandwidths for all core counts Automated loop performance model construction | G. Hager 11

2D 5-PT JACOBI STENCIL (DOUBLE PRECISION) for(j=1; j < Nj-1; ++j) for(i=1; i < Ni-1; ++i) b[j][i] = (a[ j ][i-1] + a[ j ][i+1] + a[j-1][ i ] + a[j+1][ i ] ) * s; Unit of work (1 CL): 8 LUPs Data transfer per unit:  5 CL if layer condition violated in higher cache level  3 CL if layer condition satisfied

ECM Model for 2D Jacobi (AVX) on SNB 2.7 GHz Radius- 𝑠 stencil  (2 𝑠 +1) layers have to fit Cache 𝑙 has size 𝐷 𝑙 Layer condition: for(j=1; j < Nj-1; ++j) (2𝑠 + 1) ∙ 𝑂 𝑗 ∙ 8 B < 𝐷 𝑙 for(i=1; i < Ni-1; ++i) b[j][i] = (a[ j ][i-1] + a[ j ][i+1] 2 + a[j-1][ i ] + a[j+1][ i ] ) * s; 2D 5-pt: 𝑠 = 1 LC = layer condition satisfied in … Automated loop performance model construction | G. Hager 18

2D 5-pt serial in-memory performance and layer conditions SNB 2.7 GHz Automated loop performance model construction | G. Hager 19

3D LONG-RANGE STENCIL (SINGLE PRECISION) #pragma omp parallel for for(int k=4; k < N-4; k++) { for(int j=4; j < N-4; j++) { for(int i=4; i < N-4; i++) { float lap = c0 * %V%[k][j][i] + c1 * ( V[ k ][ j ][i+1]+ V[ k ][ j ][i-1]) + c1 * ( V[ k ][j+1][ i ]+ V[ k ][j-1][ i ]) + c1 * ( V[k+1][ j ][ i ]+ V[k-1][ j ][ i ]) ... Source: + c4 * ( V[ k ][ j ][i+4]+ V[ k ][ j ][i-4]) http://goo.gl/dqOlnI + c4 * ( V[ k ][j+4][ i ]+ V[ k ][j-4][ i ]) + c4 * ( V[k+4][ j ][ i ]+ V[k-4][ j ][ i ]); U[k][j][i] = 2.f * V[k][j][i] - U[k][j][i] + ROC[k][j][i] * lap; }}}

3D long-range SP stencil ECM model Layer condition in L3 at problem size 𝑂 𝑗 × 𝑂 𝑘 × 𝑂 𝑙 : 𝑘 ∙ 𝑜 𝑢ℎ𝑠𝑓𝑏𝑒𝑡 ∙ 4 B < 𝐷 3 9 ∙ 𝑂 𝑗 ∙ 𝑐 2  | | | 68 62 24 24 17 cy 68 86 110 127 cy ECM Model: 𝑈 𝑀3𝑁𝑓𝑛 plays minor part 127 Saturation at 𝑜 𝑡 = = 8 cores. 17 Consequences:  Temporal blocking will not yield substantial speedup  Improve low-level code first (semi-stencil …?) Automated loop performance model construction | G. Hager 29

3D long-range SP stencil results (SNB) Roofline too optimistic due to overlapping assumption Automated loop performance model construction | G. Hager 30

KERNCRAFT First steps towards automated model construction

kerncraft: ECM/Roofline modeling toolkit Automated loop performance model construction | G. Hager 32

Towards automated model generation Manual Automated Registers Code inspection IACA or and/or IACA direct analysis L1 Reuse distance Traffic analysis w/ analysis, cache layer conditions simulation L2 L3 HW limits: micro- HW limits: benchmarking likwid-bench & docs & docs MEM Automated loop performance model construction | G. Hager 33

kerncraft vmovsd (%rsi,%rbx,8), %xmm1 #define N 1000 #define M 2000 vaddsd 16(%rsi,%rbx,8), %xmm1, %xmm2 vaddsd 8(%rdx,%rbx,8), %xmm2, %xmm3 for(j=1; j < N-1; ++j) Compiler vaddsd 8(%rcx,%rbx,8), %xmm3, %xmm4 for(i=1; i < M-1; ++i) vaddsd 8(%r8,%rbx,8), %xmm4, %xmm5 b[j][i] = (a[ j ][i-1] + a[ j ][i+1] vaddsd 8(%r9,%rbx,8), %xmm5, %xmm6 + a[j-1][ i ] + a[j+1][ i ] ) * s; vmulsd %xmm6, %xmm0, %xmm7 IACA pycparser TP/CP 𝑈 𝑃𝑀 , 𝑈 𝑜𝑃𝑀 AST Registers LOAD Cache simulator/ ADD L2- L1 L1 reuse distance Roofline / ECM model L3- L2 L2 𝑈 𝑁𝑓𝑛 𝐹𝐷𝑁 analysis Mem-L3 L3 𝑈 𝑀3𝑁𝑓𝑛 MEM 𝑊 𝑐 𝑈 𝑀1𝑀2 , … , 𝑈 𝑀3𝑁𝑓𝑛 Traffic volumes 𝑈 = Machine description docs likwid-bench (yaml file) Automated loop performance model construction | G. Hager 34

Restrictions on code input (selection)  Only doubles and ints supported  Array declarations may use fixed sizes or constants, with an optional offset (e.g., double u1[M+3][N-2][23], but not double u[M*N])  Only the innermost loop may contain assignment statements  Array references must either use index variables from for-loops, with optional addition or subtraction, constant or fixed values  All for-loops must use a declaration as initial statement and an increment or a decrement assignment operation as the next statement (e.g., i++, i -= 2)  Function calls and the use of pointers is not allowed anywhere in the kernel code  Write access to any data is assumed to use “normal” STORE instructions (e.g., no non-temporal stores) Automated loop performance model construction | G. Hager 35

Operating modes  ECM  Full ECM model including in-core analysis  ECMData  Data traffic analysis only (works on any system)  ECMCPU  In-core part of ECM model (IACA)  Roofline  Full Roofline model using CPU peak performance as in-core limit  RooflineIACA  Full Roofline model using IACA analysis for in-core  Benchmark  Run the actual benchmark for model validation Automated loop performance model construction | G. Hager 36

ERLANGEN REGIONAL COMPUTING CENTER Analytical Tool-Supported - PowerPoint PPT Presentation

ERLANGEN REGIONAL COMPUTING CENTER Analytical Tool-Supported Modeling of Streaming and Stencil Loops Georg Hager, Julian Hammer Erlangen Regional Computing Center (RRZE) Scalable Tools Workshop August 3-6, 2015, Lake Tahoe, CA RRZE

Erlangen Plans with SiPMs Erlangen Plans with SiPMs Albert Lehmann, Erlangen University

LIKWID Lightweight performance tools J. Treibig Erlangen Regional Computing Center University

Paraneoplastic Syndromes Case 2 Abbas Agaimy, MD University of Erlangen, Germany

Is 2.44 trillion unknowns the largest finite element system that can be solved today? U. Rde

18. & 19. JUNE 2020 atec.leaguer.com.cn OVERVIEW Exclusive Panel Evaluation in Erlangen

Music Representations Meinard Mller International Audio Laboratories Erlangen

Tempo and Beat Tracking Meinard Mller International Audio Laboratories Erlangen

Levenberg-Marquardt Minimization Jrn Wilms Remeis-Sternwarte & ECAP Universitt

Music Structure Analysis Meinard Mller International Audio Laboratories Erlangen

Audio Decomposition Meinard Mller International Audio Laboratories Erlangen

Introduction Meinard Mller International Audio Laboratories Erlangen

Music Synchronization Meinard Mller International Audio Laboratories Erlangen

Detection and (Linear) Data Model Jrn Wilms Remeis-Sternwarte & ECAP Universitt

Regional Measure 3 May 16, 2017 SFMTA Board of Directors Regional Measure 3 Prior Regional

Regional Networks Regional Networks Rural Creative Placemaking Summit Regional

International, regional and sub- -regional regional International, regional and sub

Superinflation in Loop Quantum Cosmology Nelson Nunes ITP , University of Heidelberg

Differential Cryptanalysis of Keccak Variants olbl 1 , Florian Mendel 2 , Stefan K affer 2

Project 1: Streamlined identification of PAHs/PACs using ultracompact spectroscopy and machine

P ETRI N ETS AS P ARTIAL O RDER S EMANTICS FOR B IOCHEMICAL N ETWORKS Monika Heiner Ina Koch

CS6220: DATA MINING TECHNIQUES Matrix Data: Clustering: Part 1 Instructor: Yizhou Sun

Agenda Introductions & Housekeeping Overview Basics of FEMA Form 81-31 NORFMA

Multi-channel streaming for medical and multimedia industry applications and multimedia industry

Search for production of a Higgs boson and a single top quark Ken Bloom for the CMS

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

ERLANGEN REGIONAL COMPUTING CENTER Analytical Tool-Supported - PowerPoint PPT Presentation

ERLANGEN REGIONAL COMPUTING CENTER Analytical Tool-Supported Modeling of Streaming and Stencil Loops Georg Hager, Julian Hammer Erlangen Regional Computing Center (RRZE) Scalable Tools Workshop August 3-6, 2015, Lake Tahoe, CA RRZE

Erlangen Plans with SiPMs Erlangen Plans with SiPMs Albert Lehmann, Erlangen University

LIKWID Lightweight performance tools J. Treibig Erlangen Regional Computing Center University

Paraneoplastic Syndromes Case 2 Abbas Agaimy, MD University of Erlangen, Germany

Is 2.44 trillion unknowns the largest finite element system that can be solved today? U. Rde

18. &amp; 19. JUNE 2020 atec.leaguer.com.cn OVERVIEW Exclusive Panel Evaluation in Erlangen

Music Representations Meinard Mller International Audio Laboratories Erlangen

Tempo and Beat Tracking Meinard Mller International Audio Laboratories Erlangen

Levenberg-Marquardt Minimization Jrn Wilms Remeis-Sternwarte &amp; ECAP Universitt

Music Structure Analysis Meinard Mller International Audio Laboratories Erlangen

Audio Decomposition Meinard Mller International Audio Laboratories Erlangen

Introduction Meinard Mller International Audio Laboratories Erlangen

Music Synchronization Meinard Mller International Audio Laboratories Erlangen

Detection and (Linear) Data Model Jrn Wilms Remeis-Sternwarte &amp; ECAP Universitt

Regional Measure 3 May 16, 2017 SFMTA Board of Directors Regional Measure 3 Prior Regional

Regional Networks Regional Networks Rural Creative Placemaking Summit Regional

International, regional and sub- -regional regional International, regional and sub

Superinflation in Loop Quantum Cosmology Nelson Nunes ITP , University of Heidelberg

Differential Cryptanalysis of Keccak Variants olbl 1 , Florian Mendel 2 , Stefan K affer 2

Project 1: Streamlined identification of PAHs/PACs using ultracompact spectroscopy and machine

P ETRI N ETS AS P ARTIAL O RDER S EMANTICS FOR B IOCHEMICAL N ETWORKS Monika Heiner Ina Koch

CS6220: DATA MINING TECHNIQUES Matrix Data: Clustering: Part 1 Instructor: Yizhou Sun

Agenda Introductions &amp; Housekeeping Overview Basics of FEMA Form 81-31 NORFMA

Multi-channel streaming for medical and multimedia industry applications and multimedia industry

Search for production of a Higgs boson and a single top quark Ken Bloom for the CMS

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

18. & 19. JUNE 2020 atec.leaguer.com.cn OVERVIEW Exclusive Panel Evaluation in Erlangen

Levenberg-Marquardt Minimization Jrn Wilms Remeis-Sternwarte & ECAP Universitt

Detection and (Linear) Data Model Jrn Wilms Remeis-Sternwarte & ECAP Universitt

Agenda Introductions & Housekeeping Overview Basics of FEMA Form 81-31 NORFMA