Lecture 5: HW1 Discussion, Intro to GPUs G63.2011.002/G22.2945.001 - PowerPoint PPT Presentation

Lecture 5: HW1 Discussion, Intro to GPUs G63.2011.002/G22.2945.001 · October 5, 2010 Discuss HW1 Intro to GPU Computing

Outline Discuss HW1 Intro to GPU Computing Discuss HW1 Intro to GPU Computing

Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example matrix multiplication code. Why is blocked matmul faster than un-blocked? Key: Computational Intensity Definition: Flops per FPN moved up the memory hierarchy Large intensity: good for deep memory hierarchies Discuss HW1 Intro to GPU Computing

Computational Intensity for Scalar Matmul Floating Point operations: 2 N 3 Assume: Size(L1) ≪ N 2 FPNs N 2 read each row of A once + N 3 read each column of B N times + 2 N 2 read/write C N 3 + 3 N 2 FPN-size cache misses (neglecting cache lines, etc.) Computational Intensity: about 2 Discuss HW1 Intro to GPU Computing

Computational Intensity for Blocked Matmul Floating Point operations: still 2 N 3 b : block size n : ⌈ N / b ⌉ read each block of A n 3 times b 2 n 3 + b 2 n 3 same for B + 2 N 2 read/write C 2 b 2 n 3 + 2 N 2 FPN-size cache misses Rewrite: b 2 n 3 ≈ b 2 N 3 b 3 = N 3 b Computational Intensity: 2 N 3 2 N 3 2 N 3 / b + 2 N 2 ≈ 2 N 3 / b = b → incentive to choose b ≫ 2 Discuss HW1 Intro to GPU Computing

Computational Intensity for Blocked Matmul Floating Point operations: still 2 N 3 b : block size n : ⌈ N / b ⌉ read each block of A n 3 times b 2 n 3 + b 2 n 3 same for B + 2 N 2 read/write C 2 b 2 n 3 + 2 N 2 FPN-size cache misses Rewrite: b 2 n 3 ≈ b 2 N 3 b 3 = N 3 b Computational Intensity: 2 N 3 2 N 3 2 N 3 / b + 2 N 2 ≈ 2 N 3 / b = b The power of assumptions: Can we choose b = N ? → incentive to choose b ≫ 2 Discuss HW1 Intro to GPU Computing

Hatching a Plan Consider each level of the memory hierarchy. How do we exploit. . . • . . . L2: Ignore–we’re nearly L2-local at most sizes. • . . . L1: 32 KiB = 4096 Floats. Key: memory layout. • . . . registers: 16 FP registers. Key: loop/operation ordering. Discuss HW1 Intro to GPU Computing

Optimizing for L1: Memory Layout Memory layout of A : column-major. Only use one entry of each cache line per fetch. Better to store A in row-major order. Input is row-major. If memory available (not swap!), storing a transposed copy of A can be a good idea. (Copy takes O ( N 2 ) time.) Discuss HW1 Intro to GPU Computing

Optimizing for L1: Reuse Pattern, Block Size Question Blocking: good idea. Optimal b L1 ? Follow-up question: How much needs to fit in L1? Discuss HW1 Intro to GPU Computing

Optimizing for L1: Reuse Pattern, Block Size Question Blocking: good idea. Optimal b L1 ? Follow-up question: How much needs to fit in L1? One block of each of A , B , C . Discuss HW1 Intro to GPU Computing

Optimizing for L1: Reuse Pattern, Block Size Question Blocking: good idea. Optimal b L1 ? Follow-up question: How much needs to fit in L1? One block of each of A , B , C . All of A , plus one column of B and C . 32 kiB: 8 b 2 L1 + 2 · 8 b L1 → b L1 ≤ 60 Discuss HW1 Intro to GPU Computing

L1 Block Copy Further concerns: • Cache line boundaries • SIMD • Cache set conflicts All solved by small-block copy optimization. Copy all of A . Copy b L1 -sized blocks of A , B , and C , operate on those, then copy output back. Discuss HW1 Intro to GPU Computing

L1 Block Copy: The Plan Basic plan: For each i : For each j : Load Block C [ i , j ] For each k : Load Block A [ i , k ] Load Block B [ k , j ] ⌈ b L1 / b r ⌉ 3 register kernels: C + = AB Store Block C [ i , j ] (can be improved: many A , B loads) Discuss HW1 Intro to GPU Computing

L1 Block Copy: The Plan Basic plan: For each i : For each j : Load Block C [ i , j ] For each k : Load Block A [ i , k ] Load Block B [ k , j ] ⌈ b L1 / b r ⌉ 3 register kernels: C + = AB Store Block C [ i , j ] Aside: Also neatly deals with fringes. (can be improved: many A , B loads) So: how does this solve the problems above? Can you define “alignment”? Discuss HW1 Intro to GPU Computing

Alignment A memory address a is n -byte aligned when n is a power of two and a is a multiple of n bytes. (see also IBM devWorks article) #include < stdlib.h > / ∗ dynamic allocation ∗ / double ∗ attribute (( aligned (64))) var; int error = posix memalign( ( void ∗∗ ) &var, 64, array size ); if ( error ) abort (); / ∗ static allocation ∗ / double attribute (( aligned (64))) ary2 [500]; Discuss HW1 Intro to GPU Computing

Alignment A memory address a is n -byte aligned when n is a power of two and a is a multiple of n bytes. (see also IBM devWorks article) #include < stdlib.h > / ∗ dynamic allocation ∗ / double ∗ attribute (( aligned (64))) var; int error = posix memalign( ( void ∗∗ ) &var, 64, array size ); if ( error ) abort (); Examples: Cache-line-aligned, SIMD-aligned. / ∗ static allocation ∗ / double attribute (( aligned (64))) ary2 [500]; Code generation in the non-aligned case? Discuss HW1 Intro to GPU Computing

Register Kernel Choose block size b r = 2 k , with b L1 mod b r = 0. for ( int j = 0; j < b r; ++j) for ( int k = 0; k < b r; ++k) for ( int i = 0; i < b r; ++i) C[i+j ∗ b l1] += A[i+k ∗ b l1] ∗ B[k+j ∗ b l1]; For each Ab matvec: Perform b r scalar · vector updates. • Vectorizable • Pipeline-friendly (min. data dependencies) • Access to A , C unit-stride • Access to B is inner-loop invariant • Unrolling, software pipelining: Compiler Discuss HW1 Intro to GPU Computing

Psychoanalyzing the Compiler Flags for Intel: -O3 -fno-alias -funroll-loops -std=c99 -D XOPEN SOURCE=500 -opt-streaming-stores auto -static -fast -xHost Flags for GCC: -O3 -funroll-loops -march=native -std=c99 -D XOPEN SOURCE=500 -ftree-vectorizer-verbose=2 -ffast-math GCC 4.3 sometimes better than GCC 4.4. Self-study material: • Compiler Reference: Intel GNU • C99 restrict keyword, Aliasing Discuss HW1 Intro to GPU Computing

Profiling OProfile: A sampling profiler. Uses Performance counters. Linux only, needs root. Discuss HW1 Intro to GPU Computing

Profiling Many event types countable: CPU CLK UNHALTED : Clock cycles when not halted L2 RQSTS : number of L2 cache requests LLC MISSES : L2 cache demand requests from this core that missed the L2 FLOPS : number of FP computational micro-ops executed IDLE DURING DIV : cycles divider is busy and all other execution units are idle. L1D ALL REF : All references to the L1 data cache L1D PEND MISS : Total number of outstanding L1 data cache misses at any cycle IFU MEM STALL : cycles instruction fetch pipe is stalled INST RETIRED : number of instructions retired UOPS RETIRED : number of UOPs retired MACHINE NUKES SMC : number of pipeline flushing events RAT STALLS : Partial register stall cycles BR INST DECODED : number of branch instructions decoded OProfile: A sampling profiler. Uses Performance counters. Linux only, needs root. Discuss HW1 Intro to GPU Computing

Profiling FLOPS L1D PEND MISS 8 2.6e − 04 18 0.7037 movsd 0x50(%rax),%xmm7 187 0.0062 8 0.3127 movsd 0x58(%rax),%xmm5 7 2.3e − 04 24 0.9382 movsd 0x60(%rax),%xmm3 470 0.0155 18 0.7037 movsd 0x68(%rax),%xmm4 49 0.0016 9 0.3518 movsd 0x70(%rax),%xmm2 2873 0.0950 7 0.2737 movsd 0x78(%rax),%xmm1 434 0.0144 8 0.3127 xchg %ax,%ax 184312 6.0959 26 1.0164 movsd (%rdx),%xmm0 2022 0.0669 14 0.5473 inc %esi 19 6.3e − 04 3 0.1173 mulsd (%rcx),%xmm0 5294 0.1751 189 7.3886 addsd 0x30(%rsp),%xmm0 31888 1.0547 68 2.6583 movsd %xmm0,(%rax) 66032 2.1839 37 1.4464 movsd %xmm0,0x30(%rsp) 114001 3.7704 43 1.6810 movsd (%rcx),%xmm0 1131 0.0374 3 0.1173 mulsd 0x8(%rdx),%xmm0 11913 0.3940 2 0.0782 addsd %xmm0,%xmm14 94565 3.1276 20 0.7819 movsd %xmm14,0x8(%rax) 108501 3.5885 25 0.9773 movsd (%rcx),%xmm0 4 1.3e − 04 1 0.0391 mulsd 0x10(%rdx),%xmm0 76622 2.5342 81 3.1665 addsd %xmm0,%xmm15 82075 2.7145 42 1.6419 movsd %xmm15,0x10(%rax) 119036 3.9370 36 1.4073 movsd (%rcx),%xmm0 5 1.7e − 04 0 0 mulsd 0x18(%rdx),%xmm0 2700 0.0893 0 0 addsd %xmm0,%xmm12 14861 0.4915 11 0.4300 movsd %xmm12,0x18(%rax) OProfile: A sampling profiler. Uses Performance counters. Linux only, needs root. Discuss HW1 Intro to GPU Computing

Solution Performance 9000 basic 8000 tuned 7000 blas 6000 MFlops/s 5000 4000 3000 2000 1000 0 0 100 200 300 400 500 600 700 800 Matrix dimension N git clone ssh://git@forge.tiker.net:2234/hw1-solution.git (Private, works if you signed up for an account.) Discuss HW1 Intro to GPU Computing

Lecture 5: HW1 Discussion, Intro to GPUs G63.2011.002/G22.2945.001 - PowerPoint PPT Presentation

Lecture 5: HW1 Discussion, Intro to GPUs G63.2011.002/G22.2945.001 October 5, 2010 Discuss HW1 Intro to GPU Computing Outline Discuss HW1 Intro to GPU Computing Discuss HW1 Intro to GPU Computing Outline Discuss HW1 Intro to GPU

Why use GPUs for graph processing? FOSDEM 2020 2 GPUs and Graphs Graphs GPUs Found

Interchange Intro Presentation Plus: Intro (Mixed media Interchange Intro Presentation Plus: Intro

Interchange Intro Presentation Plus: Intro (Mixed media Interchange Intro Presentation Plus: Intro

CS 171: Introduction to Computer Science II Algorithm Analysis Li Xiong Today Hw1

Lecture 2: Number Systems Logistics Webpage is up! http://www.cs.washington.edu/370 HW1

HW1 Grades our out Total: 180 Min: 55 Max: 188(178+10 for bonus credit)

HW1 Challenge: Measure and understand a software program VPR: CAD software developed at

Scott Le Grand Some Things Never Change (GPUs vs the World) How Best to Exploit GPUs

Unleashing the Power of GPUs over the Web Vishal Vaidyanathan Royal Caliber LLC GPUs are

Efficient Video Decoding on GPUs Efficient Video Decoding on GPUs by Point Based Rendering by

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Clusters of GPUs Michael LeBeane mlebeane@utexas.edu Advisor : Lizy K. John Problem Statement

MD5 Chosen-Prefix Collisions on GPUs Marc Bevand m.bevand@gmail.com marc.bevand@rapid7.com

Analyzing Throughput of GPUs Analyzing Throughput of GPUs Exploiting Within-Die Core-to-Core

Lecture 2 - HW1,numpy arrays, matplotlib, and git 2020.4.14 Review of Python Basics from Lecture

INTRO: What is a MOOD BOARD? What is it? INTRO: Why are they Used? INTRO: Things to Consider

Some logical aspects of topos theory and examples from algebraic geometry (Part II) Matthias

Canonical Orientations for the Moduli Space of G 2 -instantons Markus Upmeier (joint with Dominic

Rely-Guarantee Protocols Filipe Milito 1,2 Jonathan Aldrich 1 Lus Caires 2 1 Carnegie Mellon

Tensor For F , F Sh( X ) we have F F Sh( X ). Since tensor commutes with

Mathematical Foundation for Robotics Mayank Mittal AE640A: Autonomous Navigation January 29,

From Yesterday private = accessible only to the class that declares it public =

CSCE 488: Performance Evaluation Proper experimental technique is essential to system

Towards refined notions of computation: the global state example Danel Ahman LFCS, University of