CUDA Deep Dive – Performance CSE 6230 Jee Choi September 10, 2013
Performance Target • Tesla M2090 – 512 CUDA processors @ 1.3 GHz – each processor is capable of issuing 1 FMA (2 flops) instruction per cycle – throughput = 512 × 1.3 × 2 = 1.33 TFLOP/s – 384-bit memory interface @ 1.85 GHz – GDDR5 memory transfers data on both rising and falling edge of the clock signal – bandwidth = 384 /8 bytes × 1.85 × 2 = 177 GB/s • How can we achieve this level of performance in our code?
GPU Architecture Dual warp scheduler • warp ¡ warp ¡ scheduler ¡ scheduler ¡ – in each cycle, dual warp scheduler selects two warps and issues one instruction from each register ¡file ¡ warp to a group of sixteen cores (32,768 ¡ × 32-bit) ¡ Register file • CP ¡ CP ¡ CP ¡ CP ¡ – each thread currently residing on the CP ¡ CP ¡ CP ¡ CP ¡ multiprocessor gets its own register set CP ¡ CP ¡ CP ¡ CP ¡ 32 cores • CP ¡ CP ¡ CP ¡ CP ¡ – an instruction that has been scheduled from a CP ¡ CP ¡ CP ¡ CP ¡ warp is assigned to a group of 16 cores – it takes 2 cycles to issue 1 warp (first half-warp CP ¡ CP ¡ CP ¡ CP ¡ on the first cycle, second half-warp on the CP ¡ CP ¡ CP ¡ CP ¡ second cycle) CP ¡ CP ¡ CP ¡ CP ¡ Shared memory • – 64 KB can be configured to give 48 KB or 16 KB shared ¡memory ¡/ ¡ ¡ to shared memory, and the rest to L1 cache L1 ¡cache ¡(64 ¡KB) ¡ Streaming ¡ – shared memory is “shared” amongst the thread blocks currently residing on the SM MulGprocessor ¡(SM) ¡
Achieving performance • Does this mean as long as we have 2 warps in each SM, we can achieve peak performance?
Achieving performance • Does this mean as long as we have 2 warps in each SM, we can achieve peak performance? – theoretically yes, but realistically no
Achieving performance • Does this mean as long as we have 2 warps in each SM, we can achieve peak performance? – theoretically yes, but realistically no • Latency – from beginning to end, instructions take X cycles to finish – arithmetic instructions ~10’s of cycles – memory reads ~100’s of cycles • Data dependency – dependent instructions can’t be issued back-to-back
Example • On the G80 architecture arithmetic instruction latency is 24 cycles a = a * b + c; // 1) takes 24 cycles to complete a = a * b + c; // 2) stalled because of dependency on 1) x = x * y + z; // 3) issued immediately after 2) because no dependency a = a * b + c; // 4) stalled because of dependency on 2) X = x * y + z; // 5) immediately issued after 4) because by the time 4) // was issued 3) was on its last cycle
Example • On the G80 architecture arithmetic instruction latency is 24 cycles a = a * b + c; // 1) takes 24 cycles to complete a = a * b + c; // 2) stalled because of dependency on 1) x = x * y + z; // 3) issued immediately after 2) because no dependency a = a * b + c; // 4) stalled because of dependency on 2) X = x * y + z; // 5) immediately issued after 4) because by the time 4) // was issued 3) was on its last cycle instrucGon ¡1 ¡issued ¡ instrucGon ¡1 ¡
Example • On the G80 architecture arithmetic instruction latency is 24 cycles a = a * b + c; // 1) takes 24 cycles to complete a = a * b + c; // 2) stalled because of dependency on 1) x = x * y + z; // 3) issued immediately after 2) because no dependency a = a * b + c; // 4) stalled because of dependency on 2) X = x * y + z; // 5) immediately issued after 4) because by the time 4) // was issued 3) was on its last cycle instrucGon ¡2 ¡issued ¡ instrucGon ¡2 ¡ instrucGon ¡1 ¡ 24 ¡cycles ¡
Example • On the G80 architecture arithmetic instruction latency is 24 cycles a = a * b + c; // 1) takes 24 cycles to complete a = a * b + c; // 2) stalled because of dependency on 1) x = x * y + z; // 3) issued immediately after 2) because no dependency a = a * b + c; // 4) stalled because of dependency on 2) X = x * y + z; // 5) immediately issued after 4) because by the time 4) // was issued 3) was on its last cycle instrucGon ¡3 ¡issued ¡ 26 ¡cycles ¡ instrucGon ¡3 ¡ instrucGon ¡2 ¡ instrucGon ¡1 ¡
Example • On the G80 architecture arithmetic instruction latency is 24 cycles a = a * b + c; // 1) takes 24 cycles to complete a = a * b + c; // 2) stalled because of dependency on 1) x = x * y + z; // 3) issued immediately after 2) because no dependency a = a * b + c; // 4) stalled because of dependency on 2) X = x * y + z; // 5) immediately issued after 4) because by the time 4) // was issued 3) was on its last cycle instrucGon ¡4 ¡ instrucGon ¡3 ¡ instrucGon ¡2 ¡ instrucGon ¡1 ¡
Example • On the G80 architecture arithmetic instruction latency is 24 cycles a = a * b + c; // 1) takes 24 cycles to complete a = a * b + c; // 2) stalled because of dependency on 1) x = x * y + z; // 3) issued immediately after 2) because no dependency a = a * b + c; // 4) stalled because of dependency on 2) X = x * y + z; // 5) immediately issued after 4) because by the time 4) // was issued 3) was on its last cycle instrucGon ¡5 ¡ only ¡5 ¡instrucGons ¡ have ¡completed ¡ instrucGon ¡4 ¡ vs. ¡ instrucGon ¡3 ¡ 73 ¡instrucGons ¡to ¡ achieve ¡peak ¡ instrucGon ¡2 ¡ instrucGon ¡1 ¡ 73 ¡cycles ¡
Thread level parallelism • Thus, we need lots of independent threads/ warps to hide latency and achieve peak performance • Exactly how many do we need?
Achieving performance • Thus, we need lots of independent threads/ warps to hide latency and achieve peak performance • Exactly how many do we need? – Little’s Law can help us figure this out – recall that we used Little’s Law to help us figure out the number of in-flight memory requests to maximize bandwidth utilization
Performance Notes Bandwidth Utilization II • Little’s Law – L = λ W • L = average number of customers in a store • λ = arrival rate • W = average time spent • Memory Bandwidth tens of thousands of in-flight requests!!! Bandwidth ( λ ) Latency (W)
Thread level parallelism • For the G80 architecture GPU – arithmetic instruction latency ~24 cycles (W) – 8 cores per multiprocessor ( λ ) – parallelism (number of in-flight instructions) = 24 × 8 = 192 (6 warps) – we need 6 warps of FMA instructions with independent instructions in order to achieve peak performance (equivalently, issue FMA instructions every cycle)
Thread level parallelism • What if we have no FMA instructions? – achieving “peak” is impossible • If we don’t have enough thread-level parallelism? – is it impossible to achieve peak?
Thread level parallelism • What if we have no FMA instructions? – achieving “peak” is impossible • If we don’t have enough thread-level parallelism? – is it impossible to achieve peak? – not if you take advantage of instruction-level parallelism (ILP)
Occupancy • Recall that occupancy is – (# of warps) / (maximum # warps) • Higher is generally better, but not always – higher occupancy means more warps with which to hide latency – higher occupancy comes at a cost of fewer registers per thread – registers are the fastest memory and are necessarily in achieving high performance
Occupancy on G80 • G80 – maximum of 24 warps per SM – 6 warps are required to hide instruction latency using only TLP – occupancy = 6 / 24 = 0.25 • Let’s try to achieve peak using occupancy < 0.25 using ILP
Example • On the G80 architecture – arithmetic instruction latency is 24 cycles – there are 8 processors per SM – 192 threads / 6 warps are required to completely hide instruction latency for(i=0; i< N; i++) { a = a * b + c; // no ILP; requires 6 warps to achieve peak } `
Example • On the G80 architecture – arithmetic instruction latency is 24 cycles – there are 8 processors per SM – with ILP of 2, each warp has 2 instructions that can be issued back-to-back – this is similar to having 2 warps with ILP of 1 – you can achieve peak using half the warps for(i=0; i< N; i++) { a = a * b + c; // 2 independent instructions; ILP is 2 d = d * e + f; } `
Example • On the G80 architecture – arithmetic instruction latency is 24 cycles – there are 8 processors per SM – what is the minimum amount of ILP needed to achieve peak performance with 2 warps?
Example • On the G80 architecture – arithmetic instruction latency is 24 cycles – there are 8 processors per SM – what is the minimum amount of ILP needed to achieve peak performance with 2 warps? – since we need 6 warps of ILP of 1, we can get away with 2 warps of ILP of 3 – equivalent to an occupancy of 2 / 24 = 0.083 for(i=0; i< N; i++) { a = a * b + c; // 3 independent instructions; ILP is 3 d = d * e + f; g = g * h + I; } `
Impact of using fewer warps • Typically there is a limit on the amount of ILP that the GPU can take advantage of – you need to mix ILP and TLP to more easily achieve peak or target performance • Having lower occupancy means more registers per thread with which to increase performance – more work per thread may also reduce thread block scheduling overheads
Recommend
More recommend