From Shader Code to a Tera Terafl flop op: How Shader Cores Work Kayvon Fatahalian Stanford University SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
This talk 1. Three major ideas that make GPU processing cores run fast 2. Closer look at real GPU designs – NVIDIA GTX 285 – AMD Radeon 4890 – Intel Larrabee 3. Memory hierarchy: moving data to processors 2 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Part 1: throughput processing • Three key concepts behind how modern GPU processing cores run code • Knowing these concepts will help you: 1. Understand space of GPU core (and throughput CPU processing core) designs 2. Optimize shaders/compute kernels 3. Establish intuition: what workloads might benefit from the design of these architectures? 3 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
What’s in a GPU? Input Assembly Shader Shader Tex Core Core Rasterizer Shader Shader Tex Output Blend Core Core Video Decode Shader Shader Tex Core Core HW Work or Shader Shader Tex Distributor Core Core SW? Heterogeneous chip multi-processor (highly tuned for graphics) 4 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
A diffuse reflectance shader sampler mySamp; Texture2D<float3> myTex; float3 lightDir; float4 diffuseShader(float3 norm, float2 uv) { float3 kd; kd = myTex.Sample(mySamp, uv); kd *= clamp( dot(lightDir, norm), 0.0, 1.0); return float4(kd, 1.0); } Independent, but no explicit parallelism 5 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Compile shader 1 unshaded fragment input record sampler mySamp; Texture2D<float3> myTex; <diffuseShader>: float3 lightDir; sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 float4 diffuseShader(float3 norm, float2 uv) madd r3, v2, cb0[2], r3 { clmp r3, r3, l(0.0), l(1.0) float3 kd; mul o0, r0, r3 kd = myTex.Sample(mySamp, uv); mul o1, r1, r3 kd *= clamp ( dot(lightDir, norm), 0.0, 1.0); mul o2, r2, r3 mov o3, l(1.0) return float4(kd, 1.0); } 1 shaded fragment output record 6 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Execute shader Fetch/ Decode <diffuseShader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] ALU madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 (Execute) clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 Execution mul o1, r1, r3 Context mul o2, r2, r3 mov o3, l(1.0) 7 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Execute shader Fetch/ Decode <diffuseShader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] ALU madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 (Execute) clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 Execution mul o1, r1, r3 Context mul o2, r2, r3 mov o3, l(1.0) 8 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Execute shader Fetch/ Decode <diffuseShader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] ALU madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 (Execute) clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 Execution mul o1, r1, r3 Context mul o2, r2, r3 mov o3, l(1.0) 9 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Execute shader Fetch/ Decode <diffuseShader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] ALU madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 (Execute) clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 Execution mul o1, r1, r3 Context mul o2, r2, r3 mov o3, l(1.0) 10 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Execute shader Fetch/ Decode <diffuseShader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] ALU madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 (Execute) clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 Execution mul o1, r1, r3 Context mul o2, r2, r3 mov o3, l(1.0) 11 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Execute shader Fetch/ Decode <diffuseShader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] ALU madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 (Execute) clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 Execution mul o1, r1, r3 Context mul o2, r2, r3 mov o3, l(1.0) 12 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
CPU-“style” cores Fetch/ Out-of-order control logic Decode Fancy branch predictor ALU (Execute) Memory pre-fetcher Execution Context Data cache (A big one) 13 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Slimming down Fetch/ Decode Idea #1: ALU Remove components that (Execute) help a single instruction Execution stream run fast Context 14 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Two cores (two fragments in parallel) fragment 1 fragment 2 Fetch/ Fetch/ Decode Decode <diffuseShader>: <diffuseShader>: sample r0, v4, t0, s0 ALU ALU sample r0, v4, t0, s0 mul r3, v0, cb0[0] mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v1, cb0[1], r3 (Execute) (Execute) madd r3, v2, cb0[2], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o0, r0, r3 mul o1, r1, r3 Execution Execution mul o1, r1, r3 mul o2, r2, r3 mul o2, r2, r3 mov o3, l(1.0) Context Context mov o3, l(1.0) 15 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Four cores (four fragments in parallel) Fetch/ Fetch/ Decode Decode ALU ALU (Execute) (Execute) Execution Execution Context Context Fetch/ Fetch/ Decode Decode ALU ALU (Execute) (Execute) Execution Execution Context Context 16 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Sixteen cores (sixteen fragments in parallel) ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU 16 cores = 16 simultaneous instruction streams 17 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Instruction stream sharing But… many fragments should be able to share an instruction str eam! <diffuseShader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) 18 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Recall: simple processing core Fetch/ Decode ALU (Execute) Execution Context 19 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Add ALUs Idea #2: Fetch/ Decode Amortize cost/complexity of ALU 1 ALU 2 ALU 3 ALU 4 managing an instruction stream across many ALUs ALU 5 ALU 6 ALU 7 ALU 8 Ctx Ctx Ctx Ctx SIMD processing Ctx Ctx Ctx Ctx Shared Ctx Data 20 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Modifying the shader Fetch/ Decode <diffuseShader>: sample r0, v4, t0, s0 ALU 1 ALU 2 ALU 3 ALU 4 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 ALU 5 ALU 6 ALU 7 ALU 8 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 Ctx Ctx Ctx Ctx mul o2, r2, r3 mov o3, l(1.0) Ctx Ctx Ctx Ctx Original compiled shader: Shared Ctx Data Processes one fragment using scalar ops on scalar registers 21 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Modifying the shader Fetch/ Decode <VEC8_diffuseShader>: VEC8_sample vec_r0, vec_v4, t0, vec_s0 ALU 1 ALU 2 ALU 3 ALU 4 VEC8_mul vec_r3, vec_v0, cb0[0] VEC8_madd vec_r3, vec_v1, cb0[1], vec_r3 VEC8_madd vec_r3, vec_v2, cb0[2], vec_r3 ALU 5 ALU 6 ALU 7 ALU 8 VEC8_clmp vec_r3, vec_r3, l(0.0), l(1.0) VEC8_mul vec_o0, vec_r0, vec_r3 VEC8_mul vec_o1, vec_r1, vec_r3 Ctx Ctx Ctx Ctx VEC8_mul vec_o2, vec_r2, vec_r3 VEC8_mov vec_o3, l(1.0) Ctx Ctx Ctx Ctx New compiled shader: Shared Ctx Data Processes 8 fragments using vector ops on vector registers 22 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Modifying the shader 1 2 3 4 5 6 7 8 Fetch/ Decode <VEC8_diffuseShader>: VEC8_sample vec_r0, vec_v4, t0, vec_s0 ALU 1 ALU 2 ALU 3 ALU 4 VEC8_mul vec_r3, vec_v0, cb0[0] VEC8_madd vec_r3, vec_v1, cb0[1], vec_r3 VEC8_madd vec_r3, vec_v2, cb0[2], vec_r3 ALU 5 ALU 6 ALU 7 ALU 8 VEC8_clmp vec_r3, vec_r3, l(0.0), l(1.0) VEC8_mul vec_o0, vec_r0, vec_r3 VEC8_mul vec_o1, vec_r1, vec_r3 Ctx Ctx Ctx Ctx VEC8_mul vec_o2, vec_r2, vec_r3 VEC8_mov vec_o3, l(1.0) Ctx Ctx Ctx Ctx Shared Ctx Data 23 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
128 fragments in parallel 16 cores = 128 ALUs = 16 simultaneous instruction streams 24 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
vertices / fragments primitives 128 [ ] in parallel CUDA threads OpenCL work items compute shader threads primitives vertices fragments 25 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
But what about branches? 1 2 ... ... 8 Time (clocks) ALU 1 ALU 2 . . . . . . ALU 8 <unconditional shader code> if (x > 0) { y = pow(x, exp); y *= Ks; refl = y + Ka; } else { x = 0; refl = Ka; } <resume unconditional shader code> 26 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Recommend
More recommend