E E Energy Efficiency in Energy Efficiency in Effi i Effi i i i Graphics Rendering Graphics Rendering Graphics Rendering Graphics Rendering Preeti Ranjan Panda Department of Computer Science and Engineering Indian Institute of Technology Delhi Indian Institute of Technology Delhi Presentation at TU Dortmund, June 2011 J
Graphics Power Consumption Graphics Power Consumption p p p p Desktop computer Mobile computer Cooling Fan 4% 4% Rest R CPU CPU 13% 7% Chipset Power Power Supply 13% Supply Loss Loss VR VR Loss 22% 1% 7% HDD/ Other DVD 7% 7% 14.1' 9% Monitor 56% HDD/ LCD DVD 33% 4% 4% Graphics CPU 14% 4% Graphics 6% [ Ref : PC Energy-EfficiencyTrends and Technology, source: intel.com] B. V. N. Silpa and P. R. Panda, 2011
Observation Observation Observation Observation � GPU/Graphics rendering power is � GPU/Graphics rendering power is significant (greater than CPU) � Yet, very little research on GPU energy efficiency! y ◦ GPU performance was/is primary ◦ Proprietary GPU architectures ◦ Proprietary GPU architectures B. V. N. Silpa and P. R. Panda, 2011
Graphics Pipeline Graphics Pipeline Graphics Pipeline Graphics Pipeline From CPU T T exture exture Vertex Vertex Setup and Setup and Fragment Fragment Image Image Command Command Display Display Clipping processor Rasterize Processor Composition processor Receives Receives Delete Delete Generate Generate Pixel Pixel Blend with Blend with Transform Transform vertices and unseen pixels Coloring Frame vertices to commands part of and Buffer screen from CPU from CPU scene Z Z-test space and d Light B. V. N. Silpa and P. R. Panda, 2011
Adding Energy Efficiency Adding Energy Efficiency Adding Energy Efficiency Adding Energy Efficiency IIT Delhi – Intel Component Level: Collaboration From TEXTURE MAPPING CPU T T exture exture Vertex Vertex Setup and Setup and Fragment Fragment Image Image Command Command Display Display Clipping processor Rasterize Processor Composition processor Receives Receives Delete Delete Generate Generate Pixel Pixel Blend with Blend with Transform Transform vertices and unseen pixels Coloring Frame vertices to commands part of and Buffer screen from CPU from CPU scene Z Z-test space and d Light System Level: DVFS System Level: DVFS B. V. N. Silpa and P. R. Panda, 2011
LOW POWER LOW POWER TEXTURE MAPPING TEXTURE MAPPING TEXTURE MAPPING TEXTURE MAPPING [ICCAD’08] [ICCAD’08] [ [ ] ] B. V. N. Silpa and P. R. Panda, 2011
Power Profile of the Pipeline Power Profile of the Pipeline Power Profile of the Pipeline Power Profile of the Pipeline 100% gy ized energ 80% 80% 60% 40% 40% Normali 20% 0% City Fire Teapot Tunnel Benchmark Transform and lighting Transform and lighting Setup and rasterize Setup and rasterize Texture Memory Texture Memory Fragment processing Frame buffer write T T exture memory consumes 30-40% of total power. exture memory consumes 30 40% of total power. B. V. N. Silpa and P. R. Panda, 2011
Texture Mapping Texture Mapping Texture Mapping Texture Mapping � Add detail and surface texture to an object. � Reduces the modeling effort for the programmer. g p g T exture Mapped pp Object j T exture Object B. V. N. Silpa and P. R. Panda, 2011
Texture Filtering Texture Filtering � Texture space and object space could be at arbitrary angles to each other g � Nearest neighbor � Nearest neighbor � Bilinear interpolation : B C weighted average of four we g te ave age o ou texels nearest to the pixel C center. A B B B. V. N. Silpa and P. R. Panda, 2011
Texture Access Pattern Texture Access Pattern � Texture mapping exhibits high spatial and temporal locality (tx,ty) (tx+1,ty) Pixel � Bilinear filtering requires 4 Bilinear filtering requires 4 center center neighbouring texels (tx+1,ty+1) (tx,ty+1) � Neighbouring pixels map to spatially local texels ll l l l C � Repetitive textures A B B. V. N. Silpa and P. R. Panda, 2011
Blocking and Texture Cache Blocking and Texture Cache � Blocked Representation ◦ T T exels stored as 4x4 blocks l d 4 4 bl k ◦ Reduces dependency on texture orientation, and exploits spatial locality p y � Texture memory accessed through a Cache hierarchy (“TEXTURE CACHE”) y ( ) � Familiar architectural space � BUT, application knowledge could help improve the U , app cat o ow e ge cou e p p ove t e HW over a “standard cache” B. V. N. Silpa and P. R. Panda, 2011
Predictability in Texture Accesses Predictability in Texture Accesses Predictability in Texture Accesses Predictability in Texture Accesses (bx,by) (bx1,by) � Access to first texel gives information about access to the (tx,ty) (tx+1,ty) next 3 texels next 3 texels Pixel Pixel center � The four texels could be mapped to either one, two or four (tx+1,ty+1) (tx,ty+1) neighbouring blocks. i hb i bl k (bx by1) (bx,by1) (bx1 by1) (bx1,by1) Case 2 Case 3 Case 4 Case 1 B. V. N. Silpa and P. R. Panda, 2011
Low Power Texture Memory Architecture Low Power Texture Memory Architecture Low Power Texture Memory Architecture Low Power Texture Memory Architecture � Lower power memory architecture than � Lower power memory architecture than Cache for texturing ◦ Use a few registers to filter accesses to U f i t t filt t blocks expected to be reused ◦ Access stream has predictability - controlled Access stream has redictabilit c ntr lled access mechanism reduces tag lookups B. V. N. Silpa and P. R. Panda, 2011
How many blocks to buffer? How many blocks to buffer? How many blocks to buffer? How many blocks to buffer? � Need to buffer up to 4 blocks Buffer Texture Buffer Array Texture Buffer Array � A buffer is a set of 4x4 registers, each 32 bit � T exture Buffer Array is a group of 4 such buffers B. V. N. Silpa and P. R. Panda, 2011
Texture Lookup Texture Lookup Texture Lookup Texture Lookup � Case 1: ◦ ◦ Lookup (block0) Lookup (block0) Get the 4 texels from the block using offsets � SAVING: 3 LOOKUPS � Cases 2 & 3: � ◦ Lookup (block 0) p ( ) Get texel0 and texel1 from this block � ◦ Lookup (block 2) Get texel2 and texel3 from this block � ◦ SAVING: 2 LOOKUPS B. V. N. Silpa and P. R. Panda, 2011
Contd Contd Contd.. Contd.. � Case 4: ◦ ◦ Lookup all 4 blocks and get the texels Lookup all 4 blocks and get the texels from the respective blocks using offsets Power Savings from: Reduced Tag lookups Reduced Tag lookups Smaller buffer than cache B. V. N. Silpa and P. R. Panda, 2011
Distribution of access among various cases Distribution of access among various cases Distribution of access among various cases Distribution of access among various cases Distribution of accesses among the four cases 60% 50% 40% 40% 30% 20% 10% 0% case 1 case 2 case 3 case 4 Number of comparisons per access is 1.38 instead of 4 B. V. N. Silpa and P. R. Panda, 2011
Architecture of Architecture of Architecture of Texture Architecture of Texture exture Filter exture Filter ilter Memory ilter Memory emory emory From L1 Cache Cur Level 512-bit Controller Enable Bank Sel Hit Load Bank I Bank-I R/W R/W T exel Addr Comp-I (256 bytes) Fetch Block Addr index ADDR 4 4 Unit Bank-II Add C Addr Comp-II II (256 bytes) Offset TBA Load Hit 32-bit = = T o Filter REG = inde EN NCODER ex Block REG Addr = REG = Addr Comp B. V. N. Silpa and P. R. Panda, 2011 REG
Hit Rate into TFM Hit Rate into TFM Hit Rate into TFM Hit Rate into TFM Hit Rate 100% 100% 80% 60% 40% 20% 0% 0% Fire Teapot Tunnel Gloss Gearbox Sphere 16KB 2-way assoc 512B direct mapped 512B fully asscoc TFM TFM gives 4.5% better hit rate than a direct mapped filter of the same size g ves .5% bette t ate t a a ect appe te o t e sa e s e B. V. N. Silpa and P. R. Panda, 2011
Energy per Access Energy per Access Energy per Access Energy per Access Energy per Access 0.1 0.08 rgy(nJ) 0.06 Ener 0.04 0.02 0 0 Fire Teapot Tunnel Gloss Gearbox Sphere 16KB 2-way assoc L1 512B direct mapped L1 512B direct mapped filter 512B direct mapped filter 512B Full assoc filter 512B Full assoc filter TFM TFM consumes 75% lesser energy than the conventional T co su es 75% esse e e gy t a t e co ve t o a e tu e cac e exture cache B. V. N. Silpa and P. R. Panda, 2011
Texture Filter Memory Summary Texture Filter Memory Summary Texture Filter Memory Summary Texture Filter Memory Summary � In addition to high spatial locality texture � In addition to high spatial locality, texture mapping access pattern also has predictability p y � Replaced high energy cache lookups with low energy register buffer reads gy g � TFM consumes ~75% lesser energy than conventional texture mapping system pp g y � Overheads: ◦ TFM access 4x faster than cache access TFM access 4x faster than cache access ◦ 0.48% area overhead over texture cache subsystem y B. V. N. Silpa and P. R. Panda, 2011
DYNAMIC DYNAMIC VOLTAGE AND VOLTAGE AND FREQUENCY SCALING FREQUENCY SCALING FREQUENCY SCALING FREQUENCY SCALING ( (DVFS) ( (DVFS) ) ) [CODES+ISSS’10] [CODES+ISSS’10] B. V. N. Silpa and P. R. Panda, 2011
Recommend
More recommend