agenda
play

AGENDA Just a quick overview of what DRAM is, how it works, and - PDF document

WHAT GRAPHICS PROGRAMERS NEED TO KNOW ABOUT DRAM ERIK BRUNVAND, NILADRISH CHATTERJEE, DANIEL KOPTA AGENDA Just a quick overview of what DRAM is, how it works, and what you should know about it as a programmer. A look at the circuits so


  1. WHAT GRAPHICS PROGRAMERS NEED TO KNOW ABOUT DRAM ERIK BRUNVAND, NILADRISH CHATTERJEE, DANIEL KOPTA AGENDA • Just a quick overview of what DRAM is, how it works, and what you should know about it as a programmer. • A look at the circuits so you get some insight about why DRAM is so weird • A look at the DIMMs so you can see how that weirdness manifests in real memory sticks • A peek behind the scenes at the memory controller

  2. MEMORY SYSTEM POWER • Memory power has 
 caught up to CPU 
 power! • This is for general- 
 purpose applications • Even worse for 
 memory-bound 
 applications like 
 graphics… P. Bose, IBM, from WETI keynote, 2012 MEMORY HIERARCHY GENERAL PURPOSE PROCESSOR � CPU � L2/L3 Cache � To off-chip 
 L1 Register Cache File Memory PROCESSOR DIE

  3. MEMORY HIERARCHY GENERAL PURPOSE PROCESSOR � DRAM Memory � CPU � L2/L3 Cache � L1 Register Memory Cache File Controller PROCESSOR DIE MEMORY HIERARCHY GENERAL PURPOSE PROCESSOR

  4. MEMORY SYSTEM STRUCTURE .. .. .. .. .. .. PROC • 4 DDR3 channels • 64-bit data channels .. .. • 800 MHz channels • 1-2 DIMMs/channel • 1-4 ranks/channel 3 MEMORY SYSTEM STRUCTURE .. .. SMB PROC .. .. • The link into the processor is narrow and high frequency • The ¡Scalable ¡Memory ¡Buffer ¡chip ¡is ¡a ¡“router” ¡that ¡connects to multiple DDR3 channels (wide and slow) • Boosts processor pin bandwidth and memory capacity • More expensive, high power 4

  5. CPU DIE PHOTOS Intel Haswell Intel Sandy Bridge MEMORY HIERARCHY GRAPHICS PROCESSOR

  6. GRAPHICS PROCESSOR DIE PHOTOS INTEL XEON PHI

  7. LOOKING AHEAD… • DRAM latency and power have a large impact on system • Even when cache hit rates are high! • DRAMs are odd and complex beasts • Knowing something about their behavior can aid optimization • Sometimes you get better results even when the data bandwidth increases! DRAM CHIP 
 ORGANIZATION

  8. DRAM: 
 DYNAMIC RANDOM ACCESS MEMORY • Designed for density 
 (memory size), not speed • The quest for smaller 
 and smaller bits means 
 huge complication 
 for the circuits • And complicated 
 read/write protocols SEMICONDUCTOR MEMORY BASICS STATIC MEMORY • “Static” memory uses 
 feedback to store 1/0 
 data 0 1 • Data is retained as 
 long as power is 
 maintained

  9. SEMICONDUCTOR MEMORY BASICS STATIC MEMORY 0 1 • “Static” memory uses 
 feedback to store 1/0 
 data • Data is retained as 
 long as power is 
 Access maintained Control Six transistors per bit SEMICONDUCTOR MEMORY BASICS STATIC MEMORY Access Control 0 1 • “Static” memory uses 
 feedback to store 1/0 
 data • Data is retained as 
 long as power is 
 maintained Six transistors per bit

  10. SRAM CHIP ORGANIZATION A11 • Simple array of bit cells A8 V CC A9 Row Memory array A7 V SS decoder 256 × 256 • This example is tiny - 64k (8k x 8) A12 A5 A6 A4 • Bigger examples might have multiple arrays I/O1 Column I/O Input Column decoder data control I/O8 A1 A2 A0 A10 A3 CS2 Timing pulse generator CS1 Read, Write control WE OE SRAM CHIP ORGANIZATION • Simple access strategy • Apply address, wait, data appears on data lines (or gets written) • CS is “chip select” 
 OE is “output enable” 
 WE is “write enable” • SRAM is what’s used in 
 on-chip caches • Also for embedded systems

  11. SRAM CHIP ORGANIZATION Function Table • Simple access strategy WE CS1 CS2 OE Mode V CC current I/O pin Ref. cycle H Not selected (power down) I SB , I SB1 High-Z — × × × • Apply address, wait, 
 L Not selected (power down) I SB , I SB1 High-Z — × × × H L H H Output disable I CC High-Z — data appears on data lines 
 H L H L Read I CC Dout Read cycle (1)–(3) (or gets written) L L H H Write I CC Din Write cycle (1) L L H L Write I CC Din Write cycle (2) • CS is “chip select” 
 Note: × : H or L t RC OE is “output enable” 
 Address Valid address WE is “write enable” t AA t CO1 CS1 • SRAM is what’s used in 
 t LZ1 t CO2 t HZ1 on-chip caches CS2 t LZ2 t OE t HZ2 • Also for embedded systems t OLZ OE t OHZ High Impedance Dout Valid data t OH SEMICONDUCTOR MEMORY BASICS DYNAMIC MEMORY 1/0 • Data is stored as charge 
 on a capacitor • Access transistor allows 
 charge to be added or 
 removed from the capacitor One transistor per bit

  12. DYNAMIC MEMORY PHOTOMICROGRAPHS http://www.tf.uni-kiel.de/ www.sdram-technology.info SEMICONDUCTOR MEMORY BASICS DYNAMIC MEMORY 1/0 • Writing to the bit Write 
 Driver • Data from the driver circuit 
 dumps charge on capacitor 
 1/0 or removes charge from 
 capacitor

  13. SEMICONDUCTOR MEMORY BASICS DYNAMIC MEMORY 1/0 • Reading from the bit • Data from capacitor is coupled 
 to the bit line 1/0 • Voltage change is sensed 
 by the sense amplifier Sense 
 Amplifier • Note - reading is destructive! • Charge is removed from 
 capacitor during read DRAM ARRAY (MAT) • An entire row is first transferred Row Decoder to/from the Row Buffer Row • e.g. 16Mb array (4096x4096) Address • Row and Column = 12-bit addr • Row buffer = 4096b wide Sense Ampli fj ers • One column is then selected from that buffer Row Bu fg er Column • Note that rows and columns Column Decoder Address are addressed separately Data

  14. DRAM ARRAY (MAT) • DRAM arrays are very dense Row Decoder • But also very slow! Row Address • ~20ns to return data that is already in the Row Buffer • ~40ns to read new data into Sense Ampli fj ers a Row Buffer (precharge…) Row Bu fg er • Another ~20ns if you have Column Column Decoder to write Row Buffer back Address Data first (Row Buffer Conflict) DRAM ARRAY (MAT) • Another issue: refresh Row Decoder • The tiny little capacitors Row “leak” into the substrate Address • So, even if you don’t read a row, you have to refresh it Sense Ampli fj ers every so often Row Bu fg er • Typically every 64ms Column Column Decoder Address Data

  15. DRAM INTERNAL MAT ORGANIZATION Row Decoder Row Decoder Row Decoder Row Row Row DRAM Array DRAM Array DRAM Array Address Address Address Sense Ampli fj ers Sense Ampli fj ers Sense Ampli fj ers Row Bu fg er Row Bu fg er Row Bu fg er Column Column Column Column Decoder Address Column Decoder Column Decoder Address Address Data Data Data X2 X4 X8 x16, x32, etc. DRAM CHIP 
 • This is an x4 2Gb ORGANIZATION DRAM (512Mx4) • 8 x 256kb banks • Each multiple mats • “8n prefetch” • fetches 8x4 = 32 bits from the row buffer on each access • 8kb row buffer

  16. DRAM INTERNAL MAT ORGANIZATION DRAM COMMAND 
 STATE MACHINE • Access commands/protocols are a little more complex than for SRAM… • Activate, Precharge, RAS, CAS • If open row, then just CAS • If wrong open row then 
 write-back, Act, Pre, RAS, CAS • Lots of timing relationships! • This is what the memory controller keeps track of… • Micron DRAM datasheet is 211 pages…

  17. • Activate uses the row DRAM TIMING address (RAS) and bank address to activate and pre- charge a row • Read gives the column address (CAS) to select bits from the row buffer • Note burst of 8 words returned • Note data returned on both edges of clock (DDR) DRAM PACKAGES

  18. HIGHER LEVEL ORGANIZATION DIMM, RANK, BANK, AND ROW BUFFER Bank Processor Row Buffer Memory 
 Controller Address and Data Bus • Bank - a set of array that are active on each request • Row Buffer: The last row read from the Bank • Typically on the order of 8kB (for each 64bit read request!) • Acts like a secret cache!!!

  19. 
 DRAM CHIP SUMMARY • DRAM is designed to be as dense as 
 possible • Implications: slow and complex • Most interesting behavior: The Row Buffer • Significant over-fetch - 8kB fetched internally for a 64bit bus request • Data delivered from an “open row” is significantly faster, and lower energy, than truly random data • This “secret cache” is the key to tweaking better performance out of DRAM! DRAM DIMM AND MEMORY CONTROLLER ORGANIZATION • Niladrish Chatterjee 
 NVIDIA Corporation

  20. DRAM DIMM AND ACCESS PIPELINE • DIMMs are small PCBs on …" which DRAM chips are Array' 1/8 th 'of'the' assembled row'buffer' • Chips are separated into One'word'of' ranks data'output' • A rank is a collection of DRAM' DIMM' chips that work in unison chip'or' Rank' Bank' device' to service a memory request • There are typically 2 or 4 ranks on a DIMM Memory'bus'or'channel' Memory" Controller" DRAM DIMM AND ACCESS PIPELINE • The memory channel has …" data lines and a command/ Array' address bus 1/8 th 'of'the' row'buffer' • Data channel width is One'word'of' typically 64 (e.g. DDR3) data'output' • DRAM chips are typically x4, DRAM' DIMM' x8, x16 (bits/chip) chip'or' Rank' Bank' device' • 64bit data channel == 
 sixteen x4 chips 
 or eight x8 chips 
 or four x16 chips 
 Memory'bus'or'channel' or two x32 chips… Memory" Controller"

Recommend


More recommend